Principal Product Manager
Indexed description
What You’ll Be Doing
- Take full responsibility for the strategic direction and roadmap of the break-fix automation system spanning multiple vendors, technologies, and CSPs.
- Define automation confidence thresholds, blocking issue criteria, and human-in-the-loop intervention points that balance speed with operational safety.
- Build the operator UX for repair queues, workflow transparency, and audit trails — ensuring on-call engineers have the context they need to act quickly and confidently.
- Drive the integration between failure attribution and automated repair actions, following through from detection to resolution.
- Define repair SLOs and own the metrics framework for time-to-drain, time-to-healthy, and overall fleet availability.
- Collaborate with NCP operators, SRE teams, and hardware vendor partners to integrate RMA processes and optimize repair workflows at scale.
- 15+ years of product management experience in infrastructure, platform, or MLOps areas, or equivalent background.
- BS or MS in Computer Science, Engineering, or a related technical area, or equivalent experience.
- Demonstrated expertise with distributed systems, workflow orchestration, and the safety tradeoffs inherent in automation.
- Track record owning products with real-world operational consequences — you understand blast radius and build accordingly.
- Strong operator UX instincts — proven ability to translate complex system state into workflows that on-call engineers can act on under pressure.
- Ability to build alignment across engineering, SRE, and external vendor partner teams.
- Hands-on experience with GPU infrastructure, datacenter operations, or AI factory environments.
- Experience with RMA logistics, vendor SLA oversight, and hardware repair processes on a large scale.
- Background in reliability engineering, SLO build, or chaos/fault-injection testing.
- Prior experience at a cloud service provider or Hyperscalers infrastructure team.
- Experience building Agentic AI workflow software
Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 240,000 USD - 379,500 USD.
You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until May 18, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
, , JR2017591
Create a free Caio profile to unlock more results and save your role and location preferences.
Unlock free search