Senior AI Infrastructure & Platform Operations Engineer
Indexed description
Role Overview
As a Senior AI Infrastructure & Platform Operations Engineer, you will serve as a technical leader within the operations organization, providing deep expertise across infrastructure, networking, platform operations, and service reliability. You will be responsible for driving operational excellence across complex production environments while acting as a key escalation point for critical incidents and challenging technical issues.
What You Will Do
Lead the investigation and resolution of complex infrastructure, networking, and platform-related incidents. Support large-scale NVIDIA GPU infrastructure and high-performance networking environments. Troubleshoot complex Linux, Kubernetes, networking, storage, and hardware-related issues.
Why It Might Be a Fit
We offer: Operate some of the most advanced AI infrastructure environments in production today. Work with the latest NVIDIA GPU technologies, Kubernetes platforms, and high-performance networking environments. Help define operational standards and reliability practices for next-generation AI infrastructure services.
Requirements
- 7+ years of experience in infrastructure operations, platform operations, site reliability engineering, network operations, cloud operations, datacenter operations, or related technical roles.
- Expert-level Linux administration and troubleshooting skills.
- Strong networking expertise, including experience diagnosing complex performance, connectivity, and reliability issues.
- Strong experience operating Kubernetes in production environments.
- Experience supporting large-scale production infrastructure and distributed systems.
- Proven experience leading technical investigations and managing complex incidents.
- Experience performing root cause analysis and driving long-term operational improvements.
- Strong understanding of observability, monitoring, and service reliability practices.
- Excellent troubleshooting and analytical skills across multiple infrastructure domains.
- Strong communication, collaboration, and stakeholder management skills.
Benefits
- Operate some of the most advanced AI infrastructure environments in production today.
- Work with the latest NVIDIA GPU technologies, Kubernetes platforms, and high-performance networking environments.
- Help define operational standards and reliability practices for next-generation AI infrastructure services.
- Influence the adoption of AI-powered operational capabilities through k0rdent AI.
- Work alongside highly skilled engineers solving complex infrastructure and platform challenges at scale.
- Join a growing organisation investing heavily in AI infrastructure, platform services, and operational innovation.
Originally posted on Himalayas
Create a free Caio profile to unlock more results and save your role and location preferences.
Unlock free search