jobgether Lever · Posted 18d ago

Infrastructure Operations Engineer

US Full-time

Continue to application Add your email once, then Caio opens the original posting.

Indexed description

This position is posted by Jobgether on behalf of a partner company. We are currently looking for an Infrastructure Operations Engineer in the United States.

In this role, you will help operate and scale large-scale AI and GPU infrastructure that powers next-generation machine learning workloads across research, startup, and enterprise environments. You will work at the intersection of reliability engineering, cloud operations, and automation, ensuring that complex distributed systems remain performant, observable, and resilient. This position offers hands-on exposure to bare metal infrastructure, Kubernetes environments, and cloud platforms, with a strong emphasis on operational excellence and automation. You will collaborate closely with infrastructure engineers, network specialists, and software teams to resolve incidents, improve system reliability, and reduce operational friction. Operating in a fast-moving environment, you will contribute directly to platform stability and customer success. This is a highly technical and impactful role for engineers who thrive in complex infrastructure ecosystems and enjoy building scalable operational systems.

Accountabilities:

In this role, you will be responsible for ensuring the reliability, scalability, and efficiency of large-scale infrastructure systems supporting GPU and cloud-based workloads.

Operate, monitor, and maintain large-scale Linux-based and GPU-enabled infrastructure environments
Support provisioning, deployment, and lifecycle management of compute and storage systems
Build automation and tooling to reduce operational overhead and improve system reliability
Manage and optimize cloud infrastructure components across AWS and hybrid environments
Work with Kubernetes clusters and containerized workloads to ensure system stability and performance
Support incident response, troubleshooting, and root cause analysis in production environments
Implement and improve observability solutions using monitoring and logging tools such as Prometheus and ELK
Collaborate with engineering and network teams to improve infrastructure design and operational workflows
Participate in on-call rotations and ensure timely resolution of production issues
Contribute to infrastructure improvements, including GitOps workflows and configuration management

Requirements:

This role requires strong infrastructure engineering experience with deep expertise in systems operations, cloud platforms, and automation.

8+ years of experience working with Linux systems in production environments
5+ years of experience with AWS infrastructure and cloud services
2+ years of experience with Kubernetes and containerized workloads
Hands-on experience with Terraform and Ansible for infrastructure as code
Experience managing network-attached storage systems (e.g., NFS, Ceph, or similar)
Strong understanding of monitoring and observability tools such as Prometheus and ELK stack
Familiarity with GitOps workflows and modern infrastructure automation practices
Programming or scripting experience in Python, Go, Bash, or similar languages for automation
Strong networking fundamentals, including understanding of distributed systems and datacenter environments
Experience working with bare metal systems, GPU infrastructure, or large-scale compute environments is highly valued
Strong problem-solving skills and ability to operate effectively in ambiguous, fast-changing environments
Excellent communication skills and ability to collaborate across technical teams

Benefits:

Competitive salary ($160,000–$200,000 USD base range) plus equity and potential bonus
Fully flexible work environment (remote or hybrid within the United States)
Comprehensive medical, dental, and vision coverage (U.S. employees)
Retirement and financial wellness programs
Generous paid time off and company holidays
Paid parental leave
Professional development and learning support
Wellness, home-office, and work-from-home stipends
Opportunity to work on cutting-edge AI and GPU infrastructure at scale

How Jobgether works: We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Why Apply Through Jobgether? Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1

Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search

Want help applying to roles like this? Search Caio for free. If the repetitive CV tweaking gets heavy, Daniel can help set up Caio Agent.

Ask about Agent

jobgether Company profile preview

Source: Lever
Location: US
Compensation: Not listed
Open on Caio: 7040 roles

Salary insight

Compensation not indexed

Caio highlights salary ranges whenever the original posting exposes them. Compare similar roles as the index fills in.

Similar role details

Full-time roles Location flexible matches Lever postings

Company stats

Current index details for jobgether, based on roles Caio has indexed from public sources.

7040open roles 2sources 43markets Posted 12d agolatest role