Back to search
jobgether Lever · Posted 18d ago

Infrastructure Operations Engineer

US Full-time

IT Security & IT Lever
Continue to application Add your email once, then Caio opens the original posting.

Indexed description

This position is posted by Jobgether on behalf of a partner company. We are currently looking for an Infrastructure Operations Engineer in the United States.

In this role, you will help operate and scale large-scale AI and GPU infrastructure that powers next-generation machine learning workloads across research, startup, and enterprise environments. You will work at the intersection of reliability engineering, cloud operations, and automation, ensuring that complex distributed systems remain performant, observable, and resilient. This position offers hands-on exposure to bare metal infrastructure, Kubernetes environments, and cloud platforms, with a strong emphasis on operational excellence and automation. You will collaborate closely with infrastructure engineers, network specialists, and software teams to resolve incidents, improve system reliability, and reduce operational friction. Operating in a fast-moving environment, you will contribute directly to platform stability and customer success. This is a highly technical and impactful role for engineers who thrive in complex infrastructure ecosystems and enjoy building scalable operational systems.

Accountabilities:

In this role, you will be responsible for ensuring the reliability, scalability, and efficiency of large-scale infrastructure systems supporting GPU and cloud-based workloads.

    • Operate, monitor, and maintain large-scale Linux-based and GPU-enabled infrastructure environments
    • Support provisioning, deployment, and lifecycle management of compute and storage systems
    • Build automation and tooling to reduce operational overhead and improve system reliability
    • Manage and optimize cloud infrastructure components across AWS and hybrid environments
    • Work with Kubernetes clusters and containerized workloads to ensure system stability and performance
    • Support incident response, troubleshooting, and root cause analysis in production environments
    • Implement and improve observability solutions using monitoring and logging tools such as Prometheus and ELK
    • Collaborate with engineering and network teams to improve infrastructure design and operational workflows
    • Participate in on-call rotations and ensure timely resolution of production issues
    • Contribute to infrastructure improvements, including GitOps workflows and configuration management

    Requirements:

    This role requires strong infrastructure engineering experience with deep expertise in systems operations, cloud platforms, and automation.

      • 8+ years of experience working with Linux systems in production environments
      • 5+ years of experience with AWS infrastructure and cloud services
      • 2+ years of experience with Kubernetes and containerized workloads
      • Hands-on experience with Terraform and Ansible for infrastructure as code
      • Experience managing network-attached storage systems (e.g., NFS, Ceph, or similar)
      • Strong understanding of monitoring and observability tools such as Prometheus and ELK stack
      • Familiarity with GitOps workflows and modern infrastructure automation practices
      • Programming or scripting experience in Python, Go, Bash, or similar languages for automation
      • Strong networking fundamentals, including understanding of distributed systems and datacenter environments
      • Experience working with bare metal systems, GPU infrastructure, or large-scale compute environments is highly valued
      • Strong problem-solving skills and ability to operate effectively in ambiguous, fast-changing environments
      • Excellent communication skills and ability to collaborate across technical teams

      Benefits:

        • Competitive salary ($160,000–$200,000 USD base range) plus equity and potential bonus
        • Fully flexible work environment (remote or hybrid within the United States)
        • Comprehensive medical, dental, and vision coverage (U.S. employees)
        • Retirement and financial wellness programs
        • Generous paid time off and company holidays
        • Paid parental leave
        • Professional development and learning support
        • Wellness, home-office, and work-from-home stipends
        • Opportunity to work on cutting-edge AI and GPU infrastructure at scale
How Jobgether works: We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Why Apply Through Jobgether? Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1
Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search
Want help applying to roles like this? Search Caio for free. If the repetitive CV tweaking gets heavy, Daniel can help set up Caio Agent.
Ask about Agent