Back to search
jobgether Lever · Posted 18d ago

Infrastructure Operations Engineer

US Full-time

IT Security & IT Lever
Continue to application Add your email once, then Caio opens the original posting.

Indexed description

This position is posted by Jobgether on behalf of a partner company. We are currently looking for an Infrastructure Operations Engineer in United States.

This is an exciting opportunity for a highly technical infrastructure professional to help operate and scale next-generation AI infrastructure platforms powering advanced machine learning and large-scale compute workloads. In this role, you will work hands-on with GPU environments, Linux systems, cloud infrastructure, and automation frameworks to ensure reliability, performance, and operational efficiency. You’ll collaborate across engineering, networking, and platform teams to solve complex infrastructure challenges and improve system scalability through automation and observability. The position offers exposure to cutting-edge AI technologies, large-scale distributed systems, and modern infrastructure operations in a fast-paced and innovation-driven environment. Ideal candidates are proactive problem-solvers who thrive in highly technical settings and enjoy building resilient systems that support mission-critical AI applications.

Accountabilities:

    • Design, implement, and maintain scalable infrastructure solutions that support large-scale AI and GPU computing environments.
    • Manage and optimize Linux-based systems, cloud infrastructure, Kubernetes clusters, and bare-metal environments to ensure platform reliability and operational efficiency.
    • Collaborate with infrastructure engineering, networking, software platform, and customer-facing teams to troubleshoot issues and improve service delivery.
    • Develop and maintain automation workflows using tools such as Terraform, Ansible, and scripting languages to reduce manual operational overhead.
    • Support infrastructure provisioning, monitoring, incident response, and break/fix operations across distributed systems and customer deployments.
    • Participate in an on-call rotation, handling critical operational incidents and ensuring high availability of infrastructure services.
    • Improve observability and operational tooling through monitoring systems, logging platforms, and infrastructure performance analysis.
    • Contribute to the deployment of new platforms, features, and operational patterns that enhance scalability, resilience, and customer experience.

    Requirements:

      • 8+ years of experience managing Linux server or hosting environments, preferably including Ubuntu systems.
      • 5+ years of hands-on experience with AWS cloud infrastructure and services.
      • Strong expertise in Kubernetes, containerization technologies, and GitOps workflows.
      • Experience with infrastructure-as-code and automation tools such as Terraform and Ansible.
      • Knowledge of network-attached storage technologies including NFS, Ceph, or similar systems.
      • Experience with monitoring and observability platforms such as Prometheus and ELK stack.
      • Proficiency in scripting or software development for automation using Python, Go, Bash, or similar languages.
      • Strong networking fundamentals, with additional experience in data center networking, Infiniband, or high-speed Ethernet considered a plus.
      • Familiarity with GPU infrastructure, bare-metal provisioning, and hardware troubleshooting is highly desirable.
      • Excellent communication, analytical thinking, and problem-solving skills with the ability to navigate ambiguity and technical complexity.

      Benefits:

        • Competitive compensation package with bonus opportunities and equity participation.
        • Annual base salary range of approximately $160,000 – $200,000 USD depending on experience and location.
        • Comprehensive medical, dental, and vision coverage.
        • Retirement and financial wellness support programs.
        • Generous paid time off and company holidays.
        • Paid parental leave and family support benefits.
        • Flexible remote or hybrid work environment within the United States.
        • Professional development resources and career growth opportunities.
        • Wellness and home office stipends to support remote productivity and employee well-being.
        • Inclusive and collaborative company culture focused on innovation and continuous learning.
How Jobgether works: We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Why Apply Through Jobgether? Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1
Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search
Want help applying to roles like this? Search Caio for free. If the repetitive CV tweaking gets heavy, Daniel can help set up Caio Agent.
Ask about Agent