Back to search
jobgether Lever · Posted 28d ago

Senior DevOps / Platform Reliability Engineer

US Full-time

IT Security & IT Lever
Continue to application Add your email once, then Caio opens the original posting.

Indexed description

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Senior DevOps / Platform Reliability Engineer in the United States.

This role sits at the intersection of platform engineering, SRE, and AI-driven operations, supporting a next-generation intelligent automation platform used by enterprise-scale customers. You will be responsible for building and evolving the infrastructure backbone that powers production AI and multi-agent systems at scale. The environment is highly technical and fast-moving, requiring strong ownership of CI/CD, cloud infrastructure, observability, and security. You will work closely with engineering teams to ensure safe, reliable, and scalable deployments across complex distributed systems. A key aspect of the role involves integrating modern AI tools into DevOps workflows to reduce operational toil and improve system intelligence. This is a high-impact position where your work directly shapes platform reliability, developer velocity, and production safety.

Accountabilities:

    • Own and evolve CI/CD pipelines using modern tools such as GitHub Actions, ensuring safe, scalable, and reversible deployments for microservices and AI workloads
    • Design and manage Infrastructure as Code solutions using Terraform and CloudFormation to automate provisioning and environment consistency
    • Operate and scale Kubernetes-based infrastructure (EKS + Argo CD), including autoscaling, ingress, security controls, and multi-tenant isolation
    • Manage cloud networking and edge infrastructure including Cloudflare, AWS networking services, API gateways, load balancers, and DNS configurations
    • Oversee data and event infrastructure such as Aurora MySQL, Redis, S3, and Kafka (MSK), ensuring reliability, backups, and disaster recovery readiness
    • Build and maintain serverless and event-driven systems using AWS Lambda where appropriate
    • Develop observability platforms using Prometheus, Grafana, and OpenTelemetry, including telemetry for AI/LLM systems and agentic workflows
    • Strengthen security and compliance posture (SOC 2, HIPAA) through IAM design, secrets management, scanning, and policy-as-code enforcement
    • Drive FinOps initiatives including cost optimization, workload attribution, and LLM usage cost control
    • Partner with engineering teams to define deployment standards, operational SLOs, and platform best practices
    • Improve system reliability through monitoring, incident response, automation, and continuous infrastructure improvements
    • Document infrastructure, processes, and operational standards to enable scalability and knowledge sharing

    Requirements:

      • 5+ years of experience in DevOps, SRE, or Platform Engineering supporting production systems on AWS
      • Strong hands-on experience with CI/CD systems such as GitHub Actions, GitLab CI, Jenkins, or CircleCI
      • Deep experience operating Kubernetes environments (EKS preferred), including scaling, upgrades, and production operations
      • Strong AWS networking knowledge including VPC design, routing, security groups, load balancing, and DNS management
      • Proficiency with Terraform and Infrastructure as Code practices, ideally using OIDC-based authentication
      • Experience with production databases and storage systems including Aurora/RDS MySQL, Redis, and S3
      • Strong observability expertise using Prometheus, Grafana, and OpenTelemetry
      • Experience with Argo CD for GitOps-based deployments
      • Strong understanding of Cloudflare and AWS edge/networking services
      • Experience with Kafka/MSK and event-driven architectures
      • Strong scripting skills in Python, Bash, and Linux environments
      • Solid understanding of security practices including IAM, KMS, secrets management, and supply chain security
      • Experience with compliance and vulnerability scanning tools
      • Ability to work independently while collaborating effectively in high-ownership engineering teams

      Benefits:

        • Competitive compensation package
        • 100% employer-covered employee health premiums
        • 75%–80% coverage for dependent health, dental, and vision plans
        • 401(k) retirement plan
        • Paid parental leave
        • Unlimited PTO policy
        • Fully remote work flexibility across the United States
        • Up to $200/month co-working space reimbursement
        • Home office stipend up to $500 for setup
        • Monthly $100 stipend for internet, phone, and related expenses
        • Opportunity to work on cutting-edge AI-native infrastructure and agentic systems
        • High-autonomy engineering culture focused on ownership and innovation
How Jobgether works: We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Why Apply Through Jobgether? Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1
Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search
Want help applying to roles like this? Search Caio for free. If the repetitive CV tweaking gets heavy, Daniel can help set up Caio Agent.
Ask about Agent