Lead DevOps Engineer
Indexed description
What You'll Do
Multi-Cloud & Multi-Tenant Infrastructure:
- Design and manage infrastructure across AWS and GCP, ensuring consistent networking, security, and deployment patterns across both clouds.
- Architect tenant-isolated environments with secure VPC networking — no public-facing IPs, private subnets, VPC peering, endpoints, and VPN connectivity.
- Build and operate production Kubernetes clusters to host containerized microservices at scale.
- Define the strategy for which workloads run where — cloud vs. on-premise — based on data sensitivity, latency, and compliance requirements.
- Own and evolve a centralized, modular CI,CD pipeline built on GitHub Actions as the single path to production.
- Eliminate direct developer access to production environments; implement controlled deployment workflows using session-based access tools (e.g., AWS SSM Session Manager).
- Establish branch protection, image signing, environment promotion gates, and tenant-aware deployment strategies.
- Oversee configuration management for client-site appliances using Chef for example in a client-server architecture.
- Drive the strategy to progressively centralize microservices into cloud-hosted infrastructure, minimizing the on-premise footprint.
- Define remote access procedures, failure runbooks, and contingency workflows for on-premise hardware.
- Enforce infrastructure security best practices for a healthcare environment handling PHI and de-identified clinical data across tenant boundaries.
- Manage VPN-based access to private cloud networks and implement least-privilege IAM, secrets management, and policy-as-code across all environments.
- Ensure tenant data isolation at the network, storage, and compute layers.
- Build and maintain unified observability using Prometheus and Grafana across cloud and on-premise environments.
- Own the backup and disaster recovery strategy — container registries, automated snapshots, and cross-cloud resilience.
- Define and track SLOs for critical data pipelines and tenant-facing services.
- Mentor junior DevOps,infrastructure engineers and collaborate closely with data engineering, AI, and IT teams.
- Recommend and help hire for supporting roles (e.g., IT support for on-premise hardware operations).
- Establish DevOps standards, documentation, and runbooks for the team.
- 6+ years of DevOps,Infrastructure,SRE experience, with at least 2 years in a lead or senior capacity.
- Production experience across AWS and GCP — VPCs, IAM, compute, storage, and managed services on both platforms.
- Hands-on experience running Kubernetes in production — cluster lifecycle, Helm charts, service mesh, autoscaling, and troubleshooting.
- Deep expertise in CI,CD design using GitHub Actions (or comparable platforms) with a focus on security and governance.
- Strong understanding of multi-tenant architecture patterns — network isolation, tenant-aware deployments, and data segregation.
- Solid Docker and container lifecycle management experience.
- Infrastructure-as-Code proficiency with Terraform (multi-provider) or equivalent.
- Networking fundamentals — VPNs, VPCs, DNS, firewalls, load balancers, and zero-trust architectures.
- Comfort with Python and shell scripting for automation.
- A production-first, outcome-oriented mindset — you measure success by what's running reliably in production, not by what's in a slide deck. Customer value over story-point velocity.
- Excellent communication skills — you can translate complex technical concepts for both engineering peers and business stakeholders.
- 1+ years of DevOps team management experience — you've directly managed devops engineers, run standups, handled performance conversations, and built team culture.
- Experience with the AI-native stack — vector databases (Pinecone, Weaviate,pgvector), RAG pipelines, feature stores, LLM orchestration frameworks(LangChain, LlamaIndex), and ML pipeline tooling (MLflow, Kubeflow,SageMaker).
- Experience in healthcare, life sciences, or any environment with strict data privacy requirements (HIPAA, PHI handling).
- Experience with configuration management tools such as Chef, Ansible, or Puppet.
- Familiarity with Elasticsearch operations and management.
- Experience managing hybrid environments with on-premise hardware alongside cloud infrastructure.
- Exposure to Prometheus, Grafana, and alerting pipeline design.
- Background working with data engineering teams running ETL,ELT pipelines.
Create a free Caio profile to unlock more results and save your role and location preferences.
Unlock free search