Senior DevOps & Site Reliability Engineer - Americas
Indexed description
Your Role as a Senior DevOps & Site Reliability Engineer:
Our Cloud Operations team is seeking a Senior DevOps & Site Reliability Engineer who will play a critical role in ensuring the reliability, performance, and scalability of our diverse SaaS applications. You are a problem-solver and an automator at heart. This role is a specialized hybrid, bridging the gap between legacy VM-based architectures and modern cloud-native standards through aggressive automation and development-focused operations.
Unlike a traditional SRE, this role is deeply integrated with the software development lifecycle, focusing on the consolidation and optimization of platform operations. You will be responsible for building the CI/CD frameworks, self-service tools, and AI-driven automation that allow our engineering teams to move faster while maintaining rock-solid stability. Your mission is to maximize the ROI of our existing infrastructure by "automating away" manual toil. On-call coverage will be required on a weekly rotation basis.
A Day in the Life of a Senior DevOps & Site Reliability Engineer:
Role
In this role, you will be the technical anchor for a global platform footprint that includes a mix of Azure IaaS/PaaS, Google Cloud Platform (GCP), Kubernetes, and various data platforms. Your day will consist of:
- Intelligent Automation & DevOps: Identifying manual "toil" and replacing it with automated workflows for monitoring, change management, and routine administration of large-scale VM environments to ensure a positive ROI.
- AI-Enhanced Operations: Leading the integration of AI tools for automated code reviews, development frameworks, and predictive log analysis to drive departmental velocity and efficiency.
- Scalable CI/CD & Provisioning: Designing and maintaining "self-service" deployment frameworks and CI/CD pipelines (GitHub Actions, Bamboo) using Infrastructure as Code (Bicep, Terraform).
- Strategic ROI Projects: Evaluating platform components to determine the most cost-effective path: automating the current state or migrating features to modern, shared architectures.
- Unified Observability: Designing and maintaining a comprehensive observability stack across Azure and GCP (metrics, logs, traces) to identify performance bottlenecks and proactively address system defects.
- Cross-Functional Collaboration: Partner with engineering, security and operations teams to ensure new features are "born" with reliability, security and automated delivery in mind; Ensure adherence to security best practices and compliance standards (SOC2, HIPAA, ISO 27001) and operational excellence with cost efficiency.
- Root Cause Analysis & Forensics: Investigating complex performance defects by following log trails across web, application, and database tiers (SQL Server, MongoDB, MySQL).
- Governance & Security: Ensuring all platforms meet security standards (SOC2, HIPAA, ISO 27001) through automated policy enforcement across Azure and GCP.
- Must have a passion for life-long learning.
- 6+ years in DevOps or SRE roles, with a proven track record of bridging development and operations in complex cloud environments
- Extensive experience with Microsoft Azure (IaaS, PaaS, App Services, Networking) and/or Google Cloud Platform (GCP).
- Expert-level PowerShell and Python skills. Hands-on experience with Bicep or Terraform is required
- Strong background in Windows/Linux Server OS, Kubernetes (AKS/GKE), Helm, and container orchestration
- Familiarity with various middleware and PaaS technologies (e.g. Event Hub, Service Bus, CosmosDB, RabbitMQ, MongoDB, etc.)
- Expert-level troubleshooting and the ability to reason through complex process workflows to identify faults in large-scale platform environments.
- Experience with Atlassian suite (Jira, Confluence, Bitbucket).
- Experience with AI-driven log analysis or automated incident remediation.
- Knowledge of database tuning (SQL Server, MySQL, MongoDB).
- Familiarity with compliance standards (SOC2, HIPAA, GDPR).
- Generous PTO
- 5 additional days off for training
- Flexible work schedules
- Remote work opportunities
- Appspace Quiet Fridays (No non-essential internal meetings scheduled)
- Paid company holidays
Create a free Caio profile to unlock the full index and keep your job-search signal for future recommendations.
Unlock free search