Sr. Site Reliability Engineer
Indexed description
This is a true Software Engineering-focused Site Reliability Engineering (SRE) role — not a traditional infrastructure support, operations, or system administration position.
The ideal candidate is a strong software engineer first, with deep experience building automation, improving reliability, and applying engineering principles to operational challenges within large-scale enterprise environments.
Position Overview
The team is responsible for:
- Production system reliability
- Supporting the software development lifecycle
- Maintaining highly available and resilient systems
- Monitoring, alerting, observability, and automation initiatives
- Driving permanent engineering solutions instead of temporary operational fixes
- Reducing operational toil through automation
- Building self-healing and auto-remediation capabilities
- Improving deployment reliability and production stability
- Supporting SLOs, error budgets, and observability standards
- Leveraging AI-assisted engineering and operational tooling
- Design, develop, and operate automation platforms supporting CI/CD pipelines
- Develop backend automation services primarily using Python and Go
- Build automation to eliminate repetitive operational work
- Create self-healing and auto-remediation solutions
- Improve production reliability, scalability, and operational efficiency
- Implement monitoring, observability, and reliability standards
- Collaborate across engineering, infrastructure, and application teams to improve deployment quality and platform resilience
- Support AI-assisted tooling initiatives and agentic operational frameworks
- Manage cloud and hybrid infrastructure environments including Azure and on-prem/private cloud platforms
- Utilize Infrastructure-as-Code (IaC) tools including Terraform and Ansible
- Support Kubernetes and Docker-based container orchestration environments
- Strong software engineering and development background
- Experience automating operational workflows within enterprise environments
- Strong proficiency with:
- Python
- Go
- Experience working within:
- Cloud-native environments
- Kubernetes and containerized platforms
- AWS and/or Azure
- CI/CD automation frameworks
- Monitoring and observability tooling
- Experience with:
- GitLab
- Jira
- Logging and telemetry platforms
- Java development experience
- Terraform and Infrastructure-as-Code expertise
- Experience with:
- AppDynamics
- Splunk or centralized logging platforms
- ServiceNow
- Exposure to AI-assisted engineering tools such as Windsurf or Claude
- AI-assisted development and productivity tools
- Agentic workflows
- Prompt engineering concepts
- Techniques for improving AI output quality and reducing hallucinations
Ideal Candidate Profile
- Senior-level engineer with strong technical depth
- Excellent communication and stakeholder management skills
- Strong collaborative mindset with the ability to work cross-functionally
- Deep problem-solving and production troubleshooting abilities
- Comfortable operating across multiple teams, applications, and business domains
- Experience working within large enterprise and regulated environments preferred
Create a free Caio profile to unlock more results and save your role and location preferences.
Unlock free search