Azure CloudOps Engineer
Indexed description
This is a remote position.
We are looking for a CloudOps Engineer to operate and continuously improve the reliability, security, scalability, observability, and cost efficiency of our Azure-hosted SaaS products. Our products are deployed across development, QA, staging, and production environments, with infrastructure managed through Terraform and CI/CD automated through GitHub Actions.This role will work closely with engineering teams to ensure our SaaS platforms and AI-enabled solutions are deployed consistently, monitored effectively, secured properly, and operated reliably in production.
Environment and Technology Context- Microsoft Azure-hosted SaaS products across dev, QA, staging, and production environments.
- Terraform for infrastructure as code and repeatable environment provisioning.
- GitHub Actions for application and infrastructure CI/CD workflows.
- Azure services including Static Web Apps, Container Apps, PostgreSQL, Storage Accounts, SignalR, Service Bus, Azure AI Foundry, Speech-to-Text services, Azure Arc, and related services.
- AI-enabled product capabilities including STT workloads, LLM integrations, AI service endpoints, quotas, usage monitoring, latency monitoring, and cost controls.
Key Responsibilities
Cloud Infrastructure Operations- Manage and support Azure cloud infrastructure across dev, QA, staging, and production environments.
- Maintain operational health of Azure services including Static Web Apps, Container Apps, PostgreSQL, Storage Accounts, SignalR, Service Bus, Azure AI Foundry, Azure Arc, and related platform services.
- Ensure cloud resources are provisioned, configured, monitored, maintained, and retired according to company standards.
- Support environment setup for new products, customers, integrations, and internal initiatives.
- Identify and resolve infrastructure issues affecting performance, reliability, availability, or security.
Terraform and Infrastructure as Code- Build, maintain, and improve Terraform modules and environment configurations.
- Ensure infrastructure changes are version-controlled, peer-reviewed, tested, approved, and repeatable.
- Manage Terraform state, workspaces, variables, secrets integration, and deployment workflows.
- Detect and resolve configuration drift between Terraform and deployed Azure resources.
- Standardize naming conventions, tagging, resource group structure, environment isolation, and module patterns.
- Support scalable provisioning of new SaaS environments using reusable infrastructure templates.
GitHub Actions and CI/CD- Build, maintain, and troubleshoot GitHub Actions workflows for application and infrastructure deployments.
- Support CI/CD pipelines for multiple SaaS products and environments.
- Implement deployment promotion flows from development to QA to staging to production.
- Add deployment safeguards such as environment protection rules, approvals, rollback procedures, validation checks, release gates, and audit trails.
- Manage pipeline secrets, service principals, managed identities, and secure deployment credentials.
- Improve build and deployment reliability, speed, traceability, and auditability.
AI Service Operations- Operate and monitor Azure AI services, including Azure AI Foundry and Speech-to-Text workloads.
- Support production operations for LLM-based integrations and AI-enabled product features.
- Monitor AI service availability, latency, quota usage, token consumption, API failures, throttling, and cost.
- Help define operational standards for AI workloads, including access control, logging, alerting, failover, usage governance, and provider disruption handling.
- Work with engineering teams to troubleshoot AI service issues, integration failures, degraded model responses, or provider-side service disruptions.
- Support secure handling of AI-related secrets, endpoints, keys, managed identities, and private network access where applicable.
Monitoring, Alerting, and Observability- Implement and maintain monitoring using Azure Monitor, Log Analytics, Application Insights, and related tools.
- Create dashboards for infrastructure, application, database, messaging, storage, AI service, and deployment health.
- Configure alerts for availability, latency, errors, resource saturation, queue depth, failed jobs, failed deployments, database health, quota exhaustion, and cost anomalies.
- Improve signal quality by reducing alert noise and ensuring alerts are actionable.
- Partner with engineering teams to define service-level indicators, service-level objectives, and production health metrics.
Incident Response and Production Support- Participate in production incident response for cloud infrastructure, deployments, integrations, and platform services.
- Triage and resolve issues across Azure services, CI/CD pipelines, Terraform, networking, databases, messaging, and AI integrations.
- Create and maintain runbooks for common operational issues.
- Support root cause analysis and post-incident reviews.
- Implement preventive actions after incidents to improve system reliability.
- Help define severity levels, escalation paths, response expectations, on-call processes, and production support procedures.
Security, Identity, and Access Management- Implement cloud security best practices across Azure environments.
- Manage Azure RBAC, managed identities, service principals, Key Vault access, and least-privilege permissions.
- Secure GitHub Actions workflows, deployment credentials, environment secrets, and production access.
- Support secret rotation, certificate management, and secure configuration management.
- Help enforce network security using private endpoints, firewalls, IP restrictions, and environment-specific access rules.
- Support compliance readiness for audits, security reviews, customer due diligence, SOC 2, ISO 27001, or similar frameworks.
Database, Storage, and Messaging Operations- Support operational management of Azure PostgreSQL databases, including backups, restores, performance monitoring, connection limits, high availability, and capacity planning.
- Monitor and maintain Azure Storage Accounts, lifecycle policies, access controls, backup strategy, and usage trends.
- Support Azure Service Bus operations, including queue/topic monitoring, dead-letter handling, retry behavior, and throughput issues.
- Support SignalR operational health, connection metrics, scaling behavior, and related production issues.
Cost Management and Optimization- Monitor Azure spend across products, environments, services, and customers where applicable.
- Implement tagging standards to support cost allocation by product, environment, customer, or business unit.
- Create cost dashboards, budget alerts, anomaly detection processes, and recurring cost reviews.
- Identify underutilized resources and recommend right-sizing opportunities.
- Review AI service costs, LLM usage, token consumption, STT usage, storage growth, database sizing, and environment costs.
- Recommend savings plans, reservations, scaling rules, lifecycle policies, or shutdown schedules where appropriate.
Reliability, Backup, and Disaster Recovery- Define and maintain backup and recovery procedures for critical cloud services.
- Test database restores and validate backup reliability.
- Help define recovery time objectives and recovery point objectives for production systems.
- Support disaster recovery planning for SaaS products and customer-facing services.
- Improve resilience through scaling rules, failover patterns, health checks, synthetic monitoring, and production readiness reviews.
Documentation and Operational Standards- Create and maintain CloudOps documentation, runbooks, deployment guides, troubleshooting guides, and environment standards.
- Define standards for resource naming, tagging, logging, alerting, access control, Terraform structure, GitHub Actions workflow patterns, and production changes.
- Document operational procedures for cloud services, CI/CD workflows, AI services, and incident response.
- Enable engineering teams with reusable patterns, templates, and self-service guidance.
Requirements
Required Qualifications- 7+ years of hands-on experience operating production workloads in Microsoft Azure.
- Strong experience with Terraform and infrastructure as code.
- Experience building and maintaining CI/CD pipelines using GitHub Actions.
- Experience supporting containerized workloads, preferably Azure Container Apps or similar platforms.
- Experience with Azure monitoring and observability tools such as Azure Monitor, Log Analytics, and Application Insights.
- Experience with Azure PostgreSQL or similar managed relational databases.
- Strong understanding of Azure networking, DNS, identity, RBAC, managed identities, Key Vault, and security best practices.
- Experience troubleshooting production incidents across infrastructure, application deployments, networking, and cloud services.
- Comfortable writing scripts using Bash, PowerShell, Python, or similar tools.
- Strong documentation, communication, and cross-functional collaboration skills.
Preferred Qualifications- Experience operating AI-enabled applications or Azure AI services.
- Experience with Azure AI Foundry, Azure OpenAI, Speech-to-Text, or LLM-based integrations.
- Experience monitoring AI service usage, quotas, latency, throttling, token consumption, and cost.
- Experience with Azure Service Bus, SignalR, Storage Accounts, and Static Web Apps.
- Experience with Azure Arc.
- Experience supporting multi-product or multi-tenant SaaS platforms.
- Experience with SOC 2, ISO 27001, or similar compliance frameworks.
- Experience with FinOps, cloud cost governance, or Azure cost optimization.
- Experience designing production support processes, incident response workflows, on-call rotations, and operational runbooks.
Benefits
- Competitive salary commensurate with experience.
- Opportunities for career advancement and professional development.
- Experience collaborating with a diverse, global team within a remote work setting.
Originally posted on Himalayas
Create a free Caio profile to unlock more results and save your role and location preferences.
Unlock free search