Back to search
Embrace Software Inc Himalayas · Posted 10d ago

Azure CloudOps Engineer

USD Full time Remote

CloudOps Engineer Azure Cloud Engineer DevOps Engineer Platform Engineer
Continue to application Add your email once, then Caio opens the original posting.

Indexed description

This is a remote position.

We are looking for a CloudOps Engineer to operate and continuously improve the reliability, security, scalability, observability, and cost efficiency of our Azure-hosted SaaS products. Our products are deployed across development, QA, staging, and production environments, with infrastructure managed through Terraform and CI/CD automated through GitHub Actions.​

This role will work closely with engineering teams to ensure our SaaS platforms and AI-enabled solutions are deployed consistently, monitored effectively, secured properly, and operated reliably in production.

Environment and Technology Context
  • Microsoft Azure-hosted SaaS products across dev, QA, staging, and production environments.
  • Terraform for infrastructure as code and repeatable environment provisioning.
  • GitHub Actions for application and infrastructure CI/CD workflows.
  • Azure services including Static Web Apps, Container Apps, PostgreSQL, Storage Accounts, SignalR, Service Bus, Azure AI Foundry, Speech-to-Text services, Azure Arc, and related services.
  • AI-enabled product capabilities including STT workloads, LLM integrations, AI service endpoints, quotas, usage monitoring, latency monitoring, and cost controls.

Key Responsibilities

Cloud Infrastructure Operations
  • Manage and support Azure cloud infrastructure across dev, QA, staging, and production environments.
  • Maintain operational health of Azure services including Static Web Apps, Container Apps, PostgreSQL, Storage Accounts, SignalR, Service Bus, Azure AI Foundry, Azure Arc, and related platform services.
  • Ensure cloud resources are provisioned, configured, monitored, maintained, and retired according to company standards.
  • Support environment setup for new products, customers, integrations, and internal initiatives.
  • Identify and resolve infrastructure issues affecting performance, reliability, availability, or security.

Terraform and Infrastructure as Code
  • Build, maintain, and improve Terraform modules and environment configurations.
  • Ensure infrastructure changes are version-controlled, peer-reviewed, tested, approved, and repeatable.
  • Manage Terraform state, workspaces, variables, secrets integration, and deployment workflows.
  • Detect and resolve configuration drift between Terraform and deployed Azure resources.
  • Standardize naming conventions, tagging, resource group structure, environment isolation, and module patterns.
  • Support scalable provisioning of new SaaS environments using reusable infrastructure templates.

GitHub Actions and CI/CD
  • Build, maintain, and troubleshoot GitHub Actions workflows for application and infrastructure deployments.
  • Support CI/CD pipelines for multiple SaaS products and environments.
  • Implement deployment promotion flows from development to QA to staging to production.
  • Add deployment safeguards such as environment protection rules, approvals, rollback procedures, validation checks, release gates, and audit trails.
  • Manage pipeline secrets, service principals, managed identities, and secure deployment credentials.
  • Improve build and deployment reliability, speed, traceability, and auditability.

AI Service Operations
  • Operate and monitor Azure AI services, including Azure AI Foundry and Speech-to-Text workloads.
  • Support production operations for LLM-based integrations and AI-enabled product features.
  • Monitor AI service availability, latency, quota usage, token consumption, API failures, throttling, and cost.
  • Help define operational standards for AI workloads, including access control, logging, alerting, failover, usage governance, and provider disruption handling.
  • Work with engineering teams to troubleshoot AI service issues, integration failures, degraded model responses, or provider-side service disruptions.
  • Support secure handling of AI-related secrets, endpoints, keys, managed identities, and private network access where applicable.

Monitoring, Alerting, and Observability
  • Implement and maintain monitoring using Azure Monitor, Log Analytics, Application Insights, and related tools.
  • Create dashboards for infrastructure, application, database, messaging, storage, AI service, and deployment health.
  • Configure alerts for availability, latency, errors, resource saturation, queue depth, failed jobs, failed deployments, database health, quota exhaustion, and cost anomalies.
  • Improve signal quality by reducing alert noise and ensuring alerts are actionable.
  • Partner with engineering teams to define service-level indicators, service-level objectives, and production health metrics.

Incident Response and Production Support
  • Participate in production incident response for cloud infrastructure, deployments, integrations, and platform services.
  • Triage and resolve issues across Azure services, CI/CD pipelines, Terraform, networking, databases, messaging, and AI integrations.
  • Create and maintain runbooks for common operational issues.
  • Support root cause analysis and post-incident reviews.
  • Implement preventive actions after incidents to improve system reliability.
  • Help define severity levels, escalation paths, response expectations, on-call processes, and production support procedures.

Security, Identity, and Access Management
  • Implement cloud security best practices across Azure environments.
  • Manage Azure RBAC, managed identities, service principals, Key Vault access, and least-privilege permissions.
  • Secure GitHub Actions workflows, deployment credentials, environment secrets, and production access.
  • Support secret rotation, certificate management, and secure configuration management.
  • Help enforce network security using private endpoints, firewalls, IP restrictions, and environment-specific access rules.
  • Support compliance readiness for audits, security reviews, customer due diligence, SOC 2, ISO 27001, or similar frameworks.

Database, Storage, and Messaging Operations
  • Support operational management of Azure PostgreSQL databases, including backups, restores, performance monitoring, connection limits, high availability, and capacity planning.
  • Monitor and maintain Azure Storage Accounts, lifecycle policies, access controls, backup strategy, and usage trends.
  • Support Azure Service Bus operations, including queue/topic monitoring, dead-letter handling, retry behavior, and throughput issues.
  • Support SignalR operational health, connection metrics, scaling behavior, and related production issues.

Cost Management and Optimization
  • Monitor Azure spend across products, environments, services, and customers where applicable.
  • Implement tagging standards to support cost allocation by product, environment, customer, or business unit.
  • Create cost dashboards, budget alerts, anomaly detection processes, and recurring cost reviews.
  • Identify underutilized resources and recommend right-sizing opportunities.
  • Review AI service costs, LLM usage, token consumption, STT usage, storage growth, database sizing, and environment costs.
  • Recommend savings plans, reservations, scaling rules, lifecycle policies, or shutdown schedules where appropriate.

Reliability, Backup, and Disaster Recovery
  • Define and maintain backup and recovery procedures for critical cloud services.
  • Test database restores and validate backup reliability.
  • Help define recovery time objectives and recovery point objectives for production systems.
  • Support disaster recovery planning for SaaS products and customer-facing services.
  • Improve resilience through scaling rules, failover patterns, health checks, synthetic monitoring, and production readiness reviews.

Documentation and Operational Standards
  • Create and maintain CloudOps documentation, runbooks, deployment guides, troubleshooting guides, and environment standards.
  • Define standards for resource naming, tagging, logging, alerting, access control, Terraform structure, GitHub Actions workflow patterns, and production changes.
  • Document operational procedures for cloud services, CI/CD workflows, AI services, and incident response.
  • Enable engineering teams with reusable patterns, templates, and self-service guidance.


Requirements

Required Qualifications
  • 7+ years of hands-on experience operating production workloads in Microsoft Azure.
  • Strong experience with Terraform and infrastructure as code.
  • Experience building and maintaining CI/CD pipelines using GitHub Actions.
  • Experience supporting containerized workloads, preferably Azure Container Apps or similar platforms.
  • Experience with Azure monitoring and observability tools such as Azure Monitor, Log Analytics, and Application Insights.
  • Experience with Azure PostgreSQL or similar managed relational databases.
  • Strong understanding of Azure networking, DNS, identity, RBAC, managed identities, Key Vault, and security best practices.
  • Experience troubleshooting production incidents across infrastructure, application deployments, networking, and cloud services.
  • Comfortable writing scripts using Bash, PowerShell, Python, or similar tools.
  • Strong documentation, communication, and cross-functional collaboration skills.

Preferred Qualifications
  • Experience operating AI-enabled applications or Azure AI services.
  • Experience with Azure AI Foundry, Azure OpenAI, Speech-to-Text, or LLM-based integrations.
  • Experience monitoring AI service usage, quotas, latency, throttling, token consumption, and cost.
  • Experience with Azure Service Bus, SignalR, Storage Accounts, and Static Web Apps.
  • Experience with Azure Arc.
  • Experience supporting multi-product or multi-tenant SaaS platforms.
  • Experience with SOC 2, ISO 27001, or similar compliance frameworks.
  • Experience with FinOps, cloud cost governance, or Azure cost optimization.
  • Experience designing production support processes, incident response workflows, on-call rotations, and operational runbooks.


Benefits

  • Competitive salary commensurate with experience.
  • Opportunities for career advancement and professional development.
  • Experience collaborating with a diverse, global team within a remote work setting.

Originally posted on Himalayas

Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search
Want help applying to roles like this? Search Caio for free. If the repetitive CV tweaking gets heavy, Daniel can help set up Caio Agent.
Ask about Agent