DevOps / Infrastructure & Field Support Engineer
Indexed description
We are looking for an experienced DevOps Engineer to join a team responsible for the maintenance and further development of a complex automation system deployed on-premise at customer sites.
The system is based on Linux (Ubuntu) and a containerized Kubernetes architecture.
The platform consists of multiple cooperating application and infrastructure components, including:
- backend services,
- GPU-based computing components (CUDA),
- communication layer,
- storage,
- networking components.
The DevOps role goes beyond reactive incident handling. A key objective of the project is to systematically reduce the need for on-site interventions by developing automated monitoring, diagnostics, and recovery mechanisms.
Responsibilities Incident Handling and System Maintenance
- Diagnosing and resolving issues related to:
- Kubernetes clusters,
- containers (Docker),
- Linux (Ubuntu) operating system,
- networking,
- storage (including NFS),
- Analyzing logs and service health across application and infrastructure layers,
- Restoring full system functionality in production environments,
- Performing system deployments and upgrades at customer sites,
- Participating in on-site interventions when issues cannot be resolved remotely.
- Designing and developing automated troubleshooting mechanisms,
- Early detection of infrastructure and application-level issues,
- Automated validation of the health of key system components:
- OS,
- Kubernetes,
- containers,
- storage,
- networking,
- Building health checks and observability solutions (metrics, alerts, dashboards),
- Creating and maintaining:
- runbooks,
- standard recovery procedures,
- automated self-healing mechanisms,
- Documenting common incidents, root causes, and resolution methods.
- Close cooperation with development and architecture teams,
- Contributing to architecture simplification and standardization,
- Improving overall system stability and reliability,
- Supporting long-term efforts to reduce operational overhead and manual interventions.
Additional Requirements
- Strong experience with Linux (Ubuntu) system administration and troubleshooting,
- Hands-on experience with Kubernetes, including cluster troubleshooting and container analysis,
- Practical knowledge of Docker,
- Solid understanding of networking and diagnosing network-related issues,
- Experience with NFS / storage troubleshooting,
- Operational knowledge of GPU / CUDA environments (compatibility, stability),
- Experience working with:
- RabbitMQ,
- PostgreSQL.
- Willingness to participate in an on-call / standby rotation,
- Readiness for business travel, including on-site customer visits,
- Ability to work independently in complex, distributed environments,
- Strong analytical and problem-solving skills.
Create a free Caio profile to unlock more results and save your role and location preferences.
Unlock free search