Back to search
xBerry - R&D House Linkedin · Posted 4mo ago

DevOps / Infrastructure & Field Support Engineer

Wrocław, Lower Silesia, Poland

Linkedin
Continue to application Add your email once, then Caio opens the original posting.

Indexed description

Position Overview

We are looking for an experienced DevOps Engineer to join a team responsible for the maintenance and further development of a complex automation system deployed on-premise at customer sites.

The system is based on Linux (Ubuntu) and a containerized Kubernetes architecture.

The platform consists of multiple cooperating application and infrastructure components, including:

  • backend services,
  • GPU-based computing components (CUDA),
  • communication layer,
  • storage,
  • networking components.

The environment is characterized by high operational complexity and strong dependencies between system layers (OS, Kubernetes, applications, networking, storage). Systems are deployed across multiple locations worldwide and often operate in environments with limited local IT support, which requires high reliability and well-defined operational procedures.

The DevOps role goes beyond reactive incident handling. A key objective of the project is to systematically reduce the need for on-site interventions by developing automated monitoring, diagnostics, and recovery mechanisms.

Responsibilities Incident Handling and System Maintenance

  • Diagnosing and resolving issues related to:
    • Kubernetes clusters,
    • containers (Docker),
    • Linux (Ubuntu) operating system,
    • networking,
    • storage (including NFS),
  • Analyzing logs and service health across application and infrastructure layers,
  • Restoring full system functionality in production environments,
  • Performing system deployments and upgrades at customer sites,
  • Participating in on-site interventions when issues cannot be resolved remotely.
Automation, Observability, and System Resilience

  • Designing and developing automated troubleshooting mechanisms,
  • Early detection of infrastructure and application-level issues,
  • Automated validation of the health of key system components:
    • OS,
    • Kubernetes,
    • containers,
    • storage,
    • networking,
  • Building health checks and observability solutions (metrics, alerts, dashboards),
  • Creating and maintaining:
    • runbooks,
    • standard recovery procedures,
    • automated self-healing mechanisms,
  • Documenting common incidents, root causes, and resolution methods.
Collaboration and Architecture Improvement

  • Close cooperation with development and architecture teams,
  • Contributing to architecture simplification and standardization,
  • Improving overall system stability and reliability,
  • Supporting long-term efforts to reduce operational overhead and manual interventions.

Requirements Technical Requirements

Additional Requirements

    • Strong experience with Linux (Ubuntu) system administration and troubleshooting,
    • Hands-on experience with Kubernetes, including cluster troubleshooting and container analysis,
    • Practical knowledge of Docker,
    • Solid understanding of networking and diagnosing network-related issues,
    • Experience with NFS / storage troubleshooting,
    • Operational knowledge of GPU / CUDA environments (compatibility, stability),
    • Experience working with:
      • RabbitMQ,
      • PostgreSQL.
      • Willingness to participate in an on-call / standby rotation,
      • Readiness for business travel, including on-site customer visits,
      • Ability to work independently in complex, distributed environments,
      • Strong analytical and problem-solving skills.
Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search
Want help applying to roles like this? Search Caio for free. If the repetitive CV tweaking gets heavy, Daniel can help set up Caio Agent.
Ask about Agent