Back to search
GTN Technical Staffing Linkedin · Posted 19d ago

Senior Kubernetes Engineer

Dallas, Texas, United States

Linkedin
Continue to application Add your email once, then Caio opens the original posting.

Indexed description

Senior Kubernetes Engineer (GPU / AI Platforms)

Location: Dallas, TX (Hybrid) - relo offered

Type: Direct Hire

• 175-250K + performance bonus

• 100% company-paid benefits

• Relocation available

Overview

We are seeking a Senior Kubernetes Engineer to design, build, and optimize GPU-accelerated container platforms supporting large-scale HPC, AI/ML workloads, and next-generation CaaS / GPUaaS environments.

This role is focused on enabling scalable, multi-tenant compute platforms that power GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) offerings across hybrid and on-prem infrastructure. You will work at the intersection of Kubernetes and the NVIDIA ecosystem, driving performance, efficiency, and reliability for high-throughput, GPU-intensive workloads.

The ideal candidate brings deep hands-on experience building production-grade Kubernetes platforms for AI and HPC workloads, along with strong development skills and a passion for high-performance, distributed systems at scale.

Key Responsibilities

Kubernetes Platform Engineering

  • Architect, deploy, and operate Kubernetes clusters optimized for GPU-intensive and multi-tenant workloads
  • Design platforms supporting CaaS / GPUaaS delivery models, ensuring scalability, resilience, and performance
  • Leverage NVIDIA GPU Operator, Network Operator, and DCGM for cluster management and observability

GPU Enablement & Scheduling

  • Integrate NVIDIA device plugins, MIG, and GPU sharing capabilities into Kubernetes scheduling frameworks
  • Optimize GPU utilization and workload placement using scheduler extensions (kube-scheduler plugins, Slurm, Volcano)
  • Support AI/ML training, LLM workloads, and scientific computing at scale

Automation & Platform Development

  • Develop and maintain Kubernetes operators and custom controllers
  • Automate platform provisioning and lifecycle management using Go or Python
  • Implement Infrastructure-as-Code using Terraform, Helm, and Kustomize

Observability & Performance Optimization

  • Implement monitoring and telemetry using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry
  • Drive performance tuning, capacity planning, and optimization across GPU-enabled clusters
  • Support incident response and ensure production readiness

Security & Multi-Tenancy

  • Design secure, multi-tenant Kubernetes environments using RBAC, namespaces, and policy enforcement (OPA, Gatekeeper)
  • Ensure workload isolation and governance across shared GPU infrastructure
  • Support secure platform operations across CaaS / GPUaaS environments

DevOps & CI/CD

  • Build and maintain CI/CD pipelines using GitOps tools such as ArgoCD and FluxCD
  • Support continuous delivery and lifecycle management of Kubernetes-based platforms

Cross-Functional Collaboration

  • Partner with HPC, AI/ML, DevOps, and platform engineering teams to support high-performance workloads
  • Collaborate on platform architecture, optimization strategies, and operational best practices

Required Experience

  • Extensive experience operating Kubernetes in production-scale environments
  • Strong experience supporting HPC, AI/ML, or GPU-accelerated infrastructure
  • Experience designing or supporting CaaS, GPUaaS, or multi-tenant platform environments
  • Deep expertise with NVIDIA and Kubernetes ecosystems including GPU Operator, device plugins, NVML, MIG, and DCGM
  • Strong understanding of Kubernetes internals (CRDs, RBAC, controllers, scheduler extensions)
  • Proficiency in Go or Python for automation and operator development
  • Experience supporting GPU-intensive workloads (LLMs, AI/ML pipelines, HPC applications)
  • Hands-on experience with Helm, Kustomize, and GitOps workflows

Technical Skills

  • Monitoring and observability: Prometheus, Grafana, DCGM Exporter, OpenTelemetry
  • Networking: CNI plugins (NVIDIA CNI, Multus), service networking, cluster networking concepts
  • Infrastructure-as-Code: Terraform, Helm, Kustomize
  • CI/CD and GitOps practices

Preferred Experience

  • Experience with container runtimes (containerd, CRI-O, NVIDIA Container Toolkit)
  • Exposure to advanced networking solutions such as Cilium
  • Contributions to open-source projects within Kubernetes or NVIDIA ecosystems
  • Experience working in large-scale HPC or AI infrastructure environments

Additional Requirements

  • This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship.
  • We are unable to sponsor or take over sponsorship of employment visas at this time.

Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search
Want help applying to roles like this? Search Caio for free. If the repetitive CV tweaking gets heavy, Daniel can help set up Caio Agent.
Ask about Agent