Senior Kubernetes Engineer

Dallas, Texas, United States

Continue to application Add your email once, then Caio opens the original posting.

Indexed description

Senior Kubernetes Engineer (GPU / AI Platforms)

Location: Dallas, TX (Hybrid) - relo offered

Type: Direct Hire

• 175-250K + performance bonus

• 100% company-paid benefits

• Relocation available

Overview

We are seeking a Senior Kubernetes Engineer to design, build, and optimize GPU-accelerated container platforms supporting large-scale HPC, AI/ML workloads, and next-generation CaaS / GPUaaS environments.

This role is focused on enabling scalable, multi-tenant compute platforms that power GPU-as-a-Service (GPUaaS) and Container-as-a-Service (CaaS) offerings across hybrid and on-prem infrastructure. You will work at the intersection of Kubernetes and the NVIDIA ecosystem, driving performance, efficiency, and reliability for high-throughput, GPU-intensive workloads.

The ideal candidate brings deep hands-on experience building production-grade Kubernetes platforms for AI and HPC workloads, along with strong development skills and a passion for high-performance, distributed systems at scale.

Key Responsibilities

Kubernetes Platform Engineering

Architect, deploy, and operate Kubernetes clusters optimized for GPU-intensive and multi-tenant workloads
Design platforms supporting CaaS / GPUaaS delivery models, ensuring scalability, resilience, and performance
Leverage NVIDIA GPU Operator, Network Operator, and DCGM for cluster management and observability

GPU Enablement & Scheduling

Integrate NVIDIA device plugins, MIG, and GPU sharing capabilities into Kubernetes scheduling frameworks
Optimize GPU utilization and workload placement using scheduler extensions (kube-scheduler plugins, Slurm, Volcano)
Support AI/ML training, LLM workloads, and scientific computing at scale

Automation & Platform Development

Develop and maintain Kubernetes operators and custom controllers
Automate platform provisioning and lifecycle management using Go or Python
Implement Infrastructure-as-Code using Terraform, Helm, and Kustomize

Observability & Performance Optimization

Implement monitoring and telemetry using Prometheus, Grafana, DCGM Exporter, and OpenTelemetry
Drive performance tuning, capacity planning, and optimization across GPU-enabled clusters
Support incident response and ensure production readiness

Security & Multi-Tenancy

Design secure, multi-tenant Kubernetes environments using RBAC, namespaces, and policy enforcement (OPA, Gatekeeper)
Ensure workload isolation and governance across shared GPU infrastructure
Support secure platform operations across CaaS / GPUaaS environments

DevOps & CI/CD

Build and maintain CI/CD pipelines using GitOps tools such as ArgoCD and FluxCD
Support continuous delivery and lifecycle management of Kubernetes-based platforms

Cross-Functional Collaboration

Partner with HPC, AI/ML, DevOps, and platform engineering teams to support high-performance workloads
Collaborate on platform architecture, optimization strategies, and operational best practices

Required Experience

Extensive experience operating Kubernetes in production-scale environments
Strong experience supporting HPC, AI/ML, or GPU-accelerated infrastructure
Experience designing or supporting CaaS, GPUaaS, or multi-tenant platform environments
Deep expertise with NVIDIA and Kubernetes ecosystems including GPU Operator, device plugins, NVML, MIG, and DCGM
Strong understanding of Kubernetes internals (CRDs, RBAC, controllers, scheduler extensions)
Proficiency in Go or Python for automation and operator development
Experience supporting GPU-intensive workloads (LLMs, AI/ML pipelines, HPC applications)
Hands-on experience with Helm, Kustomize, and GitOps workflows

Technical Skills

Monitoring and observability: Prometheus, Grafana, DCGM Exporter, OpenTelemetry
Networking: CNI plugins (NVIDIA CNI, Multus), service networking, cluster networking concepts
Infrastructure-as-Code: Terraform, Helm, Kustomize
CI/CD and GitOps practices

Preferred Experience

Experience with container runtimes (containerd, CRI-O, NVIDIA Container Toolkit)
Exposure to advanced networking solutions such as Cilium
Contributions to open-source projects within Kubernetes or NVIDIA ecosystems
Experience working in large-scale HPC or AI infrastructure environments

Additional Requirements

This position requires applicants to be currently authorized to work in the U.S. without employer sponsorship.
We are unable to sponsor or take over sponsorship of employment visas at this time.

Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search

Want help applying to roles like this? Search Caio for free. If repetitive applications get heavy, Managed Job Search adds supervised execution for $99/month.

View Managed Job Search

GTN Technical Staffing Company profile preview

Source: Linkedin
Location: Dallas, Texas, United States
Compensation: Not listed
Open on Caio: 7 roles

Salary insight

Compensation not indexed

Caio highlights salary ranges whenever the original posting exposes them. Compare similar roles as the index fills in.

Similar role details

Full-time roles Location flexible matches Linkedin postings

Company stats

Current index details for GTN Technical Staffing, based on roles Caio has indexed from public sources.

7open roles 1sources 1markets Posted 15d agolatest role