Senior GPU & LLM Infrastructure Engineer (NVIDIA, vLLM, OpenShift AI)
Indexed description
Senior GPU & LLM Infrastructure Engineer (NVIDIA, vLLM, OpenShift AI)
Our banking client is building a large-scale private GenAI environment and is seeking experienced engineers to support enterprise-grade on-prem inference platforms powered by NVIDIA H200 GPU clusters and OpenShift AI. This role is focused entirely on high-performance LLM inferencing and runtime optimization - not model training or fine-tuning.
What You’ll Do
- Optimize large-scale LLM inference performance across NVIDIA GPU environments.
- Drive runtime efficiency across token generation pipelines, including KV cache and prefill/decode optimization.
- Deploy and operate modern inference frameworks including vLLM and TensorRT-LLM.
- Manage GPU throughput, batching strategies, latency tuning, and workload orchestration using RunAI and Kubernetes.
- Oversee the full Hugging Face model lifecycle including onboarding, deployment, versioning, and retirement.
- Operate and maintain OpenShift AI as the core enterprise GenAI platform.
- Support production-grade self-hosted open-source LLM environments, including Llama models.
Experience
- Strong background in AI infrastructure, GPU platforms, or LLM runtime engineering.
- Hands-on experience with NVIDIA H200 GPU clusters and large-scale inference optimization.
- Deep understanding of KV cache management, token serving pipelines, and inference latency optimization.
- Expertise with OpenShift AI, Kubernetes GPU orchestration, and RunAI.
- Strong experience with vLLM and TensorRT-LLM in production environments.
- Proven experience managing Hugging Face model deployment lifecycles.
Create a free Caio profile to unlock the full index and keep your job-search signal for future recommendations.
Unlock free search