AI Trace Generation Engineer
Indexed description
To support our growing team, we are looking for an experienced AI Trace Generation Engineer to join us. In this role, you will take both a strategic and hands-on approach to designing and building systems that enable deep visibility into distributed AI workloads. This includes developing trace collection, instrumentation, and simulation capabilities that help optimize performance across large-scale, multi-GPU environments. You will work at the intersection of machine learning and systems engineering, contributing to the core infrastructure powering next-generation AI workloads.
Your mission
- Design and implement a trace collection system for distributed LLM workloads, capturing compute operations, communication primitives, memory usage, and cluster topology across multi-GPU and multi-node setups
- Validate that collected traces accurately reflect real workload behavior - verifying operation completeness, timing consistency, and data integrity across inference and training pipelines
- Integrate with and instrument major LLM frameworks (vLLM, TensorRT-LLM, DeepSpeed, Megatron-LM and others) to extract meaningful execution data without disrupting performance
- Use collected traces as input to discrete event simulations that model and replay distributed AI workload behavior at scale
- Analyze trace data to surface bottlenecks and inefficiencies across the stack, from individual kernel execution to cluster-wide communication patterns
- 3+ years of experience in AI systems, ML infrastructure, or a closely related area
- Hands-on experience with at least one major LLM serving or training framework
- Strong proficiency in Python and C++, with a solid understanding of GPU architecture, memory bandwidth, and the difference between compute-bound and memory-bound operations
- Solid understanding of distributed communication
- Familiarity with parallelism strategies and how they shape execution behavior across large clusters
- Open source contributions or published research in relevant areas
- Experience in startup environments, with the ability to move quickly, navigate ambiguity, and take ownership
- Build something big: Help build and scale a fast-growing AI infrastructure startup
- Pay & perks: Competitive compensation with a performance-based incentive, subsidized Deutschlandticket, and access to a discount portal
- Work your way: Flexible hours with hybrid and remote-friendly options
- Fast lanes, no red tape: Flat hierarchies and rapid decision-making mean ideas ship quickly
- Global team: Work with a diverse, international team across Germany and the USA
- Modern headquarters: Well-stocked office near the Heidelberg Hauptbahnhof, available on a hybrid basis or as a place to connect during our quarterly team workshops
- Top setup: Your choice of high-quality hardware and equipment
- Relocation support: We’ll help make your move to join us as smooth as possible
Apply for this job
Create a free Caio profile to unlock the full index and keep your job-search signal for future recommendations.
Unlock free search