AI Trace Generation Engineer
Indexed description
- Design and implement a trace collection system for distributed LLM workloads, capturing compute operations, communication primitives, memory usage, and cluster topology across multi-GPU and multi-node setups
- Validate that collected traces accurately reflect real workload behavior - verifying operation completeness, timing consistency, and data integrity across inference and training pipelines
- Integrate with and instrument major LLM frameworks (vLLM, TensorRT-LLM, DeepSpeed, Megatron-LM and others) to extract meaningful execution data without disrupting performance
- Use collected traces as input to discrete event simulations that model and replay distributed AI workload behavior at scale
- Analyze trace data to surface bottlenecks and inefficiencies across the stack, from individual kernel execution to cluster-wide communication patterns
- 3+ years of experience in AI systems, ML infrastructure, or a closely related area
- Hands-on experience with at least one major LLM serving or training framework
- Strong proficiency in Python and C++, with a solid understanding of GPU architecture, memory bandwidth, and the difference between compute-bound and memory-bound operations
- Solid understanding of distributed communication
- Familiarity with parallelism strategies and how they shape execution behavior across large clusters
- Open source contributions or published research in relevant areas will definitely be appreciated!
- Previous startup experience is a plus - we move fast and value people who are comfortable with that
- Build something big: Help build and scale a fast-growing AI infrastructure startup
- Pay & perks: Competitive compensation with a performance-based incentive, subsidized Deutschlandticket, and access to a discount portal
- Work your way: Flexible hours with hybrid and remote-friendly options
- Fast lanes, no red tape: Flat hierarchies and rapid decision-making mean ideas ship quickly
- Global team: Work with a diverse, international team across Germany and the USA
- Modern headquarters: Well-stocked office near the Heidelberg Hauptbahnhof, available on a hybrid basis or as a place to connect during our quarterly team workshops
- Top setup: Your choice of high-quality hardware and equipment
- Relocation support: We’ll help make your move to join us as smooth as possible
Apply for this job
Create a free Caio profile to unlock the full index and keep your job-search signal for future recommendations.
Unlock free search