Back to search
turbalance Linkedin · Posted 17d ago

AI Trace Generation Engineer

Heidelberg, Baden-Württemberg, Germany

Linkedin
Continue to application Add your email once, then Caio opens the original posting.

Indexed description

Your mission

To support our growing team, we are looking for an experienced AI Trace Generation Engineer to join us. In this role, you will take both a strategic and hands-on approach to designing and building systems that enable deep visibility into distributed AI workloads. This includes developing trace collection, instrumentation, and simulation capabilities that help optimize performance across large-scale, multi-GPU environments. You will work at the intersection of machine learning and systems engineering, contributing to the core infrastructure powering next-generation AI workloads.

Your mission

  • Design and implement a trace collection system for distributed LLM workloads, capturing compute operations, communication primitives, memory usage, and cluster topology across multi-GPU and multi-node setups
  • Validate that collected traces accurately reflect real workload behavior - verifying operation completeness, timing consistency, and data integrity across inference and training pipelines
  • Integrate with and instrument major LLM frameworks (vLLM, TensorRT-LLM, DeepSpeed, Megatron-LM and others) to extract meaningful execution data without disrupting performance
  • Use collected traces as input to discrete event simulations that model and replay distributed AI workload behavior at scale
  • Analyze trace data to surface bottlenecks and inefficiencies across the stack, from individual kernel execution to cluster-wide communication patterns

Your profile

  • 3+ years of experience in AI systems, ML infrastructure, or a closely related area
  • Hands-on experience with at least one major LLM serving or training framework
  • Strong proficiency in Python and C++, with a solid understanding of GPU architecture, memory bandwidth, and the difference between compute-bound and memory-bound operations
  • Solid understanding of distributed communication
  • Familiarity with parallelism strategies and how they shape execution behavior across large clusters

Nice to have

  • Open source contributions or published research in relevant areas
  • Experience in startup environments, with the ability to move quickly, navigate ambiguity, and take ownership

Why us?

  • Build something big: Help build and scale a fast-growing AI infrastructure startup
  • Pay & perks: Competitive compensation with a performance-based incentive, subsidized Deutschlandticket, and access to a discount portal
  • Work your way: Flexible hours with hybrid and remote-friendly options
  • Fast lanes, no red tape: Flat hierarchies and rapid decision-making mean ideas ship quickly
  • Global team: Work with a diverse, international team across Germany and the USA
  • Modern headquarters: Well-stocked office near the Heidelberg Hauptbahnhof, available on a hybrid basis or as a place to connect during our quarterly team workshops
  • Top setup: Your choice of high-quality hardware and equipment
  • Relocation support: We’ll help make your move to join us as smooth as possible

About Us

turbalance is an innovative, emerging startup that transforms AI laws. We are a team of passionate problem-solvers who believe in what we’re building. We constantly push boundaries and embrace our inner nerds as we find new ways to tackle complex challenges. You will find a dynamic work environment here, with flat or even non-existent hierarchies and the chance to take on responsibility from day one.

Apply for this job

Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock the full index and keep your job-search signal for future recommendations.

Unlock free search