Software Development Engineer, ML Systems Integration, Machine Learning Israel (MLIL) — Integration Validation
Indexed description
The Integration team is looking for a Senior Software Development Engineer to lead the design and delivery of systems software for our next-generation ML accelerator servers.. In this role you will own the design and implementation of CI/CD pipelines, test frameworks, and system-level validation for our next-generation ML inference accelerator platform. You will work across the full stack — from firmware interfaces through data-plane performance benchmarking to production fleet readiness — ensuring every component is validated end-to-end before it reaches customers.
This is a greenfield environment with rapidly growing scope: new silicon, new software stacks (vLLM, NKI, NIXL), and new fleet-scale challenges. We are looking for a senior IC who can independently drive technical decisions, scale our validation infrastructure, and raise the bar on engineering quality across the group.
Key job responsibilities
- Own and evolve CI/CD pipelines — from pre-merge gates through continuous deployment to fleet.
- Design and implement test frameworks that enable firmware and data-plane developers to write, run, and maintain tests with minimal friction.
- Architect system-level test suites that stress control-plane and data-plane components beyond provisioning and vetting flows.
- Build and maintain performance benchmarking infrastructure for LLM inference workloads (Prefill + Decode), including dashboarding and regression detection.
- Drive integration of third-party vendor code (nightly drops) into CI/CD, ensuring quality gates catch regressions early.
- Participate in feature design reviews, contributing test plans and challenging coverage gaps.
- Define and own Continuous Testing in production environments (CTS).
- Leverage AI-assisted development tools (Kiro, LLM-based code generation) to accelerate team velocity and pioneer new engineering workflows.
You'll start your day reviewing CI pipeline results from overnight runs, triaging failures to determine whether a regression came from a vendor code drop, a firmware change, or an ML serving stack update. Mid-morning you might pair with a hardware engineer to design test cases for a new bus-level reset flow, then pivot to extending the performance benchmarking framework to catch a latency regression. After lunch you'll join a feature design review — challenging test coverage gaps and deciding where system-level validation needs to live. The rest of your afternoon could be spent writing a new pipeline stage that gates deployment on accuracy checks, or building a dashboard that gives the group visibility into fleet-readiness metrics. Throughout the day you'll lean on AI-assisted development tools to accelerate everything from infrastructure code to root-cause analysis.
Basic Qualifications
- Experience as a mentor, tech lead or leading an engineering team
- Experience leading the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems
- Knowledge of Python and/or C++ programming
- Experience programming with at least one modern language such as Java, C++, or C# including object-oriented design
- Experience building test automation frameworks and tools
- Proven experience designing and operating CI/CD systems at scale (any platform — Jenkins, GitHub Actions, internal equivalents).
- Demonstrated early adopter of AI-assisted development tools — uses LLMs, code-generation agents, or similar tools as a core part of daily workflow.
- Strong Linux systems knowledge.
- Bachelor's degree in computer science or equivalent
- Experience with AWS Services including EC2, Lambda, S3, DynamoDB, SQS
- Experience with hardware/software integration and real-time systems
- Familiarity with ML inference serving stacks (vLLM, TensorRT-LLM, Triton, or similar).
- Knowledge of Amazon internal tooling (Brazil, Pipelines, Apollo, ToD).
- Experience with performance benchmarking and profiling of GPU/accelerator workloads.
- Track record of leading technical initiatives across multiple teams.
- Experience with fleet-scale operations — monitoring, dashboarding, incident response.
Company - Annapurna Labs Ltd.
Job ID: A10425227
Create a free Caio profile to unlock more results and save your role and location preferences.
Unlock free search