ML Systems/Infrastructure Engineer

United Kingdom

Continue to application Add your email once, then Caio opens the original posting.

Indexed description

London Office

Hybrid

ML Systems/Infrastructure Engineer

Oriole is seeking a talented ML Systems/Infrastructure Engineer to help co-optimize our AI/ML software stack with cutting-edge network hardware. You’ll be a key contributor to a high-impact, agile team focused on integrating middleware communication libraries and modelling the performance of large-scale AI/ML workloads.

Key Responsibilities

Design and optimize custom GPU communication kernels to enhance performance and scalability across multi-node environments
Develop and maintain distributed communication frameworks for large-scale deep learning models, ensuring efficient parallelization and optimal resource utilization.
Profile, benchmark, and debug GPU applications to identify and resolve bottlenecks in communication and computation pipelines.
Collaborate closely with hardware and software teams to integrate optimized kernels with Oriole’s next-generation network hardware and software stack.
Contribute to system-level architecture decisions for large-scale GPU clusters, with a focus on communication efficiency, fault tolerance, and novel architectures for advanced optical network infrastructure.

Required Skills & Experience

Proficient in C++ and Python, with a strong track record in high-performance computing or machine learning projects.
Expertise in GPU programming with CUDA, including deep knowledge of GPU memory hierarchies and kernel optimization.
Hands-on experience debugging GPU kernels using tools such as Cuda-gdb, Cuda Memcheck, NSight Systems, PTX, and SASS.
Strong understanding of communication libraries and protocols, including NCCL, NVSHMEM, OpenMPI, UCX, or custom collective communication implementations.
Familiarity with HPC networking protocols/libraries such as RoCE, Infiniband, Libibverbs, and libfabric.
Experience with distributed deep learning/MoE frameworks, including PyTorch Distributed, vLLM, or DeepEP.
Solid understanding of deploying and optimizing large-scale distributed deep learning workloads in production environments, including Linux, Kubernetes, SLURM, OpenMPI, GPU drivers, Docker, and CI/CD automation.

Locations London Office Remote status Hybrid

About Oriole Networks

Accelerating AI in a Low Carbon World – Oriole Networks is a photonic networking company, developing disruptive technologies for AI/ML and HPC networking that will revolutionise data centres.

jobs--overlay#unobserveClickOutside common--modal:closed->jobs--overlay#observeClickOutside " data-jobs--overlay-overlay-value="false" data-jobs--overlay-overlay-class="!fixed flex flex-col inset-0 rounded-t-6 shadow-job-form-overlay z-career-job-application-form-overlay" data-jobs--overlay-minimize-button-hidden-class="hidden" data-careersite--jobs--form-overlay-target="form" style="--topOffset: 40px; --overlayDuration: 300ms;">

London Office

Hybrid

ML Systems/Infrastructure Engineer

Loading application form

Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search

Want help applying to roles like this? Search Caio for free. If repetitive applications get heavy, Managed Job Search adds supervised execution for $99/month.

View Managed Job Search

EPIC Centre Company profile preview

Source: Linkedin
Location: United Kingdom
Compensation: Not listed
Open on Caio: 1 role

Salary insight

Compensation not indexed

Caio highlights salary ranges whenever the original posting exposes them. Compare similar roles as the index fills in.

Similar role details

Full-time roles Location flexible matches Linkedin postings

Company stats

Current index details for EPIC Centre, based on roles Caio has indexed from public sources.

1open roles 1sources 1markets Posted 5mo agolatest role