Back to search
Canva Smartrecruiters · Posted today

Senior Machine Learning Engineer - Research Optimisation

Australia Full-time Remote

Information Technology Computer Software Smartrecruiters
Continue to application Add your email once, then Caio opens the original posting.

Indexed description

Company Description

Join the team redefining how the world experiences design.

Hey, g'day, kia ora, 你好, hallo, vítejte!

Thanks for stopping by. We know job hunting can be a little time consuming and you're probably keen to find out what's on offer, so we'll get straight to the point.

Where and how you can work

Our flagship Sydney campus is uniquely Canva - an extension of our Surry Hills neighbourhood. It’s a thoughtfully designed space with plenty of room to collaborate, focus, and connect.

This role is based in Sydney, and we’re looking for someone who calls it home. Our hybrid way of working gives you the flexibility to work remotely, and to come together on campus for meaningful in-person collaboration and connection when it matters most. We trust our Canvanauts to choose the balance that empowers them and their team to achieve their goals.

Job Description

At Canva, our mission is to empower the world to design. To get cutting-edge research into the hands of millions of users faster, we're looking for a Machine Learning Engineer focused on research enablement and performance, turning promising experiments into stable, scalable, user-facing capabilities while making training and inference faster, cheaper, and more reliable.

About the role:

You'll be the bridge between research and production. Partnering closely with researchers, you'll ensure experimental code is production ready, integrate models into our monorepo, build shared libraries and services, and create the tooling and processes that let multiple model variants ship safely and quickly. You'll also work across the training stack, profiling and tuning PyTorch workloads, improving GPU utilisation, and shaping how we use distributed training and storage to get the most out of our compute. Your work shortens the research-to-user loop, reduces duplication, and ensures our ML features are reliable, observable, and easy for other teams to adopt.

At the moment, this role is focused on:

  • Research-to-Production Pipeline: Hardening experimental models (containerisation, tests, CI/CD), making them deployable for real users.

  • Training Performance and GPU Efficiency: Profiling PyTorch training jobs, improving GPU utilisation, and applying techniques like mixed precision, efficient data loading, and distributed training strategies (FSDP, DDP, DeepSpeed) to reduce time and cost per experiment.

  • Library development: Converting experiments into well-factored libraries with clear APIs, dependency hygiene, and versioning, so teams can import rather than copy-paste.

  • Developer Experience & Documentation: Creating templates, examples, and guidance; offering supportive, high-signal communication so others can adopt libraries confidently.

  • Reliability, Observability & Cost: Instrumenting services with metrics/logging/tracing, setting SLIs/SLOs, and optimising training and inference performance and spend.

Primary Responsibilities:

  • Productionise research models: refactor, test, containerise, and integrate them into the monorepo for scalable reuse.

  • Profile and optimise PyTorch training jobs, working with researchers to identify bottlenecks across compute, memory, I/O, and networking.

  • Improve distributed training setups (multi-GPU, multi-node) and help teams pick the right parallelism strategy for their workload.

  • Build and maintain inference services, SDKs, and shared libraries that standardise pre/post-processing and interfaces across variants.

  • Create multi-variant runners and rollout frameworks (feature flags, canaries, A/B testing, automated rollbacks).

  • Establish CI/CD workflows, artifact management, and reproducible builds for ML services and model assets.

  • Add robust observability (dashboards, alerts) and reliability practices (load tests, chaos/resiliency checks) across training and inference workloads.

  • Optimise inference (batching, caching, quantisation/compilation, hardware utilisation) to reduce latency and cost.

  • Work across the broader training stack, including Kubernetes orchestration, storage (e.g. Weka, Vast, Lustre), and data pipelines, to remove friction for research teams.

  • Partner with researchers and product engineers via code reviews, pair sessions, and clear documentation to accelerate adoption.

  • Drive good engineering hygiene in the research codebase: testing strategy, dependency management, and de-duplication across multiple model variants.

You're probably a match if you:

  • Have strong software engineering fundamentals and excellent Python skills; you're comfortable turning notebooks and prototypes into production-grade services.

  • Have shipped ML systems in production (containers, APIs, CI/CD), ideally within a monorepo environment.

  • Have hands-on experience optimising PyTorch training or inference, profiling workloads, and reasoning about GPU memory, compute, and throughput.

  • Are comfortable in containerised environments and understand Kubernetes concepts well enough to debug and improve ML workloads running on it.

  • Can read research code and refactor it into clean abstractions with stable, well-documented interfaces.

  • Understand service reliability and observability (metrics, tracing, logging) and how they apply to ML systems.

  • Think holistically about the stack, from storage and networking through to model code, and can hold a credible conversation with researchers, DevOps, and platform engineers alike.

  • Communicate clearly and empathetically, especially when guiding others to adopt libraries and best practices and mentoring engineers earlier in their ML journey.

  • Bring cloud experience (AWS a plus) without needing to be a deep specialist.

Nice to Have:

  • Familiarity with model-serving/optimisation tooling (e.g., ONNX, TorchScript, Triton, quantisation).

  • Experience writing or optimising CUDA kernels, or using compilation frameworks (torch.compile, Triton, TensorRT) to speed up models.

  • Experience with distributed training frameworks (FSDP, DDP, DeepSpeed, Megatron) at meaningful scale.

  • Familiarity with high-performance storage systems (Weka, Vast, Lustre) and the data loading patterns that make or break training throughput.

  • Experience with experimentation platforms (feature flags, A/B testing) and safe rollout strategies.

  • Background with multimodal/image generation stacks or LLM-adjacent tooling (not the core focus, but helpful).

  • Knowledge of MLOps practices (model registries, artifact stores, dependency/version management).

Impact you'll have:

You'll dramatically reduce the time it takes to move from a successful experiment to a reliable, observable feature in production. You'll eliminate copy-paste, unify interfaces, enable parallel variants, and build the shared foundations that let Canva ship ML innovation at scale. You'll also help our research teams get more out of every GPU hour, making training faster and inference cheaper as we scale up the work CORE is doing.

Additional Information

What's in it for you?

Achieving our crazy big goals motivates us to work hard - and we do - but you'll experience lots of moments of magic, connectivity and fun woven throughout life at Canva, too. We also offer a range of benefits to set you up for every success in and outside of work.

Here's a taste of what's on offer:

  • Equity packages - we want our success to be yours too
  • Inclusive parental leave policy that supports all parents & carers
  • An annual Vibe & Thrive allowance to support your wellbeing, social connection, office setup & more
  • Flexible leave options that empower you to be a force for good, take time to recharge and supports you personally

Check out lifeatcanva.com for more info.

Other stuff to know

We make hiring decisions based on your experience, skills and passion, as well as how you can enhance Canva and our culture. We see AI as a powerful amplifier of creativity and technology at Canva. We’re evolving how we assess AI skills in our Technology hiring experience - you’ll tackle interactive, real-time challenges that reflect the kind of work we do. In some interviews, you may also be asked to solve a problem using an AI tool to show how you approach challenges with tech by your side.

Please note that interviews are conducted virtually.

Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search
Want help applying to roles like this? Search Caio for free. If the repetitive CV tweaking gets heavy, Daniel can help set up Caio Agent.
Ask about Agent