Back to search

Cerence Himalayas · Posted yesterday

Sr. Principal Software Engineer

USD 185000-280000 Full time Remote

Continue to application Add your email once, then Caio opens the original posting.

Indexed description

A Moving Experience.

Who is Cerence AI?

Cerence AI is the global leader in AI for transportation, specialized in building AI and voice-powered companions for cars, two-wheelers, and more that enable people to focus on what matters most. With over 500 million cars shipped with Cerence AI's technology, we partner with leading automakers (such as Volkswagen, Mercedes, Audi, Toyota and many more), mobility providers, and technology companies to power intuitive, integrated experiences that create safer, more connected, and more enjoyable journeys for drivers and passengers alike.

Our Driving Force

Our team is dedicated to pushing the boundaries of AI innovation, working around the globe with headquarters in Burlington, Massachusetts, USA and 16 other offices across Europe, Asia, and North America. We bring together diverse backgrounds, and varied skill sets with the shared goal of advancing the next generation of transportation user experiences. Our culture is customer-centric, collaborative, fast-paced, and fun, with continuous opportunities for learning and development to support your career growth.

Interested in having a significant impact in a dynamic industry with a high-performing global team? We’re looking for an exceptional SeniorPrincipalSoftware Engineer who is ready to drive the future of mobility with us!

Job Description:

What You Will Work On

Optimize and deploy high‑performance LLM inference pipelines

Own inference runtimes across data center, edge, and embedded platforms

Push model performance through quantization, kernel fusion, and cache optimization

Drive latency and throughput improvements that directly impact production products

Enable efficient, reliable deployment without external vendor dependency

Core Responsibilities

Inference Engines & Runtime

Build deep expertise and ownership of:

vLLM

TensorRT‑LLM

llama.cpp

QAIRT

Extend and tune inference engines using custom CUDA kernels

Adapt runtimes for constrained and embedded deployment environments

Quantization & NumericalOptimisation

Implement and evaluatequantisationstrategies:

INT8, INT4, FP4, FP8, mixed precision

GPTQ

Balance accuracy, latency, memory footprint, and throughput

KV Cache Optimization

Optimize key–value cache performance through:

Paging

Prefix caching

Cache‑aware memory layout design

Reduce memory pressure while sustaining high throughput

Latency & Throughput Optimisation

Design and tune:

Batching strategies

Continuous batching

Speculative decoding

Optimize tail latency and tokens/sec under real production traffic patterns

What Success Looks Like

Models deploy efficiently on edge and embedded devices, not just servers

Tokens/sec significantly outperform baseline implementations

End‑to‑end latency is minimized and predictable

Inference cost per request is materially reduced

The company is no longer dependent on partners for inference optimization

Required Experience & Skills

Strongly Required

Proven experience optimizing ML inference performance in production

Deep understanding of GPU architecture and memory hierarchies

Hands‑on experience with CUDA and low‑level performance tuning

Experience deploying models beyond research environments

Critical Technical Skills

Inference engines: vLLM, TensorRT‑LLM, llama.cpp, QAIRT

CUDA kernel development and profiling

Quantisationtechniques: INT8/INT4/FP4/FP8, AWQ, GPTQ

KV cacheoptimisationand memory layout design

Latencyoptimisation: batching, speculative decoding, continuous batching

Common Problems You’ll Be Solving

Deploy efficiently on edge or embedded targets

Achieve competitive tokens/sec

Reduce and stabilize inference latency

You will be responsible for closing these gaps, creating a major competitive advantage.

What we offer

We offer a generous compensation and benefits package (in addition to the base salary), including:

Salary range $185,000.00 USD - $280,000.00 USD It is not typical for offers to be made at or near the top of the range. The actual salary will be determined based on experience and other job-related factors.

Annual bonus opportunity

Insurance coverage (medical, dental, vision, life, and disability)

Paid time off

Paid holidays

Company contribution to the RRSP (Registered Retirement Savings Plan)

Equity awards for certain positions and levels

Remote and/or hybrid work available depending on the position

All compensation and benefits are subject to the terms and conditions of the underlying plans or programs, as applicable, and may be amended, terminated, or replaced from time to time.

Cerence Inc. (Nasdaq: CRNC and www.cerence.com) is the global industry leader in creating unique, moving experiences for the automotive world. Spun out from Nuance in October 2019, Cerence is a new, independent company that has quickly gained traction as a leader in the automotive voice assistant space, working with all of the world’s leading automakers – from Ford and Fiat Chrysler to Daimler, Audi and BMW to Geely and SAIC – to transform how a car feels, responds and learns. Its track record is built on more than 20 years of industry experience and leadership and more than 500 million cars on the road today across more than 70 languages.

As Cerence looks to the future and continues an ambitious growth agenda, we need someone to join the team and help build the future of voice and AI in cars. This is an exciting opportunity to join Cerence’s passionate, dedicated, global team and be a part of meaningful innovation in a rapidly growing industry.

EQUAL OPPORTUNITY EMPLOYER

Cerence is firmly committed to Equal Employment Opportunity (EEO) and to compliance with all federal, state and local laws that prohibit employment discrimination on the basis of age, race, color, gender, gender identity, gender expression, sex, sex stereotyping, pregnancy, national origin, ancestry, religion, physical or mental disability, medical condition, marital status, citizenship status, sexual orientation, protected military or veteran status, genetic information and other protected classifications. Cerence Equal Employment Opportunity Policy Statement.

All prospective and current Employees need to remain vigilant when it comes to executing security policies in the workplace. This includes:

- Following workplace security protocols and training programs to familiarize with the ways to maintain a safe workplace.
- Following security procedures to report any suspicious activity.
- Having respect for corporate security procedures to allow those procedures to be effective.
- Adhering to company's compliance and regulations.
- Encouraging to follow a zero tolerance for workplace violence.

- Basic knowledge of information security and data privacy requirements (e.g., how to protect data & how to be handling this data).

- Demonstrative knowledge of information security through internal training programs.

Originally posted on Himalayas

Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search

Want help applying to roles like this? Search Caio for free. If CV tailoring and application tracking get heavy, Full Caio Agent adds a human specialist.

View Full Agent

Cerence Company profile preview

Source: Himalayas
Location
Compensation: USD 185000-280000
Open on Caio: 4 roles

Salary insight

USD 185000-280000

Caio highlights salary ranges whenever the original posting exposes them. Compare similar roles as the index fills in.

Similar role details

Full time roles Remote matches Himalayas postings

Company stats

Current index details for Cerence, based on roles Caio has indexed from public sources.

4open roles 2sources 1markets Posted yesterdaylatest role

Indexed description

A Moving Experience.

Who is Cerence AI?

Our Driving Force

Job Description:

What You Will Work On

Core Responsibilities

Build deep expertise and ownership of:

vLLM

TensorRT‑LLM

llama.cpp

QAIRT

Quantization & NumericalOptimisation

INT8, INT4, FP4, FP8, mixed precision

AWQ

GPTQ

KV Cache Optimization

Paging

Prefix caching

Cache‑aware memory layout design

Latency & Throughput Optimisation

Design and tune:

Batching strategies

Continuous batching

Speculative decoding

What Success Looks Like

Required Experience & Skills

Strongly Required

Critical Technical Skills

CUDA kernel development and profiling

Common Problems You’ll Be Solving

Achieve competitive tokens/sec

Reduce and stabilize inference latency

What we offer

Annual bonus opportunity

Paid time off

Paid holidays

EQUAL OPPORTUNITY EMPLOYER

Build deep expertise and ownership of:

Cache‑aware memory layout design

Common Problems You’ll Be Solving

Achieve competitive tokens/sec