Back to search
ChatGPT Jobs Linkedin · Posted 29d ago

Machine Learning Engineer, Model Evaluations (Speech LLM) - San Francisco

Canada

Linkedin
Continue to application Add your email once, then Caio opens the original posting.

Indexed description

Job Description

Job Description - Plaud Inc.

Speech Evaluation Engineer (Speech LLM)

Company: Plaud Inc.

Location: San Francisco, CA

Type: On-site (Hybrid: Minimum 3x in-office per week)

Machine Learning & Artificial Intelligence

Job Overview

Plaud is seeking a candidate to turn ambiguous concepts like voice naturalness and cadence into clear, automated metrics. You will partner with ML researchers to define benchmarks for Speech LLMs, build scalable data pipelines, and own dashboards that track model health and performance.

Key Responsibilities

  • Define and automate metrics for subjective concepts such as naturalness, expressiveness, and conversational cadence.
  • Build reliable distributed systems and data pipelines that run at scale against live model checkpoints.
  • Partner with ML researchers to translate Speech LLM capabilities (e.g., ASR robustness, TTS emotional steerability) into measurable benchmarks.
  • Develop and own dashboards to track model health during training, improve signal-to-noise ratios, and reduce evaluation latency.
  • Debug anomalous mid-training results to identify root causes (architecture, data, or infrastructure).
  • Communicate complex statistical results and model behaviors to technical and non-technical stakeholders.

Required Qualifications

  • Engineering Skills: Strong software engineering skills, particularly in Python, with experience in distributed systems and evaluation harnesses.
  • ML Collaboration: Ability to deeply partner with researchers to define "good" performance for AI models.
  • Observability: Experience building trusted tracking dashboards (e.g., Weights & Biases, MLflow).
  • Communication: Ability to clearly articulate complex statistical results.

Preferred Qualifications

  • Speech Metrics: Familiarity with WER, CER, PESQ, and automated MOS scoring frameworks.
  • LLM-as-a-Judge: Experience using frontier or fine-tuned multi-modal LLMs to evaluate conversational logic, transcription accuracy, and audio quality.
  • Human Evaluation: Background in managing large-scale crowdsourcing for RLHF/DPO efforts.
  • Adversarial Datasets: Experience curating datasets to test edge cases (heavy accents, overlapping speech, noisy environments).

Compensation & Benefits

  • Salary: $180,000 - $270,000 base salary + performance bonus + Equity.
  • Healthcare: Top-tier healthcare (employee + dependents) including dental and vision.
  • Retirement: 401(k) with company matching.
  • Time Off: Unlimited PTO plus 13 paid holidays.
  • Parental Leave: 12 weeks of paid leave for all new parents.
  • Equipment: Choice of top-of-the-line laptops/workstations.
  • Perks: Annual offsites and fully stocked office.
Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search
Want help applying to roles like this? Search Caio for free. If the repetitive CV tweaking gets heavy, Daniel can help set up Caio Agent.
Ask about Agent