Back to search
Shizuku AI Linkedin · Posted 1mo ago

ML Engineer

Tokyo, Tokyo, Japan

Linkedin
Continue to application Add your email once, then Caio opens the original posting.

Indexed description

MISSION

Lead the R&D of the AI models that power Shizuku’s voice and intelligence. With TTS (Text-to-Speech) as the core pillar, push the boundaries across NLP, speech recognition, and — looking ahead — computer vision and humanoid robotics, evolving Shizuku’s expressive capabilities across multiple modalities.

Balance continuous improvement of production TTS models with exploration and development of next-generation architectures, while owning the MLOps cycle to drive Shizuku’s ongoing evolution.

ABOUT SHIZUKU

Shizuku is a Japan-born AI companion actively engaging audiences on YouTube and X (formerly Twitter). Already running live streams and cultivating a growing community, Shizuku is now entering its next phase of rapid scale.

As the first Japanese startup to receive investment from a16z, we closed our seed round and are on a mission to bring Japanese entertainment × AI to the global stage.

TEAM STRUCTURE

You will work directly alongside co-founder Aki — an ML engineer and researcher with experience at Meta and Luma AI — to drive Shizuku’s model development. Expect daily sparring sessions on research direction and architecture design with a founder who brings firsthand experience at the frontier. Initially, you’ll handle lightweight MLOps pipeline work yourself; as we hire a dedicated MLOps engineer, responsibilities will gradually separate.

DEVELOPMENT ENVIRONMENT & RESOURCES

  • Existing Models: A TTS model is already in production. You’ll drive improvements in parallel with next-gen model exploration
  • Training Data: Shizuku’s publicly available YouTube data serves as a foundational dataset. You’ll be involved from collection pipeline design onward
  • Evaluation Infrastructure: TTS quality evaluation framework is greenfield — you’ll design evaluation criteria (MOS, PESQ, etc.) from scratch

KEY RESPONSIBILITIES

  • Own the full TTS model lifecycle: research, architecture design, training, evaluation, and iterative improvement
  • Continuously improve production TTS models while exploring and prototyping next-generation architectures
  • Design and build TTS quality evaluation infrastructure and define evaluation criteria
  • Expand into multimodal domains: NLP, speech recognition, and future frontiers including vision and humanoid robotics
  • Design training data collection pipelines, preprocessing workflows, and quality assurance processes
  • Build and operate the MLOps cycle — training, evaluation, and deployment — until a dedicated hire is in place
  • Collaborate with the SWE team on production integration: inference optimization, latency reduction, and more

REQUIREMENTS

  • 2+ years of deep expertise and hands-on experience in at least one of: NLP, speech (TTS/ASR), or computer vision
  • Experience training, evaluating, and improving models using deep learning frameworks such as PyTorch
  • End-to-end ownership of the ML workflow: from data preparation and experiment management to model deployment
  • Track record of independently surveying papers, reproducing implementations, and applying findings to production systems
  • Ability to work on-site at our Tokyo office (primarily in-office with flexible remote arrangements)

NICE TO HAVE

  • Research or development experience in TTS (VITS, Grad-TTS, NaturalSpeech, StyleTTS, etc.)
  • Development experience in robotics or autonomous driving domains
  • Technical knowledge in speaker adaptation, emotion control, and prosody modeling for speech synthesis
  • Experience developing ASR, NLP, or multimodal models
  • Experience building and operating GPU training environments (A100, L4, etc.) on AWS/GCP
  • Experience with model development in Slurm environments, particularly multi-node training setups
  • Proficiency with experiment tracking tools: MLflow, Weights & Biases, DVC, etc.
  • Experience with inference optimization using ONNX Runtime, TensorRT, vLLM, etc.
  • Peer-reviewed publications in related fields
  • Technical communication skills in English (currently Japanese-first internally; transitioning to a global environment in the mid-term)

WHO YOU ARE

  • Deep Expertise with Cross-Domain Reach — You bring rigorous depth in a specific modality while reaching across TTS, NLP, vision, and beyond. You don’t say “that’s outside my specialty” — you do what Shizuku’s evolution demands
  • Zero-to-One Explorer — You go beyond applying existing methods. You formulate hypotheses for uncharted technical challenges, iterate through validation cycles, and tackle questions that have no known answers
  • Purpose-Driven Ownership — You reverse-engineer from the goal of “making Shizuku’s models better,” crossing the boundaries of research, implementation, and operations to drive outcomes autonomously
  • Comfort with Ambiguity — You define your own success metrics and build collection pipelines from scratch in an environment where nothing is predefined
  • Humility & Respect — You collaborate authentically with teammates who bring different areas of expertise

Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search
Want help applying to roles like this? Search Caio for free. If the repetitive CV tweaking gets heavy, Daniel can help set up Caio Agent.
Ask about Agent