Shizuku AI Linkedin · Posted 3mo ago

ML Engineer

Tokyo, Tokyo, Japan

Continue to application Add your email once, then Caio opens the original posting.

Indexed description

MISSION

Lead the R&D of the AI models that power Shizuku’s voice and intelligence. With TTS (Text-to-Speech) as the core pillar, push the boundaries across NLP, speech recognition, and — looking ahead — computer vision and humanoid robotics, evolving Shizuku’s expressive capabilities across multiple modalities.

Balance continuous improvement of production TTS models with exploration and development of next-generation architectures, while owning the MLOps cycle to drive Shizuku’s ongoing evolution.

ABOUT SHIZUKU

Shizuku is a Japan-born AI companion actively engaging audiences on YouTube and X (formerly Twitter). Already running live streams and cultivating a growing community, Shizuku is now entering its next phase of rapid scale.

As the first Japanese startup to receive investment from a16z, we closed our seed round and are on a mission to bring Japanese entertainment × AI to the global stage.

TEAM STRUCTURE

You will work directly alongside co-founder Aki — an ML engineer and researcher with experience at Meta and Luma AI — to drive Shizuku’s model development. Expect daily sparring sessions on research direction and architecture design with a founder who brings firsthand experience at the frontier. Initially, you’ll handle lightweight MLOps pipeline work yourself; as we hire a dedicated MLOps engineer, responsibilities will gradually separate.

DEVELOPMENT ENVIRONMENT & RESOURCES

Existing Models: A TTS model is already in production. You’ll drive improvements in parallel with next-gen model exploration
Training Data: Shizuku’s publicly available YouTube data serves as a foundational dataset. You’ll be involved from collection pipeline design onward
Evaluation Infrastructure: TTS quality evaluation framework is greenfield — you’ll design evaluation criteria (MOS, PESQ, etc.) from scratch

KEY RESPONSIBILITIES

Own the full TTS model lifecycle: research, architecture design, training, evaluation, and iterative improvement
Continuously improve production TTS models while exploring and prototyping next-generation architectures
Design and build TTS quality evaluation infrastructure and define evaluation criteria
Expand into multimodal domains: NLP, speech recognition, and future frontiers including vision and humanoid robotics
Design training data collection pipelines, preprocessing workflows, and quality assurance processes
Build and operate the MLOps cycle — training, evaluation, and deployment — until a dedicated hire is in place
Collaborate with the SWE team on production integration: inference optimization, latency reduction, and more

REQUIREMENTS

2+ years of deep expertise and hands-on experience in at least one of: NLP, speech (TTS/ASR), or computer vision
Experience training, evaluating, and improving models using deep learning frameworks such as PyTorch
End-to-end ownership of the ML workflow: from data preparation and experiment management to model deployment
Track record of independently surveying papers, reproducing implementations, and applying findings to production systems
Ability to work on-site at our Tokyo office (primarily in-office with flexible remote arrangements)

NICE TO HAVE

Research or development experience in TTS (VITS, Grad-TTS, NaturalSpeech, StyleTTS, etc.)
Development experience in robotics or autonomous driving domains
Technical knowledge in speaker adaptation, emotion control, and prosody modeling for speech synthesis
Experience developing ASR, NLP, or multimodal models
Experience building and operating GPU training environments (A100, L4, etc.) on AWS/GCP
Experience with model development in Slurm environments, particularly multi-node training setups
Proficiency with experiment tracking tools: MLflow, Weights & Biases, DVC, etc.
Experience with inference optimization using ONNX Runtime, TensorRT, vLLM, etc.
Peer-reviewed publications in related fields
Technical communication skills in English (currently Japanese-first internally; transitioning to a global environment in the mid-term)

WHO YOU ARE

Deep Expertise with Cross-Domain Reach — You bring rigorous depth in a specific modality while reaching across TTS, NLP, vision, and beyond. You don’t say “that’s outside my specialty” — you do what Shizuku’s evolution demands
Zero-to-One Explorer — You go beyond applying existing methods. You formulate hypotheses for uncharted technical challenges, iterate through validation cycles, and tackle questions that have no known answers
Purpose-Driven Ownership — You reverse-engineer from the goal of “making Shizuku’s models better,” crossing the boundaries of research, implementation, and operations to drive outcomes autonomously
Comfort with Ambiguity — You define your own success metrics and build collection pipelines from scratch in an environment where nothing is predefined
Humility & Respect — You collaborate authentically with teammates who bring different areas of expertise

Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search

Want help applying to roles like this? Search Caio for free. If repetitive applications get heavy, Managed Job Search adds supervised execution for $99/month.

View Managed Job Search

Shizuku AI Company profile preview

Source: Linkedin
Location: Tokyo, Tokyo, Japan
Compensation: Not listed
Open on Caio: 2 roles

Salary insight

Compensation not indexed

Caio highlights salary ranges whenever the original posting exposes them. Compare similar roles as the index fills in.

Similar role details

Full-time roles Location flexible matches Linkedin postings

Company stats

Current index details for Shizuku AI, based on roles Caio has indexed from public sources.

2open roles 1sources 1markets Posted 3mo agolatest role