Machine Learning Engineer - Video Generation
Indexed description
Key Duties & Responsibilities
- Conceptualize, expand, and sustain distributed data pipelines tailored for machine learning preprocessing, training dataset creation, and periodic updates. This encompasses end-to-end workflow management, scheduling, real-time monitoring, and robust failure recovery mechanisms to ensure uninterrupted operations.
- Build and deploy containerized pipeline infrastructure using Kubernetes, while optimizing cloud-based data storage and transfer processes (across AWS, GCS, or Azure platforms) to balance cost-effectiveness, data throughput, and operational efficiency.
- Develop specialized curation pipelines for video and image datasets, incorporating capabilities such as VLM-driven caption generation, metadata extraction, quality and aesthetic evaluation, CLIP-based content filtering, and large-scale duplicate removal.
- Conduct in-depth analysis of dataset structure and composition, pinpoint potential quality discrepancies, refine curation algorithms and logic iteratively, and establish clear benchmarks for high-quality, ML-ready video data.
- Define and promote industry-leading best practices for dataset management, including storage protocols, version control, caching strategies, retention policies, and access protocols to support seamless team collaboration.
Required Qualifications & Skills
- Proven practical experience in designing and scaling large-scale machine learning data systems, with a focus on dataset curation, quality enhancement, and end-to-end pipeline optimization.
- Expertise in distributed data processing frameworks (such as PySpark or Ray) and workflow orchestration tools (including Airflow or comparable alternatives) to manage complex data workflows.
- Solid familiarity with containerization technologies (Docker), Kubernetes for infrastructure orchestration, cloud storage and computing services (AWS, GCS, Azure), and video/image processing tools (e.g., FFmpeg, OpenCV).
- Demonstrated experience with VLM-based captioning systems, quality scoring models, CLIP-powered content filtering, or the curation of image-text paired datasets.
- Exceptional proficiency in Python programming, strong analytical and problem-solving capabilities, and effective written and verbal communication skills, including the ability to create clear technical documentation.
Create a free Caio profile to unlock more results and save your role and location preferences.
Unlock free search