Back to search
ALS International Linkedin · Posted 28d ago

Machine Learning Engineer - Video Generation

Shangcai, Henan, China

Linkedin
Continue to application Add your email once, then Caio opens the original posting.

Indexed description

Key Duties & Responsibilities


  • Conceptualize, expand, and sustain distributed data pipelines tailored for machine learning preprocessing, training dataset creation, and periodic updates. This encompasses end-to-end workflow management, scheduling, real-time monitoring, and robust failure recovery mechanisms to ensure uninterrupted operations.
  • Build and deploy containerized pipeline infrastructure using Kubernetes, while optimizing cloud-based data storage and transfer processes (across AWS, GCS, or Azure platforms) to balance cost-effectiveness, data throughput, and operational efficiency.
  • Develop specialized curation pipelines for video and image datasets, incorporating capabilities such as VLM-driven caption generation, metadata extraction, quality and aesthetic evaluation, CLIP-based content filtering, and large-scale duplicate removal.
  • Conduct in-depth analysis of dataset structure and composition, pinpoint potential quality discrepancies, refine curation algorithms and logic iteratively, and establish clear benchmarks for high-quality, ML-ready video data.
  • Define and promote industry-leading best practices for dataset management, including storage protocols, version control, caching strategies, retention policies, and access protocols to support seamless team collaboration.


Required Qualifications & Skills

  • Proven practical experience in designing and scaling large-scale machine learning data systems, with a focus on dataset curation, quality enhancement, and end-to-end pipeline optimization.
  • Expertise in distributed data processing frameworks (such as PySpark or Ray) and workflow orchestration tools (including Airflow or comparable alternatives) to manage complex data workflows.
  • Solid familiarity with containerization technologies (Docker), Kubernetes for infrastructure orchestration, cloud storage and computing services (AWS, GCS, Azure), and video/image processing tools (e.g., FFmpeg, OpenCV).
  • Demonstrated experience with VLM-based captioning systems, quality scoring models, CLIP-powered content filtering, or the curation of image-text paired datasets.
  • Exceptional proficiency in Python programming, strong analytical and problem-solving capabilities, and effective written and verbal communication skills, including the ability to create clear technical documentation.

Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search
Want help applying to roles like this? Search Caio for free. If the repetitive CV tweaking gets heavy, Daniel can help set up Caio Agent.
Ask about Agent