Lead, Machine Learning Operations Engineer
Indexed description
Role Overview
You will serve as the operational backbone for our critical Machine Learning services within the Transformation and Data Group. Your mission is to mitigate operational risks and stabilize the production environment, ensuring our core models are robust and reliable despite upstream data dependencies
You Will Be Responsible For The Following
- Own the end-to-end, continuous monitoring and healthy upkeep of production machine learning models, tracking performance and drift; diagnose and debug issues across model deployment, runtime performance, data pipelines, and infrastructure. Build the automations and platforms required to scale model monitoring.
- Ensure robust, high-availability ML model serving (i.e. data pipelines, model APIs, GenAI Apps) by designing, building, and maintaining scalable and observable ML Operations pipelines.
- Accelerate model deployment velocity by collaborating cross-functionally to implement and optimize automated CI/CD pipelines, and streamline end to end process.
- Proactively reduce incident volume and maintain target SLAs by establishing comprehensive model performance, data quality, and pipeline uptime monitoring and alerting systems.
- Drive cost efficiency and performance improvements by optimizing model serving infrastructure using cloud platforms and containerization (AWS, Docker, Kubernetes)
- Establish institutional knowledge and compliance by creating and maintaining clear, external-friendly documentation of MLOps processes.
- Bachelor’s degree in Computer Science, Data Science, or related field
- Minimum 2+ years of hands-on experience in a production environment covering MLOps, DevOps, Data Engineering, or Software Engineering
- Proven expertise in ML Operations (MLOps), specifically model deployment, proactive monitoring, and performance tuning
- Strong proficiency in containerization and orchestration, specifically Docker and Kubernetes
- Experience utilizing cloud platforms (e.g., AWS Cloud) to host and optimize model serving infrastructure
- Proficiency in core programming languages, especially Python, and exposure to Go or similar compiled languages
- Experience designing and implementing CI/CD pipelines for machine learning models
- Demonstrated capability to meet and exceed stringent Service Level Agreements (SLAs), particularly those related to model uptime and incident resolution
- Proficiency in software development methodologies (e.g., Agile/Scrum) and analytics practices for platform feature delivery
- Proficiency in React.JS for frontend, and Golang/Python for backend development.
Create a free Caio profile to unlock more results and save your role and location preferences.
Unlock free search