Machine Learning Engineer
Indexed description
Hi,
Good day!
If you are intrested with the below job role then please reply with updated resume and Contact details.
Role: ML Engineer
Hybrid: 3 Days onsite – NYC, NY
Type: Contract
Duration: Long Term
Candidates whose primary background is MLOps platform work (DAG orchestration, Terraform, Kubernetes administration, generic CI/CD pipelines) will not be a fit. We need a senior level engineer who can profile a transformer, rewrite its serving path for a 2–3x latency reduction, tune an HNSW index, and tell us which SageMaker instance type will hit our p95 target at the lowest cost.
Roles & Responsibilities
- Design, build, and scale ML-powered inference systems that process large volumes of text, image, and video data to power news-based intelligence products.
- Productionize and optimize state-of-the-art models and inference pipelines. These models include, but are not limited to:
- DistilBERT for Named Entity Recognition (NER) over hundreds of thousands of search queries/day
- TransNetV2 for video shot boundary detection at scale for archival video as well as real-time
- SBERT for embedding generation from textual descriptions
- External multimodal APIs for image/video captioning
- Support hybrid search architectures by defining embedding/re-ranking interfaces, evaluation metrics, and inference performance requirements; partner with search/platform engineers on index configuration, sharding, and cluster tuning.
- Design and implement scalable data processing pipelines across hybrid CPU/GPU environments to handle millions of media assets.
- Partner with MLOps and platform engineering to enable the deployment and operation of ML systems reliably, contributing to:
- Distributed inference architectures
- Cloud-based execution (e.g., AWS EC2, Batch, Lambda, SageMaker)
- Efficient resource utilization across workloads
- Optimize inference latency and throughput across distributed workloads using cloud-based resources (AWS EC2, Batch, Lambda, SageMaker, etc.)
- Build resilient asynchronous processing systems for large-scale workloads, ensuring:
- Reliability (retries, fault tolerance)
- Efficiency (caching, deduplication)
- Observability (metrics, logging, traceability)
- Work closely with data scientists and product teams to iterate on models, improve performance, and deliver measurable impact in production.
Requirements:
- 8+ years of experience building ML inference systems.
- Demonstrated ownership of deep-learning inference optimization in production (quantization, distillation, compilation, kernel/profile-level performance work) for transformer NLP and/or CV models.
- Experience with TensorFlow (SavedModel, tf.data, XLA, TFLite) & PyTorch (TorchScript, ONNX, FastAPI/TorchServe)
- Hands-on experience optimizing inference pipelines on AWS infrastructure, across different types of media assets.
- Experience with video frameworks/tools (e.g., FFmpeg) and working with large-scale frame-level inference.
- Demonstrated experience monitoring and debugging model latency, memory, and pipeline throughput.
- Experience with hybrid search architectures (BM25 + vector search + cross-encoder reranking).
- Familiarity with OpenAI APIs or other foundation model providers.
- Familiarity with open source HuggingFace LLMs.
- Experience with data pipeline and workflow orchestration tools (e.g., Airflow)
Create a free Caio profile to unlock more results and save your role and location preferences.
Unlock free search