Back to search
Dexian Linkedin · Posted 15d ago

AI/ML Observability Engineer #1004227

Coppell, Texas, United States

Linkedin
Continue to application Add your email once, then Caio opens the original posting.

Indexed description

Job Title: AI/ML Observability Engineer

Location: Tampa, FL / Dallas, TX (Hybrid)

Employment Type: Contract-to-Hire


Role Overview

We are seeking a hands-on AI/ML Observability Engineer to design and build intelligent monitoring solutions that enhance system reliability and performance.

This role focuses heavily on observability engineering and creating anomaly detection models from scratch. You will develop AI/ML-driven capabilities to detect, diagnose, and prevent issues across distributed systems, enabling proactive and automated operations.

You will work at the intersection of machine learning, observability platforms, and automation, helping transform traditional monitoring into intelligent, self-improving systems.


Key Responsibilities

  • Design, build, and deploy custom anomaly detection models from the ground up using telemetry data (logs, metrics, traces)
  • Develop baselining, event correlation, and predictive analytics models to identify abnormal system behavior
  • Enhance enterprise observability platforms by integrating AI/ML-driven insights and intelligent alerting
  • Build solutions that enable early detection of issues and proactive system resiliency
  • Implement OpenTelemetry-based pipelines for collecting and analyzing telemetry across distributed systems
  • Create real-time and batch data pipelines to support ML-driven observability use cases
  • Develop AI-powered alerting and root cause analysis (RCA) capabilities
  • Build services/APIs in Python for model inference and operational integration
  • Partner with SRE, platform, and engineering teams to improve monitoring, diagnostics, and incident response
  • Contribute to observability best practices including SLOs, SLIs, and Golden Signals
  • Drive automation and intelligent workflows to improve incident detection and resolution times


Required Qualifications

Core AI/ML & Engineering

  • Strong experience with Python and ML libraries (NumPy, Pandas, scikit-learn, TensorFlow or PyTorch)
  • Proven experience building anomaly detection models from scratch (time series, statistical, or ML-based approaches)
  • Solid understanding of statistics, time series analysis, and pattern recognition
  • Experience deploying ML models in production environments (real-time and batch)

Observability & Telemetry (HIGHLY IMPORTANT)

  • Strong hands-on experience with observability concepts, including:
  • Metrics, logs, traces, spans
  • Baselining and anomaly detection
  • Event correlation
  • SLOs / SLIs / Golden Signals
  • Experience with tools such as:
  • Grafana
  • Dynatrace
  • Datadog, Splunk, or similar platforms
  • Experience implementing or working with OpenTelemetry

Data & Platform Engineering

  • Experience building data pipelines for telemetry ingestion and processing
  • Familiarity with Snowflake, AWS, or similar cloud platforms
  • Experience working with distributed systems and microservices environments

Automation & Integration

  • Experience building APIs or services for ML model integration
  • Exposure to automation, CI/CD, and infrastructure workflows
  • Ability to integrate ML outputs into alerting and operational systems


Preferred Qualifications

  • Experience with AI-driven observability or AIOps platforms
  • Exposure to Generative AI / LLMs (RAG, prompt engineering, etc.)
  • Experience building:
  • Self-healing systems
  • Automated remediation workflows
  • AI-driven alerting and RCA solutions
  • Experience working with large-scale telemetry data

Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock the full index and keep your job-search signal for future recommendations.

Unlock free search