AI Reliability Engineer
Indexed description
You will work across evaluation pipelines, observability, cloud infrastructure, and CI/CD for Palona's AI agent platform. The role is equal parts DevOps and AI reliability: managing production infrastructure while building the tooling that keeps AI agents performing at quality.
Responsibilities
As an AI Reliability Engineer, you will:
- Design and build observability systems that detect quality degradation, latency issues, and system anomalies in production, including instrumentation, dashboards, and alerting
- Write and maintain automated tests for agent output quality, including deterministic checks and LLM-as-judge evaluations
- Develop release and validation systems that automate deployments across environments and enforce quality gates for AI-powered products
- Build and evolve platform infrastructure across environments using infrastructure as code, with a focus on reliability, scalability, and cost efficiency
- Build and extend evaluation pipelines that assess AI agent conversation quality, accuracy, and safety, collaborating with product and engineering to evolve evaluation criteria
- Design and build internal tools and services that support AI reliability, evaluation, and operational workflows
- Architect new systems from scratch to address emerging reliability and quality challenges across the AI agent platform
- Write production-grade code for reliability and evaluation infrastructure, contributing as a software engineer — not just an operator
- Minimum 2 years of professional software engineering experience
- Proficiency in Python
- Experience with cloud platforms (AWS, Azure, or GCP)
- Experience with monitoring and observability tools (Datadog, CloudWatch, Grafana, or similar)
- Familiarity with CI/CD pipelines and infrastructure as code
- Experience with APIs and distributed systems
- Willingness to learn new AI/LLM concepts, frameworks, and technologies
- Experience writing test frameworks or automated evaluation systems
- Experience building internal tools or developer platforms
- Exposure to LLMs, prompt engineering, or AI agent systems
- Startup experience, or ability to thrive in fast-paced environments
- Background in NLP, computer vision, or AI agent systems
- Health Care Plan (Medical, Dental & Vision)
- Retirement Plan (401k, IRA)
- Life Insurance (Basic, Voluntary & AD&D)
- Paid Time Off (Vacation, Sick & Public Holidays)
- Family Leave (Maternity, Paternity)
- Short Term & Long Term Disability
- Training & Development
- Work From Home
- Free Food & Snacks
- Stock Option Plan
Create a free Caio profile to unlock more results and save your role and location preferences.
Unlock free search