Back to search
Stealth SU Linkedin · Posted 16d ago

Staff Site Reliability Engineer (SRE)

Herzliya, Tel Aviv, Israel

Linkedin
Continue to application Add your email once, then Caio opens the original posting.

Indexed description

About Us

We are a fast-growing startup building a next-gen platform for the AI era.

The Role

We are looking for an exceptional Site Reliability Engineer to establish and lead the SRE discipline within our organization. This is a unique opportunity to define what reliability means at our scale — building the practices, standards, and tooling from the ground up, with high visibility and impact from day one.

We run a lean SRE team with high expectations — and we back that up with an AI-first approach to tooling and operations.

If you're the kind of engineer who thrives on ownership, thinks in systems, builds their own tools, and wants to leave a real mark on a company's technical foundation — this role is for you.

What You'll Do

  • Develop deep product knowledge across our platform — understanding its internals, failure modes, and operational behavior well enough to own incident resolution end-to-end.
  • Define and track SLOs/SLIs across critical platform services, and use error budgets to drive engineering decisions.
  • Own live site reliability — including on-call rotations, incident response, and post-mortems — with a focus on minimizing MTTR and preventing recurrence through systemic fixes, not just firefighting.
  • Lead capacity planning, performance analysis, and proactive risk identification across our multi-cloud environments.
  • Work hand-in-hand with engineering teams across the stack — infrastructure, application, and business layers — to embed reliability requirements everywhere.
  • Lay the groundwork for a future SRE team — designing processes and tooling that scale beyond a single person.

What We're Looking For

  • 6+ years of experience as an SRE in a high-scale production environment, with hands-on ownership across the full stack — infrastructure and application layers; business-level reliability experience is a bonus.
  • Strong AWS expertise is a must; GCP experience is a significant advantage.
  • Strong coding skills and a software engineering mindset — you build your own tools rather than waiting for someone else to.
  • Familiarity with infrastructure-as-code and container orchestration — enough to collaborate effectively with the teams that own them.
  • Rust experience is a strong bonus given that the majority of our codebase is written in Rust.
  • Experience building with AI — working with LLMs, designing agents, or integrating AI into operational tooling — is a strong bonus.
  • A true owner — you take end-to-end accountability for what you build and operate, and you don't wait to be asked.
  • A "can-do" partner to engineering and product — your default is to find a way, not to say no. You raise concerns early and constructively, but you're known for unblocking teams, not gatekeeping them.

Why Join Us

  • Be the person who defines what reliability means at a cutting-edge AI big data platform.
  • Work alongside a world-class engineering team on genuinely hard problems across the full stack.
  • High ownership, real impact, and room to grow — from day one.
  • Based in Israel, with global reach and a multi-cloud environment at massive scale.
Free. 20 seconds. No password. See every match in this search.

Create a free Caio profile to unlock more results and save your role and location preferences.

Unlock free search
Want help applying to roles like this? Search Caio for free. If the repetitive CV tweaking gets heavy, Daniel can help set up Caio Agent.
Ask about Agent