Senior Software Engineer, Engine & Distributed Systems
About Stack AI
Stack AI is a no-code platform for designing, testing, and deploying AI workflows powered by large language models. Our visual, drag-and-drop interface lets teams connect their data to AI models and ship production applications — from chatbots to document-processing pipelines to database Q&A tools — without writing code.
The role
Enterprises run real work on AI agents, and at Stack AI that work runs on a single engine. Some agents finish in a second. Others run for days, fan out into dozens of sub-agents, pause, resume, and recover from failures without losing a step. We're hiring a Senior Software Engineer, Engine & Distributed Systems to own that engine: the durable runtime at the core of the platform that has to be correct every time, at any scale.
This is deep systems work at the heart of the product. When the engine is solid, agents simply run — and getting it there is one of the more interesting distributed-systems problems in AI today. You'll own it end to end, from the execution model to how it behaves in production.
What you'll do
Own the execution engine. The runtime, scheduling, and sub-agent parallelization that run every agent on the platform.
Make long-running work durable. Build checkpointing, resumption, and recovery so agents survive failures and restarts and pick up exactly where they left off.
Shape the execution model. Decide how work is scheduled, queued, and moved from synchronous to asynchronous, so the platform stays correct and responsive as load grows.
Engineer for scale and reliability. Hold the engine to strict health targets for worker freshness, deploy safety, and drain time, and keep latency and throughput strong as volume grows.
Keep the engine open to the ecosystem. Make it straightforward to bring new agent harnesses, orchestration frameworks, and model capabilities into the runtime.
What we're looking for
5+ years building backend systems in production, with real depth in distributed systems.
Hands-on experience with durable execution or workflow orchestration (Temporal, Cadence, Airflow, or equivalent), with a way of thinking rooted in idempotency, state machines, and failure recovery.
Strong command of concurrency, queueing, retries, and fault tolerance under load.
Strong in Python and modern backend frameworks (FastAPI or similar), with sound database fundamentals (Postgres or similar).
You're drawn to the correctness problems that everything else quietly depends on.
Distributed systems is broad. If you're strong on most of this and excited to grow into the rest, we'd like to hear from you, even if you don't check every box.
Bonus points
Operating Temporal at scale.
Event-driven architectures and message queues.
Experience with PydanticAI, LangGraph, or similar.
AI or agent runtimes: tool-calling, sub-agent orchestration, streaming.
Performance and cost optimization of high-throughput backends.
Startup or growth-stage experience.
Why Stack AI
You'll join a lean, high-impact team and own the engine that every customer's agents run on. Your work ships fast and is felt across the whole product.
Stack AI is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.