Senior Backend Engineer — Customer Support Platform

The Senior Backend Engineer owns the services that keep Strata's customers unblocked — the support, exception-handling, and remediation backend that sits behind every customer-facing interaction with the Strata platform. When a tenant's purchase order fails validation, a document lands in the Expert-in-the-Loop queue, or a customer asks "what happened to my order?", the answer comes from the systems you build.

This is not a generic CRUD role. Strata is an agentic operating layer that ingests business documents (POs, ACKs, Invoices, Quotes), extracts and validates them, and delivers clean data downstream to OrderBahn and ERP systems. You will build the support and operations backend around that pipeline: the exception/HITL queue services, the customer-facing status and audit APIs, the reprocessing and replay tooling support engineers use to remediate stuck documents, the ticketing/CRM integrations, and the per-tenant configuration services. Your work is measured against hard reliability and data-quality bars — ≥99.9% availability, ≤0.5 P1 incidents/week, MTTR P1 ≤30 min, and ≥99.5% field-level data accuracy — so you build for resilience, observability, and graceful failure from day one.

You will work in AvantoDev's standard backend stack (NestJS/TypeScript and FastAPI/Python, PostgreSQL, AWS, SQS), integrate with the agent layer through MCP servers, and collaborate closely with the SRE team, the Context Engineering team, Customer Success, and the Head PM<!-- notionvc: a8766d6c-88d0-4c5e-a47a-663cfbfa0fcf -->


What You'll Build

  • Backend for the Expert-in-the-Loop (HITL) queue — APIs that surface low-confidence documents, capture support/expert decisions, and resume the paused agent workflow via the SQS-backed control plane.
  • Reprocessing & replay tooling — services that let support safely re-run a document through the pipeline (full or targeted re-extraction), with idempotency and audit guarantees.
  • Exception triage APIs — classification, assignment, SLA tracking, and auto-resolution hooks (target: ≥70% auto-resolution, ≤2% exception rate).

Customer-Facing Status & Audit APIs

  • Document lifecycle / status APIs backed by the OpenSearch state machine (FORMAT_DETECTED → PRIMARY_EXTRACTED → DOC_CLASSIFIED → SCHEMA_MATCHED → RECOVERY_EVALUATED → routing), exposing where any document is and why.
  • Audit-trail APIs — full, per-tenant history of every decision, confidence score, and routing action for support investigation and customer transparency.

Integrations & Tenant Configuration

  • Ticketing / CRM integrations (e.g., support desk, customer comms) wired to pipeline events so issues are created, updated, and resolved automatically.
  • Per-tenant configuration services — schema/alias overrides, tolerance rules, routing thresholds, and notification preferences, exposed through governed APIs (not ad-hoc DB edits).
  • Delivery/bridge services between Strata and downstream systems (OrderBahn, ERP) with reconciliation and retry semantics.

MCP & Agent Integration

  • Build and consume MCP servers (FastAPI-based) so support tooling and agents invoke the same governed capabilities (validation, lookup, reprocessing) rather than duplicating logic.


What You'll Do Day-to-Day

  • Design and implement scalable APIs in NestJS/TypeScript and/or FastAPI/Python using Domain-Driven Design (DDD), with robust validation, auth, error handling, and OpenAPI docs.
  • Implement event-driven workflows over SQS (Standard + FIFO) with DLQ patterns, exponential backoff, and idempotent processing.
  • Model and optimize PostgreSQL schemas (Aurora) with migrations, indexing, and strict tenant isolation / row-level security.

Reliability & Operability

  • Build every service to be observable by default — structured logs, metrics, and traces with X-Correlation-ID / X-Trace-ID propagation (100% coverage is an org KPI).
  • Implement health checks, circuit breakers, timeouts, retries, and graceful degradation so a downstream agent or OCR engine failure never takes down support tooling.
  • Write runbooks for the services you own and participate in the on-call rotation alongside SRE.


Quality & Security

  • Maintain strong test coverage (pytest / Jest, integration tests, moto/localstack, SuperTest, e2e tests) and contribute to CI/CD via CodePipeline.
  • Enforce security bars: 0 critical/high vulns, per-tenant rate limiting, OAuth2/equivalent auth on 100% of endpoints, and ≥95% audit-log completeness toward SOC2 readiness.

Collaboration

  • Partner with SRE on SLOs, dashboards, and incident response; with Context Engineering on MCP/agent contracts; and with Customer Success on what support actually needs.


Minimum Qualifications

  • 6+ years backend engineering in production, shipping and operating real services (not just prototypes).
  • Strong in at least one, comfortable in both: Node.js/TypeScript (NestJS or equivalent) and Python (FastAPI). REST API design, validation, auth, and clean error handling.
  • Deep PostgreSQL — schema design, migrations, query optimization, indexing, and multi-tenant isolation / row-level security.
  • Event-driven & async patterns — message queues (SQS, Kafka or equivalent), DLQs, retries, idempotency, and designing for partial failure.
  • AWS proficiency — Lambda, ECS/Fargate, S3, SQS, API Gateway, RDS/Aurora. You can deploy and operate what you build.
  • Reliability mindset — you design for SLOs, instrument for observability (structured logs/metrics/traces, correlation IDs), and have carried a pager.
  • Testing discipline — unit + integration + e2e testing (pytest/Jest, moto/localstack, SuperTest), and CI/CD experience.
  • Security awareness — authn/authz, rate limiting, input validation, secrets management, and audit logging.
  • English proficiency: B2+ required (C1 preferred). You'll write docs/runbooks, join architecture reviews, and coordinate during incidents.


Nice to Have

  • Experience building support / operations tooling — ticketing integrations, exception queues, reprocessing/replay, admin consoles.
  • Familiarity with the Model Context Protocol (MCP) and exposing services as agent-callable tools.
  • Exposure to agentic / LLM pipelines and HITL (Human-in-the-Loop) patterns (SQS-backed pause/resume).
  • OpenSearch / Elasticsearch for state tracking and operational queries.
  • Experience with ERP / order-management integrations (OrderBahn, NetSuite, or similar) and reconciliation.
  • Familiarity with DORA metrics and a high-deployment-frequency, low-change-failure delivery culture.
  • Background in commercial furniture, logistics, distribution, or manufacturing operations.
  • Terraform / IaC familiarity for owning your service infrastructure.