Research Engineer (Evals)

TLDR: We're looking for a research engineer to build and maintain our internal suite of benchmarks, covering single/multi-turn content and agentic guardrails. The research engineer would also work with the team on projects studying agent behaviours in the wild.

About us

White Circle is an AI Safety company building the safety, reliability, and optimization layer for AI systems. At the core of our platform are policies – simple natural-language rules that define what an AI model should and shouldn’t do. We automatically test, enforce, and continuously improve these policies at scale.

We’ve raised $11M from top funds, founders, and senior leaders at OpenAI, Anthropic, HuggingFace, Mistral, DeepMind, Datadog, Sentry, and others
We process over 100M+ API calls every month
We fine-tune and train our own LLMs so they run faster and cheaper than any open or proprietary model

We’re a small, highly focused team. If you want to work deeply on hard problems, see your work ship to production quickly, and influence how AI safety is actually built – you’re the one we need.

About the team

White Circle's fundamental research team works on the science of how AI systems fail in the real world: where agents break, how misalignment actually looks like to the end user, and how it is present in the model internals. We build the evals, benchmarks, environments, and tooling that empirically study high-impact agent reliability concerns — some of which become the guardrails shipped in our products, and some of which become public writeups.

You will:

Own and maintain our internal benchmark suite, covering single/multi-turn content guardrails and agentic safety.
Build benchmarks that distinguish specific model capabilities.
Work with the product team to build evals covering core functionality of our flagship models.
Build benchmarks for new features coming out of the research team.
Adapt and extend evals to new verticals and changing product data.
Work on research projects that study and quantify realistic agentic and LLM failure modes in the wild.

You’ll fit right in if you:

Have built an LLM benchmark from scratch that distinguished specific model capabilities (i.e., produced a measurable, defensible capability difference, not just a score).
Have built synthetic data for post-training textual or multimodal models.
Can reproduce a published benchmark result and identify where the original methodology is fragile or misleading.
You write Python that other people can build on. Our whole stack is Python; we want someone who has shipped and maintained production code and who factors messy problems into clean abstractions others can extend.
You can write efficient LLM inference setups, including sensible orchestration of parallel calls, retries, rate-limit handling.
An AI power-user — fluent with frontier models and coding agents day to day.

A big plus:

Automated red-teaming experience
Have worked across a range of agentic scaffolds and reproduced public benchmark results on them
Strong knowledge of existing reward-model / monitoring / safety benchmarks
One or more published papers in the evals / safety-evaluation space

Why White Circle

Paid time off in line with your local regulations, no matter where you work from.
Work from Paris (hybrid) with a relocation package available, or work from London (note: we are unable to provide relocation support or private medical insurance for London-based roles for now).
Comprehensive medical insurance for our France-based team
All the hardware, tools, and services you need
Covered subscriptions for AI agents and IDEs
Team off-sites twice a year: we’ve recently been to the Alps and to Saint-Tropez

How we hire

Introductory call with HR (25 min)
Take-home test task
Technical interview with Head of Fundamental Research (60 min)
Final conversation with our CEO (45 min)

Please submit your application in English.