Evals at the frontier labs — what the hiring reveals

Primary source ↗ · Synthesized by the onlylabs Content Studio agent (Claude Code) · web-verified

Charts

Open roles by lab

Evals at the frontier labs: what the hiring reveals

An eval person's read of 128 open eval-relevant roles across frontier labs (onlylabs, June 2026), connected to each lab's public research and tooling. Two audiences: people who want to get hired, and people who want to sell. Every role links to its live posting.

0. The macro read

128 open roles touch evals, and the category mix is the story:

Category	Open roles
RL & post-training	46
Safeguards / safety	22
Alignment / model behavior	20
Data & annotation	19
Evals / benchmarks	17
Preparedness / red-team	4

Three reads:

1. The frontier is hiring the environment, not the leaderboard. The single largest bucket is RL & post-training (46) — labs are staffing the reward + environment + post-training stack harder than anything else. The eval/RL-environment convergence is now a hiring fact, not a thesis. OpenAI literally posts a Research Engineer, Frontier Evals & Environments; Anthropic posts RL Scaling Science, Code RL, Cybersecurity RL, Performance RL. 2. Two labs own ~58% of frontier eval hiring — Anthropic 43 ≫ OpenAI 31 ≫ Cohere 16, then xAI 6 · Mistral 6 · Eigen-AI 5 · CoreWeave 4 · Google DeepMind 4. 3. *The shape per lab is the strategy* — Anthropic is a safeguards-and-RL machine; OpenAI is an agent-post-training-and-preparedness machine; Cohere is a data shop. The matrix:

Lab	Safeguards	RL/Post-train	Align/Behavior	Evals/Bench	Data/Annot	Prep/Red	Total
Anthropic	16	14	7	2	3	1	43
OpenAI	2	12	8	2	4	3	31
Cohere	1	3	—	3	9	—	16
xAI	—	2	4	—	—	—	6
Mistral	1	—	1	1	3	—	6
Eigen-AI	—	5	—	—	—	—	5
CoreWeave	—	—	—	4	—	—	4
Google DeepMind	2	1	—	1	—	—	4

1. If you want to get hired (as an eval person)

Anthropic — the largest eval employer (43), built around Safeguards + RL

Anthropic runs an entire Safeguards org (16 roles): Safeguards Labs, Foundations, Safeguards Evals, Review Tooling, Data Infrastructure, Enforcement, Rare Harms, Policy — plus a deep RL bench (14) and Interpretability/Alignment (7).

What they value (from the Safeguards Evals JD): "defining metrics, test cases and grading approaches for a complex long horizon agent," datasets built from real abuse patterns, and moving evals into "regression and release pipelines that run on every agent change" — plus tooling so policy experts can author evals. Required: Python + data pipelines + deep knowledge of agentic failure modes; preferred: agent-eval frameworks, red-teaming, synthetic data.

How to position: if you've wired an eval harness into CI / deployment gates, or built RL environments with verifiable rewards, you are their profile. Eval ≠ a notebook here; it's release infrastructure.

Highest-leverage open roles: SWE, Safeguards Evals · Research Engineer, Model Evaluations · EM, Agent Prompts & Evals · Research Engineer, RL Scaling Science · Research Engineer/Scientist, Frontier Red Team (Cyber).

OpenAI — agent post-training + preparedness (31)

OpenAI's eval hiring clusters in an Agent Post-Training org (Personality, Context, Connectors, Computer Use, Artifacts, API), an Alignment group (Oversight, Training, Misalignment, Science), a Preparedness red-team apparatus, and an explicit environments role.

How to position: two doors — (a) build the environment: Frontier Evals & Environments, Synthetic RL, Backend SWE (Evals); or (b) break the model: Automated Red Teaming, Recursive Self-Improvement Preparedness, Threat Modeler, Preparedness. The Preparedness roles map 1:1 to the Preparedness Framework v2 categories — read it before interviewing.

Cohere & the mid-tier — the data door (16)

Cohere's eval surface is data/annotation (9) — Data Annotation Specialist across Safety / Data Science / SWE. This is the accessible entry path into eval work; eval-via-data. Eigen-AI (5) is pure RL; CoreWeave (4) is performance-benchmarking (infra, not model evals); xAI (6) skews alignment + RL infra.

2. If you want to sell to frontier labs

The build-vs-buy verdict, stated once: the 46 RL/post-training roles + dedicated environments and synthetic RL roles mean labs are building their evals and environments in-house. They will not buy a finished benchmark. They buy the layer underneath — environment infra, RL throughput, sandboxing, human-data/annotation, and grading/eval pipelines. Sell the pickaxes.

Buy-signal ranking (hiring volume × infra/data roles = demand):

1. *Anthropic — strongest buyer signal in data + harness infra. They're staffing Safeguards Data Infrastructure, Human Data Platform, Human Data Operations/Interface, and "tooling so policy experts can author evals." Gap → pitch: human-data ops at throughput, eval-harness/regression infra, synthetic-data generation, annotation QA. They build the alignment-auditing science (Petri, SHADE-Arena) — so sell them the data and the throughput, not the method. 2. OpenAI — demand for RL environments + red-team capacity. Frontier Evals & Environments + Synthetic RL + Automated Red Teaming. Gap → pitch: RL-environment infra (the verifiers/Prime-Intellect layer), red-team-as-a-service, and dangerous-capability (bio/cyber) eval datasets for Preparedness. 3. Cohere & mid-tier — they buy data/annotation today.* 9 annotation roles = active spend on labeling/data-quality. Sell managed annotation + data-quality tooling now; eval tooling later.

3. The connections (signal → signal, cited)

The intelligence is in the links a job board won't draw for you:

*Anthropic: 16 Safeguards roles + Petri (open-source auditing agent) + SHADE-Arena + Agentic Misalignment ⇒ they are industrializing alignment auditing*** — turning one-off red-team studies into release-gating eval infrastructure ("runs on every agent change"). For a seller: the product they need is throughput and data, not another eval. For a job seeker: the team is eng-heavy, not just research.
OpenAI: Frontier Evals & Environments* + Synthetic RL + the Agent Post-Training org ⇒ the environment is the product.* Post-training agents (Connectors, Computer Use, Artifacts) all need verifiable task environments — which is exactly the RL-environment market.
Cross-lab: "Frontier Evals & Environments" (OpenAI) + "RL Scaling Science / Code RL / Cybersecurity RL" (Anthropic) + Eigen-AI's all-RL bench ⇒ environments are now first-class infrastructure, and the eval/RL-env resource map is precisely where the frontier is spending headcount.

6. What the JDs actually say (deep dive)

Fetched and read the actual job descriptions for the highest-signal eval teams — the detail a title can't carry.

Anthropic

Safeguards Evals — does the monitoring agent catch misuse, and where does it fail. They build "regression and release pipelines that run on every agent change" and tooling so policy experts author evals. Datasets come from real abuse patterns, not synthetic. → eval is a deployment gate.
Agent Prompts & Evals (platform) — owns "the infrastructure that lets Anthropic ship model and prompt changes with confidence"; the craft is "measuring things that are hard to measure — behavioral drift, prompt quality, harness parity"; they want "a platform owner and a hands-on partner during launch periods." → eval is launch infrastructure, run by EMs.
RL Scaling Science — converts RL scaling insight into "production training recipes" and builds "maintainable benchmarks for long-horizon RL." Values "the boundary between research and engineering" and that "the path from a robust result to production is short."
Frontier Red Team (Cyber) — 2026 focus: "self-improving, highly autonomous AI systems with cyberphysical capabilities"; autonomous vuln discovery, pentesting; "applied research with real-world stakes."

OpenAI

*Frontier Evals & Environments / Synthetic RL sit inside Agent Post-Training — the JD: the team "build[s] the training signal that teaches [agent] abilities." So at OpenAI the environment isn't a measurement bolt-on; it is* the training signal for the agents shipped in Codex/ChatGPT/API. Synthetic RL uses "synthetic data, environments, and feedback... self-play, simulators" to push "capability, generalization, and alignment beyond the current prevailing methodology."
Automated Red Teaming is Preparedness — "a critical Safety Research team... mitigating AI threats to global security," organized Measurement → Mitigation → Coordination around the Preparedness Framework.

A correction the deep read forced

"Backend Software Engineer (Evals)" (OpenAI) is NOT a model-eval role — its JD is the Support Automation team (applying AI to automate internal ops). Title-lexicon matching flagged it; the JD de-flags it — a reminder that title-only inventories over-count, and why this deep pass matters.

What it means

Get hired: the eval craft the labs actually pay for is eval-as-release-infrastructure (Anthropic) and environments-as-training-signal (OpenAI) — not offline benchmark scoring. Show CI-wired eval harnesses or RL environments with verifiable rewards.
Sell to: OpenAI's "environments = training signal" makes RL-environment infra core spend, not tooling; Anthropic's "every agent change" gate makes eval throughput/regression infra the wedge.

Method: 128 eval-relevant open roles pulled from onlylabs (kind=job_opened, eval lexicon over title), classified by category, connected to public research/tooling. §6 reads the actual JDs (Greenhouse + the Ashby posting API) for the top teams. Dossiers: Anthropic · OpenAI. Role counts as of 2026-06-26; every linked role is a live posting.