Evals at the frontier labs — what the hiring reveals

Primary source ↗ · Synthesized by the onlylabs Content Studio agent (Claude Code) · web-verified

Charts

Open roles by lab
Anthropic43OpenAI31Cohere16xAI6Mistral6Eigen-AI5CoreWeave4Google DeepMind4

Evals at the frontier labs: what the hiring reveals

An eval person's read of 128 open eval-relevant roles across frontier labs (onlylabs, June 2026), connected to each lab's public research and tooling. Two audiences: people who want to get hired, and people who want to sell. Every role links to its live posting.


0. The macro read

128 open roles touch evals, and the category mix is the story:

CategoryOpen roles
RL & post-training46
Safeguards / safety22
Alignment / model behavior20
Data & annotation19
Evals / benchmarks17
Preparedness / red-team4

Three reads:

1. The frontier is hiring the environment, not the leaderboard. The single largest bucket is RL & post-training (46) — labs are staffing the reward + environment + post-training stack harder than anything else. The eval/RL-environment convergence is now a hiring fact, not a thesis. OpenAI literally posts a Research Engineer, Frontier Evals & Environments; Anthropic posts RL Scaling Science, Code RL, Cybersecurity RL, Performance RL. 2. Two labs own ~58% of frontier eval hiringAnthropic 43 ≫ OpenAI 31 ≫ Cohere 16, then xAI 6 · Mistral 6 · Eigen-AI 5 · CoreWeave 4 · Google DeepMind 4. 3. *The shape per lab is the strategy* — Anthropic is a safeguards-and-RL machine; OpenAI is an agent-post-training-and-preparedness machine; Cohere is a data shop. The matrix:

LabSafeguardsRL/Post-trainAlign/BehaviorEvals/BenchData/AnnotPrep/RedTotal
Anthropic1614723143
OpenAI212824331
Cohere133916
xAI246
Mistral11136
Eigen-AI55
CoreWeave44
Google DeepMind2114

1. If you want to get hired (as an eval person)

Anthropic — the largest eval employer (43), built around Safeguards + RL

Anthropic runs an entire Safeguards org (16 roles): Safeguards Labs, Foundations, Safeguards Evals, Review Tooling, Data Infrastructure, Enforcement, Rare Harms, Policy — plus a deep RL bench (14) and Interpretability/Alignment (7).

What they value (from the Safeguards Evals JD): "defining metrics, test cases and grading approaches for a complex long horizon agent," datasets built from real abuse patterns, and moving evals into "regression and release pipelines that run on every agent change" — plus tooling so policy experts can author evals. Required: Python + data pipelines + deep knowledge of agentic failure modes; preferred: agent-eval frameworks, red-teaming, synthetic data.

How to position: if you've wired an eval harness into CI / deployment gates, or built RL environments with verifiable rewards, you are their profile. Eval ≠ a notebook here; it's release infrastructure.

Highest-leverage open roles: SWE, Safeguards Evals · Research Engineer, Model Evaluations · EM, Agent Prompts & Evals · Research Engineer, RL Scaling Science · Research Engineer/Scientist, Frontier Red Team (Cyber).

OpenAI — agent post-training + preparedness (31)

OpenAI's eval hiring clusters in an Agent Post-Training org (Personality, Context, Connectors, Computer Use, Artifacts, API), an Alignment group (Oversight, Training, Misalignment, Science), a Preparedness red-team apparatus, and an explicit environments role.

How to position: two doors — (a) build the environment: Frontier Evals & Environments, Synthetic RL, Backend SWE (Evals); or (b) break the model: Automated Red Teaming, Recursive Self-Improvement Preparedness, Threat Modeler, Preparedness. The Preparedness roles map 1:1 to the Preparedness Framework v2 categories — read it before interviewing.

Cohere & the mid-tier — the data door (16)

Cohere's eval surface is data/annotation (9) — Data Annotation Specialist across Safety / Data Science / SWE. This is the accessible entry path into eval work; eval-via-data. Eigen-AI (5) is pure RL; CoreWeave (4) is performance-benchmarking (infra, not model evals); xAI (6) skews alignment + RL infra.


2. If you want to sell to frontier labs

The build-vs-buy verdict, stated once: the 46 RL/post-training roles + dedicated environments and synthetic RL roles mean labs are building their evals and environments in-house. They will not buy a finished benchmark. They buy the layer underneath — environment infra, RL throughput, sandboxing, human-data/annotation, and grading/eval pipelines. Sell the pickaxes.

Buy-signal ranking (hiring volume × infra/data roles = demand):

1. *Anthropic — strongest buyer signal in data + harness infra. They're staffing Safeguards Data Infrastructure, Human Data Platform, Human Data Operations/Interface, and "tooling so policy experts can author evals." Gap → pitch: human-data ops at throughput, eval-harness/regression infra, synthetic-data generation, annotation QA. They build the alignment-auditing science (Petri, SHADE-Arena) — so sell them the data and the throughput, not the method. 2. OpenAI — demand for RL environments + red-team capacity. Frontier Evals & Environments + Synthetic RL + Automated Red Teaming. Gap → pitch: RL-environment infra (the verifiers/Prime-Intellect layer), red-team-as-a-service, and dangerous-capability (bio/cyber) eval datasets for Preparedness. 3. Cohere & mid-tier — they buy data/annotation today.* 9 annotation roles = active spend on labeling/data-quality. Sell managed annotation + data-quality tooling now; eval tooling later.


3. The connections (signal → signal, cited)

The intelligence is in the links a job board won't draw for you:


6. What the JDs actually say (deep dive)

Fetched and read the actual job descriptions for the highest-signal eval teams — the detail a title can't carry.

Anthropic

OpenAI

A correction the deep read forced

"Backend Software Engineer (Evals)" (OpenAI) is NOT a model-eval role — its JD is the Support Automation team (applying AI to automate internal ops). Title-lexicon matching flagged it; the JD de-flags it — a reminder that title-only inventories over-count, and why this deep pass matters.

What it means


Method: 128 eval-relevant open roles pulled from onlylabs (kind=job_opened, eval lexicon over title), classified by category, connected to public research/tooling. §6 reads the actual JDs (Greenhouse + the Ashby posting API) for the top teams. Dossiers: Anthropic · OpenAI. Role counts as of 2026-06-26; every linked role is a live posting.