Infra & systems at the frontier — what the hiring reveals
Charts
Infra & systems at the frontier: what the hiring reveals
An infra person's read of 715 open infrastructure/systems roles across frontier labs and neoclouds (onlylabs, June 2026), for two audiences: people who want to get hired, and people who want to sell infra to the labs. Every cited role links to its live posting.
0. The macro read
The frontier infra buildout is physical first. The largest category by far is data center / hardware (199) — the power, cooling, facilities, and networking of the GPU buildout — ahead of platform/SRE (95), inference/serving (87), and GPU/kernels/perf (81). Training systems is the smallest (15) — labs guard the core training stack tightly and hire it rarely and senior.
Two structural reads:
1. Neoclouds are data-center machines; labs are broad-stack. Nebius (53 of 86) and CoreWeave (50 of 96) are overwhelmingly data center/hardware — they're building physical GPU clouds. OpenAI (138) and Anthropic (74) spread across the whole stack (GPU, serving, platform, DC). The chip/serving shops — Cerebras, Baseten, Together — concentrate in inference/serving + kernels (their product). 2. The competitive layer is serving + performance. Inference/serving (87) and GPU/kernels/perf (81) are where Anthropic (15 serving), Cerebras (11+12), OpenAI (13 GPU), Together, and Baseten all hire — this is the layer where cost-per-token is won.
| Lab | Train sys | Inference/serving | GPU/kernels | Data center/HW | Platform/SRE | Total |
|---|---|---|---|---|---|---|
| OpenAI | 4 | 8 | 13 | 29 | 16 | 138 |
| CoreWeave | 1 | 4 | 8 | 50 | 14 | 96 |
| Nebius | — | 1 | 3 | 53 | 11 | 86 |
| Anthropic | 1 | 15 | 9 | 16 | 7 | 74 |
| Cerebras | — | 11 | 12 | 6 | 6 | 40 |
| DigitalOcean | — | 8 | 3 | 11 | — | 34 |
| Together | 1 | 8 | 6 | 4 | 8 | 34 |
| xAI | 1 | 1 | — | 12 | 3 | 29 |
| Baseten | 1 | 7 | 5 | 3 | 3 | 24 |
1. If you want to get hired (as an infra/systems person)
Position by layer, then by lab:
- Physical / data center (the biggest door): the neoclouds. CoreWeave (Technology Design Manager, Data Centers) and Nebius (Data Center Security Specialist) are hiring facilities/power/networking at scale; xAI (Data Center Operations Technician) is staffing its own buildout. If you've run data-center capacity, power, or fabric, this is the volume hiring.
- GPU / kernels / performance (the prestige low-level door): OpenAI (SWE, GPU Infrastructure - HPC, SWE, Kernel Performance), Cerebras, CoreWeave (Sr. GPU Infrastructure SWE). CUDA/Triton/compiler depth lands here.
- Inference / serving (the cost-per-token door): Anthropic (15 roles), Cerebras, Baseten (Baseten Inference Stack), Together, DigitalOcean. Serving-optimization and latency work is in demand everywhere.
- Training systems (the rare, senior door): only 15 roles total — e.g. OpenAI SWE, RL Training Infra. The labs build this in-house and hire it sparingly; you get in via the GPU/perf door and move toward it.
2. If you want to sell to frontier labs (and neoclouds)
The build-vs-buy map:
- *Neoclouds buy the physical supply chain, not software. CoreWeave/Nebius's 50+ data-center roles each mean they're building GPU clouds — they buy power, cooling, networking hardware, optics, and DCIM, and they resell compute to the labs. Pitch them: data-center supply chain, networking/optics, capacity/DCIM tooling. Pitch the labs: they are themselves buying* neocloud capacity (Anthropic even staffs an Inference Capacity Lead) — so capacity brokerage / optimization is a real wedge.
- Frontier labs build training systems in-house (15 roles, guarded) — do not sell them a training stack. They buy at the edges: kernels/compilers, observability/profiling, serving optimization, and reliability tooling. The GPU/kernel and serving hiring (OpenAI 13 GPU, Anthropic 15 serving) marks where they'll consider best-of-breed over build.
- Cerebras / Baseten / Together are the "buy your serving" alternative — if you sell inference, you're competing with them, not just the labs' internal stacks. Their hiring (inference platform + kernels) shows the serving margin is contested.
Buy-signal ranking: Neoclouds (CoreWeave/Nebius/DigitalOcean) for physical + capacity; frontier labs (OpenAI/Anthropic) for kernels/observability/serving at the edges; serving shops (Cerebras/Baseten/Together) as both buyers of low-level tooling and competitors in serving.
3. The connections (signal → signal)
- Nebius 53 + CoreWeave 50 data-center roles ⇒ the GPU-cloud buildout is a real-estate/power race, not a software race — and the frontier labs are their customers (Anthropic's Inference Capacity Lead). Sell the neoclouds supply chain; sell the labs capacity optimization.
- OpenAI's GPU/kernel + RL-Training-Infra roles + Anthropic's serving cluster ⇒ the labs keep the core training stack in-house but contest the cost-per-token layer — that's the only place an outside vendor gets in.
- Cerebras/Baseten/Together's inference-platform hiring ⇒ "serving as a product" is a contested margin — relevant whether you're competing or selling components into it.
4. What the JDs actually say (deep dive)
Read the actual JDs for the top infra teams (Greenhouse + Ashby posting APIs).
- OpenAI is going vertical into silicon. The Kernel Performance & AI Tooling role is the Hardware org: "develops AI-native silicon… building on efforts like Jalapeño… co-designing chips, systems, tools… production-ready hardware for OpenAI's supercomputing platform." OpenAI's infra hiring isn't just running GPUs — it's designing its own. Sell-side read: long-term, less off-the-shelf dependence; near-term, demand for chip/systems co-design and compiler/kernel talent.
- Baseten = serving-as-a-product, well-funded. The Baseten Inference Stack JD: "powers mission-critical inference for the world's most dynamic AI companies — Cursor, Notion, Abridge, Clay, Writer… recently raised our $1.5B Series F." If you sell inference, your competitor is Baseten, not the labs' internal stacks.
- CoreWeave = the neocloud reselling compute to the labs — "The Essential Cloud for AI… trusted by leading AI labs." Its GPU-infra + data-center hiring is the supply side of the buildout.
- Anthropic runs inference as an EM-led eng org (Engineering Manager, Inference) — the serving/cost layer is managed engineering, not ad-hoc research.
What it means: the stack is bifurcating — labs go vertical (OpenAI silicon; internal training systems), neoclouds own the physical resale (CoreWeave/Nebius), and a funded serving-product layer (Baseten/Together/Cerebras) fights for the cost-per-token margin. Sell into whichever layer you're not competing with.
Method: 715 infrastructure/systems open roles pulled from onlylabs (kind=job_opened, infra lexicon, de-noised of non-eng roles), classified by layer. §4 reads the actual JDs for the top teams. Dossiers: OpenAI · CoreWeave · Nebius · Anthropic. Counts as of 2026-06-26; every linked role is a live posting.