RepoDatabricks (DBRX)Databricks (DBRX)published Nov 24, 2025seen 5d

databricks/officeqa

Jupyter Notebook

Open original ↗

Captured source

source ↗
published Nov 24, 2025seen 5dcaptured 15hhttp 200method plain

databricks/officeqa

Description: Repository for getting started with the OfficeQA Benchmark.

Language: Jupyter Notebook

License: Apache-2.0

Stars: 129

Forks: 14

Open issues: 2

Created: 2025-11-24T18:20:06Z

Pushed: 2026-06-01T18:26:06Z

Default branch: main

Fork: no

Archived: no

README:

OfficeQA

A Grounded Reasoning Benchmark by Databricks

OfficeQA is a benchmark by Databricks, built for evaluating model / agent performance on end to end Grounded Reasoning tasks. The benchmark is split into two subsets:

1. OfficeQA Pro: The default for evaluating frontier models (N=133) 2. OfficeQA Full: A version of the benchmark containing additional easier questions to hillclimb systems on (N=246)

Additional details:

  • Questions require the [U.S Treasury Bulletin](https://fraser.stlouisfed.org/title/treasury-bulletin-407?browse=1930s) documents to answer
  • Datasets released under CC-BY-SA 4.0 and code and scripts under Apache 2.0 License.
  • For more information, see the [OfficeQA Technical Report](https://arxiv.org/abs/2603.08655)

Data Access

As of May 2026, all large files (benchmark CSVs, Treasury Bulletin PDFs, and parsed docs) have been moved from this GitHub repo to Hugging Face. The CSVs are gated to ensure agents browsing the web do not have access— request access on Hugging Face to get the benchmark questions and answers.

Once you've requested and been granted access, you can load the benchmark data:

from datasets import load_dataset
# Authenticate first: huggingface_hub.login() or set HF_TOKEN env var
dataset = load_dataset("databricks/officeqa", data_files="officeqa_pro.csv", split="train")

Overview

OfficeQA evaluates how well AI systems can reason over real-world documents to answer complex questions. The benchmark uses historical U.S. Treasury Bulletin PDFs (1939-2025), which contain dense financial tables, charts, and text data.

Repository Contents:

| File/Dir | Description | |---|---| | reward.py | Evaluation script for scoring model outputs | | corpus_scripts/ | Scripts and notebooks for working with the Treasury Bulletin corpus |

All benchmark data (CSVs, PDFs, parsed docs) is on [Hugging Face](https://huggingface.co/datasets/databricks/officeqa).

Dataset Schema (officeqa_pro.csv / officeqa_full.csv):

| Column | Description | | -------------- | ------------------------------------------------------------------------ | | uid | Unique question identifier | | question | The question to answer | | answer | Ground truth answer | | source_docs | Original URL(s) from the Federal Reserve Archive | | source_files | Corresponding parsed filename(s) (e.g., treasury_bulletin_1941_01.txt) | | difficulty | easy or hard |

Results

Headline results on OfficeQA Pro (N=133). See the OfficeQA Technical Report for the full evaluation methodology and additional settings.

Agent Harness Performance

End-to-end performance of frontier agents operating over the Treasury Bulletin corpus.

GPT-5.1 and Opus 4.5 Results included as reference point to results from the OfficeQA blog and re-run with latest OfficeQA Pro. Recorded on March 9 2026 OfficeQA Technical Report.

GPT-5.4 and Opus 4.6 Results recorded on March 9 2026 OfficeQA Technical Report. Opus 4.7 Results recorded on April 21 2026.

LLM with Oracle Page(s) + Web Search (PDF Only)

LLM performance when provided the oracle page(s) needed to answer each question along with web search access, evaluated across varying absolute relative error tolerances.

GPT-5.4 and Opus 4.6 Results recorded on March 9 2026 OfficeQA Technical Report. Opus 4.7 Results recorded on April 21 2026.

Getting Started

1. Load the benchmark questions (from Hugging Face)

from datasets import load_dataset
# Authenticate first (dataset is gated)
# huggingface_hub.login() or set HF_TOKEN env var

# Pro subset — default for evaluating frontier models
dataset = load_dataset("databricks/officeqa", data_files="officeqa_pro.csv", split="train")

# Full benchmark — includes easier questions for hillclimbing
dataset = load_dataset("databricks/officeqa", data_files="officeqa_full.csv", split="train")

2. Clone the code repository (for reward.py and scripts)

git clone https://github.com/databricks/officeqa.git
cd officeqa

3. Download the corpus (from Hugging Face)

We recommend using parsed txt files found here:

from huggingface_hub import snapshot_download

# Download transformed text (recommended for LLM/RAG workflows, ~460MB)
local_dir = snapshot_download(
repo_id="databricks/officeqa",
repo_type="dataset",
allow_patterns="treasury_bulletins_parsed/transformed/*.txt",
)

If you'd like to use the raw json parse or original PDFs, you can also download them here:

# Download parsed JSON docs (~730MB, with bounding boxes, tables, metadata)
local_dir = snapshot_download(
repo_id="databricks/officeqa",
repo_type="dataset",
allow_patterns="treasury_bulletins_parsed/jsons/*.json",
)

# Download original PDFs (~4GB)
local_dir = snapshot_download(
repo_id="databricks/officeqa",
repo_type="dataset",
allow_patterns="treasury_bulletin_pdfs/*",
)

| Format | Best for | Size | | --------------- | ------------------------------------------------------------------ | ------ | | PDFs | Systems with native PDF support, or you want to parse from scratch | ~4GB | | Parsed JSON | Full structural information, coordinates | ~730MB | | Transformed TXT | LLM/agent consumption, cleaner text | ~460MB |

See [corpus_scripts/](corpus_scripts/) for scripts to create alternative text representations from the parsed JSONs, and to visualize the parsed bounding boxes on top of the PDFs.

4. Evaluate your model outputs

from reward import score_answer

# Score a single prediction
score = score_answer(
ground_truth="123.45",
predicted="123.45",
tolerance=0.01 # 1% tolerance for numerical answers
)
print(f"Score: {score}") # 1.0 for correct, 0.0 for incorrect

The reward.py script provides fuzzy matching for numerical answers with configurable tolerance levels:

  • 0.0% - Exact match
  • 0.1% -…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Solid new repo from Databricks with moderate stars (126).