RepoAmazon (Nova)Amazon (Nova)published May 11, 2026seen 5d

amazon-science/hallucination-benchmark-trivialplus

Python

Open original ↗

Captured source

source ↗

amazon-science/hallucination-benchmark-trivialplus

Description: [ACL 2026 main] Long-Context Hallucination Detection Benchmark: Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights

Language: Python

License: NOASSERTION

Stars: 3

Forks: 1

Open issues: 0

Created: 2026-05-11T21:50:38Z

Pushed: 2026-05-13T17:03:55Z

Default branch: main

Fork: no

Archived: no

README:

TRIVIA+ Dataset

A rigorous benchmark for hallucination detection — built against the gaps in every existing one.

  • 94K-char contexts (7–33x longer than prior benchmarks)
  • Human-verified, sentence-level labels
  • Controlled label noise for robustness testing
  • Satisfies all 7 desiderata for evaluation

Dataset Overview

| Split | Count | |-------|-------| | Train | 2,263 | | Valid | 316 | | Test | 645 | | Total | 3,224 |

Data Sources

The dataset aggregates examples from multiple QA benchmarks:

| Source | Count | Description | |--------|-------|-------------| | drop | 1,339 (41.5%) | Discrete Reasoning Over Paragraphs | | msmarco / ms_marco | 763 (23.7%) | Microsoft Machine Reading Comprehension | | nq | 674 (20.9%) | Natural Questions | | trivia | 309 (9.6%) | Trivia Question Answering | | covid | 139 (4.3%) | COVID-19 scientific literature QA |

Note: The source column contains both msmarco (521) and ms_marco (242) as variants for the same origin dataset.

LLM Response Sources

Responses were generated by three LLMs:

| Model | Count | Description | |-------|-------|-------------| | mixtral_8x7b | 1,686 (52.3%) | Mixtral 8x7B | | claude | 1,006 (31.2%) | Claude (SOTA LLM) | | gemma | 532 (16.5%) | Gemma 7B |

Human Annotation

Each sample was annotated at the sentence level by multiple annotators (up to 6 per sample) through a rigorous multi-stage pipeline:

1. Two annotators label each sample independently 2. On disagreement, two additional annotators provide labels 3. If still no clear majority, two more labels are gathered 4. Labels are aggregated via majority vote with strictest-label tiebreaking

Annotators were trained over two rounds with author audits. Low-performing annotators were removed using the Dawid-Skene model. Each sentence receives one of four labels: Supported, Contradicted, Not Mentioned, or Supplementary.

Multi-vote annotation pipeline with escalating review stages and Dawid-Skene quality filtering.

File

`triviaplus_dataset.parquet` — Cleaned dataset with all annotations.

See [DATA_DETAILS.md](DATA_DETAILS.md) for complete column descriptions, label aggregation logic, and label distributions.

Loading the Dataset

import pandas as pd

# Load the dataset
df = pd.read_parquet("triviaplus_dataset.parquet")

# Filter by split
train = df[df['split'] == 'train']
valid = df[df['split'] == 'valid']
test = df[df['split'] == 'test']

# Access sentence-level labels
for idx, row in df.head(3).iterrows():
print(f"Question: {row['question'][:50]}...")
print(f"Answer: {row['answer'][:50]}...")
print(f"Sentences: {row['answer_sentence_list']}")
print(f"Labels: {row['sentence_level_majority_vote']}")
print(f"Response label: {row['response_level_label_binary']}")
print()

Verification

Run the label consistency check:

python verify_label_consistency.py triviaplus_dataset.parquet

Citation

If you use this dataset, please cite our paper:

@article{chen2025rethinking,
title={Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights},
author={Chen, Wenbo and Padmanabhan, Veena and Giyahchi, Tootiya and Wong, Elaine and Akoglu, Leman},
journal={arXiv preprint arXiv:2605.11330},
year={2025}
}

License

This dataset is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).

See the [LICENSE](LICENSE) file for the full license text.

Notability

notability 3.0/10

Low traction, routine research repo

Amazon (Nova) has a repo signal matching data demand, evals and quality.