[NAACL 2025 Best Paper Award] BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of…
Captured source
source ↗[NAACL 2025 Best Paper Award] BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models | by LG AI Research | Medium
Sitemap
Open in app
Sign up
Sign in
Medium Logo
Get app
Write
Search
Sign up
Sign in
[NAACL 2025 Best Paper Award] BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models
LG AI Research
8 min read
May 2, 2025
--
Listen
Share
Press enter or click to view image in full size
Image 1. LG AI Research’s BiGGen Bench Study Selected as Best Paper at NAACL 2025[1]
In the era of AI, the importance of language models continues to grow. Reflecting this, a wide range of language models have been introduced, accompanied by the development of numerous benchmarks to evaluate their capabilities. However, many of these benchmarks rely on abstract criteria such as preference, helpfulness, and harmlessness — making it difficult to distinguish models with high granularity and reliability[2,3,4].
Faced with this limitation, we posed a critical question “How can we identify the language model that best fits our needs and offers the highest utility?” To address this, LG AI Research’s Super Intelligence Lab partnered with Professor Minjoon Seo’s research team at KAIST to develop BiGGen Bench, a new benchmark for evaluating generative AI models. The project was a global collaboration involving researchers from Yonsei University, Carnegie Mellon University, Cornell University, MIT, the University of Washington, and the University of Illinois. To create BiGGen Bench, we defined nine core competencies of language models and 77 detailed task types, designing a total of 775 prompts and corresponding evaluation rubrics.
The study titled “The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models”[1] was presented at NAACL 2025, one of the most prestigious conferences in natural language processing, and was awarded the Best Paper Award, an honor given to only one paper among more than 2,000 submissions.
This recognition carries substantial significance. Of the 2,000+ papers submitted to NAACL 2025, roughly 1,400 were accepted — and only one received the Best Paper distinction. The selection of our work underscores the research value and practical impact of BiGGen Bench. This award joins the legacy of past NAACL Best Paper recipients such as ELMo (2018) and BERT (2019) — two groundbreaking studies in the history of language models.
Using BiGGen Bench, we also evaluated EXAONE 3.5, LG AI Research’s LLM released in December. Excluding reasoning models, EXAONE 3.5 demonstrated top-tier performance among recent non-thinking models, achieving an average score of 4.189. The recognition of BiGGen Bench by a top-tier academic conference and EXAONE 3.5’s strong performance on the benchmark further validate our progress and commitment to excellence.
At NAACL 2025, we delivered an oral presentation on the BiGGen Bench study and had the opportunity to engage in valuable discussions with global researchers. This post takes a closer look at the motivations and methodology behind BiGGen Bench.
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models[1]
Motivation for the Study
Continued progress in large language models (LLMs) depends on our ability to precisely diagnose their capabilities, identify areas for improvement, and guide model enhancement through structured evaluation. While machine learning offers a range of evaluation metrics, the generative AI domain — especially long-form generation — still lacks a standardized, reliable framework for fair and fine-grained assessment.
At LG AI Research, we are investigating self-improvement mechanisms in LLMs, guided by the following key research questions:
“How can we conduct high-resolution evaluation of language model outputs? And can a language model evolve through structured feedback?”
This led us to design and test a framework for automated self-improvement, built around a loop of generation, evaluation, and feedback. However, we soon confronted a fundamental trade-off in evaluation. Human experts can deliver highly reliable and nuanced assessments, but the process is labor-intensive and difficult to scale. LLM-based evaluators allow for scalable automation, but face challenges in accuracy, consistency, and interpretability.
To bridge this gap, we proposed a framework in which humans first define detailed evaluation rubrics across domains and task types, and then LLMs perform evaluations based on those guidelines. This hybrid approach allows for automated yet trustworthy evaluation and serves as the foundation for sustainable self-improvement in LLMs.
Key Contributions
As LLMs continue to take on more sophisticated tasks, the importance of granular and task-specific evaluation grows accordingly. Yet, many existing benchmarks have been criticized for their overly abstract criteria[2,3,4]. While some efforts have introduced domain-specific scoring rubrics[5], they often lack the resolution necessary for instance-level evaluation.
To overcome these challenges, we introduce BiGGen Bench, a new benchmark designed to comprehensively evaluate nine core abilities of LLMs. One of the key features of BiGGen Bench is its granular scoring logic, which is tailored to question types and designed to reflect subtle human judgment. For instance, when evaluating math problems, BiGGen Bench emphasizes logical reasoning and accuracy of computation, rather than relying solely on subjective helpfulness ratings.
Specifically, BiGGen Bench evaluates LLMs across the following nine competencies: Instruction Following, Grounding, Planning, Reasoning, Refinement, Safety, Theory of Mind, Tool Usage, Multilingualism.
These are assessed through 77 task types and 765 test items, offering a multidimensional, high-resolution evaluation framework. See Image 2 for an overview.
Press enter or click to view image in full size
Image 2. The 77 Task Types Included in BiGGen Bench[1]
1) Components and Evaluation Methodology of BiGGen Bench
Each BiGGen Bench instance consists of four elements: System Message, Input Prompt, Reference Answer, and a Scoring Rubric. The System Message defines the evaluator’s role (e.g., that of a teacher or expert). The Input Prompt describes the specific task or query the language model must respond to. The Reference Answer provides an ideal response to help guide…
Excerpt shown — open the source for the full document.
Notability
notability 8.0/10NAACL best paper award for benchmark.