databricks/judges
Python
Captured source
source ↗databricks/judges
Description: A small library of LLM judges
Language: Python
License: Apache-2.0
Stars: 338
Forks: 35
Open issues: 0
Created: 2024-09-17T04:21:03Z
Pushed: 2025-07-31T15:07:36Z
Default branch: main
Fork: no
Archived: yes
README:
judges ⚖️
1. [Overview](#overview) 2. [Installation](#installation) 3. [API](#api)
- [Types of Judges](#types-of-judges)
- [Classifiers](#classifiers)
- [Graders](#graders)
- [Using Judges](#using-judges)
- [Classifier Judges](#classifier-judges)
- [Combining Judges](#combining-judges)
- [Jury Object](#jury-object)
4. [Usage](#usage)
- [Pick a model](#pick-a-model)
- [Send data to an LLM](#send-data-to-an-llm)
- [Use a
judgesclassifier LLM as an evaluator model](#use-a-judges-classifier-llm-as-an-evaluator-model) - [Use a
Juryfor averaging and diversification](#use-a-jury-for-averaging-and-diversification) - [Use
AutoJudgeto create a custom LLM judge](#use-autojudge-to-create-a-custom-llm-judge)
5. [Creating Custom Judges](#creating-custom-judges) 6. [CLI](#cli) 6. [Appendix of Judges](#appendix)
- [Classifiers](#classifiers)
- [Grader](#graders)
Overview
judges is a small library to use and create LLM-as-a-Judge evaluators. The purpose of judges is to have a curated set of LLM evaluators in a low-friction format across a variety of use cases that are backed by research, and can be used off-the-shelf or serve as inspiration for building your own LLM evaluators.
Installation
pip install judges
API
Types of Judges
The library provides two types of judges:
1. Classifiers: Return boolean values.
Trueindicates the inputs passed the evaluation.Falseindicates the inputs did not pass the evaluation.
2. Graders: Return scores on a numerical or Likert scale.
- Numerical scale: 1 to 5
- Likert scale: terrible, bad, average, good, excellent
Using Judges
All judges can be used by calling the .judge() method. This method accepts the following parameters:
input: The input to be evaluated.output: The output to be evaluated.expected(optional): The expected result for comparison.
The .judge() method returns a Judgment object with the following attributes:
reasoning: The reasoning behind the judgment.score: The score assigned by the judge.
Classifier Judges
If the underlying prompt for a classifier judge outputs a Judgment similar to True or False (e.g., good or bad, yes or no, 0 or 1), the judges library automatically resolves the outputs so that a Judgment only has a boolean label.
Combining Judges
The library also provides an interface to combine multiple judges through the Jury object. The Jury object has a .vote() method that produces a Verdict.
Jury Object
.vote(): Combines the judgments of multiple judges and produces aVerdict.
Usage
Pick a model
By default, judges uses `instructor` for structured outputs and models due to its widespread use. To get started, set your OPENAI_API_KEY or whatever key you want for a specific model provider. Refer to the instructor docs for more providers.
Send data to an LLM
Next, if you'd like to use this package, you can follow the examples in the examples directory, or follow the code below:
from openai import OpenAI
client = OpenAI()
question = "What is the name of the rabbit in the following story. Respond with 'I don't know' if you don't know."
story = """
Fig was a small, scruffy dog with a big personality. He lived in a quiet little town where everyone knew his name. Fig loved adventures, and every day he would roam the neighborhood, wagging his tail and sniffing out new things to explore.
One day, Fig discovered a mysterious trail of footprints leading into the woods. Curiosity got the best of him, and he followed them deep into the trees. As he trotted along, he heard rustling in the bushes and suddenly, out popped a rabbit! The rabbit looked at Fig with wide eyes and darted off.
But instead of chasing it, Fig barked in excitement, as if saying, "Nice to meet you!" The rabbit stopped, surprised, and came back. They sat together for a moment, sharing the calm of the woods.
From that day on, Fig had a new friend. Every afternoon, the two of them would meet in the same spot, enjoying the quiet companionship of an unlikely friendship. Fig's adventurous heart had found a little peace in the simple joy of being with his new friend.
"""
# set up the input prompt
input = f'{story}\n\nQuestion:{question}'
# write down what the model is expected to respond with
# NOTE: not all judges require an expected answer. refer to the implementations
expected = "I don't know"
# get the model output
output = client.chat.completions.create(
model='gpt-4o-mini',
messages=[
{
'role': 'user',
'content': input,
},
],
).choices[0].message.contentUse a judges classifier LLM as an evaluator model
from judges.classifiers.correctness import PollMultihopCorrectness
# use the correctness classifier to determine if the first model
# answered correctly
correctness = PollMultihopCorrectness(model='openai/gpt-4o-mini')
judgment = correctness.judge(
input=input,
output=output,
expected=expected,
)
print(judgment.reasoning)
# The 'Answer' provided ('I don't know') matches the 'Reference' text which also states 'I don't know'. Therefore, the 'Answer' correctly corresponds with the information given in the 'Reference'.
print(judgment.score)
# TrueUse a Jury for averaging and diversification
A jury of LLMs can enable more diverse results and enable you to combine the judgments of multiple LLMs.
from judges import Jury from judges.classifiers.correctness import PollMultihopCorrectness, RAFTCorrectness poll = PollMultihopCorrectness(model='openai/gpt-4o') raft = RAFTCorrectness(model='openai/gpt-4o-mini') jury = Jury(judges=[poll, raft], voting_method="average") verdict = jury.vote( input=input, output=output, expected=expected, ) print(verdict.score)
Use AutoJudge to create a custom LLM judge
autojudge is an extension to the judges library that builds on our previous work aligning judges to human feedback -- given a labeled dataset with feedback and a natural language description of an evaluation task, autojudge...
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10Solid new repo from Databricks, moderate stars