Introducing judges: A Library of Research-Backed LLM-as-a-Judge Evaluators
Introducing judges: A Library of Research-Backed LLM-as-a-Judge Evaluators
Introducing judges: A Library of Research-Backed LLM-as-a-Judge Evaluators
Introducing judges: A Library of Research-Backed LLM-as-a-Judge Evaluators
AI evaluation is a critical component in building reliable, impactful models. We're excited to introduce judges
, an open-source library of LLM-as-a-judge evaluators designed to help bootstrap your evaluation process. Complementing judges is autojudge
, an extension that automatically creates evaluators aligned with human feedback.
β
judges: Research-Backed LLM-as-a-Judge
β
judges
provides a curated collection of evaluators, backed by published research, to help jumpstart your LLM evaluation process. These LLM-as-a-judge evaluators can be used either out-of-the-box or as a foundation for your specific needs.
β
Key Features of judges
:
β
- βCurated, Research-Backed LLM-as-a-judge Prompts: Every judge prompt is thoughtfully designed based on cutting-edge research and curated to ensure high-quality evaluations.
ββ - Juries: A
jury
of LLMs enables more diverse results by combining judgments from multiple LLMs.
ββ - Flexible Model Integration: Compatible with both open-source and closed-source models through OpenAI and LiteLLM integrations.
β - Human-Aligned Evaluators:
autojudge
automatically builds human-aligned LLM-as-a-judge prompts from small labeled datasets.
β
Getting started with judges
β
Install the library:
pip install judges
β
Pick a model:
- OpenAI
- By default,
judges
uses the OpenAI client and models. To get started, you'll need an OpenAI API key set as an environment variableOPENAI_API_KEY.
- By default,
- LiteLLM:β
judges
also integrates withlitellm
to allow access to most other models. Run pip install judges[litellm]
, and set the appropriate API keys based on the LiteLLM Docs.
Send data to an LLM:
from judges.classifiers.correctness import PollMultihopCorrectness
# use the correctness classifier to determine if the first model
# answered correctly
correctness = PollMultihopCorrectness(model='gpt-4o-mini')
judgment = correctness.judge(
input=input,
output=output,
expected=expected,
)
print(judgment.reasoning)
# The 'Answer' provided ('I don't know') matches the 'Reference' text which also states 'I don't know'. Therefore, the 'Answer' correctly corresponds with the information given in the 'Reference'.
print(judgment.score)
# True
β
Use a judges
classifier LLM as an evaluator model:
from judges.classifiers.correctness import PollMultihopCorrectness
# use the correctness classifier to determine if the first model
# answered correctly
correctness = PollMultihopCorrectness(model='gpt-4o-mini')
judgment = correctness.judge(
input=input,
output=output,
expected=expected,
)
print(judgment.reasoning)
# The 'Answer' provided ('I don't know') matches the 'Reference' text which also states 'I don't know'. Therefore, the 'Answer' correctly corresponds with the information given in the 'Reference'.
print(judgment.score)
# True
β
autojudge: Automating Human-Aligned Evaluations
β
While judges
provides ready-to-use evaluators, autojudge
extends this functionality by automating evaluator creation. Given a labeled dataset with feedback and a natural language description of an evaluation task, it generates grading notes for an evaluator prompt, streamlining the process of building new evaluators.
β
How autojudge Works:
ββ
βInstall the library extension:
pip install "judges[auto]"
β
Prepare your dataset:
Your dataset can be either a list of dictionaries or path to a csv file with the following fields:
- β
input
: The input provided to your mode - β
output
: The model's responseβ
label
:1
for correct,0
for incorrectβ
feedback
: Feedback explaining why the response is correct or incorrect
dataset = [
{
"input": "What's the best time to visit Paris?",
"output": "The best time to visit Paris is during the spring or fall.",
"label": 1,
"feedback": "Provides accurate and detailed advice."
},
{
"input": "Can I ride a dragon in Scotland?",
"output": "Yes, dragons are commonly seen in the highlands and can be ridden with proper training",
"label": 0,
"feedback": "Dragons are mythical creatures; the information is fictional."
}
]
β
Initialize your autojudge
:
from judges.classifiers.auto import AutoJudge
dataset = [
{
"input": "Can I ride a dragon in Scotland?",
"output": "Yes, dragons are commonly seen in the highlands and can be ridden with proper training.",
"label": 0,
"feedback": "Dragons are mythical creatures; the information is fictional.",
},
{
"input": "Can you recommend a good hotel in Tokyo?",
"output": "Certainly! Hotel Sunroute Plaza Shinjuku is highly rated for its location and amenities. It offers comfortable rooms and excellent service.",
"label": 1,
"feedback": "Offers a specific and helpful recommendation.",
},
{
"input": "Can I drink tap water in London?",
"output": "Yes, tap water in London is safe to drink and meets high quality standards.",
"label": 1,
"feedback": "Gives clear and reassuring information.",
},
{
"input": "What's the boiling point of water on the moon?",
"output": "The boiling point of water on the moon is 100Β°C, the same as on Earth.",
"label": 0,
"feedback": "Boiling point varies with pressure; the moon's vacuum affects it.",
}
]
# Task description
task = "Evaluate responses for accuracy, clarity, and helpfulness."
# Initialize autojudge
autojudge = AutoJudge.from_dataset(
dataset=dataset,
task=task,
model="gpt-4-turbo-2024-04-09",
# increase workers for speed β‘
# max_workers=2,
# generated prompts are automatically saved to disk
# save_to_disk=False,
)
β
Use your judge to evaluate new input-output pairs:
# Input-output pair to evaluate
input_ = "What are the top attractions in New York City?"
output = "Some top attractions in NYC include the Statue of Liberty and Central Park."
# Get the judgment
judgment = autojudge.judge(input=input_, output=output)
# Print the judgment
print(judgment.reasoning)
# The response accurately lists popular attractions like the Statue of Liberty and Central Park, which are well-known and relevant to the user's query.
print(judgment.score)
# True (correct)
β
Why autojudge
Matters:
β
Human-aligned evaluations are essential for developing models that meet user expectations. autojudge
provides a seamless way to automate high-quality evaluations and integrate them into development pipelines.
β
How judges
and autojudge
Work Together
β
Together, judges
and autojudge
provide a comprehensive framework for boostrapping AI evaluation:
- β
judges
provides a research-backed foundation of ready-to-use LLM-as-a-judge evaluators.
β autojudge
automates the creation of new human-aligned evaluators, enabling scalable and consistent assessments across diverse tasks.
This combination helps AI developers and researchers quickly kick-off their evaluation process and scale it for real-world applications.
β
Getting Started
β
Ready? Here's how to begin:
- Explore
judges
: Visit the GitHub repository to learn more about using LLM-as-a-judge evaluators.
β - Experiment with
autojudge
:Β Use autojudge to create scalable, human-aligned evaluations that fit your workflow.
β - Join the community:Β Have LLM-as-a-judge evaluators you'd like to contribute? Consider making a pull request and helping expand our collection of evaluation tools.