January 9, 2025
Research
No items found.

Introducing judges: A Library of Research-Backed LLM-as-a-Judge Evaluators

January 9, 2025
Julia Neagu
Freddie Vargus
January 9, 2025
Research
No items found.

Introducing judges: A Library of Research-Backed LLM-as-a-Judge Evaluators

January 9, 2025
Julia Neagu
Freddie Vargus
January 9, 2025
Research
No items found.

Introducing judges: A Library of Research-Backed LLM-as-a-Judge Evaluators

January 9, 2025
Julia Neagu
Freddie Vargus
January 9, 2025
Research
No items found.

Introducing judges: A Library of Research-Backed LLM-as-a-Judge Evaluators

January 9, 2025
Julia Neagu
Freddie Vargus

AI evaluation is a critical component in building reliable, impactful models. We're excited to introduce judges, an open-source library of LLM-as-a-judge evaluators designed to help bootstrap your evaluation process. Complementing judges is autojudge, an extension that automatically creates evaluators aligned with human feedback.

‍

judges: Research-Backed LLM-as-a-Judge

‍

judges provides a curated collection of evaluators, backed by published research, to help jumpstart your LLM evaluation process. These LLM-as-a-judge evaluators can be used either out-of-the-box or as a foundation for your specific needs.

‍

Key Features of judges:
‍

  • ‍Curated, Research-Backed LLM-as-a-judge Prompts: Every judge prompt is thoughtfully designed based on cutting-edge research and curated to ensure high-quality evaluations.
    ‍‍
  • Juries: A jury of LLMs enables more diverse results by combining judgments from multiple LLMs.
    ‍‍
  • Flexible Model Integration: Compatible with both open-source and closed-source models through OpenAI and LiteLLM integrations.
    ‍
  • Human-Aligned Evaluators: autojudge automatically builds human-aligned LLM-as-a-judge prompts from small labeled datasets.

‍

Getting started with judges

‍

Install the library:

pip install judges

‍

Pick a model:

  • OpenAI
    • By default, judges uses the OpenAI client and models. To get started, you'll need an OpenAI API key set as an environment variable OPENAI_API_KEY.
  • LiteLLM:‍
    • judges also integrates with litellm to allow access to most other models. Run pip install judges[litellm], and set the appropriate API keys based on the LiteLLM Docs.

Send data to an LLM:

from judges.classifiers.correctness import PollMultihopCorrectness

# use the correctness classifier to determine if the first model
# answered correctly
correctness = PollMultihopCorrectness(model='gpt-4o-mini')

judgment = correctness.judge(
    input=input,
    output=output,
    expected=expected,
)

print(judgment.reasoning)
# The 'Answer' provided ('I don't know') matches the 'Reference' text which also states 'I don't know'. Therefore, the 'Answer' correctly corresponds with the information given in the 'Reference'.

print(judgment.score)
# True

‍

Use a judges classifier LLM as an evaluator model:

from judges.classifiers.correctness import PollMultihopCorrectness

# use the correctness classifier to determine if the first model
# answered correctly
correctness = PollMultihopCorrectness(model='gpt-4o-mini')

judgment = correctness.judge(
    input=input,
    output=output,
    expected=expected,
)

print(judgment.reasoning)
# The 'Answer' provided ('I don't know') matches the 'Reference' text which also states 'I don't know'. Therefore, the 'Answer' correctly corresponds with the information given in the 'Reference'.

print(judgment.score)
# True

‍

autojudge: Automating Human-Aligned Evaluations
‍

While judges provides ready-to-use evaluators, autojudge extends this functionality by automating evaluator creation. Given a labeled dataset with feedback and a natural language description of an evaluation task, it generates grading notes for an evaluator prompt, streamlining the process of building new evaluators.

‍

How autojudge Works:
‍
‍

‍Install the library extension:

pip install "judges[auto]"

‍

Prepare your dataset:

Your dataset can be either a list of dictionaries or path to a csv file with the following fields:

  • ‍input: The input provided to your mode
  • ‍output: The model's response‍
  • label: 1 for correct, 0 for incorrect‍
  • feedback: Feedback explaining why the response is correct or incorrect
dataset = [
    {
        "input": "What's the best time to visit Paris?",
        "output": "The best time to visit Paris is during the spring or fall.",
        "label": 1,
        "feedback": "Provides accurate and detailed advice."
    },
    {
        "input": "Can I ride a dragon in Scotland?",
        "output": "Yes, dragons are commonly seen in the highlands and can be ridden with proper training",
        "label": 0,
        "feedback": "Dragons are mythical creatures; the information is fictional."
    }
]

‍

Initialize your autojudge:

from judges.classifiers.auto import AutoJudge

dataset = [
    {
        "input": "Can I ride a dragon in Scotland?",
        "output": "Yes, dragons are commonly seen in the highlands and can be ridden with proper training.",
        "label": 0,
        "feedback": "Dragons are mythical creatures; the information is fictional.",
    },
    {
        "input": "Can you recommend a good hotel in Tokyo?",
        "output": "Certainly! Hotel Sunroute Plaza Shinjuku is highly rated for its location and amenities. It offers comfortable rooms and excellent service.",
        "label": 1,
        "feedback": "Offers a specific and helpful recommendation.",
    },
    {
        "input": "Can I drink tap water in London?",
        "output": "Yes, tap water in London is safe to drink and meets high quality standards.",
        "label": 1,
        "feedback": "Gives clear and reassuring information.",
    },
    {
        "input": "What's the boiling point of water on the moon?",
        "output": "The boiling point of water on the moon is 100Β°C, the same as on Earth.",
        "label": 0,
        "feedback": "Boiling point varies with pressure; the moon's vacuum affects it.",
    }
]

# Task description
task = "Evaluate responses for accuracy, clarity, and helpfulness."

# Initialize autojudge
autojudge = AutoJudge.from_dataset(
    dataset=dataset,
    task=task,
    model="gpt-4-turbo-2024-04-09",
    # increase workers for speed ⚑
    # max_workers=2,
    # generated prompts are automatically saved to disk
    # save_to_disk=False,
)

‍

Use your judge to evaluate new input-output pairs:

# Input-output pair to evaluate
input_ = "What are the top attractions in New York City?"
output = "Some top attractions in NYC include the Statue of Liberty and Central Park."

# Get the judgment
judgment = autojudge.judge(input=input_, output=output)

# Print the judgment
print(judgment.reasoning)
# The response accurately lists popular attractions like the Statue of Liberty and Central Park, which are well-known and relevant to the user's query.

print(judgment.score)
# True (correct)

‍

Why autojudge Matters:
‍

Human-aligned evaluations are essential for developing models that meet user expectations. autojudge provides a seamless way to automate high-quality evaluations and integrate them into development pipelines.

‍

How judges and autojudge Work Together
‍

Together, judges and autojudge provide a comprehensive framework for boostrapping AI evaluation:

  • ‍judges provides a research-backed foundation of ready-to-use LLM-as-a-judge evaluators.
    ‍
  • autojudge automates the creation of new human-aligned evaluators, enabling scalable and consistent assessments across diverse tasks.

This combination helps AI developers and researchers quickly kick-off their evaluation process and scale it for real-world applications.

‍

Getting Started
‍

Ready? Here's how to begin:

  1. Explore judges: Visit the GitHub repository to learn more about using LLM-as-a-judge evaluators.
    ‍
  2. Experiment with autojudge:Β Use autojudge to create scalable, human-aligned evaluations that fit your workflow.
    ‍
  3. Join the community:Β Have LLM-as-a-judge evaluators you'd like to contribute? Consider making a pull request and helping expand our collection of evaluation tools.
No items found.