April 25, 2024

Research

No items found.

Building enterprise-grade customer support AI requires domain-specific evaluation

April 25, 2024

Deanna Emery

April 25, 2024

Research

No items found.

Building enterprise-grade customer support AI requires domain-specific evaluation

April 25, 2024

Deanna Emery

April 25, 2024

Research

No items found.

Building enterprise-grade customer support AI requires domain-specific evaluation

April 25, 2024

Deanna Emery

April 25, 2024

Research

No items found.

Building enterprise-grade customer support AI requires domain-specific evaluation

April 25, 2024

Deanna Emery

Back to all posts

Early adopters of generative AI solutions have focused on transforming how businesses interact with their customers. While there have been a number of successes, the prevailing stories are of companies deploying LLM-powered tools prematurely, with significant negative consequences for themselves and their customers. These latter examples highlight over and over the risks associated with the lack of comprehensive, real-world, domain-specific testing for generative AI products.

Good evaluations must replicate real product usage as closely as possible. Benchmarks can be a good starting point to begin evaluating LLMs, but such tests will fail to capture the real-world performance of a generative AI product.

‍

‍Realistic, domain-specific evaluation is the single most impactful step AI developers can take to ensure their products are suited for real-world applications and reduce deployment risks.

‍

Quotient’s tools enable AI developers to rapidly build evaluation frameworks that account for their particular tasks, domains, and even organizational knowledge, and make decisions that are actually correlated with the ultimate product performance.

In this blog, we compared benchmark evaluation to domain-specific evaluation using Quotient’s platform, starting with one of the most ubiquitous enterprise generative AI use-cases: customer support agent augmentation.

‍

Here’s what we found:

1️⃣ Evaluating on benchmarks selects the wrong models for domain-specific tasks.

2️⃣ Open source LLMs can outperform proprietary models on domain-specific benchmarks.

‍3️⃣ Benchmarks can overestimate the risk of hallucinations by 15x.

‍

And here’s how we got to these results:
‍

We generated a domain-specific evaluation dataset for customer support

We opted for a reference-based evaluation setup. Here, the quality metrics of the LLM system are calculated by comparing its outputs to those from a reference dataset.

To ensure representative and realistic evaluations, we used Quotient to generate synthetic datasets for customer support use-cases, starting from a seed dataset of existing customer support logs.

For this experiment, we chose a customer support summarization dataset, containing 95 realistic chat conversations between a customer and a support agent, including a diverse range of customer issues, product types, and chat sentiments.

This dataset enabled us to evaluate our AI agents on their ability to assist human agents by providing concise summaries of customer support conversations.

We set up multiple evaluation jobs

We tested three different models: GPT-4, and two of the most commonly used open source models – Llama-2 and Mistral-Instruct-v0.2.

We kept track of our models and prompts using Quotient’s AI recipes:

‍

{
    "name": "Mistral-Instruct-v0.2 General Summarization",
    "description": "Benchmarking models on summarization datasets",
    "system_prompt": None,
    "prompt_template": {
        "name": "General Summarization",
        "variables": '["input_text"]',
        "template_string": "Generate a concise summary of the following text.\n\nText:\n{input_text}\n\nSummary:",
    },
    "model": {"name": "mistralai/Mistral-7B-Instruct-v0.2"},
}

We used three evaluation datasets: two of the top open-source summarization benchmark datasets – CNN Daily Mail, which contains news articles from CNN and Daily Mail along with article highlights, and SAMsum, which contains dialog in the form of chat messages as well as human written summaries of the conversations, as well as Quotient’s evaluation dataset for customer support.

As part of each evaluation job, we computed commonly-used metrics for evaluating summarization tasks:

ROUGE
BERTScore
BERT sentence similarity, which measures how semantically similar the model output is compared to the reference output
Knowledge F1 score (KF1), which measures the overlap of words between the model's output and the provided context
Hallucination Rate – factually incorrect, misleading, or unfounded statements in the model output. We used SelfCheckGPT, which measures model hallucinations by quantifying the extent to which sentences in the model's answer are based on information provided in the context. For plotting purposes, we report Faithfulness (i.e. 1 - Hallucination Rate)

Here are the results we got:

Evaluating on benchmarks selects the wrong models for domain-specific tasks

We observed that the LLMs that best summarized the customer support dataset were not the same LLMs that benchmark evaluations indicated. Additionally, domain-specific evaluation indicates that different models may be the optimal choice for different use-cases and tasks – this can be a hidden-in-plain-sight opportunity for developers to quickly improve their LLM products.

For all metrics except the hallucination metric (Faithfulness), the top-performing model for Customer Support was not the best for either benchmark dataset. For example, if we focus on ROUGE and BERTScore metrics, Mistral performs the best on the Quotient’s Customer Support Dataset. However, when we ask the models to summarize the SAMSum dataset, these same metrics tell us that GPT-4 is the better performer. And when we have the models summarize the CNN Daily Mail dataset, these metrics show that Llama-2 outperforms the others.

Evaluating models on their ability to summarize benchmark datasets can yield inaccurate results, which could lead developers to select suboptimal models for specific tasks. For the best results, evaluation should be domain-specific. This means that teams will likely need to use different models for different use cases, and will need to adopt a comprehensive evaluation strategy that they can employ at each step of the way.

‍

Open source LLMs can outperform proprietary models on domain-specific benchmarks

Surprisingly, our research indicates that proprietary models like OpenAI do not always perform better than open-source options. For example, Mistral-7b-Instruct performs better than GPT-4 on customer support summarization tasks, when considering quality metrics such as BERTscore and BERT Sentence Similarity; and, across the board, both Mistral-7b-Instruct and Llama-2 are neck and neck with GPT-4.

Open-source models are often cheaper and pose fewer concerns with data privacy than proprietary alternatives. Additionally, users can easily adapt open-source LLM models to suit their specific needs and applications, through self-hosted deployments of fine-tuning. The possibility that open-source LLMs could be outperforming proprietary models when challenged with domain-specific tasks fundamentally changes the status quo in the AI industry today

‍

Benchmark datasets can overestimate the risk of hallucinations by 15x

Our experiments showed that relying on benchmark leaderboard datasets like CNN Daily Mail can overestimate the risk of hallucinations for domain-specific applications. On a scale of (0 = faithful to 1 = hallucinated), the hallucination rate across the three models averaged to 0.018 for the customer support dataset, and 0.279 on the CNN Daily Mail benchmark – over 15 times the rate of hallucination in comparison!

Hallucinations are instances where AI systems generate incorrect or misleading information that appears plausible but is not grounded in reality. They are a primary concern for enterprises building AI solutions because they pose critical risks such as misinformation, reputation damage, legal and regulatory issues, and loss of customer trust and satisfaction. Whereas developers should not underestimate the likelihood that their AI products hallucinate, they should not overestimate them either, as this can lead to unnecessary time and resources spent on hallucination-mitigation.

It is therefore essential for hallucination evaluation to be accurate and robust; we observed that a necessary condition for that is for said evaluation to also be domain-specific.

If you are working on similar problems, we’d love to hear from you!

Reach out at contact@quotientai.co.

If you are working on similar problems, we’d love to hear from you!

Reach out at contact@quotientai.co.

If you are working on similar problems, we’d love to hear from you!

Reach out at contact@quotientai.co.

If you are working on similar problems, we’d love to hear from you!

Reach out at contact@quotientai.co.