Supercharge Your AI Evaluation Pipeline with Quotient's SDK
Supercharge Your AI Evaluation Pipeline with Quotient's SDK
Supercharge Your AI Evaluation Pipeline with Quotient's SDK
Supercharge Your AI Evaluation Pipeline with Quotient's SDK
We're thrilled to announce the release of evaluations in Quotient's Python SDK!
Whether you're testing different models, iterating on prompts, or validating outputs against ground truth data, our platform makes evaluating AI applications so easy it's a no-brainer to integrate it in your routine developer workflows.
We believe developers should focus on building amazing products, not babysitting infrastructure. We've been committed from day one to making comprehensive evaluation as simple as possible - just a few lines of code and a few minutes. Behind the scenes, our distributed infrastructure handles all the heavy lifting asynchronously, so you can get back to what matters.
Quotient Evaluations in a Nutshell
- Simple Python Interface: Write evaluations in familiar Python syntax and execute them on our robust infrastructure, eliminating the complexity of setting up evaluation environments.
- Asynchronous Evaluation Engine: Launch large-scale evaluations and retrieve results on your schedule - no need to provision or maintain compute resources.
- Built-in Version Control: Keep track of every change to your prompts and datasets as you iterate, ensuring reproducibility and clear documentation of your evaluation process.
- Integrated Development Flow: Move effortlessly between interactive prompt development in PromptLab and systematic testing at scale with the SDK.
- Research-Backed Evaluators: Leverage our carefully curated suite of evaluation metrics, developed by AI researchers for common use cases.
A Real-World Example: Tax Form Q&A
With tax season upon us in the US, and new state-of-the-art models being released every other day, we have a perfect opportunity to showcase Evaluations in Quotient's SDK.
We'll evaluate how the latest and most capable models—DeepSeek R1
and OpenAI's o1
—handle tax form instruction questions, comparing their ability to provide clear, accurate tax guidance.
The full cookbook is available here.
The Evaluation Dataset
For this evaluation, we're using a dataset of 200 synthetic question-answer pairs about two critical IRS tax documents: Form 1040 (U.S. Individual Income Tax Return) and Schedule C (key form for self-employed individuals and small business owners), available on Hugging Face.
Each Q&A pair includes the question, the correct answer based on official IRS documentation, and relevant context from the tax forms. This makes it an ideal test case for evaluating both comprehension and accuracy.
Let's walk through the evaluation process:
Step 1: Initialize Your Evaluation
.png)
Step 2: Create your Prompt
.png)
Step 3: Run the Evaluation
For tax form Q&A, we use evaluators specifically designed to measure factual accuracy and answer completeness against IRS documentation.
.png)
Step 4: Analyze the Results
Our evaluation revealed that DeepSeek-R1
significantly outperforms OpenAI's o1
model across most metrics for tax-related QA tasks.
Interestingly, we discovered specific types of hallucinations that affected the faithfulness scores:
- Both models correctly expanded acronyms (like "EIC" to "Earned Income Credit") but were penalized since these expansions weren't explicitly in the context
DeepSeek-R1
occasionally included URLs not present in the context, impacting its faithfulness score
Based on performance, size, and cost considerations, `DeepSeek-R1` appears to be the better candidate for tax preparation assistance, with some prompt refinement to address the faithfulness issues.

Get Started
- Install the SDK:
pip install quotientai
- Sign up for an API key at app.quotientai.co
- Check out our documentation
We're eager to see how you'll use the Quotient SDK to enhance your AI evaluation workflow. Share your experience with us!
Bonus: Seamless integration with PromptLab
The SDK seamlessly connects with Quotient's PromptLab, enabling you to:
- Use prompts developed in PromptLab directly in your evaluations
.gif)
.png)
- Build evaluation datasets iteratively as you develop by saving manual PromptLab
.gif)
P.S: Ready to see Quotient in action? Need implementation support? Interested in enterprise features?
Schedule a call with us