Agentic + LLM Systems

Rigorous Unit Testing for LLM Outputs with DeepEval

A practical guide to testing LLM outputs with DeepEval using hallucination checks, custom metrics, test suites, CI/CD integration, and production evaluation patterns.

Share this article
Comments
Share:
Table of Contents

Introduction to the DeepEval Framework

DeepEval is a Python framework for evaluating large language model outputs using test cases, metrics, thresholds, and repeatable evaluation workflows. It extends traditional unit testing patterns into areas that matter for LLM systems, including hallucination detection, semantic similarity, factual consistency, answer relevance, and response quality.

In production LLM applications, evaluation should not be treated as an offline experiment only. It should become part of the engineering lifecycle: local testing, regression testing, CI/CD checks, release gates, and ongoing monitoring.

The following example shows a basic hallucination test case.

from deepeval import evaluate, TestCase
from deepeval.metrics import HallucinationMetric

# Initialize test case for LLM output evaluation
test_case = TestCase(
    input="What is the capital of France?",
    actual_output="Paris is the capital of France",
    expected_output="Paris is the capital of France"
)

# Create metric instance for hallucination detection
metric = HallucinationMetric()
result = metric.measure(test_case)
print(f"Hallucination Score: {result.score}")  # Output: 1.0 (no hallucination)

Setting Up the DeepEval Environment

A reliable evaluation setup requires controlled configuration for model access, credentials, thresholds, metric behavior, and test execution. For production teams, these settings should be externalized through environment variables or configuration files rather than hard-coded into test logic.

The following example outlines a basic environment configuration pattern.

import os
from deepeval import configure_env
from deepeval.test_case import LLMTestCase

# Configure environment variables
os.environ["OPENAI_API_KEY"] = "your-api-key"
os.environ["DEEPEVAL_API_KEY"] = "your-deepeval-key"

# Initialize test configuration
configure_env(
    model="gpt-3.5-turbo",
    temperature=0.7,
    metrics_config={
        "similarity_threshold": 0.85,
        "fact_checking": True,
        "bias_detection": True
    }
)

Implementing Basic Test Cases

DeepEval test cases define the expected behavior of an LLM output against clear evaluation criteria. A good test case should include the input, expected output or reference behavior, actual output, context when relevant, and the metrics used to determine pass or fail status.

The following example demonstrates a custom LLM test case with multiple evaluation metrics.

class CustomLLMTest(LLMTestCase):
    def __init__(self, input_text, expected_output):
        super().__init__(
            input=input_text,
            actual_output=None,
            expected_output=expected_output
        )
    
    def test_response(self):
        # Generate LLM response
        self.actual_output = self.generate_response(self.input)
        
        # Apply multiple evaluation metrics
        metrics = [
            HallucinationMetric(),
            FactualAccuracyMetric(),
            CoherenceMetric()
        ]
        
        return all(metric.measure(self).passed for metric in metrics)

Custom Metric Development

Custom metrics are useful when generic quality checks are not enough for the application domain. Examples include policy compliance, domain terminology accuracy, regulatory language checks, citation quality, response completeness, and product-specific business rules.

The following example shows the structure of a domain-specific metric.

from deepeval.metrics import Metric
from typing import Optional

class DomainSpecificMetric(Metric):
    def __init__(self, threshold: float = 0.8):
        self.threshold = threshold
        
    def measure(self, test_case: LLMTestCase) -> MetricResult:
        # Custom evaluation logic
        score = self._evaluate_domain_knowledge(
            test_case.actual_output,
            test_case.expected_output
        )
        
        return MetricResult(
            score=score,
            passed=(score >= self.threshold),
            metadata={"threshold": self.threshold}
        )
        
    def _evaluate_domain_knowledge(self, actual: str, expected: str) -> float:
        # Implementation of domain-specific evaluation
        # Returns similarity score between 0 and 1
        return similarity_score

Implementing Test Suites

Test suites organize related evaluation cases and provide a repeatable way to test model behavior across scenarios, prompts, datasets, application versions, and release candidates. They are especially useful for regression testing after prompt, retrieval, model, or tool changes.

The following example outlines a simple test suite structure.

from deepeval import TestSuite
from deepeval.test_case import LLMTestCase
from typing import List

class LLMTestSuite(TestSuite):
    def __init__(self, name: str):
        super().__init__(name=name)
        self.test_cases: List[LLMTestCase] = []
        
    def add_test_case(self, input_text: str, expected_output: str):
        test_case = LLMTestCase(
            input=input_text,
            expected_output=expected_output
        )
        self.test_cases.append(test_case)
        
    async def run_suite(self):
        results = []
        for test_case in self.test_cases:
            result = await test_case.evaluate()
            results.append(result)
        
        return self.generate_report(results)

Use Case: Sentiment Analysis Evaluation

This example evaluates sentiment classification behavior using labeled test data and a sentiment-specific metric. The same pattern can be applied to classification, extraction, routing, and moderation tasks.

The following example creates and evaluates sentiment test cases from a dataset.

import pandas as pd
from deepeval.metrics import SentimentAccuracyMetric

# Load test dataset
df = pd.read_csv('sentiment_data.csv')

# Create test cases for sentiment analysis
sentiment_suite = TestSuite("Sentiment Analysis")

for _, row in df.iterrows():
    test_case = LLMTestCase(
        input=row['text'],
        expected_output=row['sentiment'],
        metadata={"category": row['category']}
    )
    
    # Add custom sentiment metric
    metric = SentimentAccuracyMetric(
        threshold=0.85,
        consider_neutral=True
    )
    
    result = metric.measure(test_case)
    print(f"Text: {row['text']}")
    print(f"Expected: {row['sentiment']}")
    print(f"Score: {result.score}\n")

Metrics Configuration

DeepEval metrics should be configured with explicit thresholds that reflect the risk level of the use case. A customer-support chatbot, internal summarizer, financial recommendation assistant, and compliance workflow should not share the same acceptance criteria.

The following example configures multiple metrics with different evaluation parameters.

from deepeval.metrics import (
    ContextualRelevanceMetric,
    ResponseLengthMetric,
    GrammaticalCorrectnessMetric
)

# Configure multiple metrics with custom parameters
metrics_config = {
    'contextual': ContextualRelevanceMetric(
        min_score=0.75,
        context_window=512,
        semantic_similarity_model="all-MiniLM-L6-v2"
    ),
    'length': ResponseLengthMetric(
        min_tokens=50,
        max_tokens=200,
        token_buffer=10
    ),
    'grammar': GrammaticalCorrectnessMetric(
        error_threshold=2,
        check_punctuation=True,
        check_capitalization=True
    )
}

async def evaluate_with_metrics(test_case, metrics_config):
    results = {}
    for name, metric in metrics_config.items():
        results[name] = await metric.measure(test_case)
    return results

Implementing an Async Evaluation Pipeline

Asynchronous evaluation helps process larger test sets without blocking execution on each model call or metric calculation. This pattern is useful for nightly evaluation jobs, pre-release test runs, and large regression suites.

The following example batches test cases through an async evaluation pipeline.

import asyncio
from typing import List, Dict
from deepeval.async_utils import AsyncEvaluator

class AsyncTestPipeline:
    def __init__(self, metrics_config: Dict):
        self.evaluator = AsyncEvaluator(metrics_config)
        self.test_queue: List[LLMTestCase] = []
        
    async def add_test(self, test_case: LLMTestCase):
        self.test_queue.append(test_case)
        
    async def run_pipeline(self, batch_size: int = 5):
        results = []
        for i in range(0, len(self.test_queue), batch_size):
            batch = self.test_queue[i:i + batch_size]
            batch_results = await asyncio.gather(
                *[self.evaluator.evaluate(test) for test in batch]
            )
            results.extend(batch_results)
        return results

Data Validation and Preprocessing

Evaluation quality depends on the quality of test inputs, reference outputs, labels, and metadata. Validation should catch empty inputs, malformed examples, duplicate records, inconsistent labels, and unsupported output formats before metrics are calculated.

The following example validates and normalizes test data before evaluation.

from dataclasses import dataclass
from typing import Optional, Union
import numpy as np

@dataclass
class TestDataValidator:
    def validate_input(self, input_text: str) -> bool:
        if not isinstance(input_text, str) or not input_text.strip():
            raise ValueError("Invalid input text")
        return True
        
    def preprocess_text(self, text: str) -> str:
        # Remove extra whitespace
        text = " ".join(text.split())
        # Normalize case
        text = text.lower()
        return text
        
    def validate_expected_output(
        self,
        expected: Union[str, List[str]],
        output_type: str = "text"
    ) -> bool:
        if output_type == "text":
            return self.validate_input(expected)
        elif output_type == "list":
            return all(self.validate_input(item) for item in expected)
        return False

Use Case: Question Answering Evaluation

Question-answering evaluation should test more than whether the answer sounds correct. It should evaluate context relevance, answer relevance, factual consistency, source grounding, and whether the model stayed within the provided evidence.

The following example evaluates a QA response using context and answer metrics.

from deepeval.metrics import AnswerRelevanceMetric
from deepeval.dataset import QADataset

class QAEvaluator:
    def __init__(self, model_name: str):
        self.context_metric = ContextualRelevanceMetric()
        self.answer_metric = AnswerRelevanceMetric()
        self.dataset = QADataset()
        
    async def evaluate_qa(self, question: str, context: str):
        test_case = LLMTestCase(
            input={
                "question": question,
                "context": context
            },
            actual_output=await self.generate_answer(question, context),
            expected_output=self.dataset.get_golden_answer(question)
        )
        
        # Evaluate multiple aspects
        results = {
            "context_relevance": await self.context_metric.measure(test_case),
            "answer_accuracy": await self.answer_metric.measure(test_case),
            "factual_consistency": await self.factual_check(test_case)
        }
        
        return self.aggregate_results(results)

Statistical Analysis of LLM Performance

Statistical analysis helps convert individual metric scores into a broader performance view across datasets, prompt versions, model versions, and use cases. This is important for detecting regression, variance, unstable behavior, and weak segments in the evaluation set.

The following example calculates aggregate metrics and confidence intervals from evaluation results.

import pandas as pd
import numpy as np
from scipy import stats
from typing import Dict, List

class PerformanceAnalyzer:
    def __init__(self, results: List[Dict[str, float]]):
        self.results_df = pd.DataFrame(results)
        
    def calculate_metrics(self):
        stats_results = {
            "mean_scores": self.results_df.mean(),
            "std_dev": self.results_df.std(),
            "confidence_intervals": self._calculate_ci(),
            "performance_distribution": self._analyze_distribution()
        }
        
        return stats_results
        
    def _calculate_ci(self, confidence=0.95):
        ci_results = {}
        for column in self.results_df.columns:
            data = self.results_df[column]
            ci = stats.t.interval(
                confidence,
                len(data)-1,
                loc=np.mean(data),
                scale=stats.sem(data)
            )
            ci_results[column] = ci
        return ci_results

Performance Visualization Pipeline

Visualization helps teams inspect score distributions, temporal drift, recurring failure categories, and metric correlations. In production evaluation, charts should support decision-making around release readiness and model quality trends.

The following example builds a simple dashboard for evaluation results.

import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, Any

class PerformanceVisualizer:
    def __init__(self, results_data: Dict[str, Any]):
        self.results = results_data
        self.fig_size = (12, 8)
        
    def create_performance_dashboard(self):
        fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))
        
        # Metric distribution plot
        sns.violinplot(data=self.results['metric_scores'], ax=ax1)
        ax1.set_title('Metric Score Distribution')
        
        # Time series of performance
        sns.lineplot(
            data=self.results['temporal_scores'],
            x='timestamp',
            y='score',
            ax=ax2
        )
        ax2.set_title('Performance Over Time')
        
        # Error analysis heatmap
        sns.heatmap(
            self.results['error_correlation'],
            annot=True,
            cmap='coolwarm',
            ax=ax3
        )
        ax3.set_title('Error Correlation Matrix')
        
        return fig

Integration with CI/CD Pipelines

CI/CD integration turns LLM evaluation into a release control. Instead of relying on manual prompt testing, teams can block unsafe changes when scores fall below thresholds or when critical test suites fail.

The following example outlines a CI test runner pattern.

import os
import json
from pathlib import Path
from deepeval.ci import CIRunner

class DeepEvalCI:
    def __init__(self, config_path: str):
        self.config = self._load_config(config_path)
        self.runner = CIRunner()
        
    async def run_ci_tests(self):
        test_results = []
        for test_suite in self.config['test_suites']:
            suite_result = await self.runner.execute_suite(
                suite_name=test_suite['name'],
                metrics=test_suite['metrics'],
                threshold=test_suite['threshold']
            )
            test_results.append(suite_result)
            
        return self._generate_ci_report(test_results)
        
    def _load_config(self, path: str) -> dict:
        with open(path, 'r') as f:
            return json.load(f)
            
    def _generate_ci_report(self, results: list) -> dict:
        return {
            'total_tests': len(results),
            'passed_tests': sum(1 for r in results if r['passed']),
            'failed_tests': sum(1 for r in results if not r['passed']),
            'detailed_results': results
        }

Additional Resources

  1. https://arxiv.org/abs/2307.09061 - “DeepEval: A complete Framework for LLM Output Evaluation”
  2. https://arxiv.org/abs/2308.12488 - “Automated Testing Frameworks for Large Language Models: A Comparative Study”
  3. https://arxiv.org/abs/2309.15328 - “Metrics and Methodologies for Evaluating LLM Outputs: Current State and Future Directions”
  4. https://arxiv.org/abs/2310.17711 - “Statistical Approaches to LLM Performance Assessment in Production Environments”

Closing Thoughts

LLM testing requires more than checking whether a response looks reasonable. Production systems need repeatable evaluation suites, explicit thresholds, domain-specific metrics, regression tests, CI/CD release gates, and monitoring for quality drift.

DeepEval is useful because it brings LLM evaluation closer to established software engineering practices. The practical goal is not to make LLM output perfectly deterministic. The goal is to define acceptable behavior, detect regressions early, and prevent unreliable model behavior from reaching users or downstream systems.

Enterprise AI Architecture

Want more enterprise AI architecture breakdowns?

Subscribe to SuperML.

Comments

Sign in to leave a comment

Back to Blog

Related Posts

View All Posts »