Evaluations

Our evaluations feature is not live yet. It is under active development and will be available soon. But it should work exactly as described in this section once deployed.

New Feature: We are integrating evaluations powered by DeepEval from Confident AI within the claim normalization pipeline to evaluate the quality of extracted claims.

CheckThat AI now includes automated evaluation capabilities that assess the quality and accuracy of claim extraction and normalization processes. This integration provides comprehensive evaluation metrics without requiring any custom code from users.

Overview

The evaluation system automatically:

Analyzes claim extraction quality using industry-standard metrics
Generates detailed evaluation reports in CSV format for local analysis
Provides cloud dashboard access for advanced trace analysis
Requires no custom code - evaluations run automatically in the pipeline
Integrates with Confident AI for comprehensive evaluation tracking

DeepEval Framework

Confident AI Platform

Access detailed traces and test runs on Confident AI’s dashboard

Getting Started

API Key Setup

To access cloud dashboard features and detailed traces, you’ll need a Confident AI API key:

Get your Confident AI API Key

Visit confident-ai.com and create an account to obtain your API key.

Set your environment variable

export CONFIDENT_API_KEY="your-confident-ai-api-key"

Include in API requests

Pass your Confident AI API key along with your CheckThat AI requests to enable dashboard tracking.

Basic Usage

Evaluations run automatically when you use CheckThat AI’s claim normalization features:

from checkthat_ai import CheckThatAI
import os

client = CheckThatAI(api_key=os.getenv("OPENAI_API_KEY"))

# Evaluations run automatically during claim processing
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Extract and evaluate claims from: 'Studies show that 95% of doctors recommend this new treatment for diabetes management.'"}
    ],
    # Optional: Include Confident AI key for dashboard access
    extra_headers={
        "X-Confident-API-Key": os.getenv("CONFIDENT_API_KEY")
    }
)

print(response.choices[0].message.content)

Evaluation Metrics

The evaluation system assesses multiple dimensions of claim extraction quality:

Accuracy Metrics

Claim Identification Accuracy

Precision: How many extracted claims are actually valid claims
Recall: How many valid claims were successfully identified
F1-Score: Harmonic mean of precision and recall

Content Fidelity

Semantic similarity between original text and extracted claims
Factual consistency of extracted information
Preservation of original meaning and context

Quality Metrics

Completeness

Coverage of all relevant claims in the source text
Identification of implicit vs. explicit claims
Handling of compound and nested claims

Coherence

Logical consistency of extracted claims
Proper claim boundaries and segmentation
Maintenance of causal relationships

Confidence Scoring

Extraction Confidence

Model confidence in claim identification
Uncertainty quantification for ambiguous cases
Reliability scores for different claim types

Verification Readiness

Assessement of how well-formed claims are for fact-checking
Identification of claims requiring additional context
Flagging of unverifiable or opinion-based statements

CSV Evaluation Reports

Receive detailed evaluation reports that you can save locally for analysis:

Report Structure

Sample Evaluation Report

claim_id,original_text,extracted_claim,accuracy_score,completeness_score,coherence_score,confidence,category,verifiable,timestamp
claim_001,"Studies show 95% efficacy","Study reports 95% treatment efficacy",0.92,0.88,0.95,0.89,medical,true,2024-01-15T10:30:00Z
claim_002,"Doctors recommend treatment","95% of doctors recommend this treatment",0.87,0.91,0.93,0.85,medical,true,2024-01-15T10:30:00Z
claim_003,"New diabetes management","Treatment effective for diabetes management",0.85,0.82,0.90,0.82,medical,true,2024-01-15T10:30:00Z

Downloading Reports

import requests
import csv
from datetime import datetime

def download_evaluation_report(evaluation_id, api_key):
    """Download evaluation report as CSV"""
    
    response = requests.get(
        f'https://api.checkthat-ai.com/v1/evaluations/{evaluation_id}/report',
        headers={
            'Authorization': f'Bearer {api_key}',
            'Accept': 'text/csv'
        }
    )
    
    if response.status_code == 200:
        # Save report locally
        filename = f"evaluation_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
        with open(filename, 'w', newline='', encoding='utf-8') as file:
            file.write(response.text)
        
        print(f"✅ Report saved as {filename}")
        return filename
    else:
        print(f"❌ Failed to download report: {response.status_code}")
        return None

# Usage
report_file = download_evaluation_report("eval_123abc", "your-api-key")

# Analyze report
if report_file:
    with open(report_file, 'r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        for row in reader:
            print(f"Claim: {row['extracted_claim']}")
            print(f"Accuracy: {row['accuracy_score']}")
            print(f"Verifiable: {row['verifiable']}")
            print("---")

Confident AI Dashboard Access

Access detailed traces and advanced analytics through the Confident AI cloud dashboard:

Dashboard Features

Test Run Tracking

View detailed test runs with input/output traces and performance metrics

Evaluation Analytics

Comprehensive analytics with visualizations and trend analysis

Model Comparison

Compare performance across different models and configurations

Custom Metrics

Configure custom evaluation metrics for specific use cases

Accessing the Dashboard

Ensure API Key is Set

Make sure your CONFIDENT_API_KEY is included in your requests:

# In your CheckThat AI requests
extra_headers = {
    "X-Confident-API-Key": os.getenv("CONFIDENT_API_KEY")
}

Visit Confident AI Dashboard

Go to confident-ai.com and log in with your account.

View Your Test Runs

Navigate to your test runs to see detailed traces of your claim evaluation processes.

Analyze Performance

Use the dashboard’s analytics tools to identify patterns and optimize your claim extraction workflows.

Dashboard Screenshots

Confident AI evaluation dashboard interface

Test run details showing claim evaluation results

Integration Examples

Batch Evaluation Workflow

from checkthat_ai import CheckThatAI
import csv
import os

client = CheckThatAI(api_key=os.getenv("OPENAI_API_KEY"))

def evaluate_claims_batch(texts, output_file="batch_evaluation.csv"):
    """Process multiple texts and generate evaluation report"""
    
    results = []
    
    for i, text in enumerate(texts):
        print(f"Processing text {i+1}/{len(texts)}...")
        
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "user", "content": f"Extract and evaluate claims: {text}"}
            ],
            extra_headers={
                "X-Confident-API-Key": os.getenv("CONFIDENT_API_KEY")
            }
        )
        
        # Parse response and collect evaluation data
        # (Response would include evaluation metrics)
        results.append({
            'text_id': i+1,
            'original_text': text,
            'response': response.choices[0].message.content,
            'evaluation_id': response.id  # Use response ID to fetch detailed eval
        })
    
    # Save batch results
    with open(output_file, 'w', newline='', encoding='utf-8') as file:
        if results:
            writer = csv.DictWriter(file, fieldnames=results[0].keys())
            writer.writeheader()
            writer.writerows(results)
    
    print(f"✅ Batch evaluation complete. Results saved to {output_file}")
    return results

# Usage
sample_texts = [
    "Clinical trials show 95% efficacy in reducing symptoms.",
    "The new policy will create 1 million jobs by 2025.",
    "Scientists confirm that this treatment is 100% safe."
]

batch_results = evaluate_claims_batch(sample_texts)

Best Practices

Evaluation Strategy

Regular Quality Monitoring

Set up automated evaluation for production workflows
Monitor evaluation trends over time
Set quality thresholds and alerts for low-performance cases

Metric Selection

Choose evaluation metrics that align with your use case
Balance accuracy, completeness, and processing speed
Consider domain-specific evaluation criteria

Performance Optimization

Batch Processing

Process multiple claims together for better efficiency
Use batch APIs when available for large-scale evaluation
Implement proper rate limiting and error handling

Cost Management

Monitor evaluation costs alongside regular API usage
Use sampling for large datasets to control evaluation costs
Balance evaluation frequency with budget constraints

Data Analysis

Report Analysis

Regularly review CSV reports for quality trends
Identify patterns in low-quality extractions
Use insights to improve prompt engineering

Dashboard Utilization

Leverage Confident AI dashboard for deep analysis
Set up custom metrics for your specific domain
Use trace analysis to debug extraction issues

FAQ

Do I need to write any code for evaluations?

No! Evaluations run automatically within the claim normalization pipeline. You just need to include your CONFIDENT_API_KEY to access dashboard features and detailed traces.

How do I get CSV evaluation reports?

CSV reports are generated automatically and can be downloaded via the API or accessed through the Confident AI dashboard. Reports include detailed metrics for each extracted claim.

What if I don't have a Confident AI API key?

Basic evaluations will still run automatically. However, you’ll need a CONFIDENT_API_KEY to access the cloud dashboard, detailed traces, and advanced analytics features.

How often should I review evaluation reports?

For production systems, we recommend daily or weekly review of evaluation metrics. For development and testing, review after each significant change to your claim extraction workflow.

Getting Help: Visit confident-ai.com for detailed documentation on the Confident AI platform, or deepeval.com for information about the evaluation framework.

Pro Tip: Use the evaluation insights to continuously improve your claim extraction prompts and configurations. Low-scoring claims often reveal opportunities for prompt optimization.

Getting started

Capabilities

Other

Overview

DeepEval Framework

Confident AI Platform

Getting Started

API Key Setup

Basic Usage

Evaluation Metrics

CSV Evaluation Reports

Report Structure

Downloading Reports

Confident AI Dashboard Access

Dashboard Features

Test Run Tracking

Evaluation Analytics

Model Comparison

Custom Metrics

Accessing the Dashboard

Dashboard Screenshots

Integration Examples

Batch Evaluation Workflow

Best Practices

FAQ

Getting started

Capabilities

Other

​Overview

DeepEval Framework

Confident AI Platform

​Getting Started

​API Key Setup

​Basic Usage

​Evaluation Metrics

​CSV Evaluation Reports

​Report Structure

​Downloading Reports

​Confident AI Dashboard Access

​Dashboard Features

Test Run Tracking

Evaluation Analytics

Model Comparison

Custom Metrics

​Accessing the Dashboard

​Dashboard Screenshots

​Integration Examples

​Batch Evaluation Workflow

​Best Practices

​FAQ

Overview

Getting Started

API Key Setup

Basic Usage

Evaluation Metrics

CSV Evaluation Reports

Report Structure

Downloading Reports

Confident AI Dashboard Access

Dashboard Features

Accessing the Dashboard

Dashboard Screenshots

Integration Examples

Batch Evaluation Workflow

Best Practices

FAQ