Skip to main content
Our evaluations feature is not live yet. It is under active development and will be available soon. But it should work exactly as described in this section once deployed.
New Feature: We are integrating evaluations powered by DeepEval from Confident AI within the claim normalization pipeline to evaluate the quality of extracted claims.
CheckThat AI now includes automated evaluation capabilities that assess the quality and accuracy of claim extraction and normalization processes. This integration provides comprehensive evaluation metrics without requiring any custom code from users.

Overview

The evaluation system automatically:
  • Analyzes claim extraction quality using industry-standard metrics
  • Generates detailed evaluation reports in CSV format for local analysis
  • Provides cloud dashboard access for advanced trace analysis
  • Requires no custom code - evaluations run automatically in the pipeline
  • Integrates with Confident AI for comprehensive evaluation tracking

Getting Started

API Key Setup

To access cloud dashboard features and detailed traces, you’ll need a Confident AI API key:
1

Get your Confident AI API Key

Visit confident-ai.com and create an account to obtain your API key.
2

Set your environment variable

export CONFIDENT_API_KEY="your-confident-ai-api-key"
3

Include in API requests

Pass your Confident AI API key along with your CheckThat AI requests to enable dashboard tracking.

Basic Usage

Evaluations run automatically when you use CheckThat AI’s claim normalization features:
from checkthat_ai import CheckThatAI
import os

client = CheckThatAI(api_key=os.getenv("OPENAI_API_KEY"))

# Evaluations run automatically during claim processing
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Extract and evaluate claims from: 'Studies show that 95% of doctors recommend this new treatment for diabetes management.'"}
    ],
    # Optional: Include Confident AI key for dashboard access
    extra_headers={
        "X-Confident-API-Key": os.getenv("CONFIDENT_API_KEY")
    }
)

print(response.choices[0].message.content)

Evaluation Metrics

The evaluation system assesses multiple dimensions of claim extraction quality:
Claim Identification Accuracy
  • Precision: How many extracted claims are actually valid claims
  • Recall: How many valid claims were successfully identified
  • F1-Score: Harmonic mean of precision and recall
Content Fidelity
  • Semantic similarity between original text and extracted claims
  • Factual consistency of extracted information
  • Preservation of original meaning and context
Completeness
  • Coverage of all relevant claims in the source text
  • Identification of implicit vs. explicit claims
  • Handling of compound and nested claims
Coherence
  • Logical consistency of extracted claims
  • Proper claim boundaries and segmentation
  • Maintenance of causal relationships
Extraction Confidence
  • Model confidence in claim identification
  • Uncertainty quantification for ambiguous cases
  • Reliability scores for different claim types
Verification Readiness
  • Assessement of how well-formed claims are for fact-checking
  • Identification of claims requiring additional context
  • Flagging of unverifiable or opinion-based statements

CSV Evaluation Reports

Receive detailed evaluation reports that you can save locally for analysis:

Report Structure

Sample Evaluation Report
claim_id,original_text,extracted_claim,accuracy_score,completeness_score,coherence_score,confidence,category,verifiable,timestamp
claim_001,"Studies show 95% efficacy","Study reports 95% treatment efficacy",0.92,0.88,0.95,0.89,medical,true,2024-01-15T10:30:00Z
claim_002,"Doctors recommend treatment","95% of doctors recommend this treatment",0.87,0.91,0.93,0.85,medical,true,2024-01-15T10:30:00Z
claim_003,"New diabetes management","Treatment effective for diabetes management",0.85,0.82,0.90,0.82,medical,true,2024-01-15T10:30:00Z

Downloading Reports

import requests
import csv
from datetime import datetime

def download_evaluation_report(evaluation_id, api_key):
    """Download evaluation report as CSV"""
    
    response = requests.get(
        f'https://api.checkthat-ai.com/v1/evaluations/{evaluation_id}/report',
        headers={
            'Authorization': f'Bearer {api_key}',
            'Accept': 'text/csv'
        }
    )
    
    if response.status_code == 200:
        # Save report locally
        filename = f"evaluation_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
        with open(filename, 'w', newline='', encoding='utf-8') as file:
            file.write(response.text)
        
        print(f"✅ Report saved as {filename}")
        return filename
    else:
        print(f"❌ Failed to download report: {response.status_code}")
        return None

# Usage
report_file = download_evaluation_report("eval_123abc", "your-api-key")

# Analyze report
if report_file:
    with open(report_file, 'r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        for row in reader:
            print(f"Claim: {row['extracted_claim']}")
            print(f"Accuracy: {row['accuracy_score']}")
            print(f"Verifiable: {row['verifiable']}")
            print("---")

Confident AI Dashboard Access

Access detailed traces and advanced analytics through the Confident AI cloud dashboard:

Dashboard Features

Test Run Tracking

View detailed test runs with input/output traces and performance metrics

Evaluation Analytics

Comprehensive analytics with visualizations and trend analysis

Model Comparison

Compare performance across different models and configurations

Custom Metrics

Configure custom evaluation metrics for specific use cases

Accessing the Dashboard

1

Ensure API Key is Set

Make sure your CONFIDENT_API_KEY is included in your requests:
# In your CheckThat AI requests
extra_headers = {
    "X-Confident-API-Key": os.getenv("CONFIDENT_API_KEY")
}
2

Visit Confident AI Dashboard

Go to confident-ai.com and log in with your account.
3

View Your Test Runs

Navigate to your test runs to see detailed traces of your claim evaluation processes.
4

Analyze Performance

Use the dashboard’s analytics tools to identify patterns and optimize your claim extraction workflows.

Dashboard Screenshots

Confident AI evaluation dashboard interface

Confident AI dashboard showing evaluation metrics and traces

Test run details showing claim evaluation results

Detailed test run view with claim extraction analysis

Integration Examples

Batch Evaluation Workflow

from checkthat_ai import CheckThatAI
import csv
import os

client = CheckThatAI(api_key=os.getenv("OPENAI_API_KEY"))

def evaluate_claims_batch(texts, output_file="batch_evaluation.csv"):
    """Process multiple texts and generate evaluation report"""
    
    results = []
    
    for i, text in enumerate(texts):
        print(f"Processing text {i+1}/{len(texts)}...")
        
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "user", "content": f"Extract and evaluate claims: {text}"}
            ],
            extra_headers={
                "X-Confident-API-Key": os.getenv("CONFIDENT_API_KEY")
            }
        )
        
        # Parse response and collect evaluation data
        # (Response would include evaluation metrics)
        results.append({
            'text_id': i+1,
            'original_text': text,
            'response': response.choices[0].message.content,
            'evaluation_id': response.id  # Use response ID to fetch detailed eval
        })
    
    # Save batch results
    with open(output_file, 'w', newline='', encoding='utf-8') as file:
        if results:
            writer = csv.DictWriter(file, fieldnames=results[0].keys())
            writer.writeheader()
            writer.writerows(results)
    
    print(f"✅ Batch evaluation complete. Results saved to {output_file}")
    return results

# Usage
sample_texts = [
    "Clinical trials show 95% efficacy in reducing symptoms.",
    "The new policy will create 1 million jobs by 2025.",
    "Scientists confirm that this treatment is 100% safe."
]

batch_results = evaluate_claims_batch(sample_texts)

Best Practices

Regular Quality Monitoring
  • Set up automated evaluation for production workflows
  • Monitor evaluation trends over time
  • Set quality thresholds and alerts for low-performance cases
Metric Selection
  • Choose evaluation metrics that align with your use case
  • Balance accuracy, completeness, and processing speed
  • Consider domain-specific evaluation criteria
Batch Processing
  • Process multiple claims together for better efficiency
  • Use batch APIs when available for large-scale evaluation
  • Implement proper rate limiting and error handling
Cost Management
  • Monitor evaluation costs alongside regular API usage
  • Use sampling for large datasets to control evaluation costs
  • Balance evaluation frequency with budget constraints
Report Analysis
  • Regularly review CSV reports for quality trends
  • Identify patterns in low-quality extractions
  • Use insights to improve prompt engineering
Dashboard Utilization
  • Leverage Confident AI dashboard for deep analysis
  • Set up custom metrics for your specific domain
  • Use trace analysis to debug extraction issues

FAQ

No! Evaluations run automatically within the claim normalization pipeline. You just need to include your CONFIDENT_API_KEY to access dashboard features and detailed traces.
CSV reports are generated automatically and can be downloaded via the API or accessed through the Confident AI dashboard. Reports include detailed metrics for each extracted claim.
Basic evaluations will still run automatically. However, you’ll need a CONFIDENT_API_KEY to access the cloud dashboard, detailed traces, and advanced analytics features.
For production systems, we recommend daily or weekly review of evaluation metrics. For development and testing, review after each significant change to your claim extraction workflow.
Getting Help: Visit confident-ai.com for detailed documentation on the Confident AI platform, or deepeval.com for information about the evaluation framework.
Pro Tip: Use the evaluation insights to continuously improve your claim extraction prompts and configurations. Low-scoring claims often reveal opportunities for prompt optimization.