Build a Groundedness Verification Tool Using Upstage API and LangChain

Understanding the Target Audience

The primary audience for this tutorial includes AI developers, data scientists, and business managers interested in ensuring the reliability of AI-generated content. Their pain points often revolve around the accuracy of AI outputs and the need for trustworthy information in decision-making processes. They aim to enhance the credibility of their AI systems while maintaining efficiency in content generation. This audience values clear, concise communication and practical examples that demonstrate real-world applications of technology.

Introduction to Upstage’s Groundedness Check Service

Upstage’s Groundedness Check service offers a robust API for verifying that AI-generated responses are anchored in reliable source material. By submitting context–answer pairs to the Upstage endpoint, users can instantly determine whether the supplied context supports a given answer and receive a confidence assessment of that grounding. This tutorial demonstrates how to utilize Upstage’s core capabilities, including single-shot verification, batch processing, and multi-domain testing, to ensure that AI systems produce factual and trustworthy content across diverse subject areas.

Setting Up the Environment

To begin, install the necessary packages:

pip install -qU langchain-core langchain-upstage

Next, set your Upstage API key in the environment to authenticate all subsequent groundedness check requests:

import os
os.environ["UPSTAGE_API_KEY"] = "Use Your API Key Here"

Creating the AdvancedGroundednessChecker Class

The AdvancedGroundednessChecker class wraps Upstage’s groundedness API into a reusable interface that allows for both single and batch context–answer checks while accumulating results. It includes methods to extract a confidence label from each response and compute overall accuracy statistics across all checks.

class AdvancedGroundednessChecker:
    def __init__(self):
        self.checker = UpstageGroundednessCheck()
        self.results = []
   
    def check_single(self, context: str, answer: str) -> Dict[str, Any]:
        request = {"context": context, "answer": answer}
        response = self.checker.invoke(request)
        result = {
            "context": context,
            "answer": answer,
            "grounded": response,
            "confidence": self._extract_confidence(response)
        }
        self.results.append(result)
        return result
   
    def batch_check(self, test_cases: List[Dict[str, str]]) -> List[Dict[str, Any]]:
        batch_results = []
        for case in test_cases:
            result = self.check_single(case["context"], case["answer"])
            batch_results.append(result)
        return batch_results
   
    def _extract_confidence(self, response) -> str:
        if hasattr(response, 'lower'):
            if 'grounded' in response.lower():
                return 'high'
            elif 'not grounded' in response.lower():
                return 'low'
        return 'medium'
   
    def analyze_results(self) -> Dict[str, Any]:
        total = len(self.results)
        grounded = sum(1 for r in self.results if 'grounded' in str(r['grounded']).lower())
        return {
            "total_checks": total,
            "grounded_count": grounded,
            "not_grounded_count": total - grounded,
            "accuracy_rate": grounded / total if total > 0 else 0
        }

Running Groundedness Checks

Here are examples of running single groundedness checks:

result1 = checker.check_single(
    context="Mauna Kea is an inactive volcano on the island of Hawai'i.",
    answer="Mauna Kea is 5,207.3 meters tall."
)
result2 = checker.check_single(
    context="Python is a high-level programming language created by Guido van Rossum in 1991.",
    answer="Python was made by Guido van Rossum & focuses on code readability."
)
result3 = checker.check_single(
    context="The Great Wall of China is approximately 13,000 miles long.",
    answer="The Great Wall of China is very long."
)
result4 = checker.check_single(
    context="Water boils at 100 degrees Celsius at sea level atmospheric pressure.",
    answer="Water boils at 90 degrees Celsius at sea level."
)

Batch Processing Example

Batch processing allows for multiple checks at once:

test_cases = [
    {
        "context": "Shakespeare wrote Romeo and Juliet in the late 16th century.",
        "answer": "Romeo and Juliet was written by Shakespeare."
    },
    {
        "context": "The speed of light is approximately 299,792,458 meters per second.",
        "answer": "Light travels at about 300,000 kilometers per second."
    },
    {
        "context": "Earth has one natural satellite called the Moon.",
        "answer": "Earth has two moons."
    }
]
batch_results = checker.batch_check(test_cases)

Results Analysis

After running the checks, analyze the results:

analysis = checker.analyze_results()
print(f"Total checks performed: {analysis['total_checks']}")
print(f"Grounded responses: {analysis['grounded_count']}")
print(f"Not grounded responses: {analysis['not_grounded_count']}")
print(f"Groundedness rate: {analysis['accuracy_rate']:.2%}")

Multi-domain Testing

Conduct multi-domain validations to illustrate how Upstage handles groundedness across different subject areas:

domains = {
    "Science": {
        "context": "Photosynthesis is the process by which plants convert sunlight, carbon dioxide, & water into glucose and oxygen.",
        "answer": "Plants use photosynthesis to make food from sunlight and CO2."
    },
    "History": {
        "context": "World War II ended in 1945 after the surrender of Japan following the atomic bombings.",
        "answer": "WWII ended in 1944 with Germany's surrender."
    },
    "Geography": {
        "context": "Mount Everest is the highest mountain on Earth, located in the Himalayas at 8,848.86 meters.",
        "answer": "Mount Everest is the tallest mountain and is located in the Himalayas."
    }
}
for domain, test_case in domains.items():
    result = checker.check_single(test_case["context"], test_case["answer"])

Creating a Test Report

Generate a detailed test report summarizing the performance:

def create_test_report(checker_instance):
    report = {
        "summary": checker_instance.analyze_results(),
        "detailed_results": checker_instance.results,
        "recommendations": []
    }
    accuracy = report["summary"]["accuracy_rate"]
    if accuracy < 0.7:
        report["recommendations"].append("Consider reviewing answer generation process")
    if accuracy > 0.9:
        report["recommendations"].append("High accuracy - system performing well")
    return report

Conclusion

This tutorial demonstrated the following:

Basic groundedness checking
Batch processing capabilities
Multi-domain testing
Results analysis and reporting
Advanced wrapper implementation

With Upstage’s Groundedness Check, users gain a scalable, domain-agnostic solution for real-time fact verification and confidence scoring. By integrating this service into workflows, organizations can enhance the reliability of AI-generated outputs and maintain rigorous standards of factual integrity across all applications.

For further exploration, check out the Upstage website for more resources and documentation.