A Coding Guide to Build Modular and Self-Correcting QA Systems with DSPy
In this tutorial, we explore how to build an intelligent and self-correcting question-answering system using the DSPy framework, integrated with Google’s Gemini 1.5 Flash model. We begin by defining structured Signatures that clearly outline input-output behavior, which DSPy uses as its foundation for building reliable pipelines. With DSPy’s declarative programming approach, we construct composable modules, such as AdvancedQA and SimpleRAG, to answer questions using both context and retrieval-augmented generation. By combining DSPy’s modularity with Gemini’s powerful reasoning, we craft an AI system capable of delivering accurate, step-by-step answers. As we progress, we also leverage DSPy’s optimization tools, such as BootstrapFewShot, to automatically enhance performance based on training examples.
Installation
We start by installing the required libraries, DSPy for declarative AI pipelines, and google-generativeai to access Google’s Gemini models. After importing the necessary modules, we configure Gemini using our API key. Finally, we set up DSPy to use the Gemini 1.5 Flash model as our language model backend.
!pip install dspy-ai google-generativeai
import dspy
import google.generativeai as genai
import random
from typing import List, Optional
GOOGLE_API_KEY = "Use Your Own API Key"
genai.configure(api_key=GOOGLE_API_KEY)
dspy.configure(lm=dspy.LM(model="gemini/gemini-1.5-flash", api_key=GOOGLE_API_KEY))
Defining Signatures
We define two DSPy Signatures to structure our system’s inputs and outputs. First, QuestionAnswering
expects a context and a question, and it returns both reasoning and a final answer, allowing the model to explain its thought process. Next, FactualityCheck
is designed to verify the truthfulness of an answer by returning a simple boolean, helping us build a self-correcting QA system.
class QuestionAnswering(dspy.Signature):
"""Answer questions based on given context with reasoning."""
context: str = dspy.InputField(desc="Relevant context information")
question: str = dspy.InputField(desc="Question to answer")
reasoning: str = dspy.OutputField(desc="Step-by-step reasoning")
answer: str = dspy.OutputField(desc="Final answer")
class FactualityCheck(dspy.Signature):
"""Verify if an answer is factually correct given context."""
context: str = dspy.InputField()
question: str = dspy.InputField()
answer: str = dspy.InputField()
is_correct: bool = dspy.OutputField(desc="True if answer is factually correct")
Creating the AdvancedQA Module
We create an AdvancedQA
module to add self-correction capability to our QA system. It first uses a Chain-of-Thought predictor to generate an answer with reasoning. Then, it checks the factual accuracy using a fact-checking predictor. If the answer is incorrect, we refine the context and retry, up to a specified number of times, to ensure more reliable outputs.
class AdvancedQA(dspy.Module):
def __init__(self, max_retries: int = 2):
super().__init__()
self.max_retries = max_retries
self.qa_predictor = dspy.ChainOfThought(QuestionAnswering)
self.fact_checker = dspy.Predict(FactualityCheck)
def forward(self, context: str, question: str) -> dspy.Prediction:
prediction = self.qa_predictor(context=context, question=question)
for attempt in range(self.max_retries):
fact_check = self.fact_checker(
context=context,
question=question,
answer=prediction.answer
)
if fact_check.is_correct:
break
refined_context = f"{context}nnPrevious incorrect answer: {prediction.answer}nPlease provide a more accurate answer."
prediction = self.qa_predictor(context=refined_context, question=question)
return prediction
Implementing SimpleRAG Module
We build a SimpleRAG
module to simulate Retrieval-Augmented Generation using DSPy. We provide a knowledge base and implement a basic keyword-based retriever to fetch the most relevant documents for a given question. These documents serve as context for the AdvancedQA
module, which then performs reasoning and self-correction to produce an accurate answer.
class SimpleRAG(dspy.Module):
def __init__(self, knowledge_base: List[str]):
super().__init__()
self.knowledge_base = knowledge_base
self.qa_system = AdvancedQA()
def retrieve(self, question: str, top_k: int = 2) -> str:
scored_docs = []
question_words = set(question.lower().split())
for doc in self.knowledge_base:
doc_words = set(doc.lower().split())
score = len(question_words.intersection(doc_words))
scored_docs.append((score, doc))
scored_docs.sort(reverse=True)
return "nn".join([doc for _, doc in scored_docs[:top_k]])
def forward(self, question: str) -> dspy.Prediction:
context = self.retrieve(question)
return self.qa_system(context=context, question=question)
Knowledge Base and Training Examples
We define a small knowledge base containing diverse facts across various topics, including history, programming, and science. This serves as our context source for retrieval. Alongside, we prepare a set of training examples to guide DSPy’s optimization process. Each example includes a question, its relevant context, and the correct answer, helping our system learn how to respond more accurately.
knowledge_base = [
"Use Your Context and Knowledge Base Here"
]
training_examples = [
dspy.Example(
question="What is the height of the Eiffel Tower?",
context="The Eiffel Tower is located in Paris, France. It was constructed from 1887 to 1889 and stands 330 meters tall including antennas.",
answer="330 meters"
).with_inputs("question", "context"),
dspy.Example(
question="Who created Python programming language?",
context="Python is a high-level programming language created by Guido van Rossum. It was first released in 1991 and emphasizes code readability.",
answer="Guido van Rossum"
).with_inputs("question", "context"),
dspy.Example(
question="What is machine learning?",
context="ML focuses on algorithms that can learn from data without being explicitly programmed.",
answer="Machine learning focuses on algorithms that learn from data without explicit programming."
).with_inputs("question", "context")
]
Evaluating the System
We begin by defining a simple accuracy metric to check if the predicted answer contains the correct response. After initializing our SimpleRAG
system and a baseline ChainOfThought
QA module, we test it on a sample question before any optimization. Then, using DSPy’s BootstrapFewShot optimizer, we fine-tune the QA system with our training examples. This enables the model to automatically generate more effective prompts, leading to improved accuracy, which we verify by comparing responses before and after optimization.
def accuracy_metric(example, prediction, trace=None):
"""Simple accuracy metric for evaluation"""
return example.answer.lower() in prediction.answer.lower()
print(" Initializing DSPy QA System with Gemini...")
print(" Note: Using Google's Gemini 1.5 Flash (free tier)")
rag_system = SimpleRAG(knowledge_base)
basic_qa = dspy.ChainOfThought(QuestionAnswering)
print("n Before Optimization:")
test_question = "What is the height of the Eiffel Tower?"
test_context = knowledge_base[0]
initial_prediction = basic_qa(context=test_context, question=test_question)
print(f"Q: {test_question}")
print(f"A: {initial_prediction.answer}")
print(f"Reasoning: {initial_prediction.reasoning}")
print("n Optimizing with BootstrapFewShot...")
optimizer = dspy.BootstrapFewShot(metric=accuracy_metric, max_bootstrapped_demos=2)
optimized_qa = optimizer.compile(basic_qa, trainset=training_examples)
print("n After Optimization:")
optimized_prediction = optimized_qa(context=test_context, question=test_question)
print(f"Q: {test_question}")
print(f"A: {optimized_prediction.answer}")
print(f"Reasoning: {optimized_prediction.reasoning}")
Final Evaluation
We run an Advanced RAG demo by asking multiple questions across different domains. For each question, the SimpleRAG
system retrieves the most relevant context and then uses the self-correcting AdvancedQA
module to generate a well-reasoned answer. We print the answers along with a preview of the reasoning, showcasing how DSPy combines retrieval and thoughtful generation to deliver reliable responses.
def evaluate_system(qa_module, test_cases):
"""Evaluate QA system performance"""
correct = 0
total = len(test_cases)
for example in test_cases:
prediction = qa_module(context=example.context, question=example.question)
if accuracy_metric(example, prediction):
correct += 1
return correct / total
print(f"n Evaluation Results:")
print(f"Basic QA Accuracy: {evaluate_system(basic_qa, training_examples):.2%}")
print(f"Optimized QA Accuracy: {evaluate_system(optimized_qa, training_examples):.2%}")
print("n Tutorial Complete! Key DSPy Concepts Demonstrated:")
print("1. Signatures - Defined input/output schemas")
print("2. Modules - Built composable QA systems")
print("3. Self-correction - Implemented iterative improvement")
print("4. RAG - Created retrieval-augmented generation")
print("5. Optimization - Used BootstrapFewShot to improve prompts")
print("6. Evaluation - Measured system performance")
print("7. Free API - Powered by Google Gemini 1.5 Flash")
In conclusion, we have successfully demonstrated the full potential of DSPy for building advanced QA pipelines. We see how DSPy simplifies the design of intelligent modules with clear interfaces, supports self-correction loops, integrates basic retrieval, and enables few-shot prompt optimization with minimal code. With just a few lines, we configure and evaluate our models using real-world examples, measuring performance gains. This hands-on experience shows how DSPy, when combined with Google’s Gemini API, empowers us to rapidly prototype, test, and scale sophisticated language applications without boilerplate or complex logic.
Check out the Codes. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, YouTube, and Spotify, and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.