«`html

Build a Gemini-Powered DataFrame Agent for Natural Language Data Analysis with Pandas and LangChain

In this tutorial, we will learn to harness the power of Google’s Gemini models alongside the flexibility of Pandas to perform both straightforward and sophisticated data analyses on the classic Titanic dataset. By combining the ChatGoogleGenerativeAI client with LangChain’s experimental Pandas DataFrame agent, we will set up an interactive “agent” that can interpret natural-language queries. This agent will be capable of inspecting data, computing statistics, uncovering correlations, and generating visual insights without writing manual code for each task.

Target Audience Analysis

The target audience for this tutorial consists of data analysts, data scientists, and business professionals interested in leveraging AI for data analysis. Their pain points include:

Difficulty in interpreting large datasets without extensive coding skills.
Time constraints in manually analyzing data and generating insights.
A need for more intuitive tools that simplify data exploration and visualization.

Their goals include:

Enhancing their data analysis capabilities with AI tools.
Improving decision-making through data-driven insights.
Learning how to utilize emerging technologies like Gemini and LangChain.

Interests of this audience may include:

Natural language processing and its applications in data analysis.
Integration of AI models into existing data workflows.
Exploring innovative methods to visualize and interpret data.

Communication preferences include:

Clear, concise, and technical language with practical examples.
Step-by-step instructions that are easy to follow.
Access to code snippets and tutorials that can be directly applied.

Setting Up Your Environment

First, we install the required libraries, langchain_experimental, langchain_google_genai, and pandas, using pip to enable the DataFrame agent and Google Gemini integration. Use the following command:

!pip install langchain_experimental langchain_google_genai pandas

Next, import the core modules:

import os
import pandas as pd
import numpy as np
from langchain.agents.agent_types import AgentType
from langchain_experimental.agents.agent_toolkits import create_pandas_dataframe_agent
from langchain_google_genai import ChatGoogleGenerativeAI

Set your GOOGLE_API_KEY environment variable, and we’re ready to instantiate a Gemini-powered Pandas agent for conversational data analysis.

Creating the Gemini Agent

def setup_gemini_agent(df, temperature=0, model="gemini-1.5-flash"):
    llm = ChatGoogleGenerativeAI(
        model=model,
        temperature=temperature,
        convert_system_message_to_human=True
    )
    agent = create_pandas_dataframe_agent(
        llm=llm,
        df=df,
        verbose=True,
        agent_type=AgentType.OPENAI_FUNCTIONS,
        allow_dangerous_code=True
    )
    return agent

This helper function initializes a Gemini-powered LLM client with our chosen model and temperature. Then it wraps it into a LangChain Pandas DataFrame agent that can execute natural-language queries against our DataFrame.

Loading and Exploring Data

def load_and_explore_data():
    print("Loading Titanic Dataset...")
    df = pd.read_csv(
        "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
    )
    print(f"Dataset shape: {df.shape}")
    print(f"Columns: {list(df.columns)}")
    return df

This function fetches the Titanic CSV directly from the Pandas GitHub repository, providing a quick overview of its dimensions and column names.

Basic Analysis Demo

def basic_analysis_demo(agent):
    print("\nBASIC ANALYSIS DEMO")
    print("=" * 50)
    queries = [
        "How many rows and columns are in the dataset?",
        "What's the survival rate (percentage of people who survived)?",
        "How many people have more than 3 siblings?",
        "What's the square root of the average age?",
        "Show me the distribution of passenger classes"
    ]
    for query in queries:
        print(f"\nQuery: {query}")
        try:
            result = agent.invoke(query)
            print(f"Result: {result['output']}")
        except Exception as e:
            print(f"Error: {e}")

This demo routine initiates basic exploratory queries against our Titanic DataFrame agent, capturing and displaying the output or any errors encountered.

Advanced Analysis Demo

def advanced_analysis_demo(agent):
    print("\nADVANCED ANALYSIS DEMO")
    print("=" * 50)
    advanced_queries = [
        "What's the correlation between age and fare?",
        "Create a survival analysis by gender and class",
        "What's the median age for each passenger class?",
        "Find passengers with the highest fares and their details",
        "Calculate the survival rate for different age groups (0-18, 18-65, 65+)"
    ]
    for query in advanced_queries:
        print(f"\nQuery: {query}")
        try:
            result = agent.invoke(query)
            print(f"Result: {result['output']}")
        except Exception as e:
            print(f"Error: {e}")

This function runs a series of sophisticated queries, computing correlations and performing stratified survival analyses.

Multi-DataFrame Demo

def multi_dataframe_demo():
    print("\nMULTI-DATAFRAME DEMO")
    print("=" * 50)
    df = pd.read_csv(
        "https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
    )
    df_filled = df.copy()
    df_filled["Age"] = df_filled["Age"].fillna(df_filled["Age"].mean())
    agent = setup_gemini_agent([df, df_filled])
    queries = [
        "How many rows in the age column are different between the two datasets?",
        "Compare the average age in both datasets",
        "What percentage of age values were missing in the original dataset?",
        "Show summary statistics for age in both datasets"
    ]
    for query in queries:
        print(f"\nQuery: {query}")
        try:
            result = agent.invoke(query)
            print(f"Result: {result['output']}")
        except Exception as e:
            print(f"Error: {e}")

This demo illustrates cross-dataset comparisons, allowing for deeper insights into data differences and imputation impacts.

Custom Analysis Demo

def custom_analysis_demo(agent):
    print("\nCUSTOM ANALYSIS DEMO")
    print("=" * 50)
    custom_queries = [
        "Create a risk score for each passenger based on: Age (higher age = higher risk), Gender (male = higher risk), Class (3rd class = higher risk), Family size (alone or large family = higher risk). Then show the top 10 highest risk passengers who survived",
        "Analyze the 'deck' information from the cabin data: Extract deck letter from cabin numbers, Show survival rates by deck, Which deck had the highest survival rate?",
        "Find interesting patterns: Did people with similar names (same surname) tend to survive together? What's the relationship between ticket price and survival? Were there any age groups that had 100% survival rate?"
    ]
    for i, query in enumerate(custom_queries, 1):
        print(f"\nCustom Analysis {i}:")
        print(f"Query: {query[:100]}...")
        try:
            result = agent.invoke(query)
            print(f"Result: {result['output']}")
        except Exception as e:
            print(f"Error: {e}")

This routine demonstrates how to handle complex, multi-step prompts for bespoke analyses.

Main Function

def main():
    print("Advanced Pandas Agent with Gemini Tutorial")
    print("=" * 60)
    if not os.getenv("GOOGLE_API_KEY"):
        print("Warning: GOOGLE_API_KEY not set!")
        print("Please set your Gemini API key as an environment variable.")
        return
    try:
        df = load_and_explore_data()
        print("\nSetting up Gemini Agent...")
        agent = setup_gemini_agent(df)
        basic_analysis_demo(agent)
        advanced_analysis_demo(agent)
        multi_dataframe_demo()
        custom_analysis_demo(agent)
        print("\nTutorial completed successfully!")
    except Exception as e:
        print(f"Error: {e}")
        print("Make sure you have installed all required packages and set your API key.")
if __name__ == "__main__":
    main()

The main function serves as the starting point for the tutorial, verifying the API key, loading data, and running analysis demos sequentially.

Conclusion

Combining Pandas with Gemini via a LangChain DataFrame agent transforms data exploration from writing boilerplate code into crafting clear, natural-language queries. Users can compute summary statistics, build custom risk scores, compare multiple DataFrames, or conduct nuanced survival analyses, achieving all this with minimal setup. This interactive analytics assistant can adapt to new questions on the fly, surfacing hidden patterns and accelerating workflows.

Check out the Notebook. All credit for this research goes to the researchers of this project. Feel free to follow us on Twitter and join our community.

«`