«`html
How to Build an Advanced BrightData Web Scraper with Google Gemini for AI-Powered Data Extraction
In this tutorial, we walk you through building an enhanced web scraping tool that leverages BrightData’s powerful proxy network alongside Google’s Gemini API for intelligent data extraction. You’ll learn how to structure your Python project, install and import the necessary libraries, and encapsulate scraping logic within a clean, reusable BrightDataScraper
class. Whether you’re targeting Amazon product pages, bestseller listings, or LinkedIn profiles, the scraper’s modular methods will demonstrate how to configure scraping parameters, handle errors gracefully, and return structured JSON results. An optional React-style AI agent integration also shows you how to combine LLM-driven reasoning with real-time scraping, empowering you to pose natural language queries for on-the-fly data analysis.
Installation of Required Libraries
We install all of the key libraries needed for the tutorial in one step: langchain-brightdata
for BrightData web scraping, langchain-google-genai
and google-generativeai
for Google Gemini integration, langgraph
for agent orchestration, and langchain-core
for the core LangChain framework.
!pip install langchain-brightdata langchain-google-genai langgraph langchain-core google-generativeai
Importing Necessary Libraries
import os
import json
from typing import Dict, Any, Optional
from langchain_brightdata import BrightDataWebScraperAPI
from langchain_google_genai import ChatGoogleGenerativeAI
from langgraph.prebuilt import create_react_agent
These imports prepare your environment and core functionality: os
and json
handle system operations and data serialization, while typing
provides structured type hints. You then bring in BrightDataWebScraperAPI
for BrightData scraping, ChatGoogleGenerativeAI
to interface with Google’s Gemini LLM, and create_react_agent
to orchestrate these components in a React-style agent.
Creating the BrightDataScraper Class
class BrightDataScraper:
"""Enhanced web scraper using BrightData API"""
def __init__(self, api_key: str, google_api_key: Optional[str] = None):
"""Initialize scraper with API keys"""
self.api_key = api_key
self.scraper = BrightDataWebScraperAPI(bright_data_api_key=api_key)
if google_api_key:
self.llm = ChatGoogleGenerativeAI(
model="gemini-2.0-flash",
google_api_key=google_api_key
)
self.agent = create_react_agent(self.llm, [self.scraper])
The BrightDataScraper
class encapsulates all BrightData web-scraping logic and optional Gemini-powered intelligence under a single, reusable interface. Its methods enable you to easily fetch Amazon product details, bestseller lists, and LinkedIn profiles, handling API calls, error handling, and JSON formatting, and even stream natural-language “agent” queries when a Google API key is provided.
Scraping Methods
def scrape_amazon_product(self, url: str, zipcode: str = "10001") -> Dict[str, Any]:
"""Scrape Amazon product data"""
try:
results = self.scraper.invoke({
"url": url,
"dataset_type": "amazon_product",
"zipcode": zipcode
})
return {"success": True, "data": results}
except Exception as e:
return {"success": False, "error": str(e)}
def scrape_amazon_bestsellers(self, region: str = "in") -> Dict[str, Any]:
"""Scrape Amazon bestsellers"""
try:
url = f"https://www.amazon.{region}/gp/bestsellers/"
results = self.scraper.invoke({
"url": url,
"dataset_type": "amazon_product"
})
return {"success": True, "data": results}
except Exception as e:
return {"success": False, "error": str(e)}
def scrape_linkedin_profile(self, url: str) -> Dict[str, Any]:
"""Scrape LinkedIn profile data"""
try:
results = self.scraper.invoke({
"url": url,
"dataset_type": "linkedin_person_profile"
})
return {"success": True, "data": results}
except Exception as e:
return {"success": False, "error": str(e)}
Running AI Agent Queries
def run_agent_query(self, query: str) -> None:
"""Run AI agent with natural language query"""
if not hasattr(self, 'agent'):
print("Error: Google API key required for agent functionality")
return
try:
for step in self.agent.stream(
{"messages": query},
stream_mode="values"
):
step["messages"][-1].pretty_print()
except Exception as e:
print(f"Agent error: {e}")
Printing Results
def print_results(self, results: Dict[str, Any], title: str = "Results") -> None:
"""Pretty print results"""
print(f"\n{'='*50}")
print(f"{title}")
print(f"{'='*50}")
if results["success"]:
print(json.dumps(results["data"], indent=2, ensure_ascii=False))
else:
print(f"Error: {results['error']}")
print()
Main Execution Function
def main():
"""Main execution function"""
BRIGHT_DATA_API_KEY = "Use Your Own API Key"
GOOGLE_API_KEY = "Use Your Own API Key"
scraper = BrightDataScraper(BRIGHT_DATA_API_KEY, GOOGLE_API_KEY)
print(" Scraping Amazon India Bestsellers...")
bestsellers = scraper.scrape_amazon_bestsellers("in")
scraper.print_results(bestsellers, "Amazon India Bestsellers")
print(" Scraping Amazon Product...")
product_url = "https://www.amazon.com/dp/B08L5TNJHG"
product_data = scraper.scrape_amazon_product(product_url, "10001")
scraper.print_results(product_data, "Amazon Product Data")
print(" Scraping LinkedIn Profile...")
linkedin_url = "https://www.linkedin.com/in/satyanadella/"
linkedin_data = scraper.scrape_linkedin_profile(linkedin_url)
scraper.print_results(linkedin_data, "LinkedIn Profile Data")
print(" Running AI Agent Query...")
agent_query = """
Scrape Amazon product data for https://www.amazon.com/dp/B0D2Q9397Y?th=1
in New York (zipcode 10001) and summarize the key product details.
"""
scraper.run_agent_query(agent_query)
The main()
function ties everything together by setting your BrightData and Google API keys, instantiating the BrightDataScraper
, and then demonstrating each feature: it scrapes Amazon India’s bestsellers, fetches details for a specific product, retrieves a LinkedIn profile, and finally runs a natural-language agent query, printing neatly formatted results after each step.
Final Execution Block
if __name__ == "__main__":
print("Installing required packages...")
os.system("pip install -q langchain-brightdata langchain-google-genai langgraph")
os.environ["BRIGHT_DATA_API_KEY"] = "Use Your Own API Key"
main()
Finally, this entry-point block ensures that, when run as a standalone script, the required scraping libraries are quietly installed, and the BrightData API key is set in the environment. Then the main function is executed to initiate all scraping and agent workflows.
Conclusion
By the end of this tutorial, you’ll have a ready-to-use Python script that automates tedious data collection tasks, abstracts away low-level API details, and optionally taps into generative AI for advanced query handling. You can extend this foundation by adding support for other dataset types, integrating additional LLMs, or deploying the scraper as part of a larger data pipeline or web service. With these building blocks in place, you’re now equipped to gather, analyze, and present web data more efficiently, whether for market research, competitive intelligence, or custom AI-driven applications.
Check out the Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.
«`