←back to Blog

Getting Started with Mirascope: Removing Semantic Duplicates using an LLM

«`html

Getting Started with Mirascope: Removing Semantic Duplicates using an LLM

Mirascope is a powerful and user-friendly library that provides a unified interface for working with a wide range of Large Language Model (LLM) providers, including OpenAI, Anthropic, Mistral, Google (Gemini and Vertex AI), Groq, Cohere, LiteLLM, Azure AI, and Amazon Bedrock. It simplifies everything from text generation and structured data extraction to building complex AI-powered workflows and agent systems.

This guide focuses on using Mirascope’s OpenAI integration to identify and remove semantic duplicates—entries that may differ in wording but carry the same meaning—from a list of customer reviews.

Installing the Dependencies

To install Mirascope with OpenAI support, use the following command:

pip install "mirascope[openai]"

OpenAI Key

To obtain an OpenAI API key, visit OpenAI API Keys and generate a new key. New users may need to add billing details and make a minimum payment of 5 USD to activate API access.

import os
from getpass import getpass
os.environ['OPENAI_API_KEY'] = getpass('Enter OpenAI API Key: ')

Defining the List of Customer Reviews

The following list captures key customer sentiments, including praise for sound quality and ease of use, complaints about battery life, build quality, and call/mic issues, along with a positive note on value for money:

customer_reviews = [
    "Sound quality is amazing!",
    "Audio is crystal clear and very immersive.",
    "Incredible sound, especially the bass response.",
    "Battery doesn't last as advertised.",
    "Needs charging too often.",
    "Battery drains quickly -- not ideal for travel.",
    "Setup was super easy and straightforward.",
    "Very user-friendly, even for my parents.",
    "Simple interface and smooth experience.",
    "Feels cheap and plasticky.",
    "Build quality could be better.",
    "Broke within the first week of use.",
    "People say they can't hear me during calls.",
    "Mic quality is terrible on Zoom meetings.",
    "Great product for the price!"
]

Defining a Pydantic Schema

This Pydantic model defines the structure for the response of a semantic deduplication task on customer reviews. This schema helps structure and validate the output of a language model tasked with clustering or deduplicating natural language input:

from pydantic import BaseModel, Field

class DeduplicatedReviews(BaseModel):
    duplicates: list[list[str]] = Field(
        ..., description="A list of semantically equivalent customer review groups"
    )
    reviews: list[str] = Field(
        ..., description="The deduplicated list of core customer feedback themes"
    )

Defining a Mirascope @openai.call for Semantic Deduplication

This code defines a semantic deduplication function using Mirascope’s @openai.call decorator, which enables seamless integration with OpenAI’s gpt-4o model. The deduplicate_customer_reviews function takes a list of customer reviews and uses a structured prompt—defined by the @prompt_template decorator—to guide the LLM in identifying and grouping semantically similar reviews.

from mirascope.core import openai, prompt_template

@openai.call(model="gpt-4o", response_model=DeduplicatedReviews)
@prompt_template(
    """
    SYSTEM:
    You are an AI assistant helping to analyze customer reviews. 
    Your task is to group semantically similar reviews together -- even if they are worded differently.

    - Use your understanding of meaning, tone, and implication to group duplicates.
    - Return two lists:
      1. A deduplicated list of the key distinct review sentiments.
      2. A list of grouped duplicates that share the same underlying feedback.

    USER:
    {reviews}
    """
)
def deduplicate_customer_reviews(reviews: list[str]): ...

Executing the Deduplication Function

The following code executes the deduplicate_customer_reviews function using the list of customer reviews and prints the structured output:

response = deduplicate_customer_reviews(customer_reviews)

# Ensure response format
assert isinstance(response, DeduplicatedReviews)

# Print Output
print("Distinct Customer Feedback:")
for item in response.reviews:
    print("-", item)

print("Grouped Duplicates:")
for group in response.duplicates:
    print("-", group)

The output shows a clean summary of customer feedback by grouping semantically similar reviews. The «Distinct Customer Feedback» section highlights key insights, while the «Grouped Duplicates» section captures different phrasings of the same sentiment. This helps eliminate redundancy and makes the feedback easier to analyze.

Check out the full code. All credit for this research goes to the researchers of this project.

Ready to connect with 1 Million+ AI developers, engineers, and researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience.

«`