«`html

Getting Started with Microsoft’s Presidio: A Step-by-Step Guide to Detecting and Anonymizing Personally Identifiable Information (PII) in Text

In this tutorial, we will explore how to use Microsoft’s Presidio, an open-source framework designed for detecting, analyzing, and anonymizing personally identifiable information (PII) in free-form text. Built on top of the efficient spaCy NLP library, Presidio is both lightweight and modular, making it easy to integrate into real-time applications and pipelines.

Target Audience Analysis

The target audience for this guide includes data scientists, software developers, and business analysts who are involved in data privacy and compliance. They are likely to work in industries such as finance, healthcare, and technology, where handling PII is critical.

Pain Points: Concerns about data breaches, compliance with regulations like GDPR and CCPA, and the complexity of implementing PII detection and anonymization.
Goals: To effectively detect and anonymize PII in text, ensure compliance with data protection regulations, and maintain data utility.
Interests: Best practices in data privacy, tools for data protection, and advancements in natural language processing (NLP).
Communication Preferences: Technical documentation, step-by-step guides, and practical examples.

Installation of Presidio Libraries

To get started with Presidio, you’ll need to install the following key libraries:

presidio-analyzer: This is the core library responsible for detecting PII entities in text using built-in and custom recognizers.
presidio-anonymizer: This library provides tools to anonymize detected PII using configurable operators.
spaCy NLP model (en_core_web_lg): Presidio uses spaCy for natural language processing tasks like named entity recognition. The en_core_web_lg model provides high-accuracy results and is recommended for English-language PII detection.

To install the libraries, run the following commands:

pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg

Basic PII Detection with Presidio Analyzer

In this section, we initialize the Presidio Analyzer Engine and run a basic analysis to detect a U.S. phone number from a sample text. We also suppress lower-level log warnings from the Presidio library for cleaner output.

import logging
logging.getLogger("presidio-analyzer").setLevel(logging.ERROR)

from presidio_analyzer import AnalyzerEngine

# Set up the engine
analyzer = AnalyzerEngine()

# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language='en')
print(results)

Creating a Custom PII Recognizer

This code block shows how to create a custom PII recognizer in Presidio using a simple deny list, ideal for detecting fixed terms like academic titles (e.g., “Dr.”, “Prof.”).

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry

# Step 1: Create a custom pattern recognizer using deny_list
academic_title_recognizer = PatternRecognizer(
    supported_entity="ACADEMIC_TITLE",
    deny_list=["Dr.", "Dr", "Professor", "Prof."]
)

# Step 2: Add it to a registry
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(academic_title_recognizer)

# Step 3: Create analyzer engine with the updated registry
analyzer = AnalyzerEngine(registry=registry)

# Step 4: Analyze text
text = "Prof. John Smith is meeting with Dr. Alice Brown."
results = analyzer.analyze(text=text, language="en")

for result in results:
    print(result)

Using the Presidio Anonymizer

This code block demonstrates how to use the Presidio Anonymizer Engine to anonymize detected PII entities in a given text.

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

# Initialize the engine
engine = AnonymizerEngine()

# Anonymize detected entities
result = engine.anonymize(
    text="My name is Bond, James Bond",
    analyzer_results=[
        RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
        RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
    ],
    operators={"PERSON": OperatorConfig("replace", {"new_value": "BIP"})},
)

print(result)

Custom Entity Recognition and Hash-Based Anonymization

In this example, we define custom PII entities (e.g., Aadhaar and PAN numbers) using regex-based PatternRecognizers and anonymize sensitive data using a custom hash-based operator.

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern

# Define custom recognizers
pan_recognizer = PatternRecognizer(
    supported_entity="IND_PAN",
    name="PAN Recognizer",
    patterns=[Pattern(name="pan", regex=r"\b[A-Z]{5}[0-9]{4}[A-Z]\b", score=0.8)],
    supported_language="en"
)

aadhaar_recognizer = PatternRecognizer(
    supported_entity="AADHAAR",
    name="Aadhaar Recognizer",
    patterns=[Pattern(name="aadhaar", regex=r"\b\d{4}[- ]?\d{4}[- ]?\d{4}\b", score=0.8)],
    supported_language="en"
)

Analyzing and Anonymizing Input Texts

We analyze two separate texts that both include the same PAN and Aadhaar values. The custom operator ensures they’re anonymized consistently across both inputs.

from pprint import pprint

# Example texts
text1 = "My PAN is ABCDE1234F and Aadhaar number is 1234-5678-9123."
text2 = "His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F."

# Analyze and anonymize first text
results1 = analyzer.analyze(text=text1, language="en")
anon1 = anonymizer.anonymize(
    text1,
    results1,
    {
        "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
    }
)

# Analyze and anonymize second text
results2 = analyzer.analyze(text=text2, language="en")
anon2 = anonymizer.anonymize(
    text2,
    results2,
    {
        "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping})
    }
)

# View results
print(" Original 1:", text1)
print(" Anonymized 1:", anon1.text)
print(" Original 2:", text2)
print(" Anonymized 2:", anon2.text)

Check out the Presidio GitHub repository for more resources. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and subscribe to our newsletter.

«`

Getting Started with Microsoft’s Presidio: A Step-by-Step Guide to Detecting and Anonymizing Personally Identifiable Information PII in Text