«`html
Getting Started with Microsoft’s Presidio: A Step-by-Step Guide to Detecting and Anonymizing Personally Identifiable Information (PII) in Text
In this tutorial, we will explore how to use Microsoft’s Presidio, an open-source framework designed for detecting, analyzing, and anonymizing personally identifiable information (PII) in free-form text. Built on top of the efficient spaCy NLP library, Presidio is both lightweight and modular, making it easy to integrate into real-time applications and pipelines.
Target Audience Analysis
The target audience for this guide includes data scientists, software developers, and business analysts who are involved in data privacy and compliance. They are likely to work in industries such as finance, healthcare, and technology, where handling PII is critical.
- Pain Points: Concerns about data breaches, compliance with regulations like GDPR and CCPA, and the complexity of implementing PII detection and anonymization.
- Goals: To effectively detect and anonymize PII in text, ensure compliance with data protection regulations, and maintain data utility.
- Interests: Best practices in data privacy, tools for data protection, and advancements in natural language processing (NLP).
- Communication Preferences: Technical documentation, step-by-step guides, and practical examples.
Installation of Presidio Libraries
To get started with Presidio, you’ll need to install the following key libraries:
- presidio-analyzer: This is the core library responsible for detecting PII entities in text using built-in and custom recognizers.
- presidio-anonymizer: This library provides tools to anonymize detected PII using configurable operators.
- spaCy NLP model (en_core_web_lg): Presidio uses spaCy for natural language processing tasks like named entity recognition. The en_core_web_lg model provides high-accuracy results and is recommended for English-language PII detection.
To install the libraries, run the following commands:
pip install presidio-analyzer presidio-anonymizer python -m spacy download en_core_web_lg
Basic PII Detection with Presidio Analyzer
In this section, we initialize the Presidio Analyzer Engine and run a basic analysis to detect a U.S. phone number from a sample text. We also suppress lower-level log warnings from the Presidio library for cleaner output.
import logging logging.getLogger("presidio-analyzer").setLevel(logging.ERROR) from presidio_analyzer import AnalyzerEngine # Set up the engine analyzer = AnalyzerEngine() # Call analyzer to get results results = analyzer.analyze(text="My phone number is 212-555-5555", entities=["PHONE_NUMBER"], language='en') print(results)
Creating a Custom PII Recognizer
This code block shows how to create a custom PII recognizer in Presidio using a simple deny list, ideal for detecting fixed terms like academic titles (e.g., “Dr.”, “Prof.”).
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, RecognizerRegistry # Step 1: Create a custom pattern recognizer using deny_list academic_title_recognizer = PatternRecognizer( supported_entity="ACADEMIC_TITLE", deny_list=["Dr.", "Dr", "Professor", "Prof."] ) # Step 2: Add it to a registry registry = RecognizerRegistry() registry.load_predefined_recognizers() registry.add_recognizer(academic_title_recognizer) # Step 3: Create analyzer engine with the updated registry analyzer = AnalyzerEngine(registry=registry) # Step 4: Analyze text text = "Prof. John Smith is meeting with Dr. Alice Brown." results = analyzer.analyze(text=text, language="en") for result in results: print(result)
Using the Presidio Anonymizer
This code block demonstrates how to use the Presidio Anonymizer Engine to anonymize detected PII entities in a given text.
from presidio_anonymizer import AnonymizerEngine from presidio_anonymizer.entities import RecognizerResult, OperatorConfig # Initialize the engine engine = AnonymizerEngine() # Anonymize detected entities result = engine.anonymize( text="My name is Bond, James Bond", analyzer_results=[ RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8), RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8), ], operators={"PERSON": OperatorConfig("replace", {"new_value": "BIP"})}, ) print(result)
Custom Entity Recognition and Hash-Based Anonymization
In this example, we define custom PII entities (e.g., Aadhaar and PAN numbers) using regex-based PatternRecognizers and anonymize sensitive data using a custom hash-based operator.
from presidio_analyzer import AnalyzerEngine, PatternRecognizer, Pattern # Define custom recognizers pan_recognizer = PatternRecognizer( supported_entity="IND_PAN", name="PAN Recognizer", patterns=[Pattern(name="pan", regex=r"\b[A-Z]{5}[0-9]{4}[A-Z]\b", score=0.8)], supported_language="en" ) aadhaar_recognizer = PatternRecognizer( supported_entity="AADHAAR", name="Aadhaar Recognizer", patterns=[Pattern(name="aadhaar", regex=r"\b\d{4}[- ]?\d{4}[- ]?\d{4}\b", score=0.8)], supported_language="en" )
Analyzing and Anonymizing Input Texts
We analyze two separate texts that both include the same PAN and Aadhaar values. The custom operator ensures they’re anonymized consistently across both inputs.
from pprint import pprint # Example texts text1 = "My PAN is ABCDE1234F and Aadhaar number is 1234-5678-9123." text2 = "His Aadhaar is 1234-5678-9123 and PAN is ABCDE1234F." # Analyze and anonymize first text results1 = analyzer.analyze(text=text1, language="en") anon1 = anonymizer.anonymize( text1, results1, { "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping}) } ) # Analyze and anonymize second text results2 = analyzer.analyze(text=text2, language="en") anon2 = anonymizer.anonymize( text2, results2, { "DEFAULT": OperatorConfig("reanonymizer", {"entity_mapping": entity_mapping}) } ) # View results print(" Original 1:", text1) print(" Anonymized 1:", anon1.text) print(" Original 2:", text2) print(" Anonymized 2:", anon2.text)
Check out the Presidio GitHub repository for more resources. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and subscribe to our newsletter.
«`