Understanding the Target Audience
The target audience for «A Coding Guide to Build a Functional Data Analysis Workflow Using Lilac» consists mainly of data professionals, data analysts, and business intelligence developers. This audience typically works in industries that rely heavily on data-driven decision-making, such as finance, healthcare, technology, and marketing.
Pain Points:
- Struggling with inefficient data workflows that are difficult to maintain.
- Lack of modularity and scalability in existing data analysis pipelines.
- Challenges in filtering and exporting structured insights effectively.
Goals:
- To build efficient and reusable data analysis workflows.
- To leverage functional programming principles for cleaner and more manageable code.
- To transform and extract actionable insights from datasets with ease.
Interests:
- Utilizing new libraries and frameworks, such as Lilac, for data management.
- Staying updated on best practices in data analysis and visualization.
- Engaging in communities focused on data science and programming.
Communication Preferences:
- Prefer technical documentation that is concise and practical.
- Engage with content that includes code examples and hands-on tutorials.
- Appreciate peer-reviewed research and case studies that provide real-world applications.
Coding Guide for a Functional Data Analysis Workflow Using Lilac
This tutorial presents a comprehensive and modular data analysis pipeline utilizing the Lilac library. This approach not only facilitates dataset management but also integrates Python’s functional programming paradigm, fostering a clean and extensible workflow. The tutorial encompasses all stages, from project setup and data generation to insight extraction and output exporting, while focusing on reusable and testable code structures.
Getting Started
To begin, install the necessary libraries by executing the following command:
!pip install lilac[all] pandas numpy
This command ensures the complete Lilac suite is installed along with Pandas and NumPy, crucial for effective data handling and analysis.
Importing Essential Libraries
Next, import the required libraries:
import json
import uuid
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any, Tuple, Optional
from functools import reduce, partial
import lilac as ll
These libraries include json and uuid for data handling, Pandas for structured data manipulation, and Path from pathlib to manage directories. The inclusion of type hints enhances function clarity, while functools aids in functional composition patterns. Lastly, the core Lilac library is imported as ll.
Creating Functional Utilities
Define reusable functional utilities to streamline the data processing:
def pipe(*functions):
return lambda x: reduce(lambda acc, f: f(acc), functions, x)
def map_over(func, iterable):
return list(map(func, iterable))
def filter_by(predicate, iterable):
return list(filter(predicate, iterable))
The pipe function enables left-to-right function composition, while map_over and filter_by facilitate functional transformations and filtering of iterable data. Following this, realistic sample data can be generated:
def create_sample_data() -> List[Dict[str, Any]]:
return [
{"id": 1, "text": "What is machine learning?", "category": "tech", "score": 0.9, "tokens": 5},
{"id": 2, "text": "Machine learning is AI subset", "category": "tech", "score": 0.8, "tokens": 6},
{"id": 3, "text": "Contact support for help", "category": "support", "score": 0.7, "tokens": 4},
{"id": 4, "text": "What is machine learning?", "category": "tech", "score": 0.9, "tokens": 5},
{"id": 5, "text": "Deep learning neural networks", "category": "tech", "score": 0.85, "tokens": 4},
{"id": 6, "text": "How to optimize models?", "category": "tech", "score": 0.75, "tokens": 5},
{"id": 7, "text": "Performance tuning guide", "category": "guide", "score": 0.6, "tokens": 3},
{"id": 8, "text": "Advanced optimization techniques", "category": "tech", "score": 0.95, "tokens": 3},
{"id": 9, "text": "Gradient descent algorithm", "category": "tech", "score": 0.88, "tokens": 3},
{"id": 10, "text": "Model evaluation metrics", "category": "tech", "score": 0.82, "tokens": 3},
]
This function creates a sample dataset consisting of fields like text, category, score, and token counts, providing an essential resource for demonstrating Lilac’s capabilities.
Setting Up the Lilac Project
Establish the Lilac project directory:
def setup_lilac_project(project_name: str) -> str:
project_dir = f"./{project_name}-{uuid.uuid4().hex[:6]}"
Path(project_dir).mkdir(exist_ok=True)
ll.set_project_dir(project_dir)
return project_dir
This function initializes a unique directory for the project, ensuring organized management of data files.
Creating and Transforming Datasets
Generate a dataset from the sample data:
def create_dataset_from_data(name: str, data: List[Dict]) -> ll.Dataset:
data_file = f"{name}.jsonl"
with open(data_file, 'w') as f:
for item in data:
f.write(json.dumps(item) + '\n')
config = ll.DatasetConfig(
namespace="tutorial",
name=name,
source=ll.sources.JSONSource(filepaths=[data_file])
)
return ll.create_dataset(config)
This function converts the sample data into a JSON Lines file and creates a Lilac dataset for structured analysis.
Data Extraction and Filtering
Extract the data into a Pandas DataFrame:
def extract_dataframe(dataset: ll.Dataset, fields: List[str]) -> pd.DataFrame:
return dataset.to_pandas(fields)
Then, apply functional filters:
def apply_functional_filters(df: pd.DataFrame) -> Dict[str, pd.DataFrame]:
filters = {
'high_score': lambda df: df[df['score'] >= 0.8],
'tech_category': lambda df: df[df['category'] == 'tech'],
'min_tokens': lambda df: df[df['tokens'] >= 4],
'no_duplicates': lambda df: df.drop_duplicates(subset=['text'], keep='first'),
'combined_quality': lambda df: df[(df['score'] >= 0.8) & (df['tokens'] >= 3) & (df['category'] == 'tech')]
}
return {name: filter_func(df.copy()) for name, filter_func in filters.items()}
These functions allow for generating multiple filtered views of the data, facilitating comprehensive analysis.
Analyzing Data Quality
Assess the quality of the dataset with the following function:
def analyze_data_quality(df: pd.DataFrame) -> Dict[str, Any]:
return {
'total_records': len(df),
'unique_texts': df['text'].nunique(),
'duplicate_rate': 1 - (df['text'].nunique() / len(df)),
'avg_score': df['score'].mean(),
'category_distribution': df['category'].value_counts().to_dict(),
'score_distribution': {
'high': len(df[df['score'] >= 0.8]),
'medium': len(df[(df['score'] >= 0.6) & (df['score'] < 0.8)]),
'low': len(df[df['score'] < 0.6])
},
'token_stats': {
'mean': df['tokens'].mean(),
'min': df['tokens'].min(),
'max': df['tokens'].max()
}
}
This function provides essential metrics, allowing users to gauge the dataset’s integrity and readiness for analysis.
Transformations and Exporting Data
Define transformations to enrich the dataset:
def create_data_transformations() -> Dict[str, callable]:
return {
'normalize_scores': lambda df: df.assign(norm_score=df['score'] / df['score'].max()),
'add_length_category': lambda df: df.assign(
length_cat=pd.cut(df['tokens'], bins=[0, 3, 5, float('inf')], labels=['short', 'medium', 'long'])
),
'add_quality_tier': lambda df: df.assign(
quality_tier=pd.cut(df['score'], bins=[0, 0.6, 0.8, 1.0], labels=['low', 'medium', 'high'])
),
'add_category_rank': lambda df: df.assign(
category_rank=df.groupby('category')['score'].rank(ascending=False)
)
}
Apply these transformations to the DataFrame:
def apply_transformations(df: pd.DataFrame, transform_names: List[str]) -> pd.DataFrame:
transformations = create_data_transformations()
selected_transforms = [transformations[name] for name in transform_names if name in transformations]
return pipe(*selected_transforms)(df.copy()) if selected_transforms else df
Finally, export filtered datasets to files:
def export_filtered_data(filtered_datasets: Dict[str, pd.DataFrame], output_dir: str) -> None:
Path(output_dir).mkdir(exist_ok=True)
for name, df in filtered_datasets.items():
output_file = Path(output_dir) / f"{name}_filtered.jsonl"
with open(output_file, 'w') as f:
for _, row in df.iterrows():
f.write(json.dumps(row.to_dict()) + '\n')
print(f"Exported {len(df)} records to {output_file}")
This function organizes exported subsets, facilitating downstream applications.
Main Analysis Pipeline
The main function orchestrates the full workflow:
def main_analysis_pipeline():
print("Setting up Lilac project...")
project_dir = setup_lilac_project("advanced_tutorial")
print("Creating sample dataset...")
sample_data = create_sample_data()
dataset = create_dataset_from_data("sample_data", sample_data)
print("Extracting data...")
df = extract_dataframe(dataset, ['id', 'text', 'category', 'score', 'tokens'])
print("Analyzing data quality...")
quality_report = analyze_data_quality(df)
print(f"Original data: {quality_report['total_records']} records")
print(f"Duplicates: {quality_report['duplicate_rate']:.1%}")
print(f"Average score: {quality_report['avg_score']:.2f}")
print("Applying transformations...")
transformed_df = apply_transformations(df, ['normalize_scores', 'add_length_category', 'add_quality_tier'])
print("Applying filters...")
filtered_datasets = apply_functional_filters(transformed_df)
print("\nFilter Results:")
for name, filtered_df in filtered_datasets.items():
print(f" {name}: {len(filtered_df)} records")
print("Exporting filtered datasets...")
export_filtered_data(filtered_datasets, f"{project_dir}/exports")
print("\nTop Quality Records:")
best_quality = filtered_datasets['combined_quality'].head(3)
for _, row in best_quality.iterrows():
print(f" • {row['text']} (score: {row['score']}, category: {row['category']})")
return {
'original_data': df,
'transformed_data': transformed_df,
'filtered_data': filtered_datasets,
'quality_report': quality_report
}
if __name__ == "__main__":
results = main_analysis_pipeline()
print("\nAnalysis complete! Check the exports folder for filtered datasets.")
This pipeline showcases the integration of the Lilac library with functional programming principles, enabling the development of modular and expressive data workflows.
In conclusion, users will acquire a practical understanding of creating a reproducible data pipeline that leverages Lilac’s dataset abstractions and functional programming patterns for scalable and clean analysis. The tutorial covers crucial stages such as dataset creation, transformation, filtering, quality analysis, and export, providing flexibility for both experimentation and deployment.
All credit for this research goes to the researchers of this project. For further engagement, consider joining professional communities or subscribing to relevant newsletters.