data-quality-auditor
Assess data quality with checks for missing values, duplicates, type issues, and inconsistencies. Use for data validation, ETL pipelines, or dataset documentation.
$ インストール
git clone https://github.com/dkyazzentwatwa/chatgpt-skills /tmp/chatgpt-skills && cp -r /tmp/chatgpt-skills/data-quality-auditor ~/.claude/skills/chatgpt-skills// tip: Run this command in your terminal to install the skill
SKILL.md
name: data-quality-auditor description: Assess data quality with checks for missing values, duplicates, type issues, and inconsistencies. Use for data validation, ETL pipelines, or dataset documentation.
Data Quality Auditor
Comprehensive data quality assessment for CSV/Excel datasets.
Features
- Completeness: Missing values analysis
- Uniqueness: Duplicate detection
- Validity: Type validation and constraints
- Consistency: Pattern and format checks
- Quality Score: Overall data quality metric
- Reports: Detailed HTML/JSON reports
Quick Start
from data_quality_auditor import DataQualityAuditor
auditor = DataQualityAuditor()
auditor.load_csv("customers.csv")
# Run full audit
report = auditor.audit()
print(f"Quality Score: {report['quality_score']}/100")
# Check specific issues
missing = auditor.check_missing()
duplicates = auditor.check_duplicates()
CLI Usage
# Full audit
python data_quality_auditor.py --input data.csv
# Generate HTML report
python data_quality_auditor.py --input data.csv --report report.html
# Check specific aspects
python data_quality_auditor.py --input data.csv --missing
python data_quality_auditor.py --input data.csv --duplicates
python data_quality_auditor.py --input data.csv --types
# JSON output
python data_quality_auditor.py --input data.csv --json
# Validate against rules
python data_quality_auditor.py --input data.csv --rules rules.json
API Reference
DataQualityAuditor Class
class DataQualityAuditor:
def __init__(self)
# Data loading
def load_csv(self, filepath: str, **kwargs) -> 'DataQualityAuditor'
def load_dataframe(self, df: pd.DataFrame) -> 'DataQualityAuditor'
# Full audit
def audit(self) -> dict
def quality_score(self) -> float
# Individual checks
def check_missing(self) -> dict
def check_duplicates(self, subset: list = None) -> dict
def check_types(self) -> dict
def check_uniqueness(self) -> dict
def check_patterns(self, column: str, pattern: str) -> dict
# Validation
def validate_column(self, column: str, rules: dict) -> dict
def validate_dataset(self, rules: dict) -> dict
# Reports
def generate_report(self, output: str, format: str = "html") -> str
def summary(self) -> str
Quality Checks
Missing Values
missing = auditor.check_missing()
# Returns:
{
"total_cells": 10000,
"missing_cells": 150,
"missing_percent": 1.5,
"by_column": {
"email": {"count": 50, "percent": 5.0},
"phone": {"count": 100, "percent": 10.0}
},
"rows_with_missing": 120
}
Duplicates
dups = auditor.check_duplicates()
# Returns:
{
"total_rows": 1000,
"duplicate_rows": 25,
"duplicate_percent": 2.5,
"duplicate_groups": [...],
"by_columns": {
"email": {"duplicates": 15},
"phone": {"duplicates": 20}
}
}
Type Validation
types = auditor.check_types()
# Returns:
{
"columns": {
"age": {
"detected_type": "int64",
"unique_values": 75,
"sample_values": [25, 30, 45],
"issues": []
},
"date": {
"detected_type": "object",
"unique_values": 365,
"sample_values": ["2023-01-01", "invalid"],
"issues": ["Mixed date formats detected"]
}
}
}
Validation Rules
Define custom validation rules:
{
"columns": {
"email": {
"required": true,
"unique": true,
"pattern": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
},
"age": {
"type": "integer",
"min": 0,
"max": 120
},
"status": {
"allowed_values": ["active", "inactive", "pending"]
},
"created_at": {
"type": "date",
"format": "%Y-%m-%d"
}
}
}
results = auditor.validate_dataset(rules)
Quality Score
The quality score (0-100) is calculated from:
- Completeness (30%): Missing value ratio
- Uniqueness (25%): Duplicate row ratio
- Validity (25%): Type and constraint compliance
- Consistency (20%): Format and pattern adherence
score = auditor.quality_score()
# 85.5
Output Formats
Audit Report
{
"file": "data.csv",
"rows": 1000,
"columns": 15,
"quality_score": 85.5,
"completeness": {
"score": 92.0,
"missing_cells": 800,
"details": {...}
},
"uniqueness": {
"score": 97.5,
"duplicate_rows": 25,
"details": {...}
},
"validity": {
"score": 78.0,
"type_issues": [...],
"details": {...}
},
"consistency": {
"score": 80.0,
"pattern_issues": [...],
"details": {...}
},
"recommendations": [
"Column 'phone' has 10% missing values",
"25 duplicate rows detected",
"Column 'date' has inconsistent formats"
]
}
Example Workflows
Pre-Import Validation
auditor = DataQualityAuditor()
auditor.load_csv("import_data.csv")
report = auditor.audit()
if report['quality_score'] < 80:
print("Data quality below threshold!")
for rec in report['recommendations']:
print(f" - {rec}")
exit(1)
ETL Pipeline Check
auditor = DataQualityAuditor()
auditor.load_dataframe(transformed_df)
# Check critical columns
email_check = auditor.validate_column("email", {
"required": True,
"unique": True,
"pattern": r"^[\w.+-]+@[\w-]+\.[\w.-]+$"
})
if email_check['issues']:
raise ValueError(f"Email validation failed: {email_check['issues']}")
Generate Documentation
auditor = DataQualityAuditor()
auditor.load_csv("dataset.csv")
# Generate comprehensive report
auditor.generate_report("quality_report.html", format="html")
# Or get summary text
print(auditor.summary())
Dependencies
- pandas>=2.0.0
- numpy>=1.24.0
Repository

dkyazzentwatwa
Author
dkyazzentwatwa/chatgpt-skills/data-quality-auditor
1
Stars
0
Forks
Updated16h ago
Added1w ago