Exemple de Skill Data Analysis

Une competence pour les workflows de data science et les bonnes pratiques d'analytics.

Cas d'utilisation

Cette competence aide avec :

  • Exploration et nettoyage des donnees
  • Analyse statistique
  • Bonnes pratiques de visualisation
  • Workflows Python/Pandas

SKILL.md complet

markdown
---
name: Data Analysis Best Practices
description: Guide for exploratory data analysis and statistical workflows
version: 1.0.0
author: Data Science Community
platforms:
  - claude-code
  - codex
categories:
  - data
  - ai-ml
tags:
  - python
  - pandas
  - statistics
  - visualization
---

# Data Analysis Best Practices

## Workflow Overview

1. **Understand** - Define the question
2. **Explore** - Examine the data
3. **Clean** - Handle issues
4. **Analyze** - Apply methods
5. **Visualize** - Communicate findings
6. **Document** - Record process

## Data Exploration

### Initial Assessment

\`\`\`python
import pandas as pd

def explore_dataset(df: pd.DataFrame) -> None:
    """Comprehensive initial data exploration."""
    print(f"Shape: {df.shape}")
    print(f"\nColumn types:\n{df.dtypes}")
    print(f"\nMissing values:\n{df.isnull().sum()}")
    print(f"\nBasic stats:\n{df.describe()}")
    print(f"\nSample rows:\n{df.head()}")
\`\`\`

### Column Analysis

\`\`\`python
def analyze_column(df: pd.DataFrame, column: str) -> dict:
    """Detailed analysis of a single column."""
    col = df[column]

    analysis = {
        'dtype': col.dtype,
        'missing': col.isnull().sum(),
        'unique': col.nunique(),
        'memory': col.memory_usage(deep=True),
    }

    if pd.api.types.is_numeric_dtype(col):
        analysis.update({
            'mean': col.mean(),
            'median': col.median(),
            'std': col.std(),
            'min': col.min(),
            'max': col.max(),
        })

    return analysis
\`\`\`

## Data Cleaning

### Missing Values

\`\`\`python
# Strategy 1: Drop rows with any missing values
df_clean = df.dropna()

# Strategy 2: Fill with mean/median/mode
df['column'] = df['column'].fillna(df['column'].median())

# Strategy 3: Forward/backward fill (time series)
df['column'] = df['column'].fillna(method='ffill')

# Strategy 4: Interpolation
df['column'] = df['column'].interpolate(method='linear')
\`\`\`

### Outliers

\`\`\`python
def detect_outliers_iqr(df: pd.DataFrame, column: str) -> pd.Series:
    """Detect outliers using IQR method."""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    return (df[column] < lower_bound) | (df[column] > upper_bound)
\`\`\`

## Statistical Analysis

### Hypothesis Testing

\`\`\`python
from scipy import stats

# T-test for comparing two groups
t_stat, p_value = stats.ttest_ind(group_a, group_b)

# Chi-square test for categorical data
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# Correlation
correlation, p_value = stats.pearsonr(x, y)
\`\`\`

### Effect Size

\`\`\`python
def cohens_d(group1, group2):
    """Calculate Cohen's d effect size."""
    n1, n2 = len(group1), len(group2)
    var1, var2 = group1.var(), group2.var()

    pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
    return (group1.mean() - group2.mean()) / pooled_std
\`\`\`

## Visualization

### Best Practices

1. **Clear titles** - State what the chart shows
2. **Labeled axes** - Include units
3. **Appropriate chart type** - Match data to visual
4. **Color with purpose** - Meaningful, accessible
5. **Minimal clutter** - Remove unnecessary elements

### Chart Selection

| Data Type | Relationship | Chart Type |
|-----------|--------------|------------|
| Categorical | Distribution | Bar chart |
| Numerical | Distribution | Histogram |
| Two numerical | Correlation | Scatter plot |
| Time series | Trend | Line chart |
| Parts of whole | Composition | Pie/Stacked bar |
| Comparison | Ranking | Horizontal bar |

### Example Visualization

\`\`\`python
import matplotlib.pyplot as plt
import seaborn as sns

def create_distribution_plot(df: pd.DataFrame, column: str):
    """Create a clean distribution visualization."""
    fig, ax = plt.subplots(figsize=(10, 6))

    sns.histplot(data=df, x=column, kde=True, ax=ax)

    ax.set_title(f'Distribution of {column}', fontsize=14, fontweight='bold')
    ax.set_xlabel(column, fontsize=12)
    ax.set_ylabel('Frequency', fontsize=12)

    # Add summary statistics
    mean_val = df[column].mean()
    median_val = df[column].median()
    ax.axvline(mean_val, color='red', linestyle='--', label=f'Mean: {mean_val:.2f}')
    ax.axvline(median_val, color='green', linestyle='--', label=f'Median: {median_val:.2f}')

    ax.legend()
    plt.tight_layout()

    return fig
\`\`\`

## Documentation

### Notebook Structure

1. **Title and Overview**
2. **Data Loading**
3. **Exploration**
4. **Cleaning**
5. **Analysis**
6. **Conclusions**
7. **Next Steps**

### Code Comments

\`\`\`python
# Good: Explains WHY
# Remove entries before 2020 as data quality is unreliable
df = df[df['date'] >= '2020-01-01']

# Bad: Explains WHAT (obvious from code)
# Filter dataframe by date
df = df[df['date'] >= '2020-01-01']
\`\`\`

Prochaines etapes