データ分析スキルの例
データサイエンスワークフローと分析のベストプラクティスのためのスキルです。
ユースケース
このスキルは以下に役立ちます:
- データ探索とクリーニング
- 統計分析
- 可視化のベストプラクティス
- Python/Pandasワークフロー
完全なSKILL.md
markdown
---
name: Data Analysis Best Practices
description: Guide for exploratory data analysis and statistical workflows
version: 1.0.0
author: Data Science Community
platforms:
- claude-code
- codex
categories:
- data
- ai-ml
tags:
- python
- pandas
- statistics
- visualization
---
# Data Analysis Best Practices
## Workflow Overview
1. **Understand** - Define the question
2. **Explore** - Examine the data
3. **Clean** - Handle issues
4. **Analyze** - Apply methods
5. **Visualize** - Communicate findings
6. **Document** - Record process
## Data Exploration
### Initial Assessment
\`\`\`python
import pandas as pd
def explore_dataset(df: pd.DataFrame) -> None:
"""Comprehensive initial data exploration."""
print(f"Shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic stats:\n{df.describe()}")
print(f"\nSample rows:\n{df.head()}")
\`\`\`
### Column Analysis
\`\`\`python
def analyze_column(df: pd.DataFrame, column: str) -> dict:
"""Detailed analysis of a single column."""
col = df[column]
analysis = {
'dtype': col.dtype,
'missing': col.isnull().sum(),
'unique': col.nunique(),
'memory': col.memory_usage(deep=True),
}
if pd.api.types.is_numeric_dtype(col):
analysis.update({
'mean': col.mean(),
'median': col.median(),
'std': col.std(),
'min': col.min(),
'max': col.max(),
})
return analysis
\`\`\`
## Data Cleaning
### Missing Values
\`\`\`python
# Strategy 1: Drop rows with any missing values
df_clean = df.dropna()
# Strategy 2: Fill with mean/median/mode
df['column'] = df['column'].fillna(df['column'].median())
# Strategy 3: Forward/backward fill (time series)
df['column'] = df['column'].fillna(method='ffill')
# Strategy 4: Interpolation
df['column'] = df['column'].interpolate(method='linear')
\`\`\`
### Outliers
\`\`\`python
def detect_outliers_iqr(df: pd.DataFrame, column: str) -> pd.Series:
"""Detect outliers using IQR method."""
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return (df[column] < lower_bound) | (df[column] > upper_bound)
\`\`\`
## Statistical Analysis
### Hypothesis Testing
\`\`\`python
from scipy import stats
# T-test for comparing two groups
t_stat, p_value = stats.ttest_ind(group_a, group_b)
# Chi-square test for categorical data
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
# Correlation
correlation, p_value = stats.pearsonr(x, y)
\`\`\`
### Effect Size
\`\`\`python
def cohens_d(group1, group2):
"""Calculate Cohen's d effect size."""
n1, n2 = len(group1), len(group2)
var1, var2 = group1.var(), group2.var()
pooled_std = np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2))
return (group1.mean() - group2.mean()) / pooled_std
\`\`\`
## Visualization
### Best Practices
1. **Clear titles** - State what the chart shows
2. **Labeled axes** - Include units
3. **Appropriate chart type** - Match data to visual
4. **Color with purpose** - Meaningful, accessible
5. **Minimal clutter** - Remove unnecessary elements
### Chart Selection
| Data Type | Relationship | Chart Type |
|-----------|--------------|------------|
| Categorical | Distribution | Bar chart |
| Numerical | Distribution | Histogram |
| Two numerical | Correlation | Scatter plot |
| Time series | Trend | Line chart |
| Parts of whole | Composition | Pie/Stacked bar |
| Comparison | Ranking | Horizontal bar |
### Example Visualization
\`\`\`python
import matplotlib.pyplot as plt
import seaborn as sns
def create_distribution_plot(df: pd.DataFrame, column: str):
"""Create a clean distribution visualization."""
fig, ax = plt.subplots(figsize=(10, 6))
sns.histplot(data=df, x=column, kde=True, ax=ax)
ax.set_title(f'Distribution of {column}', fontsize=14, fontweight='bold')
ax.set_xlabel(column, fontsize=12)
ax.set_ylabel('Frequency', fontsize=12)
# Add summary statistics
mean_val = df[column].mean()
median_val = df[column].median()
ax.axvline(mean_val, color='red', linestyle='--', label=f'Mean: {mean_val:.2f}')
ax.axvline(median_val, color='green', linestyle='--', label=f'Median: {median_val:.2f}')
ax.legend()
plt.tight_layout()
return fig
\`\`\`
## Documentation
### Notebook Structure
1. **Title and Overview**
2. **Data Loading**
3. **Exploration**
4. **Cleaning**
5. **Analysis**
6. **Conclusions**
7. **Next Steps**
### Code Comments
\`\`\`python
# Good: Explains WHY
# Remove entries before 2020 as data quality is unreliable
df = df[df['date'] >= '2020-01-01']
# Bad: Explains WHAT (obvious from code)
# Filter dataframe by date
df = df[df['date'] >= '2020-01-01']
\`\`\`