Marketplace

model-selection

Select appropriate AI/ML models based on capability matching, benchmarks, cost-performance tradeoffs, and deployment constraints.

allowed_tools: Read, Write, Glob, Grep, Task

$ インストール

git clone https://github.com/melodic-software/claude-code-plugins /tmp/claude-code-plugins && cp -r /tmp/claude-code-plugins/plugins/ai-ml-planning/skills/model-selection ~/.claude/skills/claude-code-plugins

// tip: Run this command in your terminal to install the skill


name: model-selection description: Select appropriate AI/ML models based on capability matching, benchmarks, cost-performance tradeoffs, and deployment constraints. allowed-tools: Read, Write, Glob, Grep, Task

Model Selection Framework

When to Use This Skill

Use this skill when:

  • Model Selection tasks - Working on select appropriate ai/ml models based on capability matching, benchmarks, cost-performance tradeoffs, and deployment constraints
  • Planning or design - Need guidance on Model Selection approaches
  • Best practices - Want to follow established patterns and standards

Overview

Model selection is the systematic process of choosing the right AI/ML model based on task requirements, performance characteristics, cost constraints, and deployment considerations. Poor model selection leads to suboptimal performance, excessive costs, or deployment failures.

Model Selection Decision Tree

┌─────────────────────────────────────────────────────────────────┐
│                   MODEL SELECTION FRAMEWORK                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  1. TASK ANALYSIS                                               │
│     What are the core capabilities needed?                      │
│     ├── Text Generation → LLM                                   │
│     ├── Classification → Traditional ML / Small LM              │
│     ├── Code Generation → Code-specialized LLM                  │
│     ├── Vision → Multimodal / Vision Model                      │
│     ├── Embeddings → Embedding Model                            │
│     └── Structured Output → Instruction-tuned LLM               │
│                                                                  │
│  2. REQUIREMENTS MAPPING                                         │
│     ├── Quality: Accuracy, coherence, factuality                │
│     ├── Latency: Real-time vs batch                             │
│     ├── Cost: Per-token, per-request budgets                    │
│     ├── Privacy: Data residency, local deployment               │
│     └── Scale: Requests per second, concurrent users            │
│                                                                  │
│  3. MODEL EVALUATION                                             │
│     ├── Benchmark analysis                                       │
│     ├── Task-specific testing                                   │
│     └── Cost-performance optimization                           │
│                                                                  │
│  4. DEPLOYMENT PLANNING                                          │
│     ├── Cloud API vs self-hosted                                │
│     ├── Hardware requirements                                   │
│     └── Scaling strategy                                        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

LLM Capability Matrix

General Purpose Models (December 2025)

ModelProviderContextStrengthsWeaknesses
GPT-4oOpenAI128KMultimodal, fast, reliableCost for high volume
GPT-4o-miniOpenAI128KCost-effective, good qualityLess capable than full
Claude 3.5 SonnetAnthropic200KLong context, coding, analysisAvailability
Claude 3.5 HaikuAnthropic200KFast, cost-effectiveLess capable
Gemini 1.5 ProGoogle1MMassive context, multimodalLatency variance
Gemini 1.5 FlashGoogle1MFast, cost-effectiveQuality tradeoffs
o1OpenAI128KDeep reasoning, math, codingSlow, expensive
o1-miniOpenAI128KReasoning, cost-effectiveNarrower than o1

Specialized Models

Use CaseRecommended ModelsNotes
Code GenerationGPT-4o, Claude 3.5 Sonnet, CodexClaude excels at refactoring
Long DocumentsClaude 3.5, Gemini 1.5200K-1M context
Embeddingstext-embedding-3-large, Cohere embed-v3Quality vs cost
VisionGPT-4o, Claude 3.5, Gemini 1.5All support images
Structured OutputGPT-4o (JSON mode), ClaudeSchema enforcement
Reasoningo1, o1-miniChain of thought

Local/Open Models

ModelParametersVRAM RequiredUse Case
Llama 3.21B-90B2GB-180GBGeneral, local deployment
Mistral7B-8x22B14GB-180GBEuropean data residency
Phi-33.8B-14B8GB-28GBEdge, mobile
Qwen 2.50.5B-72B1GB-144GBMultilingual
CodeLlama7B-70B14GB-140GBCode-specific

Model Comparison Framework

Benchmark Interpretation

BenchmarkMeasuresWeight
MMLUGeneral knowledgeMedium
HumanEvalCode generationHigh for coding tasks
GSM8KMath reasoningHigh for analytical
MT-BenchConversation qualityHigh for chat
GPQAGraduate-level QADomain expertise
Arena ELOHuman preferenceOverall quality

Task-Specific Evaluation

public class ModelEvaluator
{
    public async Task<EvaluationReport> EvaluateModels(
        List<ModelConfig> candidates,
        EvaluationDataset dataset,
        CancellationToken ct)
    {
        var results = new Dictionary<string, ModelMetrics>();

        foreach (var model in candidates)
        {
            var metrics = new ModelMetrics
            {
                ModelId = model.Id,
                Provider = model.Provider
            };

            // Run task-specific tests
            foreach (var testCase in dataset.TestCases)
            {
                var startTime = Stopwatch.StartNew();

                var response = await CallModel(model, testCase.Prompt, ct);

                startTime.Stop();

                metrics.AddResult(new TestResult
                {
                    TestId = testCase.Id,
                    LatencyMs = startTime.ElapsedMilliseconds,
                    InputTokens = CountTokens(testCase.Prompt),
                    OutputTokens = CountTokens(response),
                    Score = await EvaluateResponse(response, testCase.Expected),
                    Cost = CalculateCost(model, testCase.Prompt, response)
                });
            }

            results[model.Id] = metrics;
        }

        return new EvaluationReport
        {
            Results = results,
            Recommendation = SelectBestModel(results, dataset.Requirements)
        };
    }

    private ModelRecommendation SelectBestModel(
        Dictionary<string, ModelMetrics> results,
        Requirements requirements)
    {
        // Score based on requirements weights
        var scores = results.Select(r => new
        {
            Model = r.Key,
            Score = CalculateWeightedScore(r.Value, requirements)
        }).OrderByDescending(s => s.Score);

        return new ModelRecommendation
        {
            Primary = scores.First().Model,
            Fallback = scores.Skip(1).FirstOrDefault()?.Model,
            Reasoning = GenerateReasoning(scores, requirements)
        };
    }
}

Cost-Performance Analysis

Pricing Comparison (December 2025, per 1M tokens)

ModelInput CostOutput CostNotes
GPT-4o$2.50$10.00Standard
GPT-4o-mini$0.15$0.60Budget option
Claude 3.5 Sonnet$3.00$15.00Premium
Claude 3.5 Haiku$0.25$1.25Budget
Gemini 1.5 Pro$1.25$5.00Pay-as-you-go
Gemini 1.5 Flash$0.075$0.30High volume
o1$15.00$60.00Reasoning tasks
o1-mini$3.00$12.00Reasoning budget

Cost Optimization Strategies

StrategySavingsTrade-off
Smaller model for simple tasks80-95%Quality on complex tasks
Prompt caching50-90%Cache management complexity
Batch processing50%Latency increase
Prompt optimization20-40%Development effort
Response length limits10-30%Potentially truncated output

ROI Calculator

public class ModelCostCalculator
{
    public CostProjection CalculateMonthlyCost(
        ModelConfig model,
        UsageEstimate usage)
    {
        var inputCost = usage.MonthlyInputTokens / 1_000_000m
            * model.InputPricePerMillion;

        var outputCost = usage.MonthlyOutputTokens / 1_000_000m
            * model.OutputPricePerMillion;

        var cachedSavings = usage.CacheHitRate * inputCost
            * model.CacheDiscount;

        return new CostProjection
        {
            GrossInputCost = inputCost,
            GrossOutputCost = outputCost,
            CacheSavings = cachedSavings,
            NetMonthlyCost = inputCost + outputCost - cachedSavings,
            CostPerRequest = (inputCost + outputCost - cachedSavings)
                / usage.MonthlyRequests
        };
    }

    public ModelComparison CompareModels(
        List<ModelConfig> models,
        UsageEstimate usage,
        QualityRequirements requirements)
    {
        var comparisons = models.Select(m => new
        {
            Model = m,
            Cost = CalculateMonthlyCost(m, usage),
            MeetsRequirements = EvaluateQuality(m, requirements)
        }).Where(c => c.MeetsRequirements)
          .OrderBy(c => c.Cost.NetMonthlyCost)
          .ToList();

        return new ModelComparison
        {
            CheapestQualified = comparisons.FirstOrDefault()?.Model,
            AllOptions = comparisons,
            Savings = CalculateSavings(comparisons)
        };
    }
}

Fine-Tuning Decision Framework

When to Fine-Tune

Consider Fine-TuningUse Prompting Instead
Consistent specific formatFew-shot examples work
Domain vocabularyGeneral vocabulary
Latency critical (shorter prompts)Latency acceptable
High volume (amortize cost)Low volume
Specialized behaviorStandard behavior
Proprietary knowledgePublic knowledge

Fine-Tuning ROI Analysis

## Fine-Tuning Decision: [Use Case]

### Current State (Prompting)
- Prompt tokens: [X] tokens
- Output quality: [Score]
- Cost per request: $[X]
- Monthly cost: $[X]

### Projected State (Fine-Tuned)
- Prompt tokens: [Y] tokens (reduced)
- Output quality: [Score] (maintained/improved)
- Fine-tuning cost: $[X] (one-time)
- Inference cost per request: $[Y]
- Monthly cost: $[Y]

### Break-Even Analysis
- Monthly savings: $[X]
- Break-even: [N] months
- 12-month ROI: [X]%

### Recommendation
[Fine-tune / Continue prompting / Hybrid approach]

Deployment Considerations

Cloud vs Self-Hosted Decision

FactorCloud APISelf-Hosted
Initial costLow (pay-per-use)High (infrastructure)
ScalingAutomaticManual
LatencyNetwork dependentControlled
Data privacyLimitedFull control
CustomizationLimitedFull
MaintenanceNoneSignificant

Hardware Requirements (Self-Hosted)

Model SizeGPU MemoryRecommended GPU
7B14GBRTX 4090, A10
13B26GBA100-40GB
30B60GB2x A100-40GB
70B140GB2x A100-80GB
8x7B (MoE)100GB+Multiple A100s

Model Selection Template

# Model Selection: [Project Name]

## Task Requirements
- **Primary Use Case**: [Description]
- **Quality Requirements**: [Accuracy/coherence targets]
- **Latency Requirements**: [P95 target]
- **Volume**: [Requests/day]
- **Budget**: [Monthly budget]

## Evaluation Results

| Model | Quality Score | P95 Latency | Monthly Cost | Notes |
|-------|--------------|-------------|--------------|-------|
| [Model A] | [Score] | [ms] | $[X] | [Notes] |
| [Model B] | [Score] | [ms] | $[X] | [Notes] |
| [Model C] | [Score] | [ms] | $[X] | [Notes] |

## Recommendation

**Primary Model**: [Model]
- Rationale: [Why this model]

**Fallback Model**: [Model]
- Use when: [Conditions]

**Cost Projection**: $[X]/month

## Implementation Notes
- [Deployment approach]
- [Monitoring strategy]
- [Scaling considerations]

Validation Checklist

  • Task requirements clearly defined
  • Capability mapping completed
  • Candidate models identified
  • Benchmarks reviewed
  • Task-specific evaluation conducted
  • Cost analysis completed
  • Latency requirements validated
  • Deployment constraints considered
  • Fine-tuning decision made
  • Fallback strategy defined

Integration Points

Inputs from:

  • Business requirements → Task definition
  • ml-project-lifecycle skill → Project constraints

Outputs to:

  • token-budgeting skill → Cost estimation
  • rag-architecture skill → Embedding model selection
  • Application code → Model configuration

Repository

melodic-software
melodic-software
Author
melodic-software/claude-code-plugins/plugins/ai-ml-planning/skills/model-selection
3
Stars
0
Forks
Updated4d ago
Added1w ago