Marketplace

llm-basics

LLM architecture, tokenization, transformers, and inference optimization. Use for understanding and working with language models.

$ Instalar

git clone https://github.com/pluginagentmarketplace/custom-plugin-ai-engineer /tmp/custom-plugin-ai-engineer && cp -r /tmp/custom-plugin-ai-engineer/skills/llm-basics ~/.claude/skills/custom-plugin-ai-engineer

// tip: Run this command in your terminal to install the skill


name: llm-basics description: LLM architecture, tokenization, transformers, and inference optimization. Use for understanding and working with language models. sasmp_version: "1.3.0" bonded_agent: 01-llm-fundamentals bond_type: PRIMARY_BOND

LLM Basics

Master the fundamentals of Large Language Models.

Quick Start

Using OpenAI API

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain transformers briefly."}
    ],
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Using Hugging Face

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

inputs = tokenizer("Hello, how are", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

Core Concepts

Transformer Architecture

Input → Embedding → [N × Transformer Block] → Output

Transformer Block:
┌───────────────────────────┐
│ Multi-Head Self-Attention │
├───────────────────────────┤
│   Layer Normalization     │
├───────────────────────────┤
│   Feed-Forward Network    │
├───────────────────────────┤
│   Layer Normalization     │
└───────────────────────────┘

Tokenization

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Hello, world!"

# Encode
tokens = tokenizer.encode(text)
print(tokens)  # [15496, 11, 995, 0]

# Decode
decoded = tokenizer.decode(tokens)
print(decoded)  # "Hello, world!"

Key Parameters

# Generation parameters
params = {
    'temperature': 0.7,      # Randomness (0-2)
    'max_tokens': 1000,      # Output length limit
    'top_p': 0.9,            # Nucleus sampling
    'top_k': 50,             # Top-k sampling
    'frequency_penalty': 0,  # Reduce repetition
    'presence_penalty': 0    # Encourage new topics
}

Model Comparison

ModelParametersContextBest For
GPT-4~1.7T128KComplex reasoning
GPT-3.5175B16KGeneral tasks
Claude 3N/A200KLong context
Llama 27-70B4KOpen source
Mistral 7B7B32KEfficient inference

Local Inference

With Ollama

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Run a model
ollama run llama2

# API usage
curl http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt": "Why is the sky blue?"
}'

With vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-hf")
sampling = SamplingParams(temperature=0.8, max_tokens=100)

outputs = llm.generate(["Hello, my name is"], sampling)

Best Practices

  1. Start simple: Use API before local deployment
  2. Mind context: Stay within context window limits
  3. Temperature tuning: Lower for facts, higher for creativity
  4. Token efficiency: Shorter prompts = lower costs
  5. Streaming: Use for better UX in applications

Error Handling & Retry

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def call_llm_with_retry(prompt: str) -> str:
    return client.chat.completions.create(...)

Troubleshooting

SymptomCauseSolution
Rate limit errorsToo many requestsAdd exponential backoff
Empty responsemax_tokens=0Check parameter values
High latencyLarge modelUse smaller model
TimeoutPrompt too longReduce input size

Unit Test Template

def test_llm_completion():
    response = call_llm("Hello")
    assert response is not None
    assert len(response) > 0