semtools

This skill provides semantic search capabilities using embedding-based similarity matching for code and text. Enables meaning-based search beyond keyword matching, with optional document parsing (PDF, DOCX, PPTX) support.

$ Installieren

git clone https://github.com/massgen/MassGen /tmp/MassGen && cp -r /tmp/MassGen/massgen/skills/semtools ~/.claude/skills/MassGen

// tip: Run this command in your terminal to install the skill


name: semtools description: This skill provides semantic search capabilities using embedding-based similarity matching for code and text. Enables meaning-based search beyond keyword matching, with optional document parsing (PDF, DOCX, PPTX) support. license: MIT

Semtools: Semantic Search

Perform semantic (meaning-based) search across code and documents using embedding-based similarity matching.

Purpose

The semtools skill provides access to Semtools, a high-performance Rust-based CLI for semantic search and document processing. Unlike traditional text search (ripgrep) which matches exact strings, or structural search (ast-grep) which matches syntax patterns, semtools understands semantic meaning through embeddings.

Key capabilities:

  1. Semantic Search: Find code/text by meaning, not just keywords
  2. Workspace Management: Index large codebases for fast repeated searches
  3. Document Parsing: Convert PDFs, DOCX, PPTX to searchable text (requires API key)

Semtools excels at discovery - finding relevant code when you don't know the exact keywords, function names, or syntax patterns.

When to Use This Skill

Use the semtools skill when you need meaning-based search:

Semantic Code Discovery:

  • Finding code that implements a concept ("error handling", "data validation")
  • Discovering similar functionality across different modules
  • Locating examples of a pattern when you don't know exact names
  • Understanding what code does without reading everything

Documentation & Knowledge:

  • Searching documentation by concept, not keywords
  • Finding related discussions in comments or docs
  • Discovering similar issues or solutions
  • Analyzing technical documents (PDFs, reports)

Use Cases:

  • "Find all authentication-related code" (without knowing function names)
  • "Show me error handling patterns" (regardless of specific error types)
  • "Find code similar to this implementation" (semantic similarity)
  • "Search research papers for 'distributed consensus'" (document search)

Choose semtools over file-search (ripgrep/ast-grep) when:

  • You know the concept but not the keywords
  • Exact string matching misses relevant results
  • You want semantically similar code, not exact matches
  • Searching across languages or mixed content

Still use file-search when:

  • You know exact keywords, function names, or patterns
  • You need structural code matching (ast-grep)
  • Speed is critical (ripgrep is faster for exact matches)
  • You're searching for specific symbols or references

Available Commands

Semtools provides three CLI commands you can use via execute_command:

  • search - Semantic search across code and text files
  • workspace - Manage workspaces for caching embeddings
  • parse - Convert documents (PDF, DOCX, PPTX) to searchable text

All commands work out-of-the-box in your execution environment. Document parsing requires the LLAMA_CLOUD_API_KEY environment variable to be set.

Core Operations

1. Semantic Search (search)

Find files and code sections by semantic meaning:

# Basic semantic search
search "authentication logic" src/

# Search with more context (5 lines before/after)
search "error handling" --n-lines 5 src/

# Get more results (default: 3)
search "database queries" --top-k 10 src/

# Control similarity threshold (0.0-1.0, lower = more lenient)
search "API endpoints" --max-distance 0.4 src/

Parameters:

  • --n-lines N: Show N lines of context around matches (default: 3)
  • --top-k K: Return top K most similar matches (default: 3)
  • --max-distance D: Maximum embedding distance (0.0-1.0, default: 0.3)
  • -i: Case-insensitive matching

Output format:

Match 1 (similarity: 0.12)
File: src/auth/handlers.py
Lines: 42-47
----
def authenticate_user(username: str, password: str) -> Optional[User]:
    """Authenticate user credentials against database."""
    user = get_user_by_username(username)
    if user and verify_password(password, user.password_hash):
        return user
    return None
----

Match 2 (similarity: 0.18)
File: src/middleware/auth.py
...

2. Workspace Management (workspace)

For large codebases, create workspaces to cache embeddings and enable fast repeated searches:

# Create/activate workspace
workspace use my-project

# Set workspace via environment variable
export SEMTOOLS_WORKSPACE=my-project

# Index files in workspace (workspace auto-detected from env var)
search "query" src/

# Check workspace status
workspace status

# Clean up old workspaces
workspace prune

Benefits:

  • Fast repeated searches: Embeddings cached, no re-computation
  • Large codebases: IVF_PQ indexing for scalability
  • Session persistence: Maintain context across multiple searches

When to use workspaces:

  • Searching the same codebase multiple times
  • Very large projects (1000+ files)
  • Interactive exploration sessions
  • CI/CD pipelines with repeated searches

3. Document Parsing (parse) ⚠️ Requires API Key

Convert documents to searchable markdown (requires LlamaParse API key):

# Parse PDFs to markdown
parse research_papers/*.pdf

# Parse Word documents
parse reports/*.docx

# Parse presentations
parse slides/*.pptx

# Parse and pipe to search
parse docs/*.pdf | xargs search "neural networks"

Supported formats:

  • PDF (.pdf)
  • Word (.docx)
  • PowerPoint (.pptx)

Configuration:

# Via environment variable
export LLAMA_CLOUD_API_KEY="llx-..."

# Via config file
cat > ~/.parse_config.json << EOF
{
  "api_key": "llx-...",
  "max_concurrent_requests": 10,
  "timeout_seconds": 3600
}
EOF

Important: Document parsing is optional. Semantic search works without it.

Workflow Patterns

Pattern 1: Concept Discovery

When you know what you're looking for conceptually but not by name:

# Step 1: Broad semantic search
search "rate limiting implementation" src/

# Step 2: Review results, refine query
search "throttle requests per user" src/ --top-k 10

# Step 3: Use ripgrep for exact follow-up
rg "RateLimiter" --type py src/

Pattern 2: Similar Code Finder

When you want to find code similar to a reference implementation:

# Step 1: Extract key concepts from reference code
# [Read example_auth.py and identify key concepts]

# Step 2: Search for similar implementations
search "user authentication with JWT tokens" src/

# Step 3: Compare implementations
# [Review semantic matches to find similar approaches]

Pattern 3: Documentation Search

When researching concepts in documentation or comments:

# Search code comments semantically
search "thread safety guarantees" src/ --n-lines 10

# Search markdown documentation
search "deployment best practices" docs/

# Combined search
search "performance optimization" --top-k 20

Pattern 4: Cross-Language Search

When searching for concepts across different languages:

# Semantic search works across languages
search "connection pooling" src/

# May find:
# - Java: "ConnectionPool manager"
# - Python: "database connection reuse"
# - Go: "pool of persistent connections"
# All semantically related despite different terminology

Pattern 5: Document Analysis (with API key)

When analyzing PDFs or documents:

# Step 1: Parse documents to markdown
parse research/*.pdf > papers.md

# Step 2: Search converted content
search "transformer architecture" papers.md

# Step 3: Combine with code search
search "attention mechanism implementation" src/

Integration with file-search

Semtools and file-search (ripgrep/ast-grep) are complementary tools. Use them together for comprehensive search:

Search Strategy Matrix

You KnowUse FirstThen UseWhy
Exact keywordsripgrepsearchFast exact match, then find similar
Concept onlysearchripgrepFind relevant code, then search specifics
Function nameripgrepsearchFind definition, then find similar usage
Code patternast-grepsearchFind structure, then find similar logic
Approximate ideasearchripgrep + ast-grepDiscover, then drill down

Layered Search Approach

# Layer 1: Semantic discovery (what's related?)
search "user session management" --top-k 10

# Layer 2: Exact text search (what's the implementation?)
rg "SessionManager|session_store" --type py

# Layer 3: Structural search (how is it used?)
sg --pattern 'session.$METHOD($$$)' --lang python

# Layer 4: Reference tracking (where is it called?)
# [Use serena skill for symbol-level tracking]

Best Practices

1. Start Broad, Then Narrow

Use semantic search for discovery, then narrow with exact search:

# GOOD: Broad semantic discovery first
search "authentication" src/ --top-k 10
# [Review results to learn terminology]
rg "authenticate|verify_credentials" --type py src/

# AVOID: Starting too narrow and missing variations
rg "authenticate" --type py  # Misses "verify_credentials", "check_auth", etc.

2. Adjust Similarity Threshold

Tune --max-distance based on results:

# Too many irrelevant results? Decrease distance (more strict)
search "query" --max-distance 0.2

# Missing relevant results? Increase distance (more lenient)
search "query" --max-distance 0.5

# Default (0.3) works well for most cases
search "query"

3. Use Workspaces for Repeated Searches

For interactive exploration, always use workspaces:

# GOOD: Create workspace once, search many times
export SEMTOOLS_WORKSPACE=my-analysis
search "concept1" src/
search "concept2" src/
search "concept3" src/

# INEFFICIENT: Re-compute embeddings every time
search "concept1" src/
search "concept2" src/

4. Combine with Context Tools

Get more context around semantic matches:

# Find semantically similar code
search "retry logic" src/ --n-lines 2

# Get more context with ripgrep
rg -C 10 "retry" src/specific_file.py

# Or read the full file
cat src/specific_file.py

5. Phrase Queries Conceptually

Write queries as concepts, not exact keywords:

# GOOD: Conceptual queries
search "handling network timeouts"
search "user input validation"
search "concurrent data access"

# LESS EFFECTIVE: Exact keyword queries (use ripgrep instead)
search "timeout"  # Use: rg "timeout"
search "validate"  # Use: rg "validate"

Understanding Semantic Distance

Semtools uses embedding vectors to measure semantic similarity:

  • Distance 0.0: Identical meaning
  • Distance 0.1-0.2: Very similar (synonyms, paraphrases)
  • Distance 0.2-0.3: Related concepts (default threshold)
  • Distance 0.3-0.4: Loosely related
  • Distance 0.5+: Weakly related or unrelated

Practical guidelines:

# Strict matching (only close matches)
--max-distance 0.2

# Balanced matching (default, recommended)
--max-distance 0.3

# Lenient matching (exploratory search)
--max-distance 0.4

# Very lenient (may include false positives)
--max-distance 0.5

Local vs. Cloud Embeddings

Semantic Search (Local):

  • Uses local embeddings (model2vec, potion-multilingual-128M)
  • No API calls or cloud dependencies
  • Fast, private, no cost
  • Works offline

Document Parsing (Cloud):

  • Uses LlamaParse API (cloud-based)
  • Requires API key and internet connection
  • Processes PDFs, DOCX, PPTX
  • Usage-based pricing (check LlamaIndex pricing)

Privacy consideration: Semantic search is 100% local. Only document parsing sends data to LlamaParse API.

Performance Considerations

Speed Characteristics

Without workspace:

  • First search: ~2-5 seconds (embedding computation)
  • Subsequent searches: ~2-5 seconds each (re-compute embeddings)

With workspace (cached embeddings):

  • First search: ~2-5 seconds (builds index)
  • Subsequent searches: ~0.1-0.5 seconds (cached)
  • Large codebases: IVF_PQ indexing for scalability

Comparison:

  • ripgrep: 0.01-0.1 seconds (fastest, exact match)
  • ast-grep: 0.1-0.5 seconds (fast, structural)
  • semtools (cached): 0.1-0.5 seconds (fast, semantic)
  • semtools (uncached): 2-5 seconds (slower, semantic)

Optimization Tips

# 1. Use workspaces for repeated searches
export SEMTOOLS_WORKSPACE=my-project

# 2. Limit search scope to relevant directories
search "query" src/ --not tests/

# 3. Use --top-k to control result count
search "query" --top-k 5

# 4. Pipe to head for quick preview
search "query" | head -50

Unix Pipeline Integration

Semtools is designed for Unix-style composition:

# Find and parse PDFs, then search
find docs/ -name "*.pdf" | xargs parse | xargs search "topic"

# Search and filter with grep
search "authentication" src/ | grep -i "jwt"

# Count matches
search "error handling" src/ | grep "Match" | wc -l

# Combine with other tools
search "API" src/ | xargs -I {} rg -l "REST" {}

Limitations

When NOT to Use Semtools

  1. Exact keyword search: Use ripgrep for known keywords

    # WRONG TOOL: Semantic search for exact function name
    search "authenticate_user"
    
    # RIGHT TOOL: Use ripgrep for exact matches
    rg "authenticate_user" --type py
    
  2. Structural code patterns: Use ast-grep for syntax matching

    # WRONG TOOL: Semantic search for code structure
    search "class with constructor"
    
    # RIGHT TOOL: Use ast-grep for structure
    sg --pattern 'class $NAME { constructor($$$) { $$$ } }'
    
  3. Symbol references: Use serena for LSP-based tracking

    # WRONG TOOL: Semantic search for all usages
    search "MyClass usage"
    
    # RIGHT TOOL: Use serena for precise references
    serena find_referencing_symbols --name 'MyClass'
    
  4. Small codebases: Overhead not worth it for <100 files

    • ripgrep is faster and simpler for small projects

Known Edge Cases

  • Ambiguous queries: Vague concepts return broad results
  • Technical jargon: Domain-specific terms may have lower accuracy
  • Short code snippets: Limited context reduces embedding quality
  • Mixed languages: Embeddings tuned for English (multilingual model used)
  • Generated code: Repetitive patterns may cluster together

Troubleshooting

No Semantic Matches Found

If semantic search returns zero results:

  1. Verify files exist: Use ripgrep to confirm content

    rg "concept" src/
    
  2. Increase similarity threshold: Be more lenient

    search "query" --max-distance 0.5
    
  3. Rephrase query: Try different terminology

    search "user authentication"
    search "verify user credentials"
    search "login validation"
    
  4. Check file types: Ensure searching correct extensions

    search "query" src/*.py  # Target specific types
    

Too Many Irrelevant Results

If semantic search returns too much noise:

  1. Decrease similarity threshold: Be more strict

    search "query" --max-distance 0.2
    
  2. Limit result count: Review top matches only

    search "query" --top-k 3
    
  3. Narrow directory scope: Search specific paths

    search "query" src/specific_module/
    
  4. Refine query: Add more specific concepts

    # Vague
    search "data"
    
    # Specific
    search "data validation with regex patterns"
    

Document Parsing Fails

If parse fails:

  1. Verify API key is set:

    echo $LLAMA_CLOUD_API_KEY
    
  2. Check file format: Ensure supported format (PDF, DOCX, PPTX)

    file document.pdf  # Verify file type
    
  3. Check file size: Large files may timeout

    du -h document.pdf  # Check size
    
  4. Review parse config: Adjust timeouts if needed

    cat ~/.parse_config.json
    

Workspace Issues

If workspace commands fail:

# Check workspace status
workspace status

# Prune corrupted workspaces
workspace prune

# Recreate workspace
rm -rf ~/.semtools/workspaces/my-workspace
export SEMTOOLS_WORKSPACE=my-workspace

Resources