doc-scraper

Generic web scraper for extracting and organizing Snowflake documentation with intelligent caching and configurable spider depth. Scrapes any section of docs.snowflake.com controlled by --base-path.

$ 安裝

git clone https://github.com/sfc-gh-dflippo/snowflake-dbt-demo /tmp/snowflake-dbt-demo && cp -r /tmp/snowflake-dbt-demo/.claude/skills/doc-scraper ~/.claude/skills/snowflake-dbt-demo

// tip: Run this command in your terminal to install the skill


name: doc-scraper description: Generic web scraper for extracting and organizing Snowflake documentation with intelligent caching and configurable spider depth. Scrapes any section of docs.snowflake.com controlled by --base-path.

Snowflake Documentation Scraper

Scrapes docs.snowflake.com sections to Markdown with SQLite caching (7-day expiration).

Usage

First time setup (auto-installs uv and doc-scraper):

python3 .claude/skills/doc-scraper/scripts/doc_scraper.py

Subsequent runs:

doc-scraper --output-dir=./snowflake-docs
doc-scraper --output-dir=./snowflake-docs --base-path="/en/sql-reference/"
doc-scraper --output-dir=./snowflake-docs --spider-depth=2

Command Options

OptionDefaultDescription
--output-dirRequiredOutput directory for scraped docs
--base-path/en/migrations/URL section to scrape
--spider-depth1Link depth: 0=seeds, 1=+links, 2=+2nd
--limitNoneCap URLs (for testing)
--dry-run-Preview without writing

Output

output-dir/
├── SKILL.md              # Auto-generated index
├── scraper_config.yaml   # Editable config (auto-created)
├── .cache/               # SQLite cache (auto-managed)
└── en/migrations/*.md    # Scraped pages with frontmatter

Configuration

Auto-created at {output-dir}/scraper_config.yaml:

rate_limiting:
  max_concurrent_threads: 4
spider:
  max_pages: 1000
  allowed_paths: ["/en/"]
scraped_pages:
  expiration_days: 7

Troubleshooting

IssueSolution
Too many pagesLower --spider-depth or edit config
Missing pagesIncrease --spider-depth
Cache corruptionDelete {output-dir}/.cache/ (rare)