comp-scout-scrape
Scrape competition websites, extract structured data, and auto-persist to GitHub issues. Creates issues for new competitions, adds comments for duplicates.
$ インストール
git clone https://github.com/majiayu000/claude-skill-registry /tmp/claude-skill-registry && cp -r /tmp/claude-skill-registry/skills/data/comp-scout-scrape ~/.claude/skills/claude-skill-registry// tip: Run this command in your terminal to install the skill
name: comp-scout-scrape description: Scrape competition websites, extract structured data, and auto-persist to GitHub issues. Creates issues for new competitions, adds comments for duplicates.
Competition Scraper
Scrape creative writing competitions from Australian aggregator sites and automatically persist to GitHub.
What This Skill Does
- Scrapes competitions.com.au and netrewards.com.au
- Extracts structured data (dates, prompts, prizes)
- Checks for duplicates against existing GitHub issues (by URL and title similarity)
- Creates issues for NEW competitions only
- Adds comments to existing issues when same competition found on another site
- Skips competitions that are already tracked
The scraper already filters out sponsored/lottery ads. Your job is to check for duplicates, then persist only new competitions.
What Counts as "New"
A competition is NEW if:
- Its URL is not found in any existing issue body (check the full body text, not just the primary URL field)
- AND its normalized title is <80% similar to all existing issue titles
A competition is a DUPLICATE if:
- Its URL appears anywhere in an existing issue (body text, comments) → already tracked, skip
- Its normalized title is >80% similar to an existing issue title → likely same competition, skip
- Same competition found on a different aggregator site → add comment to existing issue noting the alternate URL
Note: An issue body may contain multiple URLs (one per aggregator site). When checking for duplicates, search the entire issue body for the scraped URL, not just a specific field.
Word Limit Clarification
"25WOL" is a category name, NOT a filter. Competitions with 25, 50, or 100 word limits are all valid creative writing competitions - persist them all (if new).
Prerequisites
pip install playwright
playwright install chromium
Also requires:
ghCLI authenticated- Target repository for competition data (not this skills repo)
Workflow
Step 1: Determine Target Repository
The target repo stores competition issues. Specify or get from config:
# From workspace config (if hiivmind-pulse-gh initialized)
TARGET_REPO=$(yq '.repositories[0].full_name' .hiivmind/github/config.yaml 2>/dev/null)
# Or use default/specified
TARGET_REPO="${TARGET_REPO:-discreteds/competition-data}"
Step 2: Scrape Listings
Run the scraper to get structured competition data:
python skills/comp-scout-scrape/scraper.py listings
Output:
{
"competitions": [
{
"url": "https://competitions.com.au/win-example/",
"site": "competitions.com.au",
"title": "Win a $500 Gift Card",
"normalized_title": "500 gift card",
"brand": "Example Brand",
"prize_summary": "$500",
"prize_value": 500,
"closing_date": "2024-12-31"
}
],
"scrape_date": "2024-12-09",
"errors": []
}
Step 3: Check for Existing Issues
For each scraped competition, check if it already exists:
# Get all open competition issues
gh issue list -R "$TARGET_REPO" \
--label "competition" \
--state open \
--json number,title,body \
--limit 200
Match by:
- URL in issue body (exact match = definite duplicate)
- Normalized title similarity (>80% = likely duplicate)
Step 4: Fetch Details for New Competitions
For competitions not already tracked, get full details:
python skills/comp-scout-scrape/scraper.py detail "https://competitions.com.au/win-example/"
For multiple new competitions, use batch mode:
echo '{"urls": ["url1", "url2", ...]}' | python skills/comp-scout-scrape/scraper.py details-batch
Step 4.5: Apply Auto-Tagging Rules (NOT Filtering)
IMPORTANT: Auto-tagging is for LABELING issues, not for skipping/excluding competitions.
Check competitions against user preferences from the data repo's CLAUDE.md to determine which labels to apply.
- Fetch preferences:
gh api repos/$TARGET_REPO/contents/CLAUDE.md -H "Accept: application/vnd.github.raw" 2>/dev/null
-
Parse the Detection Keywords section for tagging rules
-
For each competition, check if title/prize matches any keywords:
For each tag_rule in [for-kids, cruise]:
For each keyword in tag_rule.keywords:
If keyword.lower() in (competition.title + competition.prize_summary).lower():
Add tag_rule.label to issue labels
- ALL competitions are ALWAYS persisted as issues. Tagged competitions:
- Get the relevant label applied (e.g.,
for-kids,cruise) - Are closed immediately with explanation comment
- But they ARE STILL CREATED as issues (for record-keeping and potential review)
- Get the relevant label applied (e.g.,
Step 5: Auto-Persist Results
For New Competitions → Create Issue
gh issue create -R "$TARGET_REPO" \
--title "$TITLE" \
--label "competition" \
--label "25wol" \
--body "$(cat <<'EOF'
## Competition Details
**URL:** {url}
**Brand:** {brand}
**Prize:** {prize_summary}
**Word Limit:** {word_limit} words
**Closes:** {closing_date}
**Draw Date:** {draw_date}
**Winners Notified:** {notification_info}
## Prompt
> {prompt}
---
*Scraped from {site} on {scrape_date}*
EOF
)"
Then set milestone by closing month:
gh issue edit $ISSUE_NUMBER -R "$TARGET_REPO" --milestone "December 2024"
For Duplicates → Add Comment
If competition URL found on another site:
gh issue comment $EXISTING_ISSUE -R "$TARGET_REPO" --body "$(cat <<'EOF'
### Also found on {other_site}
**URL:** {url}
**Title on this site:** {title}
*Discovered: {date}*
EOF
)"
For Filtered Competitions → Create Issue + Close
If competition matched auto-filter keywords:
# Create the issue first (for record-keeping)
ISSUE_URL=$(gh issue create -R "$TARGET_REPO" \
--title "$TITLE" \
--label "competition" \
--label "25wol" \
--label "$FILTER_LABEL" \
--body "...")
# Extract issue number
ISSUE_NUMBER=$(echo "$ISSUE_URL" | grep -oE '[0-9]+$')
# Close with explanation
gh issue close $ISSUE_NUMBER -R "$TARGET_REPO" --comment "$(cat <<'EOF'
Auto-filtered: matches '$KEYWORD' in $FILTER_RULE preferences.
See CLAUDE.md in this repository for filter settings.
EOF
)"
Step 6: Report Results
Present confirmation to user:
✅ Scrape complete!
**Created 3 new issues:**
- #42: Win a $500 Coles Gift Card (closes Dec 31)
- #43: Win a Trip to Bali (closes Jan 15)
- #44: Win a Year's Supply of Coffee (closes Dec 20)
**Auto-filtered 2 (created + closed):**
- #45: Win Lego Set (for-kids: matched "Lego")
- #46: Win P&O Cruise (cruise: matched "P&O")
**Found 2 duplicates (added as comments):**
- #38: Win Woolworths Gift Cards (also on netrewards.com.au)
- #39: Win Dreamworld Experience (also on netrewards.com.au)
**Skipped 7 already tracked**
IMPORTANT: Do NOT ask "Would you like me to analyze these?" at the end. When invoked by comp-scout-daily, the workflow will automatically invoke analyze/compose skills next. Report results and stop.
Output Fields
Listing Output
| Field | Type | Description |
|---|---|---|
| url | string | Full URL to competition detail page |
| site | string | Source site (competitions.com.au or netrewards.com.au) |
| title | string | Competition title as displayed |
| normalized_title | string | Lowercase, prefixes stripped, for matching |
| brand | string | Sponsor/brand name (if available) |
| prize_summary | string | Prize description or value badge |
| prize_value | int/null | Numeric value in dollars |
| closing_date | string/null | YYYY-MM-DD format |
Detail Output
All listing fields plus:
| Field | Type | Description |
|---|---|---|
| prompt | string | The actual competition question/prompt |
| word_limit | int | Maximum words (default 25) |
| entry_method | string | How to submit entry |
| winner_notification | object/null | Notification details from JSON-LD |
| scraped_at | string | ISO timestamp of scrape |
Winner Notification Object
| Field | Type | Description |
|---|---|---|
| notification_text | string | Raw notification text |
| notification_date | string/null | Specific date if mentioned |
| notification_days | int/null | Days after close/draw |
| selection_text | string | How winners are selected |
| selection_date | string/null | When judging occurs |
Title Normalization
Titles are normalized for deduplication:
- Lowercase
- Strip prefixes: "Win ", "Win a ", "Win an ", "Win the ", "Win 1 of "
- Remove punctuation
- Collapse whitespace
Example:
Original: "Win a $500 Coles Gift Card"
Normalized: "500 coles gift card"
Example Session
User: Scrape competitions
Claude: I'll scrape competitions and persist new ones to GitHub.
[Runs: python skills/comp-scout-scrape/scraper.py listings]
Found 12 competitions from both sites.
[Runs: gh issue list -R discreteds/competition-data --label competition --json number,title,body]
Checking against 45 existing issues...
- 3 are new
- 2 are duplicates (same competition, different source)
- 7 already tracked
Fetching details for 3 new competitions...
[Creates issues and adds comments]
✅ Scrape complete!
**Created 3 new issues:**
- #46: Win a $500 Coles Gift Card (closes Dec 31)
- Milestone: December 2024
- #47: Win a Trip to Bali (closes Jan 15)
- Milestone: January 2025
- #48: Win a Year's Supply of Coffee (closes Dec 20)
- Milestone: December 2024
**Added 2 duplicate comments:**
- #38: Also found on netrewards.com.au
- #39: Also found on netrewards.com.au
CLI Commands Reference
# Scrape all listing pages
python skills/comp-scout-scrape/scraper.py listings
# Get full details for one competition
python skills/comp-scout-scrape/scraper.py detail "URL"
# Get full details for multiple competitions (batch mode)
echo '{"urls": ["url1", "url2"]}' | python skills/comp-scout-scrape/scraper.py details-batch
# Debug: just get URLs
python skills/comp-scout-scrape/scraper.py urls
Batch Details Output
{
"details": [
{
"url": "...",
"title": "...",
"prompt": "Tell us in 25 words...",
"word_limit": 25,
...
}
],
"scrape_date": "2024-12-09",
"errors": []
}
Persistence Details
This skill handles all GitHub persistence. The separate comp-scout-persist skill is deprecated - its functionality is merged here.
Issue Creation Template
## Competition Details
**URL:** {url}
**Brand:** {brand}
**Prize:** {prize_summary}
**Word Limit:** {word_limit} words
**Closes:** {closing_date}
**Draw Date:** {draw_date}
**Winners Notified:** {notification_info}
## Prompt
> {prompt}
---
*Scraped from {site} on {scrape_date}*
Labels
| Label | Description | Auto-applied |
|---|---|---|
competition | All competition issues | Always |
25wol | 25 words or less type | Always |
for-kids | Auto-filtered (kids competitions) | When keyword matches |
cruise | Auto-filtered (cruise competitions) | When keyword matches |
closing-soon | Closes within 3 days | By separate check |
entry-drafted | Entry has been composed | By comp-scout-compose |
entry-submitted | Entry has been submitted | Manually |
Milestones
Issues are assigned to milestones by closing date month:
- "December 2024"
- "January 2025"
- etc.
# Create milestone if needed
gh api repos/$TARGET_REPO/milestones \
--method POST \
--field title="$MONTH_YEAR" \
--field due_on="$LAST_DAY_OF_MONTH"
# Assign to issue
gh issue edit $ISSUE_NUMBER -R "$TARGET_REPO" --milestone "$MONTH_YEAR"
Duplicate Comment Template
### Also found on {other_site}
**URL:** {url}
**Title on this site:** {title}
*Discovered: {date}*
Filtered Issue Handling
When a competition matches filter keywords:
- Issue is created (for record-keeping)
- Filter label is applied (e.g.,
for-kids) - Issue is immediately closed with explanation
gh issue close $ISSUE_NUMBER -R "$TARGET_REPO" \
--comment "Auto-filtered: matches '$KEYWORD' in $FILTER_RULE preferences."
Integration
This skill is invoked by comp-scout-daily as the first step in the workflow.
After scraping, you can:
- Use comp-scout-analyze to generate entry strategies
- Use comp-scout-compose to write actual entries
- Both will auto-persist their results as comments on the issue
Repository
