comp-scout-scrape

Scrape competition websites, extract structured data, and auto-persist to GitHub issues. Creates issues for new competitions, adds comments for duplicates.

$ インストール

git clone https://github.com/majiayu000/claude-skill-registry /tmp/claude-skill-registry && cp -r /tmp/claude-skill-registry/skills/data/comp-scout-scrape ~/.claude/skills/claude-skill-registry

// tip: Run this command in your terminal to install the skill


name: comp-scout-scrape description: Scrape competition websites, extract structured data, and auto-persist to GitHub issues. Creates issues for new competitions, adds comments for duplicates.

Competition Scraper

Scrape creative writing competitions from Australian aggregator sites and automatically persist to GitHub.

What This Skill Does

  1. Scrapes competitions.com.au and netrewards.com.au
  2. Extracts structured data (dates, prompts, prizes)
  3. Checks for duplicates against existing GitHub issues (by URL and title similarity)
  4. Creates issues for NEW competitions only
  5. Adds comments to existing issues when same competition found on another site
  6. Skips competitions that are already tracked

The scraper already filters out sponsored/lottery ads. Your job is to check for duplicates, then persist only new competitions.

What Counts as "New"

A competition is NEW if:

  • Its URL is not found in any existing issue body (check the full body text, not just the primary URL field)
  • AND its normalized title is <80% similar to all existing issue titles

A competition is a DUPLICATE if:

  • Its URL appears anywhere in an existing issue (body text, comments) → already tracked, skip
  • Its normalized title is >80% similar to an existing issue title → likely same competition, skip
  • Same competition found on a different aggregator site → add comment to existing issue noting the alternate URL

Note: An issue body may contain multiple URLs (one per aggregator site). When checking for duplicates, search the entire issue body for the scraped URL, not just a specific field.

Word Limit Clarification

"25WOL" is a category name, NOT a filter. Competitions with 25, 50, or 100 word limits are all valid creative writing competitions - persist them all (if new).

Prerequisites

pip install playwright
playwright install chromium

Also requires:

  • gh CLI authenticated
  • Target repository for competition data (not this skills repo)

Workflow

Step 1: Determine Target Repository

The target repo stores competition issues. Specify or get from config:

# From workspace config (if hiivmind-pulse-gh initialized)
TARGET_REPO=$(yq '.repositories[0].full_name' .hiivmind/github/config.yaml 2>/dev/null)

# Or use default/specified
TARGET_REPO="${TARGET_REPO:-discreteds/competition-data}"

Step 2: Scrape Listings

Run the scraper to get structured competition data:

python skills/comp-scout-scrape/scraper.py listings

Output:

{
  "competitions": [
    {
      "url": "https://competitions.com.au/win-example/",
      "site": "competitions.com.au",
      "title": "Win a $500 Gift Card",
      "normalized_title": "500 gift card",
      "brand": "Example Brand",
      "prize_summary": "$500",
      "prize_value": 500,
      "closing_date": "2024-12-31"
    }
  ],
  "scrape_date": "2024-12-09",
  "errors": []
}

Step 3: Check for Existing Issues

For each scraped competition, check if it already exists:

# Get all open competition issues
gh issue list -R "$TARGET_REPO" \
  --label "competition" \
  --state open \
  --json number,title,body \
  --limit 200

Match by:

  1. URL in issue body (exact match = definite duplicate)
  2. Normalized title similarity (>80% = likely duplicate)

Step 4: Fetch Details for New Competitions

For competitions not already tracked, get full details:

python skills/comp-scout-scrape/scraper.py detail "https://competitions.com.au/win-example/"

For multiple new competitions, use batch mode:

echo '{"urls": ["url1", "url2", ...]}' | python skills/comp-scout-scrape/scraper.py details-batch

Step 4.5: Apply Auto-Tagging Rules (NOT Filtering)

IMPORTANT: Auto-tagging is for LABELING issues, not for skipping/excluding competitions.

Check competitions against user preferences from the data repo's CLAUDE.md to determine which labels to apply.

  1. Fetch preferences:
gh api repos/$TARGET_REPO/contents/CLAUDE.md -H "Accept: application/vnd.github.raw" 2>/dev/null
  1. Parse the Detection Keywords section for tagging rules

  2. For each competition, check if title/prize matches any keywords:

For each tag_rule in [for-kids, cruise]:
  For each keyword in tag_rule.keywords:
    If keyword.lower() in (competition.title + competition.prize_summary).lower():
      Add tag_rule.label to issue labels
  1. ALL competitions are ALWAYS persisted as issues. Tagged competitions:
    • Get the relevant label applied (e.g., for-kids, cruise)
    • Are closed immediately with explanation comment
    • But they ARE STILL CREATED as issues (for record-keeping and potential review)

Step 5: Auto-Persist Results

For New Competitions → Create Issue

gh issue create -R "$TARGET_REPO" \
  --title "$TITLE" \
  --label "competition" \
  --label "25wol" \
  --body "$(cat <<'EOF'
## Competition Details

**URL:** {url}
**Brand:** {brand}
**Prize:** {prize_summary}
**Word Limit:** {word_limit} words
**Closes:** {closing_date}
**Draw Date:** {draw_date}
**Winners Notified:** {notification_info}

## Prompt

> {prompt}

---
*Scraped from {site} on {scrape_date}*
EOF
)"

Then set milestone by closing month:

gh issue edit $ISSUE_NUMBER -R "$TARGET_REPO" --milestone "December 2024"

For Duplicates → Add Comment

If competition URL found on another site:

gh issue comment $EXISTING_ISSUE -R "$TARGET_REPO" --body "$(cat <<'EOF'
### Also found on {other_site}

**URL:** {url}
**Title on this site:** {title}
*Discovered: {date}*
EOF
)"

For Filtered Competitions → Create Issue + Close

If competition matched auto-filter keywords:

# Create the issue first (for record-keeping)
ISSUE_URL=$(gh issue create -R "$TARGET_REPO" \
  --title "$TITLE" \
  --label "competition" \
  --label "25wol" \
  --label "$FILTER_LABEL" \
  --body "...")

# Extract issue number
ISSUE_NUMBER=$(echo "$ISSUE_URL" | grep -oE '[0-9]+$')

# Close with explanation
gh issue close $ISSUE_NUMBER -R "$TARGET_REPO" --comment "$(cat <<'EOF'
Auto-filtered: matches '$KEYWORD' in $FILTER_RULE preferences.

See CLAUDE.md in this repository for filter settings.
EOF
)"

Step 6: Report Results

Present confirmation to user:

✅ Scrape complete!

**Created 3 new issues:**
- #42: Win a $500 Coles Gift Card (closes Dec 31)
- #43: Win a Trip to Bali (closes Jan 15)
- #44: Win a Year's Supply of Coffee (closes Dec 20)

**Auto-filtered 2 (created + closed):**
- #45: Win Lego Set (for-kids: matched "Lego")
- #46: Win P&O Cruise (cruise: matched "P&O")

**Found 2 duplicates (added as comments):**
- #38: Win Woolworths Gift Cards (also on netrewards.com.au)
- #39: Win Dreamworld Experience (also on netrewards.com.au)

**Skipped 7 already tracked**

IMPORTANT: Do NOT ask "Would you like me to analyze these?" at the end. When invoked by comp-scout-daily, the workflow will automatically invoke analyze/compose skills next. Report results and stop.

Output Fields

Listing Output

FieldTypeDescription
urlstringFull URL to competition detail page
sitestringSource site (competitions.com.au or netrewards.com.au)
titlestringCompetition title as displayed
normalized_titlestringLowercase, prefixes stripped, for matching
brandstringSponsor/brand name (if available)
prize_summarystringPrize description or value badge
prize_valueint/nullNumeric value in dollars
closing_datestring/nullYYYY-MM-DD format

Detail Output

All listing fields plus:

FieldTypeDescription
promptstringThe actual competition question/prompt
word_limitintMaximum words (default 25)
entry_methodstringHow to submit entry
winner_notificationobject/nullNotification details from JSON-LD
scraped_atstringISO timestamp of scrape

Winner Notification Object

FieldTypeDescription
notification_textstringRaw notification text
notification_datestring/nullSpecific date if mentioned
notification_daysint/nullDays after close/draw
selection_textstringHow winners are selected
selection_datestring/nullWhen judging occurs

Title Normalization

Titles are normalized for deduplication:

  1. Lowercase
  2. Strip prefixes: "Win ", "Win a ", "Win an ", "Win the ", "Win 1 of "
  3. Remove punctuation
  4. Collapse whitespace

Example:

Original: "Win a $500 Coles Gift Card"
Normalized: "500 coles gift card"

Example Session

User: Scrape competitions

Claude: I'll scrape competitions and persist new ones to GitHub.

[Runs: python skills/comp-scout-scrape/scraper.py listings]

Found 12 competitions from both sites.

[Runs: gh issue list -R discreteds/competition-data --label competition --json number,title,body]

Checking against 45 existing issues...
- 3 are new
- 2 are duplicates (same competition, different source)
- 7 already tracked

Fetching details for 3 new competitions...

[Creates issues and adds comments]

✅ Scrape complete!

**Created 3 new issues:**
- #46: Win a $500 Coles Gift Card (closes Dec 31)
  - Milestone: December 2024
- #47: Win a Trip to Bali (closes Jan 15)
  - Milestone: January 2025
- #48: Win a Year's Supply of Coffee (closes Dec 20)
  - Milestone: December 2024

**Added 2 duplicate comments:**
- #38: Also found on netrewards.com.au
- #39: Also found on netrewards.com.au

CLI Commands Reference

# Scrape all listing pages
python skills/comp-scout-scrape/scraper.py listings

# Get full details for one competition
python skills/comp-scout-scrape/scraper.py detail "URL"

# Get full details for multiple competitions (batch mode)
echo '{"urls": ["url1", "url2"]}' | python skills/comp-scout-scrape/scraper.py details-batch

# Debug: just get URLs
python skills/comp-scout-scrape/scraper.py urls

Batch Details Output

{
  "details": [
    {
      "url": "...",
      "title": "...",
      "prompt": "Tell us in 25 words...",
      "word_limit": 25,
      ...
    }
  ],
  "scrape_date": "2024-12-09",
  "errors": []
}

Persistence Details

This skill handles all GitHub persistence. The separate comp-scout-persist skill is deprecated - its functionality is merged here.

Issue Creation Template

## Competition Details

**URL:** {url}
**Brand:** {brand}
**Prize:** {prize_summary}
**Word Limit:** {word_limit} words
**Closes:** {closing_date}
**Draw Date:** {draw_date}
**Winners Notified:** {notification_info}

## Prompt

> {prompt}

---
*Scraped from {site} on {scrape_date}*

Labels

LabelDescriptionAuto-applied
competitionAll competition issuesAlways
25wol25 words or less typeAlways
for-kidsAuto-filtered (kids competitions)When keyword matches
cruiseAuto-filtered (cruise competitions)When keyword matches
closing-soonCloses within 3 daysBy separate check
entry-draftedEntry has been composedBy comp-scout-compose
entry-submittedEntry has been submittedManually

Milestones

Issues are assigned to milestones by closing date month:

  • "December 2024"
  • "January 2025"
  • etc.
# Create milestone if needed
gh api repos/$TARGET_REPO/milestones \
  --method POST \
  --field title="$MONTH_YEAR" \
  --field due_on="$LAST_DAY_OF_MONTH"

# Assign to issue
gh issue edit $ISSUE_NUMBER -R "$TARGET_REPO" --milestone "$MONTH_YEAR"

Duplicate Comment Template

### Also found on {other_site}

**URL:** {url}
**Title on this site:** {title}
*Discovered: {date}*

Filtered Issue Handling

When a competition matches filter keywords:

  1. Issue is created (for record-keeping)
  2. Filter label is applied (e.g., for-kids)
  3. Issue is immediately closed with explanation
gh issue close $ISSUE_NUMBER -R "$TARGET_REPO" \
  --comment "Auto-filtered: matches '$KEYWORD' in $FILTER_RULE preferences."

Integration

This skill is invoked by comp-scout-daily as the first step in the workflow.

After scraping, you can:

  • Use comp-scout-analyze to generate entry strategies
  • Use comp-scout-compose to write actual entries
  • Both will auto-persist their results as comments on the issue