pdf-extractor

Extract text, tables, and metadata from PDF files. Use when working with PDFs, document extraction, or parsing PDF content.

$ インストール

git clone https://github.com/liushuang393/serverlessAIAgents /tmp/serverlessAIAgents && cp -r /tmp/serverlessAIAgents/.agentflow/skills/pdf-extractor ~/.claude/skills/serverlessAIAgents

// tip: Run this command in your terminal to install the skill


name: pdf-extractor description: Extract text, tables, and metadata from PDF files. Use when working with PDFs, document extraction, or parsing PDF content. version: 1.0.0 author: agentflow triggers:

  • pdf
  • extract text
  • parse document
  • read pdf requirements:
  • pypdf
  • pdfplumber tags:
  • document
  • extraction
  • pdf

PDF Extraction Instructions

Overview

This skill extracts text and data from PDF files using Python libraries.

Usage

Basic Text Extraction

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

Extract Tables

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            print(table)

Get Metadata

from pypdf import PdfReader

reader = PdfReader("document.pdf")
metadata = reader.metadata
print(f"Title: {metadata.title}")
print(f"Author: {metadata.author}")
print(f"Pages: {len(reader.pages)}")

Requirements

Install the required packages:

pip install pypdf pdfplumber

Notes

  • For scanned PDFs, consider using OCR libraries like pytesseract
  • Large PDFs should be processed page by page to manage memory