name: ocr-document-processor description: Extract text from images and scanned PDFs using OCR. Supports 100+ languages, table detection, structured output (markdown/JSON), and batch processing.

OCR Document Processor

Extract text from images, scanned PDFs, and photographs using Optical Character Recognition (OCR). Supports multiple languages, structured output formats, and intelligent document parsing.

Core Capabilities

Image OCR: Extract text from PNG, JPEG, TIFF, BMP images
PDF OCR: Process scanned PDFs page by page
Multi-language: Support for 100+ languages
Structured Output: Plain text, Markdown, JSON, or HTML
Table Detection: Extract tabular data to CSV/JSON
Batch Processing: Process multiple documents at once
Quality Assessment: Confidence scoring for OCR results

Quick Start

from scripts.ocr_processor import OCRProcessor

# Simple text extraction
processor = OCRProcessor("document.png")
text = processor.extract_text()
print(text)

# Extract to structured format
result = processor.extract_structured()
print(result['text'])
print(result['confidence'])
print(result['blocks'])  # Text blocks with positions

Core Workflow

1. Basic Text Extraction

from scripts.ocr_processor import OCRProcessor

# From image
processor = OCRProcessor("scan.png")
text = processor.extract_text()

# From PDF
processor = OCRProcessor("scanned.pdf")
text = processor.extract_text()  # All pages

# Specific pages
text = processor.extract_text(pages=[1, 2, 3])

2. Structured Extraction

# Get detailed results
result = processor.extract_structured()

# Result contains:
# - text: Full extracted text
# - blocks: Text blocks with bounding boxes
# - lines: Individual lines
# - words: Individual words with confidence
# - confidence: Overall confidence score
# - language: Detected language

3. Export Formats

# Export to Markdown
processor.export_markdown("output.md")

# Export to JSON
processor.export_json("output.json")

# Export to searchable PDF
processor.export_searchable_pdf("searchable.pdf")

# Export to HTML
processor.export_html("output.html")

Language Support

# Specify language for better accuracy
processor = OCRProcessor("german_doc.png", lang='deu')

# Multiple languages
processor = OCRProcessor("mixed_doc.png", lang='eng+fra+deu')

# Auto-detect language
processor = OCRProcessor("document.png", lang='auto')

Supported Languages (Common)

Code	Language	Code	Language
eng	English	fra	French
deu	German	spa	Spanish
ita	Italian	por	Portuguese
rus	Russian	chi_sim	Chinese (Simplified)
chi_tra	Chinese (Traditional)	jpn	Japanese
kor	Korean	ara	Arabic
hin	Hindi	nld	Dutch

Image Preprocessing

Preprocessing improves OCR accuracy on low-quality images.

# Enable preprocessing
processor = OCRProcessor("noisy_scan.png")
processor.preprocess(
    deskew=True,        # Fix rotation
    denoise=True,       # Remove noise
    threshold=True,     # Binarize image
    contrast=1.5        # Enhance contrast
)
text = processor.extract_text()

Available Preprocessing Options

Option	Description	Default
`deskew`	Correct skewed/rotated images	False
`denoise`	Remove noise and artifacts	False
`threshold`	Convert to black/white	False
`threshold_method`	'otsu', 'adaptive', 'simple'	'otsu'
`contrast`	Contrast factor (1.0 = no change)	1.0
`sharpen`	Sharpen factor (0 = none)	0
`scale`	Upscale factor for small text	1.0
`remove_shadows`	Remove shadow artifacts	False

Table Extraction

# Extract tables from document
tables = processor.extract_tables()

# Each table is a list of rows
for table in tables:
    for row in table:
        print(row)

# Export tables to CSV
processor.export_tables_csv("tables/")

# Export to JSON
processor.export_tables_json("tables.json")

PDF Processing

Multi-Page PDFs

# Process all pages
processor = OCRProcessor("document.pdf")
full_text = processor.extract_text()

# Process specific pages
page_3 = processor.extract_text(pages=[3])

# Get per-page results
results = processor.extract_by_page()
for page_num, text in results.items():
    print(f"Page {page_num}: {len(text)} characters")

Create Searchable PDF

# Convert scanned PDF to searchable PDF
processor = OCRProcessor("scanned.pdf")
processor.export_searchable_pdf("searchable.pdf")

Batch Processing

from scripts.ocr_processor import batch_ocr

# Process directory of images
results = batch_ocr(
    input_dir="scans/",
    output_dir="extracted/",
    output_format="markdown",
    lang="eng",
    recursive=True
)

print(f"Processed: {results['success']} files")
print(f"Failed: {results['failed']} files")

Receipt/Document Parsing

Receipt Extraction

# Parse receipt structure
processor = OCRProcessor("receipt.jpg")
receipt_data = processor.parse_receipt()

# Returns structured data:
# - vendor: Store name
# - date: Transaction date
# - items: List of items with prices
# - subtotal: Subtotal amount
# - tax: Tax amount
# - total: Total amount

Business Card Parsing

# Extract business card info
processor = OCRProcessor("card.jpg")
contact = processor.parse_business_card()

# Returns:
# - name: Person's name
# - title: Job title
# - company: Company name
# - email: Email addresses
# - phone: Phone numbers
# - address: Physical address
# - website: Website URLs

Configuration

processor = OCRProcessor("document.png")

# Configure OCR settings
processor.config.update({
    'psm': 3,           # Page segmentation mode
    'oem': 3,           # OCR engine mode
    'dpi': 300,         # DPI for processing
    'timeout': 30,      # Timeout in seconds
    'min_confidence': 60,  # Minimum word confidence
})

Page Segmentation Modes (PSM)

Mode	Description
0	Orientation and script detection only
1	Automatic page segmentation with OSD
3	Fully automatic page segmentation (default)
4	Assume single column of text
6	Assume single uniform block of text
7	Treat image as single text line
8	Treat image as single word
11	Sparse text. Find as much text as possible
12	Sparse text with OSD

Quality Assessment

# Get confidence scores
result = processor.extract_structured()

# Overall confidence (0-100)
print(f"Confidence: {result['confidence']}%")

# Per-word confidence
for word in result['words']:
    print(f"{word['text']}: {word['confidence']}%")

# Filter low-confidence words
high_conf_words = [w for w in result['words'] if w['confidence'] > 80]

Output Formats

Markdown Export

processor.export_markdown("output.md")

Output includes:

Document title (if detected)
Structured headings
Paragraphs
Tables (as Markdown tables)
Page breaks for multi-page docs

JSON Export

processor.export_json("output.json")

Output structure:

{
  "source": "document.pdf",
  "pages": 5,
  "language": "eng",
  "confidence": 92.5,
  "text": "Full extracted text...",
  "blocks": [
    {
      "type": "paragraph",
      "text": "Block text...",
      "bbox": [x, y, width, height],
      "confidence": 95.2
    }
  ],
  "tables": [...]
}

HTML Export

processor.export_html("output.html")

Creates styled HTML with:

Preserved layout approximation
Highlighted low-confidence regions
Embedded images (optional)
Print-friendly styling

CLI Usage

# Basic extraction
python ocr_processor.py image.png -o output.txt

# Extract to markdown
python ocr_processor.py document.pdf -o output.md --format markdown

# Specify language
python ocr_processor.py german.png --lang deu

# Batch processing
python ocr_processor.py scans/ -o extracted/ --batch

# With preprocessing
python ocr_processor.py noisy.png --preprocess --deskew --denoise

Error Handling

from scripts.ocr_processor import OCRProcessor, OCRError

try:
    processor = OCRProcessor("document.png")
    text = processor.extract_text()
except OCRError as e:
    print(f"OCR failed: {e}")
except FileNotFoundError:
    print("File not found")

Performance Tips

Image Quality: Higher resolution (300+ DPI) improves accuracy
Preprocessing: Use for low-quality scans
Language: Specifying language improves speed and accuracy
PSM Mode: Choose appropriate mode for document type
Large Files: Process PDFs page by page for memory efficiency

Limitations

Handwritten text: Limited accuracy
Complex layouts: May lose structure
Very low quality: Preprocessing helps but has limits
Non-Latin scripts: Require specific language packs

Dependencies

pytesseract>=0.3.10
Pillow>=10.0.0
PyMuPDF>=1.23.0
opencv-python>=4.8.0
numpy>=1.24.0

System Requirements

Tesseract OCR engine must be installed
Language data files for non-English languages

ocr-document-processor

$ 安裝