docx-advanced-patterns
Advanced python-docx patterns for handling nested tables, complex cell structures, and content extraction beyond basic .text property. Complements the official docx skill with specialized techniques for forms, checklists, and complex layouts.
$ Installieren
git clone https://github.com/belumume/claude-skills /tmp/claude-skills && cp -r /tmp/claude-skills/docx-advanced-patterns ~/.claude/skills/claude-skills// tip: Run this command in your terminal to install the skill
name: docx-advanced-patterns description: Advanced python-docx patterns for handling nested tables, complex cell structures, and content extraction beyond basic .text property. Complements the official docx skill with specialized techniques for forms, checklists, and complex layouts. version: 1.0.0 dependencies:
- python>=3.8
- python-docx>=0.8.11
DOCX Advanced Patterns Skill
Specialized patterns for python-docx that handle complex document structures not covered by basic .text extraction.
When to Use This Skill
Invoke this skill when working with DOCX files that have:
- Nested tables within table cells
- Forms with checkbox options
- Complex multi-row cell layouts
- Checklists with embedded options
- Cell content that doesn't appear with
.textproperty
Use alongside the official docx skill for comprehensive document handling.
Core Pattern: Nested Table Extraction
Problem
python-docx's cell.text property only extracts direct paragraph text - it does not traverse nested tables within cells.
Symptom:
cell.text # Returns: '' or '\n'
# But cell visually contains content!
Detection
Check if a cell contains nested tables:
if cell.tables:
print(f"Found {len(cell.tables)} nested table(s)")
# Cell has nested content - need special extraction
Solution (Simple)
def extract_cell_content_with_nested_tables(cell):
"""
Extract all text from a cell, including text from nested tables.
Args:
cell: python-docx _Cell object
Returns:
str: Combined text from cell paragraphs and nested tables
"""
text_parts = []
# Get direct paragraph text (not inside nested tables)
for para in cell.paragraphs:
para_text = para.text.strip()
if para_text:
text_parts.append(para_text)
# Get content from nested tables
if cell.tables:
for nested_table in cell.tables:
for nested_row in nested_table.rows:
# For checkbox lists: Column 0 = label, Column 1 = checkbox
# Extract text from first column only
if nested_row.cells:
first_col_text = nested_row.cells[0].text.strip()
# Filter out checkbox characters
if first_col_text and first_col_text not in ['âź', 'â', 'â', 'â']:
text_parts.append(first_col_text)
return '\n'.join(text_parts) if text_parts else ''
Solution (Recursive for Deep Nesting)
For documents with multiple levels of table nesting:
def extract_cell_content_recursively(cell):
"""
Recursively extract text from cell including deeply nested tables.
Handles arbitrary nesting depth.
"""
text_parts = []
def _extract_recursive(cell_obj):
# Get direct paragraphs
for para in cell_obj.paragraphs:
para_text = para.text.strip()
if para_text and para_text not in ['âź', 'â', 'â', 'â']:
text_parts.append(para_text)
# Recursively get nested tables
for nested_table in cell_obj.tables:
for nested_row in nested_table.rows:
for nested_cell in nested_row.cells:
_extract_recursive(nested_cell)
_extract_recursive(cell)
return '\n'.join(text_parts) if text_parts else ''
Usage Examples
Example 1: Extracting Form Checkbox Options
Document Structure:
Table Cell contains:
Nested Table:
Row 1: "High potential" | â
Row 2: "Moderate potential" | â
Row 3: "Low potential" | â
Extraction:
from docx import Document
doc = Document('form.docx')
table = doc.tables[0]
cell = table.rows[1].cells[0]
# Wrong way - returns empty
basic_text = cell.text
print(basic_text) # Output: '' or '\n'
# Right way - extracts nested content
full_text = extract_cell_content_with_nested_tables(cell)
print(full_text)
# Output:
# High potential
# Moderate potential
# Low potential
Example 2: Processing All Cells in a Table
def process_table_with_nested_content(table):
"""Process all cells, handling nested tables"""
for row in table.rows:
for cell in row.cells:
# Extract with nested table support
content = extract_cell_content_with_nested_tables(cell)
if content:
# Process content (translate, analyze, etc.)
processed = do_something_with(content)
print(f"Cell content: {processed}")
Example 3: Detecting Nested Tables
def analyze_document_structure(doc):
"""Find all cells with nested tables"""
nested_cells = []
for t_idx, table in enumerate(doc.tables):
for r_idx, row in enumerate(table.rows):
for c_idx, cell in enumerate(row.cells):
if cell.tables:
nested_cells.append({
'table': t_idx,
'row': r_idx,
'col': c_idx,
'nested_count': len(cell.tables)
})
return nested_cells
# Usage
doc = Document('complex_form.docx')
nested = analyze_document_structure(doc)
for item in nested:
print(f"Table {item['table']}, Row {item['row']}, Col {item['col']}: "
f"{item['nested_count']} nested table(s)")
Common Use Cases
1. Government Forms
Forms often use nested tables for checkbox grids:
def extract_form_responses(doc):
"""Extract all form checkbox options"""
responses = {}
for table in doc.tables:
for row in table.rows:
# First cell = question
question = row.cells[0].text.strip()
# Second cell = checkbox options (nested table)
if row.cells[1].tables:
options = extract_cell_content_with_nested_tables(row.cells[1])
responses[question] = options.split('\n')
return responses
2. Evaluation Forms
Extract rating scales and options:
def extract_evaluation_items(doc):
"""Extract evaluation criteria and options"""
evaluations = []
for table in doc.tables:
for row_idx, row in enumerate(table.rows[1:], 1):
# Get criterion
criterion = row.cells[0].text.strip()
# Get rating options (often nested)
rating_cell = row.cells[1]
rating_options = extract_cell_content_with_nested_tables(rating_cell)
evaluations.append({
'criterion': criterion,
'options': rating_options.split('\n')
})
return evaluations
3. Complex Data Tables
Extract structured data from cells with nested layouts:
def extract_complex_cell_data(cell):
"""Extract data from cells with complex nested structures"""
data = {
'main_content': '',
'nested_items': []
}
# Direct paragraphs
for para in cell.paragraphs:
if para.text.strip():
data['main_content'] = para.text.strip()
break
# Nested table data
if cell.tables:
for nested_table in cell.tables:
for nested_row in nested_table.rows:
row_data = [c.text.strip() for c in nested_row.cells]
data['nested_items'].append(row_data)
return data
Integration with Official docx Skill
This skill complements the official docx skill:
Official docx skill provides:
- Document creation (docx-js)
- Basic text extraction (pandoc)
- Tracked changes workflows
- Comment handling
- XML access for complex cases
This skill provides:
- Nested table extraction
- Complex cell content handling
- Form and checklist processing
- Advanced content extraction patterns
Use together:
# For basic operations: use official skill
from docx import Document
# For nested table handling: use this skill
from docx_advanced import extract_cell_content_with_nested_tables
# Combine both
doc = Document('complex_form.docx') # Official
for table in doc.tables: # Official
for row in table.rows: # Official
for cell in row.cells: # Official
# Advanced extraction:
content = extract_cell_content_with_nested_tables(cell)
Performance Considerations
For Large Documents:
Cache nested table checks:
def build_nested_table_cache(doc):
"""Pre-compute which cells have nested tables"""
cache = {}
for t_idx, table in enumerate(doc.tables):
for r_idx, row in enumerate(table.rows):
for c_idx, cell in enumerate(row.cells):
if cell.tables:
cache[(t_idx, r_idx, c_idx)] = len(cell.tables)
return cache
# Usage
cache = build_nested_table_cache(doc)
for t_idx, table in enumerate(doc.tables):
for r_idx, row in enumerate(table.rows):
for c_idx, cell in enumerate(row.cells):
if (t_idx, r_idx, c_idx) in cache:
# This cell has nested tables
content = extract_cell_content_with_nested_tables(cell)
else:
# Regular extraction
content = cell.text
Troubleshooting
Issue: Extraction returns empty despite visible content
Diagnosis:
cell = table.rows[1].cells[0]
print(f"cell.text: '{cell.text}'")
print(f"cell.tables: {len(cell.tables)}")
if not cell.text.strip() and cell.tables:
print("Content is in nested tables!")
Fix: Use extract_cell_content_with_nested_tables(cell)
Issue: Checkbox characters (âź, â) appear in output
Fix: Filter them out:
text = cell.text.strip()
# Remove checkbox unicode characters
clean_text = text.replace('âź', '').replace('â', '').replace('â', '').replace('â', '')
Issue: Multi-line content not preserved
Fix: Join with newlines:
'\n'.join(text_parts) # Preserves line structure
Best Practices
-
Always check for nested tables first:
if cell.tables: content = extract_cell_content_with_nested_tables(cell) else: content = cell.text -
Handle checkbox characters:
CHECKBOX_CHARS = ['âź', 'â', 'â', 'â'] if text not in CHECKBOX_CHARS: # Process text -
Preserve structure:
# Use newlines to maintain line breaks '\n'.join(lines) -
Test with sample documents:
def test_extraction(): doc = Document('sample_form.docx') cell = doc.tables[0].rows[1].cells[0] extracted = extract_cell_content_with_nested_tables(cell) assert 'High potential' in extracted assert 'Moderate potential' in extracted
Reference Implementation
See REFERENCE.md for:
- Complete working examples
- Integration patterns
- Advanced recursive extraction
- Performance optimization techniques
Contributing to Anthropic Skills
This pattern is not currently in the official docx skill. If you find it useful, consider contributing:
- Fork https://github.com/anthropics/skills
- Add to
document-skills/docx/SKILL.md - Submit pull request with:
- Pattern description
- Code examples
- Use cases
Success Criteria
Pattern is working if:
- Cells with nested tables return full content
- Checkbox options are extracted correctly
- Form fields are readable
- No content is lost during extraction
- Structure is preserved (line breaks maintained)
Repository
