process-faq
Process and transform FAQ documents (xlsx, word, pdf, txt) into RAG-optimized format. Use when working with FAQ files, knowledge base documents, or when the user needs to analyze and restructure FAQ content for RAG systems.
$ 설치
git clone https://github.com/joneqian/claude-skills-suite /tmp/claude-skills-suite && cp -r /tmp/claude-skills-suite/output/process-faq ~/.claude/skills/claude-skills-suite// tip: Run this command in your terminal to install the skill
name: process-faq description: Process and transform FAQ documents (xlsx, word, pdf, txt) into RAG-optimized format. Use when working with FAQ files, knowledge base documents, or when the user needs to analyze and restructure FAQ content for RAG systems. tools: Read, Write, Bash, AskUserQuestion
Process FAQ - FAQ Knowledge Base Processor
Transform raw FAQ documents into RAG-optimized structured format with intelligent content expansion and analysis.
What This Skill Does
This skill helps you:
- Convert FAQ documents to readable Markdown format (script)
- Analyze FAQ content and identify expansion opportunities (Claude)
- Expand content: split complex questions, rewrite answers (Claude)
- Standardize format and generate keywords automatically (script)
Supported Input Formats
- Excel (.xlsx)
- Word (.docx)
- PDF (.pdf)
- Text (.txt)
Workflow Overview (3-Step Process)
Step 1: Convert → Markdown (script)
Step 2: Expand → Enhanced FAQ (Claude - THIS IS THE KEY STEP!)
Step 3: Standardize → Final RAG format (script)
New Workflow (Claude + Script Collaboration)
Step 1: Convert to Markdown (Script)
First, convert the input file to Markdown so Claude can read and analyze it:
python process-faq/scripts/convert_to_markdown.py <input_file>
This creates a *_for_analysis.md file with structured FAQ content.
Why Markdown?
- Claude can directly read and understand the content
- Better for analyzing content quality vs just checking format
- Allows for nuanced, intelligent analysis
Step 2: Claude Analyzes and Expands Content (CRITICAL!)
This is the most important step where YOU (Claude) create value!
Read the Markdown file and perform deep content analysis and expansion:
Phase A: Content Quality Analysis (质量检查)
IMPORTANT: Do this BEFORE expanding content!
Identify and document issues:
-
Logical Issues (逻辑问题)
- Contradictions: Do different answers give conflicting information?
- Example: Q1 says "支持退货" but Q2 says "不支持退货"
- Inconsistencies: Do similar questions have different answers?
- Example: "配送时间 3 天" vs "配送时间 5-7 天"
- Outdated Information: References to old products, prices, or policies?
- Contradictions: Do different answers give conflicting information?
-
Duplicate/Redundant Content (重复内容)
- Exact Duplicates: Same question appears multiple times
- Semantic Duplicates: "如何登录?" vs "怎么登录?" (same meaning)
- Overlapping Answers: Multiple questions share 80%+ identical content
-
Missing or Incomplete Information (缺失信息)
- Incomplete Answers: Too brief, missing key steps or details
- Missing Context: Assumes knowledge users may not have
- Broken Logic: Answer doesn't actually address the question
-
Clarity Issues (表达问题)
- Unclear Questions: Too vague ("支持什么?")
- Ambiguous Terms: Undefined jargon or acronyms
- Poor Structure: Wall of text without formatting
Action Required: Document all issues found and decide:
- ✅ Fix: Resolve contradictions, merge duplicates, clarify ambiguities
- ⚠️ Flag: Note issues that need user clarification
- ❌ Remove: Delete truly useless or incorrect content
Phase B: Content Expansion Strategy
Goal: Transform a small FAQ into a comprehensive, high-quality knowledge base
CRITICAL: Expansion must happen AFTER quality analysis!
Key expansion techniques:
-
Resolve Issues First (基于 Phase A 的发现)
- Fix Contradictions: Choose the correct information, note uncertainty for user
- Merge Duplicates: Combine semantically identical questions into one best version
- Complete Incomplete: Fill in missing steps, add context
- Clarify Ambiguities: Reword vague questions to be specific
-
Split Complex Questions
- If one question like "如何使用产品?" contains multiple sub-topics
- Break it into specific questions: "如何安装?", "如何配置?", "如何维护?"
-
Extract Knowledge Points from Long Answers
- If one answer is 500+ characters and covers multiple topics
- Identify each distinct knowledge point
- Create a dedicated Q&A for each point
- BUT: Ensure each extracted point is accurate and consistent
-
Identify Missing Common Questions
- Based on the domain, what would users naturally ask?
- Add questions that should exist but don't
- Consider user journey: pre-purchase → purchase → usage → troubleshooting
-
Rewrite Answers for Clarity
- Make each answer concise and focused
- Remove sales language if not appropriate
- Add structure (numbered lists, bullet points)
- Ensure consistency with other related answers
Example: Quality Analysis + Expansion
Original (with issues):
Q1: 你们是怎么调理睡眠的?
A1: [500字,包含:产品介绍、使用流程、手环说明、"手环180元"、售后政策等]
Q2: 先用后付是什么意思?
A2: [与Q1相同的500字回答]
Q3: 华为手环多少钱?
A3: 手环200元左右
Phase A Analysis - Issues Found:
- ❌ Contradiction: Q1 说"手环 180 元",Q3 说"手环 200 元左右"
- ❌ Duplicate: Q1 和 Q2 的回答完全相同
- ⚠️ Overlapping: 三个问题都提到手环,信息散乱
Phase B Expansion - After Fixes:
Q1: 你们是怎么调理睡眠的?
A1: 我们通过太赫兹能量睡垫来调理睡眠,能帮助疏通经络、改善气血循环...
[简洁,只讲核心调理原理,不再包含价格等无关信息]
Q2: 先用后付是什么意思?
A2: 先用后付就是您可以先把产品拿回家免费体验,有效果再付款...
[独立回答,不再重复Q1的内容]
Q3: 华为手环多少钱?
A3: 华为手环180元左右(已统一价格,解决矛盾)
Q4: 为什么要用华为手环测睡眠?
A4: 手环能精准测出入睡时间、深睡时长等数据...
[新增问题,补充手环相关信息]
Q5: 手环怎么使用?
A5: 充电后戴在手腕上,连接手机APP即可...
[新增问题,完善手环知识点]
Summary:
- Fixed 1 contradiction (价格统一)
- Merged 1 duplicate (Q1 和 Q2)
- Expanded 3 → 5 FAQs (提取知识点)
- Each answer is now focused and consistent
Phase C: Categorization
Design a clear category structure:
- Group related questions together
- Use domain-appropriate category names
- Aim for 5-10 main categories
Step 3: User Consultation (Optional)
Use the AskUserQuestion tool if you need clarification on:
- Domain-specific terminology
- Tone preferences (formal vs casual)
- Whether to keep sales language
- Priority topics to expand
In most cases, you can proceed directly to Step 4 based on your analysis.
Step 4: Generate Expanded FAQ (Excel)
CRITICAL: This is where you do the actual content expansion!
Create a new Excel file with the expanded FAQ content:
File naming: <original_name>_expanded.xlsx
Required columns:
分类(Category)问题(Question)回答(Answer)
Optional column (script will generate if missing):
关键词(Keywords) - you can leave this empty, script will auto-generate
How to create the file:
IMPORTANT (Cross-Platform Compatibility):
- DO NOT use
python -c "..."to run inline Python code - this causes quote escaping issues on Windows - ALWAYS use the Write tool to create a
.pyscript file first, then run it withpython script.py
Step-by-step approach:
- First, use the Write tool to create a Python script (e.g.,
create_faq.py):
# create_faq.py - Use Write tool to create this file
import pandas as pd
data = [
{
"分类": "睡眠问题咨询",
"问题": "你们是怎么调理睡眠的?",
"回答": "我们通过太赫兹能量睡垫来调理睡眠..."
},
{
"分类": "睡眠问题咨询",
"问题": "我总是入睡困难怎么办?",
"回答": "入睡困难通常和气血不畅有关..."
},
# ... 添加所有扩展后的FAQ
]
df = pd.DataFrame(data)
df.to_excel("filename_expanded.xlsx", index=False)
print("Successfully created filename_expanded.xlsx")
- Then run the script using Bash:
python create_faq.py
- Clean up the temporary script after use:
rm create_faq.py # or 'del create_faq.py' on Windows CMD
Quality checklist before saving:
- Each question is specific and focused
- Each answer is concise (typically 50-200 characters)
- Questions are grouped by logical categories
- All important knowledge points are covered
- No redundant or duplicate questions
Step 5: Standardize Format with Script
Use the script to process the expanded file:
python process-faq/scripts/generate_rag_faq.py <expanded_file> <final_output_file>
What the script does:
- Auto-generates keywords using jieba TF-IDF
- Applies professional Excel formatting
- Sets proper column widths and styles
- Performs final duplicate check
- Creates the final RAG-optimized knowledge base
Example:
python process-faq/scripts/generate_rag_faq.py 申花太赫兹_expanded.xlsx 申花太赫兹_RAG_优化版.xlsx
Complete Example Workflow
User: "Please process 申花太赫兹知识库.xlsx and convert it to RAG format"
You (Claude):
Step 1: Convert to Markdown
python process-faq/scripts/convert_to_markdown.py 申花太赫兹知识库.xlsx
Step 2: Read and Analyze
- Read the generated
申花太赫兹知识库_for_analysis.mdusing Read tool - Analyze: "I found that the original 7 FAQs have very long answers (500+ characters each)"
- Identify: "Each answer actually covers 3-5 different topics"
Step 3: Expand Content
- Extract knowledge points from long answers
- Create dedicated Q&A for each point
- Example: From 1 question about "如何调理睡眠", expand to:
- 你们是怎么调理睡眠的?(调理原理)
- 我总是入睡困难怎么办?(具体症状)
- 先用后付是什么意思?(购买政策)
- 为什么要用华为手环?(设备说明)
- 手环怎么使用?(使用指南)
- 等等...
Step 4: Generate Expanded Excel
Use pandas to create 申花太赫兹知识库_expanded.xlsx with 31 focused FAQs (from original 7)
Step 5: Standardize Format
python process-faq/scripts/generate_rag_faq.py 申花太赫兹知识库_expanded.xlsx 申花太赫兹知识库_RAG_优化版.xlsx
Step 6: Report Results
- "Successfully expanded 7 FAQs into 31 focused entries"
- "Organized into 8 categories"
- "Auto-generated keywords for all entries"
- "Final file ready for RAG system"
Key Principles
1. Content Expansion is Key
The main value you provide is:
- Expanding small FAQs into comprehensive knowledge bases
- Extracting knowledge points from long answers
- Creating focused, specific Q&A pairs
- NOT just cleaning up format or removing duplicates
2. Quality Over Quantity (But More is Often Better)
- Each FAQ should be focused and specific
- Better to have 30 focused FAQs than 5 long ones
- Each answer should ideally be 50-200 characters
- Long answers (500+) should be split into multiple FAQs
3. Think Like a RAG System
- How would users search for this information?
- What specific questions would they ask?
- Would this answer be found by semantic search?
- Is the question specific enough to match user intent?
4. Division of Labor
Claude does (creative work):
- Content understanding
- Knowledge point extraction
- Question splitting and rewording
- Answer rewriting
- Category design
Script does (mechanical work):
- Keyword extraction (jieba TF-IDF)
- Format standardization
- Excel styling
- Final duplicate check
Quality Analysis and Expansion Checklist
Phase A: Quality Analysis (Must do FIRST!)
- Check for contradictions: Do different FAQs give conflicting information?
- Identify duplicates: Exact or semantic duplicates (same meaning, different wording)
- Find inconsistencies: Similar questions with different answers (e.g., different prices, timeframes)
- Spot incomplete info: Answers missing key steps or context
- Flag unclear content: Vague questions, ambiguous terms, undefined jargon
- Note outdated info: References to old products, policies, or prices
Action: Document all issues and plan how to resolve them
Phase B: Content Expansion (After quality fixes!)
- Contradictions resolved: Unified conflicting information
- Duplicates merged: Combined semantically identical questions
- Long answers split: Any answer >300 characters covering multiple topics
- Each Q&A focused: One question = one specific topic
- All knowledge points extracted: No information lost from original
- Questions are specific: Avoided vague questions like "如何使用?"
- Answers are concise: Typically 50-200 characters per answer
- Answers are consistent: Related FAQs give aligned information
- Categories are clear: 5-10 logical categories based on the domain
- Common questions added: Anticipated natural user questions
- Proper structure: Used lists, numbering, or bullet points where appropriate
Output Format
The final Excel file will have:
| 分类 | 问题 | 回答 | 关键词 |
|---|---|---|---|
| Category | Question | Answer | Keywords |
Example:
| 分类 | 问题 | 回答 | 关键词 |
|---|---|---|---|
| 账户管理 | 如何重置密码? | 1. 点击"忘记密码"\n2. 输入邮箱\n3. 查收重置链接 | 密码,重置,账户 |
| 支付问题 | 支持哪些支付方式? | 我们支持:\n- 支付宝\n- 微信支付\n- 银行卡 | 支付,方式,支付宝 |
Best Practices
- Always Convert First: Don't try to analyze binary files directly
- Read Thoroughly: Actually read the Markdown file, understand the domain
- Quality BEFORE Expansion: Analyze issues first, then expand
- Don't expand broken content - fix it first!
- Resolve contradictions before creating more FAQs
- Merge duplicates before splitting long answers
- Look for Logic Issues:
- Contradicting information across FAQs
- Inconsistent answers to similar questions
- Missing prerequisites or context
- Extract Knowledge Points: Identify every distinct topic in long answers
- Ensure Consistency: Related FAQs should give aligned, non-conflicting information
- Create Focused FAQs: Each Q&A should cover one specific topic
- Think Like Users: What would they search for? What questions would they ask?
- Generate Intermediate File: Create
*_expanded.xlsxbefore running the script - Let Script Handle Keywords: Don't manually generate keywords, let jieba do it
Common Expansion Scenarios
Scenario 1: Long Answer with Multiple Topics
Original:
Q: 你们的产品怎么样?
A: 我们的产品质量好、价格实惠、支持30天退货、全国包邮、还有24小时客服...
Expanded:
Q1: 你们的产品质量如何?
A1: 产品经过严格质检,质量可靠...
Q2: 产品价格贵吗?
A2: 价格实惠,性价比高...
Q3: 支持退货吗?
A3: 支持30天无理由退货...
Q4: 包邮吗?
A4: 全国包邮,无需额外运费...
Q5: 有客服支持吗?
A5: 提供24小时在线客服...
Scenario 2: Vague Question Needs Specificity
Original:
Q: 如何使用?
A: [Long explanation covering installation, configuration, daily use, troubleshooting...]
Expanded:
Q1: 如何安装产品?
A1: [Installation steps]
Q2: 如何进行初始配置?
A2: [Configuration guide]
Q3: 日常使用注意事项有哪些?
A3: [Daily usage tips]
Q4: 遇到问题怎么排查?
A4: [Troubleshooting steps]
Scenario 3: Contradictions and Inconsistencies
Original (with logic issues):
Q1: 配送需要多久?
A1: 一般3-5个工作日送达
Q2: 什么时候能收到货?
A2: 通常7天内送达
Q3: 支持退货吗?
A3: 支持7天无理由退货
Q4: 可以退款吗?
A4: 不支持退款,只能换货
Issues Found:
- ❌ Contradiction: Q1 说 3-5 天,Q2 说 7 天
- ❌ Contradiction: Q3 支持退货,Q4 说不支持退款
- ⚠️ Semantic duplicate: Q1 和 Q2 问的是同一件事
Fixed and Expanded:
Q1: 配送需要多久?
A1: 一般3-5个工作日送达(偏远地区可能需要7天)
[统一时间信息,说明例外情况]
Q2: 支持退货吗?
A2: 支持7天无理由退货退款
[解决矛盾:统一退货和退款政策]
Q3: 如何申请退货?
A3: 联系客服说明原因,获得退货地址后寄回即可
[新增:补充退货流程]
Q4: 退货运费谁承担?
A4: 质量问题我们承担,非质量问题需您承担
[新增:补充退货细节]
Summary:
- Resolved 2 contradictions
- Merged 1 semantic duplicate
- Expanded with 2 new related questions
- All information now consistent
Scenario 4: Missing Obvious Questions
Original: Only has "如何注册账户?"
Should also add:
- 注册需要提供哪些信息?
- 可以用手机号注册吗?
- 忘记密码怎么办?
- 如何修改个人信息?
- 如何注销账户?
Error Handling
If conversion fails:
- Check file format is supported
- Verify file is not corrupted
- Try opening the file manually first
- Check file encoding (should be UTF-8)
If content is unstructured:
- The Markdown will show raw text
- You'll need to manually identify Q&A pairs
- Suggest restructuring the source document
Dependencies
Required Python packages:
- pandas (data handling)
- openpyxl (Excel support)
- python-docx (Word support)
- PyPDF2 (PDF support)
- jieba (Chinese text processing)
Install with:
pip install -r process-faq/requirements.txt
Repository
