name: process-faq description: Process and transform FAQ documents (xlsx, word, pdf, txt) into RAG-optimized format. Use when working with FAQ files, knowledge base documents, or when the user needs to analyze and restructure FAQ content for RAG systems. tools: Read, Write, Bash, AskUserQuestion

Process FAQ - FAQ Knowledge Base Processor

Transform raw FAQ documents into RAG-optimized structured format with intelligent content expansion and analysis.

What This Skill Does

This skill helps you:

Convert FAQ documents to readable Markdown format (script)
Analyze FAQ content and identify expansion opportunities (Claude)
Expand content: split complex questions, rewrite answers (Claude)
Standardize format and generate keywords automatically (script)

Supported Input Formats

Excel (.xlsx)
Word (.docx)
PDF (.pdf)
Text (.txt)

Workflow Overview (3-Step Process)

Step 1: Convert → Markdown (script)
Step 2: Expand → Enhanced FAQ (Claude - THIS IS THE KEY STEP!)
Step 3: Standardize → Final RAG format (script)

New Workflow (Claude + Script Collaboration)

Step 1: Convert to Markdown (Script)

First, convert the input file to Markdown so Claude can read and analyze it:

python process-faq/scripts/convert_to_markdown.py <input_file>

This creates a *_for_analysis.md file with structured FAQ content.

Why Markdown?

Claude can directly read and understand the content
Better for analyzing content quality vs just checking format
Allows for nuanced, intelligent analysis

Step 2: Claude Analyzes and Expands Content (CRITICAL!)

This is the most important step where YOU (Claude) create value!

Read the Markdown file and perform deep content analysis and expansion:

Phase A: Content Quality Analysis (质量检查)

IMPORTANT: Do this BEFORE expanding content!

Identify and document issues:

Logical Issues (逻辑问题)
- Contradictions: Do different answers give conflicting information?
  - Example: Q1 says "支持退货" but Q2 says "不支持退货"
- Inconsistencies: Do similar questions have different answers?
  - Example: "配送时间 3 天" vs "配送时间 5-7 天"
- Outdated Information: References to old products, prices, or policies?
Duplicate/Redundant Content (重复内容)
- Exact Duplicates: Same question appears multiple times
- Semantic Duplicates: "如何登录?" vs "怎么登录?" (same meaning)
- Overlapping Answers: Multiple questions share 80%+ identical content
Missing or Incomplete Information (缺失信息)
- Incomplete Answers: Too brief, missing key steps or details
- Missing Context: Assumes knowledge users may not have
- Broken Logic: Answer doesn't actually address the question
Clarity Issues (表达问题)
- Unclear Questions: Too vague ("支持什么?")
- Ambiguous Terms: Undefined jargon or acronyms
- Poor Structure: Wall of text without formatting

Action Required: Document all issues found and decide:

✅ Fix: Resolve contradictions, merge duplicates, clarify ambiguities
⚠️ Flag: Note issues that need user clarification
❌ Remove: Delete truly useless or incorrect content

Phase B: Content Expansion Strategy

Goal: Transform a small FAQ into a comprehensive, high-quality knowledge base

CRITICAL: Expansion must happen AFTER quality analysis!

Key expansion techniques:

Resolve Issues First (基于 Phase A 的发现)
- Fix Contradictions: Choose the correct information, note uncertainty for user
- Merge Duplicates: Combine semantically identical questions into one best version
- Complete Incomplete: Fill in missing steps, add context
- Clarify Ambiguities: Reword vague questions to be specific
Split Complex Questions
- If one question like "如何使用产品?" contains multiple sub-topics
- Break it into specific questions: "如何安装?", "如何配置?", "如何维护?"
Extract Knowledge Points from Long Answers
- If one answer is 500+ characters and covers multiple topics
- Identify each distinct knowledge point
- Create a dedicated Q&A for each point
- BUT: Ensure each extracted point is accurate and consistent
Identify Missing Common Questions
- Based on the domain, what would users naturally ask?
- Add questions that should exist but don't
- Consider user journey: pre-purchase → purchase → usage → troubleshooting
Rewrite Answers for Clarity
- Make each answer concise and focused
- Remove sales language if not appropriate
- Add structure (numbered lists, bullet points)
- Ensure consistency with other related answers

Example: Quality Analysis + Expansion

Original (with issues):

Q1: 你们是怎么调理睡眠的?
A1: [500字，包含：产品介绍、使用流程、手环说明、"手环180元"、售后政策等]

Q2: 先用后付是什么意思?
A2: [与Q1相同的500字回答]

Q3: 华为手环多少钱？
A3: 手环200元左右

Phase A Analysis - Issues Found:

❌ Contradiction: Q1 说"手环 180 元"，Q3 说"手环 200 元左右"
❌ Duplicate: Q1 和 Q2 的回答完全相同
⚠️ Overlapping: 三个问题都提到手环，信息散乱

Phase B Expansion - After Fixes:

Q1: 你们是怎么调理睡眠的？
A1: 我们通过太赫兹能量睡垫来调理睡眠，能帮助疏通经络、改善气血循环...
[简洁，只讲核心调理原理，不再包含价格等无关信息]

Q2: 先用后付是什么意思？
A2: 先用后付就是您可以先把产品拿回家免费体验，有效果再付款...
[独立回答，不再重复Q1的内容]

Q3: 华为手环多少钱？
A3: 华为手环180元左右（已统一价格，解决矛盾）

Q4: 为什么要用华为手环测睡眠？
A4: 手环能精准测出入睡时间、深睡时长等数据...
[新增问题，补充手环相关信息]

Q5: 手环怎么使用？
A5: 充电后戴在手腕上，连接手机APP即可...
[新增问题，完善手环知识点]

Summary:

Fixed 1 contradiction (价格统一)
Merged 1 duplicate (Q1 和 Q2)
Expanded 3 → 5 FAQs (提取知识点)
Each answer is now focused and consistent

Phase C: Categorization

Design a clear category structure:

Group related questions together
Use domain-appropriate category names
Aim for 5-10 main categories

Step 3: User Consultation (Optional)

Use the AskUserQuestion tool if you need clarification on:

Domain-specific terminology
Tone preferences (formal vs casual)
Whether to keep sales language
Priority topics to expand

In most cases, you can proceed directly to Step 4 based on your analysis.

Step 4: Generate Expanded FAQ (Excel)

CRITICAL: This is where you do the actual content expansion!

Create a new Excel file with the expanded FAQ content:

File naming: <original_name>_expanded.xlsx

Required columns:

分类 (Category)
问题 (Question)
回答 (Answer)

Optional column (script will generate if missing):

关键词 (Keywords) - you can leave this empty, script will auto-generate

How to create the file:

IMPORTANT (Cross-Platform Compatibility):

DO NOT use python -c "..." to run inline Python code - this causes quote escaping issues on Windows
ALWAYS use the Write tool to create a .py script file first, then run it with python script.py

Step-by-step approach:

First, use the Write tool to create a Python script (e.g., create_faq.py):

# create_faq.py - Use Write tool to create this file
import pandas as pd

data = [
    {
        "分类": "睡眠问题咨询",
        "问题": "你们是怎么调理睡眠的？",
        "回答": "我们通过太赫兹能量睡垫来调理睡眠..."
    },
    {
        "分类": "睡眠问题咨询",
        "问题": "我总是入睡困难怎么办？",
        "回答": "入睡困难通常和气血不畅有关..."
    },
    # ... 添加所有扩展后的FAQ
]

df = pd.DataFrame(data)
df.to_excel("filename_expanded.xlsx", index=False)
print("Successfully created filename_expanded.xlsx")

Then run the script using Bash:

python create_faq.py

Clean up the temporary script after use:

rm create_faq.py  # or 'del create_faq.py' on Windows CMD

Quality checklist before saving:

Each question is specific and focused
Each answer is concise (typically 50-200 characters)
Questions are grouped by logical categories
All important knowledge points are covered
No redundant or duplicate questions

Step 5: Standardize Format with Script

Use the script to process the expanded file:

python process-faq/scripts/generate_rag_faq.py <expanded_file> <final_output_file>

What the script does:

Auto-generates keywords using jieba TF-IDF
Applies professional Excel formatting
Sets proper column widths and styles
Performs final duplicate check
Creates the final RAG-optimized knowledge base

Example:

python process-faq/scripts/generate_rag_faq.py 申花太赫兹_expanded.xlsx 申花太赫兹_RAG_优化版.xlsx

Complete Example Workflow

User: "Please process 申花太赫兹知识库.xlsx and convert it to RAG format"

You (Claude):

Step 1: Convert to Markdown

python process-faq/scripts/convert_to_markdown.py 申花太赫兹知识库.xlsx

Step 2: Read and Analyze

Read the generated 申花太赫兹知识库_for_analysis.md using Read tool
Analyze: "I found that the original 7 FAQs have very long answers (500+ characters each)"
Identify: "Each answer actually covers 3-5 different topics"

Step 3: Expand Content

Extract knowledge points from long answers
Create dedicated Q&A for each point
Example: From 1 question about "如何调理睡眠", expand to:
- 你们是怎么调理睡眠的？(调理原理)
- 我总是入睡困难怎么办？(具体症状)
- 先用后付是什么意思？(购买政策)
- 为什么要用华为手环？(设备说明)
- 手环怎么使用？(使用指南)
- 等等...

Step 4: Generate Expanded Excel Use pandas to create 申花太赫兹知识库_expanded.xlsx with 31 focused FAQs (from original 7)

Step 5: Standardize Format

python process-faq/scripts/generate_rag_faq.py 申花太赫兹知识库_expanded.xlsx 申花太赫兹知识库_RAG_优化版.xlsx

Step 6: Report Results

"Successfully expanded 7 FAQs into 31 focused entries"
"Organized into 8 categories"
"Auto-generated keywords for all entries"
"Final file ready for RAG system"

Key Principles

1. Content Expansion is Key

The main value you provide is:

Expanding small FAQs into comprehensive knowledge bases
Extracting knowledge points from long answers
Creating focused, specific Q&A pairs
NOT just cleaning up format or removing duplicates

2. Quality Over Quantity (But More is Often Better)

Each FAQ should be focused and specific
Better to have 30 focused FAQs than 5 long ones
Each answer should ideally be 50-200 characters
Long answers (500+) should be split into multiple FAQs

3. Think Like a RAG System

How would users search for this information?
What specific questions would they ask?
Would this answer be found by semantic search?
Is the question specific enough to match user intent?

4. Division of Labor

Claude does (creative work):

Content understanding
Knowledge point extraction
Question splitting and rewording
Answer rewriting
Category design

Script does (mechanical work):

Keyword extraction (jieba TF-IDF)
Format standardization
Excel styling
Final duplicate check

Quality Analysis and Expansion Checklist

Phase A: Quality Analysis (Must do FIRST!)

Check for contradictions: Do different FAQs give conflicting information?
Identify duplicates: Exact or semantic duplicates (same meaning, different wording)
Find inconsistencies: Similar questions with different answers (e.g., different prices, timeframes)
Spot incomplete info: Answers missing key steps or context
Flag unclear content: Vague questions, ambiguous terms, undefined jargon
Note outdated info: References to old products, policies, or prices

Action: Document all issues and plan how to resolve them

Phase B: Content Expansion (After quality fixes!)

Output Format

The final Excel file will have:

分类	问题	回答	关键词
Category	Question	Answer	Keywords

Example:

分类	问题	回答	关键词
账户管理	如何重置密码?	1. 点击"忘记密码"\n2. 输入邮箱\n3. 查收重置链接	密码,重置,账户
支付问题	支持哪些支付方式?	我们支持:\n- 支付宝\n- 微信支付\n- 银行卡	支付,方式,支付宝

Best Practices

Always Convert First: Don't try to analyze binary files directly
Read Thoroughly: Actually read the Markdown file, understand the domain
Quality BEFORE Expansion: Analyze issues first, then expand
- Don't expand broken content - fix it first!
- Resolve contradictions before creating more FAQs
- Merge duplicates before splitting long answers
Look for Logic Issues:
- Contradicting information across FAQs
- Inconsistent answers to similar questions
- Missing prerequisites or context
Extract Knowledge Points: Identify every distinct topic in long answers
Ensure Consistency: Related FAQs should give aligned, non-conflicting information
Create Focused FAQs: Each Q&A should cover one specific topic
Think Like Users: What would they search for? What questions would they ask?
Generate Intermediate File: Create *_expanded.xlsx before running the script
Let Script Handle Keywords: Don't manually generate keywords, let jieba do it

Common Expansion Scenarios

Scenario 1: Long Answer with Multiple Topics

Original:

Q: 你们的产品怎么样?
A: 我们的产品质量好、价格实惠、支持30天退货、全国包邮、还有24小时客服...

Expanded:

Q1: 你们的产品质量如何?
A1: 产品经过严格质检，质量可靠...

Q2: 产品价格贵吗?
A2: 价格实惠，性价比高...

Q3: 支持退货吗?
A3: 支持30天无理由退货...

Q4: 包邮吗?
A4: 全国包邮，无需额外运费...

Q5: 有客服支持吗?
A5: 提供24小时在线客服...

Scenario 2: Vague Question Needs Specificity

Original:

Q: 如何使用?
A: [Long explanation covering installation, configuration, daily use, troubleshooting...]

Expanded:

Q1: 如何安装产品?
A1: [Installation steps]

Q2: 如何进行初始配置?
A2: [Configuration guide]

Q3: 日常使用注意事项有哪些?
A3: [Daily usage tips]

Q4: 遇到问题怎么排查?
A4: [Troubleshooting steps]

Scenario 3: Contradictions and Inconsistencies

Original (with logic issues):

Q1: 配送需要多久?
A1: 一般3-5个工作日送达

Q2: 什么时候能收到货?
A2: 通常7天内送达

Q3: 支持退货吗?
A3: 支持7天无理由退货

Q4: 可以退款吗?
A4: 不支持退款，只能换货

Issues Found:

❌ Contradiction: Q1 说 3-5 天，Q2 说 7 天
❌ Contradiction: Q3 支持退货，Q4 说不支持退款
⚠️ Semantic duplicate: Q1 和 Q2 问的是同一件事

Fixed and Expanded:

Q1: 配送需要多久?
A1: 一般3-5个工作日送达（偏远地区可能需要7天）
[统一时间信息，说明例外情况]

Q2: 支持退货吗?
A2: 支持7天无理由退货退款
[解决矛盾：统一退货和退款政策]

Q3: 如何申请退货?
A3: 联系客服说明原因，获得退货地址后寄回即可
[新增：补充退货流程]

Q4: 退货运费谁承担?
A4: 质量问题我们承担，非质量问题需您承担
[新增：补充退货细节]