第16讲：OCR 文字识别 Skill 开发

掌握 OCR 文字识别技能，实现图片、扫描件、PDF 的文字提取，让纸质文档数字化变得简单高效。

一、场景分析

1.1 用户痛点

在日常办公中，经常需要处理无法直接复制的文字内容：

纸质文档录入：合同、发票、证件等纸质文件需要手动录入，效率低且容易出错
图片文字提取：截图、照片中含有重要文字信息，无法直接复制
扫描件处理：扫描的 PDF 无法搜索和复制，需要 OCR 识别
表格识别困难：图片中的表格结构复杂，手动重建耗时
批量处理需求：大量图片需要统一识别，人工处理不现实

1.2 典型应用场景

场景	需求描述	Skill 价值
发票识别	识别发票图片中的金额、日期、税号等信息	自动提取结构化数据
名片识别	提取名片上的姓名、电话、公司等信息	一键录入通讯录
合同扫描	将纸质合同扫描件转为可编辑文本	数字化存档
证件识别	识别身份证、营业执照等证件信息	自动填写表单
表格识别	识别图片中的表格并转为 Excel	保留表格结构

二、核心功能设计

2.1 Skill 功能架构

👁️ OCR 智能识别
├── 文字识别
│   ├── 印刷体识别
│   ├── 手写体识别
│   ├── 多语言识别
│   └── 倾斜校正
├── 结构化识别
│   ├── 发票识别
│   ├── 名片识别
│   ├── 身份证识别
│   ├── 银行卡识别
│   └── 营业执照识别
├── 表格识别
│   ├── 表格检测
│   ├── 单元格识别
│   ├── 表格重建
│   └── Excel 导出
├── 批量处理
│   ├── 批量识别
│   ├── 批量导出
│   ├── 结果校对
│   └── 错误标记
└── 高级功能
    ├── 图像预处理
    ├── 区域选择识别
    ├── 置信度评估
    └── 结果格式化

2.2 技术选型

OCR 处理的核心技术栈：

功能	技术方案	说明
开源 OCR	Tesseract / PaddleOCR	免费，可离线使用
云端 OCR	百度/腾讯/阿里云 OCR API	精度高，支持复杂场景
表格识别	PaddleOCR-Table / ExcelNet	专门用于表格识别
图像处理	OpenCV / Pillow	图像预处理和优化

三、技术实现

3.1 Coze 平台实现

3.1.1 基础 OCR 代码

使用 Tesseract：

import pytesseract
from PIL import Image
import cv2
import numpy as np

def ocr_image(image_path, lang='chi_sim+eng', preprocess=True):
    """
    识别图片中的文字
    
    Args:
        image_path: 图片路径
        lang: 识别语言（chi_sim:简体中文, eng:英文）
        preprocess: 是否进行图像预处理
    
    Returns:
        识别的文字内容
    """
    # 读取图片
    image = cv2.imread(image_path)
    
    if preprocess:
        # 图像预处理
        image = preprocess_image(image)
    
    # 转换为 PIL Image
    pil_image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    
    # OCR 识别
    text = pytesseract.image_to_string(pil_image, lang=lang)
    
    return text.strip()

def preprocess_image(image):
    """
    图像预处理，提高识别准确率
    
    Args:
        image: OpenCV 图像对象
    """
    # 转为灰度图
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
    # 去噪
    denoised = cv2.fastNlMeansDenoising(gray)
    
    # 二值化
    _, binary = cv2.threshold(denoised, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    
    # 倾斜校正
    corrected = deskew(binary)
    
    return corrected

def deskew(image):
    """倾斜校正"""
    coords = np.column_stack(np.where(image > 0))
    angle = cv2.minAreaRect(coords)[-1]
    
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle
    
    (h, w) = image.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(image, M, (w, h),
                             flags=cv2.INTER_CUBIC,
                             borderMode=cv2.BORDER_REPLICATE)
    
    return rotated

使用 PaddleOCR（推荐）：

from paddleocr import PaddleOCR
import cv2

# 初始化 OCR 引擎
ocr_engine = PaddleOCR(
    use_angle_cls=True,  # 使用方向分类器
    lang='ch',           # 中文
    use_gpu=False        # CPU 运行
)

def paddle_ocr(image_path):
    """
    使用 PaddleOCR 识别图片
    
    Args:
        image_path: 图片路径
    
    Returns:
        识别结果列表，每个元素包含文字、位置和置信度
    """
    result = ocr_engine.ocr(image_path, cls=True)
    
    recognized_text = []
    for line in result[0]:
        if line:
            bbox = line[0]      # 文字框位置
            text = line[1][0]   # 文字内容
            confidence = line[1][1]  # 置信度
            
            recognized_text.append({
                'text': text,
                'bbox': bbox,
                'confidence': confidence
            })
    
    return recognized_text

def extract_text_only(image_path):
    """仅提取文字内容"""
    result = paddle_ocr(image_path)
    return '\n'.join([item['text'] for item in result])

3.1.2 结构化识别代码

发票识别：

import re

def recognize_invoice(image_path):
    """
    识别发票信息
    
    Args:
        image_path: 发票图片路径
    
    Returns:
        结构化发票信息
    """
    # OCR 识别
    text = extract_text_only(image_path)
    
    # 提取关键信息
    invoice_info = {
        'invoice_code': extract_pattern(text, r'发票代码[:：]\s*(\d+)'),
        'invoice_number': extract_pattern(text, r'发票号码[:：]\s*(\d+)'),
        'date': extract_pattern(text, r'(\d{4}年\d{1,2}月\d{1,2}日|\d{4}-\d{2}-\d{2})'),
        'amount': extract_pattern(text, r'(?:价税合计|金额)[:：]?\s*[¥￥]?\s*([\d,\.]+)'),
        'seller': extract_pattern(text, r'销售方.*?名称[:：]\s*([^\n]+)'),
        'buyer': extract_pattern(text, r'购买方.*?名称[:：]\s*([^\n]+)'),
        'tax_id': extract_pattern(text, r'纳税人识别号[:：]\s*([A-Z0-9]+)')
    }
    
    return invoice_info

def extract_pattern(text, pattern):
    """使用正则提取内容"""
    match = re.search(pattern, text, re.DOTALL)
    return match.group(1).strip() if match else None

名片识别：

def recognize_business_card(image_path):
    """
    识别名片信息
    
    Args:
        image_path: 名片图片路径
    
    Returns:
        结构化名片信息
    """
    text = extract_text_only(image_path)
    lines = text.split('\n')
    
    card_info = {
        'name': None,
        'company': None,
        'title': None,
        'phone': None,
        'email': None,
        'address': None
    }
    
    for line in lines:
        line = line.strip()
        
        # 识别手机号
        phone_match = re.search(r'1[3-9]\d{9}', line)
        if phone_match:
            card_info['phone'] = phone_match.group()
            continue
        
        # 识别邮箱
        email_match = re.search(r'[\w.-]+@[\w.-]+\.\w+', line)
        if email_match:
            card_info['email'] = email_match.group()
            continue
        
        # 识别公司（通常包含"公司"、"集团"等）
        if any(keyword in line for keyword in ['公司', '集团', '企业', '科技']):
            card_info['company'] = line
            continue
        
        # 识别职位（通常包含"经理"、"总监"等）
        if any(keyword in line for keyword in ['经理', '总监', '主管', '工程师']):
            card_info['title'] = line
            continue
        
        # 识别姓名（通常是短文本，2-4个字符）
        if 2 <= len(line) <= 4 and not any(c.isdigit() for c in line):
            if not card_info['name']:
                card_info['name'] = line
    
    return card_info

3.1.3 表格识别代码

def recognize_table(image_path):
    """
    识别图片中的表格
    
    Args:
        image_path: 图片路径
    
    Returns:
        表格数据（二维列表）
    """
    # 使用 PaddleOCR 的表格识别
    from paddleocr import PPStructure
    
    table_engine = PPStructure(
        show_log=False,
        layout=False,
        table=True
    )
    
    img = cv2.imread(image_path)
    result = table_engine(img)
    
    tables = []
    for line in result:
        if line['type'] == 'table':
            # 提取表格 HTML
            html = line['res']['html']
            # 解析 HTML 为二维列表
            table_data = parse_table_html(html)
            tables.append(table_data)
    
    return tables

def parse_table_html(html):
    """解析表格 HTML 为二维列表"""
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find('table')
    
    data = []
    for row in table.find_all('tr'):
        row_data = []
        for cell in row.find_all(['td', 'th']):
            row_data.append(cell.get_text().strip())
        data.append(row_data)
    
    return data

def table_to_excel(table_data, output_path):
    """将表格数据保存为 Excel"""
    import pandas as pd
    
    df = pd.DataFrame(table_data[1:], columns=table_data[0])
    df.to_excel(output_path, index=False)
    
    return output_path

3.1.4 PDF OCR 代码

from pdf2image import convert_from_path
import os

def ocr_pdf(pdf_path, output_format='text'):
    """
    识别 PDF 扫描件中的文字
    
    Args:
        pdf_path: PDF 文件路径
        output_format: 输出格式 ('text', 'json', 'docx')
    
    Returns:
        识别结果
    """
    # 将 PDF 转为图片
    images = convert_from_path(pdf_path, dpi=300)
    
    all_results = []
    
    for i, image in enumerate(images):
        # 保存临时图片
        temp_image = f'temp_page_{i}.png'
        image.save(temp_image, 'PNG')
        
        # OCR 识别
        result = paddle_ocr(temp_image)
        all_results.append({
            'page': i + 1,
            'content': result
        })
        
        # 删除临时文件
        os.remove(temp_image)
    
    # 格式化输出
    if output_format == 'text':
        return format_ocr_results_as_text(all_results)
    elif output_format == 'json':
        return all_results
    elif output_format == 'docx':
        return create_docx_from_ocr(all_results, pdf_path.replace('.pdf', '.docx'))

def format_ocr_results_as_text(results):
    """将 OCR 结果格式化为文本"""
    text = ""
    for page in results:
        text += f"\n--- 第{page['page']}页 ---\n"
        for item in page['content']:
            text += item['text'] + "\n"
    return text

3.2 OpenClaw 平台实现

OpenClaw 的 OCR Skill 示例：

from openclaw import Skill, Tool
from paddleocr import PaddleOCR
import cv2

class OCRSkill(Skill):
    name = "OCR智能识别"
    description = "识别图片和文档中的文字"
    
    def __init__(self):
        self.ocr = PaddleOCR(use_angle_cls=True, lang='ch', use_gpu=False)
    
    @Tool
    def recognize_text(self, image_path: str) -> str:
        """识别图片中的文字"""
        result = self.ocr.ocr(image_path, cls=True)
        texts = [line[1][0] for line in result[0] if line]
        return '\n'.join(texts)
    
    @Tool
    def recognize_invoice(self, image_path: str) -> dict:
        """识别发票信息"""
        text = self.recognize_text(image_path)
        # 提取结构化信息
        return {
            'invoice_code': self._extract(text, r'发票代码[:：]\s*(\d+)'),
            'invoice_number': self._extract(text, r'发票号码[:：]\s*(\d+)'),
            'amount': self._extract(text, r'价税合计.*?([\d,\.]+)'),
            'date': self._extract(text, r'(\d{4}年\d{1,2}月\d{1,2}日)')
        }
    
    @Tool
    def recognize_table(self, image_path: str, output: str) -> str:
        """识别图片中的表格并导出 Excel"""
        # 表格识别逻辑
        # ...
        return f"表格已导出到 {output}"
    
    def _extract(self, text, pattern):
        import re
        match = re.search(pattern, text)
        return match.group(1) if match else None

四、Prompt 设计

4.1 系统 Prompt

你是 OCR 智能识别助手，专门帮助用户从图片和文档中提取文字信息。

你可以执行以下操作：
1. 文字识别：识别图片中的印刷体和手写体文字
2. 结构化识别：识别发票、名片、证件等结构化文档
3. 表格识别：识别图片中的表格并导出为 Excel
4. PDF 识别：识别扫描件 PDF 中的文字
5. 批量处理：批量识别多张图片

工作流程：
1. 理解用户的识别需求
2. 询问图片/文档信息
3. 选择合适的识别模式
4. 执行 OCR 识别
5. 返回结构化结果

注意事项：
- 提醒用户图片清晰度影响识别准确率
- 提供置信度信息供参考
- 复杂排版建议分段识别
- 敏感文档注意隐私保护

4.2 意图识别示例

用户输入	识别意图	提取参数
"识别这张图片里的文字"	通用 OCR	图片路径
"提取这张发票的信息"	发票识别	图片路径
"把这张表格转成 Excel"	表格识别	图片路径、输出路径
"识别这个 PDF 扫描件"	PDF OCR	PDF 路径
"批量识别这些图片"	批量 OCR	文件夹路径

五、实战案例

5.1 案例一：发票自动录入

场景：财务部门需要批量处理发票，提取信息录入系统。

解决方案：

def batch_process_invoices(image_folder):
    """
    批量处理发票
    
    Args:
        image_folder: 发票图片文件夹
    """
    import os
    import pandas as pd
    
    results = []
    
    for filename in os.listdir(image_folder):
        if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
            image_path = os.path.join(image_folder, filename)
            
            # 识别发票
            invoice_info = recognize_invoice(image_path)
            invoice_info['filename'] = filename
            invoice_info['status'] = '成功' if invoice_info['invoice_code'] else '需人工核对'
            
            results.append(invoice_info)
    
    # 导出为 Excel
    df = pd.DataFrame(results)
    output_path = os.path.join(image_folder, '发票识别结果.xlsx')
    df.to_excel(output_path, index=False)
    
    return output_path

5.2 案例二：名片信息录入

场景：销售团队参加展会后，需要整理收集到的名片。

解决方案：

def batch_process_business_cards(image_folder):
    """批量处理名片"""
    import os
    import pandas as pd
    
    contacts = []
    
    for filename in os.listdir(image_folder):
        if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
            image_path = os.path.join(image_folder, filename)
            
            # 识别名片
            card_info = recognize_business_card(image_path)
            card_info['filename'] = filename
            
            contacts.append(card_info)
    
    # 导出为 Excel
    df = pd.DataFrame(contacts)
    output_path = os.path.join(image_folder, '名片信息.xlsx')
    df.to_excel(output_path, index=False)
    
    # 同时生成 vCard 格式
    vcard_path = os.path.join(image_folder, 'contacts.vcf')
    generate_vcards(contacts, vcard_path)
    
    return output_path, vcard_path

def generate_vcards(contacts, output_path):
    """生成 vCard 文件"""
    with open(output_path, 'w', encoding='utf-8') as f:
        for contact in contacts:
            f.write('BEGIN:VCARD\n')
            f.write('VERSION:3.0\n')
            if contact.get('name'):
                f.write(f"FN:{contact['name']}\n")
            if contact.get('phone'):
                f.write(f"TEL:{contact['phone']}\n")
            if contact.get('email'):
                f.write(f"EMAIL:{contact['email']}\n")
            if contact.get('company'):
                f.write(f"ORG:{contact['company']}\n")
            if contact.get('title'):
                f.write(f"TITLE:{contact['title']}\n")
            f.write('END:VCARD\n')

六、实战练习

练习 1：通用 OCR 工具

创建一个 Skill，实现以下功能：

接收用户上传的图片
进行 OCR 识别
返回识别的文字内容
显示每个文字的置信度

练习 2：发票识别助手

创建一个 Skill，实现以下功能：

识别发票图片
提取发票代码、号码、金额、日期
验证发票信息完整性
导出为 Excel 格式

练习 3：表格识别导出

创建一个 Skill，实现以下功能：

识别图片中的表格
保留表格结构
导出为 Excel 文件
支持批量处理多张图片

七、常见问题

Q1：识别准确率不高怎么办？

解决方案：

提高图片分辨率（建议 300 DPI 以上）
进行图像预处理（去噪、二值化、倾斜校正）
使用云端 OCR API（如百度、腾讯）获得更高精度
对于特定场景使用专门的识别模型

Q2：手写体识别效果差？

解决方案：

使用专门的手写体 OCR 模型
提醒用户书写工整
考虑使用云端手写识别服务
对于重要内容建议人工校对

Q3：表格识别后格式错乱？

解决方案：

使用专门的表格识别模型（如 PaddleOCR-Table）
确保表格线条清晰
识别后人工检查并调整
复杂表格建议分段识别

八、下节预告

下一讲我们将学习 邮件自动化 Skill 开发，包括：

邮件自动发送
邮件内容模板化
附件自动处理
邮件批量发送

加入学习群

👉 加入AI编程学习交流群

点击加入

本讲是《AI Skills 从入门到实践》系列课程的第16讲。

🎓 AI 编程实战课程

想系统学习 AI 编程？程序员晚枫的 AI 编程实战课 帮你从零上手！

👉 免费试看：B站免费试看前3讲，先看看适不适合自己
👉 课程报名：点击这里报名，现在报名还送书📖