第 4 讲：Python 文本与字节 | 字符串编码、Unicode、bytes、str 完全指南

大家好，我是正在实战各种 AI 项目的程序员晚枫。

🎬 开篇：一个乱码问题引发的血案

你有没有遇到过这样的崩溃时刻？

# 场景1：读取文件报错
with open('data.txt', 'r') as f:
    content = f.read()
# UnicodeDecodeError: 'utf-8' codec can't decode byte...

# 场景2：爬虫数据乱码
response = requests.get(url)
text = response.text  # 全是乱码

# 场景3：数据库存储问题
name = "张三"
cursor.execute("INSERT INTO users VALUES (?)", (name,))
# 存进去是乱的，或者报错

这些问题都源于一个核心概念：字符编码。

今天我们就彻底搞懂 Python 的文本与字节，让你从此告别乱码噩梦！

🔤 Unicode：字符编码的本质

什么是 Unicode？

Unicode 是一个字符集，为世界上所有的字符分配一个唯一的编号（码点）。

# 查看字符的 Unicode 码点
print(ord('A'))      # 65
print(ord('中'))     # 20013
print(ord('😀'))     # 128512

# 从码点得到字符
print(chr(65))       # 'A'
print(chr(20013))    # '中'
print(chr(128512))   # '😀'

# Unicode 转义
print('\u4e2d')      # '中'
print('\U0001F600')  # '😀'

Unicode 的编码实现

Unicode 只是规定了字符和编号的对应关系，如何存储这些编号就是编码的问题：

编码	特点	每个字符字节数
UTF-8	最流行，变长	1-4 字节
UTF-16	Windows 内部使用	2 或 4 字节
UTF-32	固定长度	4 字节
GBK	中文专用	1-2 字节

UTF-8：互联网的标准编码

UTF-8 是变长编码，兼容 ASCII：

ASCII 字符（0-127）：1 字节
欧洲字符：2 字节
常用汉字：3 字节
Emoji：4 字节

# 查看编码后的字节
print('A'.encode('utf-8'))     # b'A' - 1 字节
print('中'.encode('utf-8'))    # b'\xe4\xb8\xad' - 3 字节
print('😀'.encode('utf-8'))    # b'\xf0\x9f\x98\x80' - 4 字节

# 计算字节数
print(len('A'.encode('utf-8')))    # 1
print(len('中'.encode('utf-8')))   # 3
print(len('Hello'.encode('utf-8')))  # 5
print(len('你好'.encode('utf-8')))   # 6

🔄 str 与 bytes：Python 的两种字符串类型

核心区别

Python 3 明确区分了两种类型：

类型	说明	例子
`str`	Unicode 字符串（人类可读）	`'Hello 中文'`
`bytes`	字节序列（机器可读）	`b'Hello'`

# str：字符串
s = 'Hello 中文'
print(type(s))  # <class 'str'>
print(s)        # Hello 中文

# bytes：字节序列
b = b'Hello'
print(type(b))  # <class 'bytes'>
print(b)        # b'Hello'

# bytes 只能包含 ASCII 字符
# b = b'中文'  # SyntaxError！

# 中文需要先编码
b = '中文'.encode('utf-8')
print(b)  # b'\xe4\xb8\xad\xe6\x96\x87'

编码与解码

# 编码：str → bytes
s = 'Hello 中文'
b_utf8 = s.encode('utf-8')
b_gbk = s.encode('gbk')

print(f"UTF-8: {b_utf8}")  # b'Hello \xe4\xb8\xad\xe6\x96\x87'
print(f"GBK:   {b_gbk}")   # b'Hello \xd6\xd0\xce\xc4'

# 解码：bytes → str
s1 = b_utf8.decode('utf-8')
s2 = b_gbk.decode('gbk')

print(s1)  # Hello 中文
print(s2)  # Hello 中文

# 错误示范：用错误的编码解码
# s3 = b_utf8.decode('gbk')  # 乱码或报错

bytearray：可变的字节序列

# bytes 是不可变的
b = b'hello'
# b[0] = ord('H')  # TypeError

# bytearray 是可变的
ba = bytearray(b'hello')
ba[0] = ord('H')
print(ba)  # bytearray(b'Hello')

# 用途：构建二进制数据
data = bytearray()
data.extend(b'\x89PNG')  # PNG 文件头
data.extend(b'\r\n\x1a\n')
print(data)  # bytearray(b'\x89PNG\r\n\x1a\n')

⚠️ 常见编码问题与解决方案

问题 1：读取文件报错

# ❌ 错误示范
with open('data.txt', 'r') as f:
    content = f.read()
# UnicodeDecodeError: 'utf-8' codec can't decode...

# ✅ 解决方案1：指定正确的编码
with open('data.txt', 'r', encoding='gbk') as f:
    content = f.read()

# ✅ 解决方案2：忽略错误
with open('data.txt', 'r', encoding='utf-8', errors='ignore') as f:
    content = f.read()

# ✅ 解决方案3：替换错误字符
with open('data.txt', 'r', encoding='utf-8', errors='replace') as f:
    content = f.read()  # 不可解码的字节会被 � 替换

# ✅ 解决方案4：二进制读取后手动解码
with open('data.txt', 'rb') as f:
    raw = f.read()
    
# 尝试不同编码
for encoding in ['utf-8', 'gbk', 'gb18030', 'big5']:
    try:
        content = raw.decode(encoding)
        print(f"成功解码，编码是: {encoding}")
        break
    except UnicodeDecodeError:
        continue

问题 2：检测文件编码

def detect_encoding(file_path, sample_size=1024):
    """检测文件编码"""
    import chardet
    
    with open(file_path, 'rb') as f:
        raw = f.read(sample_size)
    
    result = chardet.detect(raw)
    return result['encoding']

# 使用
# encoding = detect_encoding('unknown_file.txt')
# with open('unknown_file.txt', 'r', encoding=encoding) as f:
#     content = f.read()

问题 3：网络请求乱码

import requests

# ❌ 错误：直接用 response.text
response = requests.get('http://example.com')
text = response.text  # 可能乱码

# ✅ 正确：先获取字节，再按正确编码解码
response = requests.get('http://example.com')
response.encoding = response.apparent_encoding  # 自动检测
text = response.text

# 或者手动处理
response = requests.get('http://example.com')
content = response.content  # bytes
text = content.decode('gbk')  # 假设网页是 GBK

问题 4：数据库存储乱码

import sqlite3

# 创建数据库时指定编码
conn = sqlite3.connect(':memory:')
conn.text_factory = str  # 返回 str 而不是 bytes

# 或者保留为 bytes，手动处理
conn.text_factory = bytes
cursor = conn.execute("SELECT name FROM users")
for row in cursor:
    name = row[0].decode('utf-8')
    print(name)

📝 正则表达式进阶

编译正则表达式

import re

# 编译正则表达式（提升性能）
phone_pattern = re.compile(r'\d{3}-\d{4}-\d{4}')

# 使用编译后的模式
text = "电话：010-1234-5678，手机：139-1234-5678"
phones = phone_pattern.findall(text)
print(phones)  # ['010-1234-5678', '139-1234-5678']

# 性能对比
import timeit

pattern_str = r'\d{3}-\d{4}-\d{4}'
pattern_compiled = re.compile(pattern_str)
text = "010-1234-5678" * 100

time_str = timeit.timeit(
    lambda: re.findall(pattern_str, text),
    number=10000
)

time_compiled = timeit.timeit(
    lambda: pattern_compiled.findall(text),
    number=10000
)

print(f"未编译: {time_str:.4f}s")
print(f"编译后: {time_compiled:.4f}s")
print(f"性能提升: {(time_str/time_compiled - 1)*100:.1f}%")

分组与命名

import re

# 普通分组
date_str = "2024-01-15"
match = re.search(r'(\d{4})-(\d{2})-(\d{2})', date_str)
if match:
    year, month, day = match.groups()
    print(f"年: {year}, 月: {month}, 日: {day}")

# 命名分组
match = re.search(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})', date_str)
if match:
    print(f"年: {match.group('year')}")
    print(f"月: {match.group('month')}")
    print(f"日: {match.group('day')}")

# 非捕获分组
text = "hello world"
# (?:...) 不捕获
match = re.search(r'(hello)(?:\s+)(world)', text)
print(match.groups())  # ('hello', 'world') - 只有两组

# 前瞻和后顾
# (?=...) 正向前瞻
# (?!...) 负向前瞻
# (?<=...) 正向后顾
# (?<!...) 负向后顾

# 示例：匹配 @ 符号前的用户名
text = "用户 @alice 和 @bob"
mentions = re.findall(r'(?<=@)\w+', text)
print(mentions)  # ['alice', 'bob']

# 匹配不在数字后面的字母
text = "a1b2c3d"
letters = re.findall(r'(?<!\d)[a-z]', text)
print(letters)  # ['a', 'c']

常用正则模式

import re

# 1. 邮箱
email_pattern = r'[\w\.-]+@[\w\.-]+\.\w+'
emails = re.findall(email_pattern, "联系: test@example.com")

# 2. 手机号（中国）
phone_pattern = r'1[3-9]\d{9}'
phones = re.findall(phone_pattern, "手机: 13912345678")

# 3. URL
url_pattern = r'https?://[\w\.-]+(?:/[\w\.-]*)*'
urls = re.findall(url_pattern, "访问 https://example.com/path")

# 4. IP 地址
ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
ips = re.findall(ip_pattern, "IP: 192.168.1.1")

# 5. HTML 标签
tag_pattern = r'<(\w+)[^>]*>.*?</\1>'
# 匹配配对的 HTML 标签

# 6. 中文
chinese_pattern = r'[\u4e00-\u9fff]+'
chinese = re.findall(chinese_pattern, "Hello 世界 World")

# 7. 非贪婪匹配
text = "<div>content</div><div>more</div>"
# 贪婪
greedy = re.findall(r'<div>.*</div>', text)
print(greedy)  # ['<div>content</div><div>more</div>']

# 非贪婪
non_greedy = re.findall(r'<div>.*?</div>', text)
print(non_greedy)  # ['<div>content</div>', '<div>more</div>']

re 模块常用函数

import re

text = "Hello 123 World 456"

# search：搜索第一个匹配
match = re.search(r'\d+', text)
if match:
    print(match.group())  # '123'

# findall：找出所有匹配
numbers = re.findall(r'\d+', text)
print(numbers)  # ['123', '456']

# finditer：返回匹配对象的迭代器
for match in re.finditer(r'\d+', text):
    print(f"位置 {match.start()}: {match.group()}")

# sub：替换
new_text = re.sub(r'\d+', '[数字]', text)
print(new_text)  # 'Hello [数字] World [数字]'

# sub 使用函数
def double(match):
    return str(int(match.group()) * 2)

new_text = re.sub(r'\d+', double, text)
print(new_text)  # 'Hello 246 World 912'

# split：分割
parts = re.split(r'\s+', "a  b   c")
print(parts)  # ['a', 'b', 'c']

# match：从开头匹配
match = re.match(r'Hello', text)
if match:
    print("开头匹配成功")

# fullmatch：完全匹配
if re.fullmatch(r'Hello \d+ World \d+', text):
    print("完全匹配")

🔧 实战案例：文本处理工具

import re
from collections import Counter

class TextProcessor:
    """文本处理工具集"""
    
    def __init__(self, text):
        self.text = text
    
    def clean_text(self):
        """清理文本：去除多余空白、特殊字符"""
        # 统一换行符
        text = self.text.replace('\r\n', '\n')
        # 去除多余空白
        text = re.sub(r'[ \t]+', ' ', text)
        text = re.sub(r'\n+', '\n', text)
        return text.strip()
    
    def extract_emails(self):
        """提取邮箱"""
        pattern = r'[\w\.-]+@[\w\.-]+\.\w+'
        return re.findall(pattern, self.text)
    
    def extract_urls(self):
        """提取 URL"""
        pattern = r'https?://[\w\.-]+(?:/[\w\./-]*)?(?:\?[\w=&]*)?'
        return re.findall(pattern, self.text)
    
    def extract_phone_numbers(self):
        """提取中国手机号"""
        pattern = r'1[3-9]\d{9}'
        return re.findall(pattern, self.text)
    
    def word_frequency(self, top_n=10):
        """词频统计"""
        # 简单的英文分词
        words = re.findall(r'\b[a-zA-Z]+\b', self.text.lower())
        counter = Counter(words)
        return counter.most_common(top_n)
    
    def chinese_frequency(self, top_n=10):
        """中文字符统计"""
        chinese = re.findall(r'[\u4e00-\u9fff]', self.text)
        counter = Counter(chinese)
        return counter.most_common(top_n)
    
    def remove_html_tags(self):
        """移除 HTML 标签"""
        clean = re.sub(r'<[^>]+>', '', self.text)
        return clean
    
    def normalize_whitespace(self):
        """标准化空白字符"""
        return ' '.join(self.text.split())

# 使用示例
sample_text = """
<div>
    <h1>联系我们</h1>
    <p>邮箱: contact@example.com</p>
    <p>电话: 13912345678</p>
    <p>网站: https://www.example.com/contact</p>
</div>
"""

processor = TextProcessor(sample_text)
print("邮箱:", processor.extract_emails())
print("电话:", processor.extract_phone_numbers())
print("URL:", processor.extract_urls())
print("清理HTML:", processor.remove_html_tags())

📊 编码检测与转换工具

import chardet
from pathlib import Path

class EncodingConverter:
    """编码转换工具"""
    
    @staticmethod
    def detect(file_path, sample_size=10000):
        """检测文件编码"""
        with open(file_path, 'rb') as f:
            raw = f.read(sample_size)
        result = chardet.detect(raw)
        return result
    
    @staticmethod
    def convert(input_path, output_path, 
                from_encoding=None, to_encoding='utf-8'):
        """转换文件编码"""
        # 自动检测源编码
        if from_encoding is None:
            result = EncodingConverter.detect(input_path)
            from_encoding = result['encoding']
            confidence = result['confidence']
            print(f"检测到编码: {from_encoding} (置信度: {confidence:.2%})")
        
        # 读取并转换
        with open(input_path, 'r', encoding=from_encoding, errors='replace') as f:
            content = f.read()
        
        with open(output_path, 'w', encoding=to_encoding) as f:
            f.write(content)
        
        print(f"已转换: {input_path} → {output_path}")
    
    @staticmethod
    def batch_convert(directory, to_encoding='utf-8', 
                      extensions=('.txt', '.csv', '.md')):
        """批量转换目录下的文件"""
        path = Path(directory)
        for file_path in path.rglob('*'):
            if file_path.suffix.lower() in extensions:
                output_path = file_path.with_suffix('.utf8' + file_path.suffix)
                EncodingConverter.convert(file_path, output_path, to_encoding=to_encoding)

# 使用示例
# result = EncodingConverter.detect('unknown_file.txt')
# print(f"编码: {result['encoding']}, 置信度: {result['confidence']}")

# EncodingConverter.convert('gbk_file.txt', 'utf8_file.txt', 'gbk', 'utf-8')
# EncodingConverter.batch_convert('./data')

⚠️ 避坑指南

陷阱 1：混淆 str 和 bytes

# ❌ 错误：拼接 str 和 bytes
s = 'Hello'
b = b'World'
# result = s + b  # TypeError

# ✅ 正确：统一类型
result = s + b.decode('utf-8')  # str + str
# 或
result = s.encode('utf-8') + b  # bytes + bytes

陷阱 2：文件读写编码不一致

# ❌ 错误：写入和读取编码不一致
with open('test.txt', 'w', encoding='utf-8') as f:
    f.write('中文')

with open('test.txt', 'r', encoding='gbk') as f:
    content = f.read()  # 乱码！

# ✅ 正确：保持编码一致
with open('test.txt', 'r', encoding='utf-8') as f:
    content = f.read()

陷阱 3：正则表达式中的编码问题

import re

# ❌ 错误：在 bytes 上使用 str 正则
data = b'Hello World'
# pattern = re.compile(r'hello', re.I)  # str 模式
# match = pattern.search(data)  # TypeError

# ✅ 正确：使用 bytes 正则
pattern = re.compile(b'hello', re.I)
match = pattern.search(data)

# 或者先解码
text = data.decode('utf-8')
pattern = re.compile(r'hello', re.I)
match = pattern.search(text)

陷阱 4：Unicode 规范化问题

# Unicode 有多种表示方式
s1 = 'café'  # 使用单个字符 'é'
s2 = 'cafe\u0301'  # 使用 'e' + 组合重音符

print(s1 == s2)  # False！虽然看起来一样
print(s1, s2)    # café café

# ✅ 规范化
import unicodedata

s1_norm = unicodedata.normalize('NFC', s1)
s2_norm = unicodedata.normalize('NFC', s2)

print(s1_norm == s2_norm)  # True

# 规范化形式
# NFC: 组合字符（推荐）
# NFD: 分解字符
# NFKC: 兼容组合
# NFKD: 兼容分解

🎯 本讲总结

通过本讲，我们掌握了：

知识点	核心要点
Unicode	字符集，为每个字符分配唯一编号
UTF-8	变长编码，互联网标准，兼容 ASCII
str vs bytes	`str` 是 Unicode 字符串，`bytes` 是字节序列
编码/解码	`str.encode()` → bytes，`bytes.decode()` → str
常见编码	UTF-8（通用）、GBK（中文）、Latin-1（西欧）
正则表达式	编译优化、分组、命名、前瞻后顾
编码检测	chardet 库自动检测文件编码
Unicode 规范化	NFC/NFD 处理等价字符

记住这句话：

理解编码的本质：str 是给人看的，bytes 是给机器传输的，encode/decode 是两者之间的桥梁。

学习路线： 零基础 → 《从入门到实践》 → 《流畅的 Python》 → 本门课程 → 《CPython 设计与实现》

🎓 加入《流畅的 Python》直播共读营

学到这里，如果你想系统吃透这本书——欢迎加入我的直播共读课。

每周直播精讲，逐章拆解核心知识点
专属学习群，随时答疑交流
试运营特惠：~~499 元~~ → 299 元

👉 【立即报名《流畅的 Python》共读课】：https://mp.weixin.qq.com/s/ivHJwn1nNx5ug4TFrapvGg

🔗 课程导航

← 上一讲：集合与映射 | 下一讲：函数即对象 →

💬 联系我

平台	账号/链接
微信	扫码加好友
微博	@程序员晚枫
知乎	@程序员晚枫
抖音	@程序员晚枫
小红书	@程序员晚枫
B 站	Python 自动化办公社区

主营业务：AI 编程培训、企业内训、技术咨询

🎓 AI 编程实战课程

想系统学习 AI 编程？程序员晚枫的 AI 编程实战课 帮你从零上手！

👉 免费试看：B站免费试看前3讲，先看看适不适合自己
👉 课程报名：点击这里报名，现在报名还送书📖