大家好,我是正在实战各种 AI 项目的程序员晚枫。
🎬 开篇:一个乱码问题引发的血案 你有没有遇到过这样的崩溃时刻?
1 2 3 4 5 6 7 8 9 10 11 12 13 with open ('data.txt' , 'r' ) as f: content = f.read() response = requests.get(url) text = response.text name = "张三" cursor.execute("INSERT INTO users VALUES (?)" , (name,))
这些问题都源于一个核心概念:字符编码。
今天我们就彻底搞懂 Python 的文本与字节,让你从此告别乱码噩梦!
🔤 Unicode:字符编码的本质 什么是 Unicode? Unicode 是一个字符集 ,为世界上所有的字符分配一个唯一的编号(码点)。
1 2 3 4 5 6 7 8 9 10 11 12 13 print (ord ('A' )) print (ord ('中' )) print (ord ('😀' )) print (chr (65 )) print (chr (20013 )) print (chr (128512 )) print ('\u4e2d' ) print ('\U0001F600' )
Unicode 的编码实现 Unicode 只是规定了字符和编号的对应关系,如何存储这些编号 就是编码的问题:
编码 特点 每个字符字节数 UTF-8 最流行,变长 1-4 字节 UTF-16 Windows 内部使用 2 或 4 字节 UTF-32 固定长度 4 字节 GBK 中文专用 1-2 字节
UTF-8:互联网的标准编码 UTF-8 是变长编码,兼容 ASCII:
1 2 3 4 ASCII 字符(0-127):1 字节 欧洲字符:2 字节 常用汉字:3 字节 Emoji:4 字节
1 2 3 4 5 6 7 8 9 10 print ('A' .encode('utf-8' )) print ('中' .encode('utf-8' )) print ('😀' .encode('utf-8' )) print (len ('A' .encode('utf-8' ))) print (len ('中' .encode('utf-8' ))) print (len ('Hello' .encode('utf-8' ))) print (len ('你好' .encode('utf-8' )))
🔄 str 与 bytes:Python 的两种字符串类型 核心区别 Python 3 明确区分了两种类型:
类型 说明 例子 strUnicode 字符串(人类可读) 'Hello 中文'bytes字节序列(机器可读) b'Hello'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 s = 'Hello 中文' print (type (s)) print (s) b = b'Hello' print (type (b)) print (b) b = '中文' .encode('utf-8' ) print (b)
编码与解码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 s = 'Hello 中文' b_utf8 = s.encode('utf-8' ) b_gbk = s.encode('gbk' ) print (f"UTF-8: {b_utf8} " ) print (f"GBK: {b_gbk} " ) s1 = b_utf8.decode('utf-8' ) s2 = b_gbk.decode('gbk' ) print (s1) print (s2)
bytearray:可变的字节序列 1 2 3 4 5 6 7 8 9 10 11 12 13 14 b = b'hello' ba = bytearray (b'hello' ) ba[0 ] = ord ('H' ) print (ba) data = bytearray () data.extend(b'\x89PNG' ) data.extend(b'\r\n\x1a\n' ) print (data)
⚠️ 常见编码问题与解决方案 问题 1:读取文件报错 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 with open ('data.txt' , 'r' ) as f: content = f.read() with open ('data.txt' , 'r' , encoding='gbk' ) as f: content = f.read() with open ('data.txt' , 'r' , encoding='utf-8' , errors='ignore' ) as f: content = f.read() with open ('data.txt' , 'r' , encoding='utf-8' , errors='replace' ) as f: content = f.read() with open ('data.txt' , 'rb' ) as f: raw = f.read() for encoding in ['utf-8' , 'gbk' , 'gb18030' , 'big5' ]: try : content = raw.decode(encoding) print (f"成功解码,编码是: {encoding} " ) break except UnicodeDecodeError: continue
问题 2:检测文件编码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 def detect_encoding (file_path, sample_size=1024 ): """检测文件编码""" import chardet with open (file_path, 'rb' ) as f: raw = f.read(sample_size) result = chardet.detect(raw) return result['encoding' ]
问题 3:网络请求乱码 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import requestsresponse = requests.get('http://example.com' ) text = response.text response = requests.get('http://example.com' ) response.encoding = response.apparent_encoding text = response.text response = requests.get('http://example.com' ) content = response.content text = content.decode('gbk' )
问题 4:数据库存储乱码 1 2 3 4 5 6 7 8 9 10 11 12 import sqlite3conn = sqlite3.connect(':memory:' ) conn.text_factory = str conn.text_factory = bytes cursor = conn.execute("SELECT name FROM users" ) for row in cursor: name = row[0 ].decode('utf-8' ) print (name)
📝 正则表达式进阶 编译正则表达式 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import rephone_pattern = re.compile (r'\d{3}-\d{4}-\d{4}' ) text = "电话:010-1234-5678,手机:139-1234-5678" phones = phone_pattern.findall(text) print (phones) import timeitpattern_str = r'\d{3}-\d{4}-\d{4}' pattern_compiled = re.compile (pattern_str) text = "010-1234-5678" * 100 time_str = timeit.timeit( lambda : re.findall(pattern_str, text), number=10000 ) time_compiled = timeit.timeit( lambda : pattern_compiled.findall(text), number=10000 ) print (f"未编译: {time_str:.4 f} s" )print (f"编译后: {time_compiled:.4 f} s" )print (f"性能提升: {(time_str/time_compiled - 1 )*100 :.1 f} %" )
分组与命名 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 import redate_str = "2024-01-15" match = re.search(r'(\d{4})-(\d{2})-(\d{2})' , date_str)if match : year, month, day = match .groups() print (f"年: {year} , 月: {month} , 日: {day} " ) match = re.search(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})' , date_str)if match : print (f"年: {match .group('year' )} " ) print (f"月: {match .group('month' )} " ) print (f"日: {match .group('day' )} " ) text = "hello world" match = re.search(r'(hello)(?:\s+)(world)' , text)print (match .groups()) text = "用户 @alice 和 @bob" mentions = re.findall(r'(?<=@)\w+' , text) print (mentions) text = "a1b2c3d" letters = re.findall(r'(?<!\d)[a-z]' , text) print (letters)
常用正则模式 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 import reemail_pattern = r'[\w\.-]+@[\w\.-]+\.\w+' emails = re.findall(email_pattern, "联系: test@example.com" ) phone_pattern = r'1[3-9]\d{9}' phones = re.findall(phone_pattern, "手机: 13912345678" ) url_pattern = r'https?://[\w\.-]+(?:/[\w\.-]*)*' urls = re.findall(url_pattern, "访问 https://example.com/path" ) ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}' ips = re.findall(ip_pattern, "IP: 192.168.1.1" ) tag_pattern = r'<(\w+)[^>]*>.*?</\1>' chinese_pattern = r'[\u4e00-\u9fff]+' chinese = re.findall(chinese_pattern, "Hello 世界 World" ) text = "<div>content</div><div>more</div>" greedy = re.findall(r'<div>.*</div>' , text) print (greedy) non_greedy = re.findall(r'<div>.*?</div>' , text) print (non_greedy)
re 模块常用函数 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 import retext = "Hello 123 World 456" match = re.search(r'\d+' , text)if match : print (match .group()) numbers = re.findall(r'\d+' , text) print (numbers) for match in re.finditer(r'\d+' , text): print (f"位置 {match .start()} : {match .group()} " ) new_text = re.sub(r'\d+' , '[数字]' , text) print (new_text) def double (match ): return str (int (match .group()) * 2 ) new_text = re.sub(r'\d+' , double, text) print (new_text) parts = re.split(r'\s+' , "a b c" ) print (parts) match = re.match (r'Hello' , text)if match : print ("开头匹配成功" ) if re.fullmatch(r'Hello \d+ World \d+' , text): print ("完全匹配" )
🔧 实战案例:文本处理工具 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 import refrom collections import Counterclass TextProcessor : """文本处理工具集""" def __init__ (self, text ): self.text = text def clean_text (self ): """清理文本:去除多余空白、特殊字符""" text = self.text.replace('\r\n' , '\n' ) text = re.sub(r'[ \t]+' , ' ' , text) text = re.sub(r'\n+' , '\n' , text) return text.strip() def extract_emails (self ): """提取邮箱""" pattern = r'[\w\.-]+@[\w\.-]+\.\w+' return re.findall(pattern, self.text) def extract_urls (self ): """提取 URL""" pattern = r'https?://[\w\.-]+(?:/[\w\./-]*)?(?:\?[\w=&]*)?' return re.findall(pattern, self.text) def extract_phone_numbers (self ): """提取中国手机号""" pattern = r'1[3-9]\d{9}' return re.findall(pattern, self.text) def word_frequency (self, top_n=10 ): """词频统计""" words = re.findall(r'\b[a-zA-Z]+\b' , self.text.lower()) counter = Counter(words) return counter.most_common(top_n) def chinese_frequency (self, top_n=10 ): """中文字符统计""" chinese = re.findall(r'[\u4e00-\u9fff]' , self.text) counter = Counter(chinese) return counter.most_common(top_n) def remove_html_tags (self ): """移除 HTML 标签""" clean = re.sub(r'<[^>]+>' , '' , self.text) return clean def normalize_whitespace (self ): """标准化空白字符""" return ' ' .join(self.text.split()) sample_text = """ <div> <h1>联系我们</h1> <p>邮箱: contact@example.com</p> <p>电话: 13912345678</p> <p>网站: https://www.example.com/contact</p> </div> """ processor = TextProcessor(sample_text) print ("邮箱:" , processor.extract_emails())print ("电话:" , processor.extract_phone_numbers())print ("URL:" , processor.extract_urls())print ("清理HTML:" , processor.remove_html_tags())
📊 编码检测与转换工具 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 import chardetfrom pathlib import Pathclass EncodingConverter : """编码转换工具""" @staticmethod def detect (file_path, sample_size=10000 ): """检测文件编码""" with open (file_path, 'rb' ) as f: raw = f.read(sample_size) result = chardet.detect(raw) return result @staticmethod def convert (input_path, output_path, from_encoding=None , to_encoding='utf-8' ): """转换文件编码""" if from_encoding is None : result = EncodingConverter.detect(input_path) from_encoding = result['encoding' ] confidence = result['confidence' ] print (f"检测到编码: {from_encoding} (置信度: {confidence:.2 %} )" ) with open (input_path, 'r' , encoding=from_encoding, errors='replace' ) as f: content = f.read() with open (output_path, 'w' , encoding=to_encoding) as f: f.write(content) print (f"已转换: {input_path} → {output_path} " ) @staticmethod def batch_convert (directory, to_encoding='utf-8' , extensions=('.txt' , '.csv' , '.md' ) ): """批量转换目录下的文件""" path = Path(directory) for file_path in path.rglob('*' ): if file_path.suffix.lower() in extensions: output_path = file_path.with_suffix('.utf8' + file_path.suffix) EncodingConverter.convert(file_path, output_path, to_encoding=to_encoding)
⚠️ 避坑指南 陷阱 1:混淆 str 和 bytes 1 2 3 4 5 6 7 8 9 s = 'Hello' b = b'World' result = s + b.decode('utf-8' ) result = s.encode('utf-8' ) + b
陷阱 2:文件读写编码不一致 1 2 3 4 5 6 7 8 9 10 with open ('test.txt' , 'w' , encoding='utf-8' ) as f: f.write('中文' ) with open ('test.txt' , 'r' , encoding='gbk' ) as f: content = f.read() with open ('test.txt' , 'r' , encoding='utf-8' ) as f: content = f.read()
陷阱 3:正则表达式中的编码问题 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import redata = b'Hello World' pattern = re.compile (b'hello' , re.I) match = pattern.search(data)text = data.decode('utf-8' ) pattern = re.compile (r'hello' , re.I) match = pattern.search(text)
陷阱 4:Unicode 规范化问题 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 s1 = 'café' s2 = 'cafe\u0301' print (s1 == s2) print (s1, s2) import unicodedatas1_norm = unicodedata.normalize('NFC' , s1) s2_norm = unicodedata.normalize('NFC' , s2) print (s1_norm == s2_norm)
🎯 本讲总结 通过本讲,我们掌握了:
知识点 核心要点 Unicode 字符集,为每个字符分配唯一编号 UTF-8 变长编码,互联网标准,兼容 ASCII str vs bytes str 是 Unicode 字符串,bytes 是字节序列编码/解码 str.encode() → bytes,bytes.decode() → str常见编码 UTF-8(通用)、GBK(中文)、Latin-1(西欧) 正则表达式 编译优化、分组、命名、前瞻后顾 编码检测 chardet 库自动检测文件编码 Unicode 规范化 NFC/NFD 处理等价字符
记住这句话 :
理解编码的本质:str 是给人看的,bytes 是给机器传输的,encode/decode 是两者之间的桥梁。
📚 推荐教材 《Python 编程从入门到实践(第 3 版)》 | 《流畅的 Python(第 2 版)》 | 《CPython 设计与实现》
学习路线: 零基础 → 《从入门到实践》 → 《流畅的 Python》 → 本门课程 → 《CPython 设计与实现》
🎓 加入《流畅的 Python》直播共读营 学到这里,如果你想系统吃透这本书——欢迎加入我的直播共读课。
每周直播精讲,逐章拆解核心知识点 专属学习群,随时答疑交流 试运营特惠:499 元 → 299 元 👉 【立即报名《流畅的 Python》共读课】 :https://mp.weixin.qq.com/s/ivHJwn1nNx5ug4TFrapvGg
🔗 课程导航 ← 上一讲:集合与映射 | 下一讲:函数即对象 →
💬 联系我 主营业务 :AI 编程培训、企业内训、技术咨询
🎓 AI 编程实战课程 想系统学习 AI 编程?程序员晚枫的 AI 编程实战课 帮你从零上手!