大家好,我是正在实战各种AI项目的程序员晚枫。
今天聊一个让新手望而生畏、但学会后威力无穷的技能——正则表达式(Regular Expression) 。
一个真实的文本处理灾难 去年有个学员问我:"晚枫老师,我要从10万个HTML文件中提取邮箱地址,怎么办?"
他写的代码:
1 2 3 4 5 6 7 8 9 10 11 12 def extract_emails (text ): emails = [] at_pos = text.find('@' ) return emails
用正则表达式 :
1 2 3 4 5 6 7 import redef extract_emails (text ): pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b' return re.findall(pattern, text)
你可能觉得正则很难记、很晦涩。但其实只要掌握最常用的10个模式,就能搞定90%的文本处理需求。
这篇文章总结了我在数据处理中最常用的正则技巧,帮你快速上手。
为什么要学正则? 正则 vs 传统方法 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 def extract_numbers (text ): numbers = [] current = '' for char in text: if char.isdigit(): current += char else : if current: numbers.append(int (current)) current = '' if current: numbers.append(int (current)) return numbers import renumbers = [int (n) for n in re.findall(r'\d+' , text)]
正则就是文本处理的瑞士军刀。
正则能做什么? ✅ 数据提取:邮箱、电话、URL、价格等 ✅ 数据验证:格式检查、输入校验 ✅ 数据清洗:去除多余空格、标点、特殊字符 ✅ 数据转换:格式化、替换、重命名 ✅ 日志分析:提取关键信息、错误定位 Python中的正则模块 核心函数 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 import reresult = re.match (r'hello' , 'hello world' ) print (result.group()) result = re.search(r'world' , 'hello world' ) print (result.group()) results = re.findall(r'\d+' , 'abc123def456' ) print (results) for match in re.finditer(r'\d+' , 'abc123def456' ): print (match .group(), match .span()) result = re.sub(r'\d+' , 'X' , 'abc123def456' ) print (result) parts = re.split(r'[,;\s]+' , 'a,b;c d' ) print (parts) pattern = re.compile (r'\d+' ) results = pattern.findall('abc123def456' )
匹配对象的方法 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import retext = "Email: alice@example.com, Phone: 13812345678" pattern = r'(\w+)@(\w+\.\w+)' match = re.search(pattern, text)if match : print (match .group()) print (match .group(0 )) print (match .group(1 )) print (match .group(2 )) print (match .groups()) print (match .start()) print (match .end()) print (match .span())
正则语法速查表 基础元字符 符号 含义 示例 .任意字符(除换行) a.c 匹配 "abc", "a1c"\d数字 [0-9] \d+ 匹配 "123"\D非数字 \D+ 匹配 "abc"\w单词字符 [a-zA-Z0-9_] \w+ 匹配 "hello_123"\W非单词字符 \W+ 匹配 "!@#"\s空白字符(空格、制表、换行) \s+ 匹配 " "\S非空白字符 \S+ 匹配 "hello"
量词 符号 含义 示例 *0次或多次 a* 匹配 "", "a", "aaa"+1次或多次 a+ 匹配 "a", "aaa"?0次或1次 a? 匹配 "", "a"{n}恰好n次 a{3} 匹配 "aaa"{n,}至少n次 a{2,} 匹配 "aa", "aaa", ...{n,m}n到m次 a{2,4} 匹配 "aa", "aaa", "aaaa"
定位符 符号 含义 示例 ^字符串开头 ^hello 匹配开头的hello$字符串结尾 world$ 匹配结尾的world\b单词边界 \bhello\b 匹配单词hello\B非单词边界 \Bhello 匹配非边界的hello
字符集 符号 含义 示例 [abc]匹配a、b、c任一个 [aeiou] 匹配元音[^abc]匹配非a、b、c [^0-9] 匹配非数字[a-z]匹配a到z [A-Za-z] 匹配字母[0-9]匹配0到9 同\d
分组 符号 含义 示例 ()分组 (ab)+ 匹配 "ab", "abab"(?:)非捕获分组 (?:ab)+ 不捕获分组(?P<name>)命名分组 (?P<email>\w+@\w+)` ` 或
10个必备正则模式 模式1:匹配数字 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import retext = "年龄:25,身高:175cm,体重:70.5kg,温度:-5°C" integers = re.findall(r'\d+' , text) print (integers) decimals = re.findall(r'\d+\.\d+' , text) print (decimals) numbers = re.findall(r'-?\d+\.?\d*' , text) print (numbers) text2 = "完成度:85%,进度:99.5%" percents = re.findall(r'\d+\.?\d*%' , text2) print (percents) text3 = "价格:$99.99,原价:¥199.00" prices = re.findall(r'[$¥]\d+\.?\d*' , text3) print (prices)
模式2:匹配邮箱 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import repattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' text = """ 联系我们: - 技术支持:support@example.com - 销售咨询:sales@company.co.uk - 客服热线:service123@mail.test-site.cn """ emails = re.findall(pattern, text) print (emails)def is_valid_email (email ): """验证邮箱格式""" pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' return bool (re.match (pattern, email)) print (is_valid_email('test@example.com' )) print (is_valid_email('invalid-email' )) print (is_valid_email('test@.com' ))
模式3:匹配手机号(中国大陆) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import retext = """ 联系方式: - 手机:13800138000 - 座机:021-12345678 - 手机:15912345678 - 手机:18600001111 """ phones = re.findall(r'1[3-9]\d{9}' , text) print (phones) text2 = "电话:138-0013-8000 或 159 1234 5678" phones = re.findall(r'1[3-9][- ]?\d{4}[- ]?\d{4}' , text2) print (phones) landlines = re.findall(r'\d{3,4}-\d{7,8}' , text) print (landlines) def is_valid_phone (phone ): """验证手机号""" pattern = r'^1[3-9]\d{9}$' return bool (re.match (pattern, phone)) print (is_valid_phone('13800138000' )) print (is_valid_phone('12800138000' ))
模式4:匹配URL 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 import retext = """ 网站链接: - 官网:https://www.example.com - 论坛:http://bbs.test-site.org/page?id=123 - 文档:https://docs.python.org/3/library/re.html - 图片:https://cdn.example.com/images/logo.png """ urls = re.findall(r'https?://[^\s<>"{}|\\^`[\]]+' , text) for url in urls: print (url) domains = re.findall(r'https?://([^/]+)' , text) print (domains)paths = re.findall(r'https?://[^/]+(/[^\s]*)?' , text) print (paths)def is_valid_url (url ): """验证URL""" pattern = r'^https?://[^\s<>"{}|\\^`[\]]+$' return bool (re.match (pattern, url))
模式5:提取HTML标签内容 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 import rehtml = """ <html> <head><title>我的网站</title></head> <body> <h1>欢迎</h1> <p class="intro">这是一段介绍文字</p> <a href="https://example.com">链接</a> <img src="logo.png" alt="Logo"> </body> </html> """ title = re.search(r'<title>(.*?)</title>' , html, re.DOTALL) if title: print (title.group(1 )) links = re.findall(r'<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>' , html, re.DOTALL) for url, text in links: print (f"{text} : {url} " ) images = re.findall(r'<img[^>]*src="([^"]*)"[^>]*alt="([^"]*)"' , html) for src, alt in images: print (f"{alt} : {src} " ) def extract_tag_content (html, tag ): """提取指定标签的内容""" pattern = rf'<{tag} [^>]*>(.*?)</{tag} >' return re.findall(pattern, html, re.DOTALL) print (extract_tag_content(html, 'h1' )) print (extract_tag_content(html, 'p' ))
模式6:验证密码强度 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 import redef check_password_strength (password ): """检查密码强度""" if len (password) < 8 : return "弱:少于8位" has_lower = re.search(r'[a-z]' , password) has_upper = re.search(r'[A-Z]' , password) has_digit = re.search(r'\d' , password) has_special = re.search(r'[!@#$%^&*(),.?":{}|<>]' , password) if all ([has_lower, has_upper, has_digit, has_special]): return "强:包含大小写字母、数字和特殊字符" elif all ([has_lower, has_upper, has_digit]): return "中:包含大小写字母和数字" else : return "弱:缺少必要字符" print (check_password_strength("abc" )) print (check_password_strength("hello123" )) print (check_password_strength("Hello123" )) print (check_password_strength("Hello123!" )) def validate_password (password ): """严格验证密码(返回True/False)""" pattern = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%^&*])[A-Za-z\d!@#$%^&*]{8,}$' return bool (re.match (pattern, password)) print (validate_password("Hello123!" )) print (validate_password("hello123" ))
模式7:格式化字符串 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 import redef camel_to_snake (name ): """驼峰命名转下划线命名""" result = re.sub(r'(?<!^)(?=[A-Z])' , '_' , name).lower() return result print (camel_to_snake('myVariableName' )) print (camel_to_snake('getHTTPResponse' )) def camel_to_snake_better (name ): """改进版:处理连续大写字母""" result = re.sub(r'([a-z\d])([A-Z])' , r'\1_\2' , name) result = re.sub(r'([A-Z]+)([A-Z][a-z])' , r'\1_\2' , result) return result.lower() print (camel_to_snake_better('getHTTPResponse' )) def snake_to_camel (name ): """下划线命名转驼峰命名""" components = name.split('_' ) return components[0 ] + '' .join(x.title() for x in components[1 :]) print (snake_to_camel('my_variable_name' )) def snake_to_pascal (name ): """下划线命名转帕斯卡命名""" return '' .join(x.title() for x in name.split('_' )) print (snake_to_pascal('my_variable_name' )) def format_phone (phone ): """格式化手机号:13800138000 -> 138-0013-8000""" phone = re.sub(r'\D' , '' , phone) if len (phone) == 11 : return f"{phone[:3 ]} -{phone[3 :7 ]} -{phone[7 :]} " return phone print (format_phone('13800138000' )) print (format_phone('138-0013-8000' ))
模式8:清理文本 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 import retext = " Hello!!! World??? \n\n 这是一个 测试\t文本... " cleaned = re.sub(r'\s+' , ' ' , text).strip() print (cleaned) no_punct = re.sub(r'[^\w\s]' , '' , cleaned) print (no_punct) chinese_and_letters = re.sub(r'[^\u4e00-\u9fa5a-zA-Z\s]' , '' , text) print (chinese_and_letters)text2 = "价格:¥199.00,数量:100" numbers_only = re.sub(r'[^\d]' , '' , text2) print (numbers_only) html_text = "<p>Hello <b>World</b></p>" no_html = re.sub(r'<[^>]+>' , '' , html_text) print (no_html) text3 = "Hello\x00World\x1FTest" cleaned = re.sub(r'[\x00-\x1F\x7F]' , '' , text3) print (cleaned) text4 = '他说:"Hello",她回答' Hi'' normalized = re.sub(r'["""]' , '"' , text4) normalized = re.sub(r"['']" , "'" , normalized) print (normalized)
模式9:解析日志 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 import refrom collections import Counterlog_line = '192.168.1.1 - - [15/Jan/2024:10:30:45 +0800] "GET /index.html HTTP/1.1" 200 1234' pattern = r'^(\S+) \S+ \S+ \[([^\]]+)\] "(\S+) ([^"]+) HTTP/\d\.\d" (\d+) (\d+)' match = re.match (pattern, log_line)if match : ip = match .group(1 ) timestamp = match .group(2 ) method = match .group(3 ) path = match .group(4 ) status = match .group(5 ) size = match .group(6 ) print (f"IP: {ip} " ) print (f"时间: {timestamp} " ) print (f"方法: {method} " ) print (f"路径: {path} " ) print (f"状态码: {status} " ) print (f"大小: {size} " ) def analyze_logs (log_file ): """分析日志文件""" ip_counter = Counter() path_counter = Counter() status_counter = Counter() pattern = r'^(\S+) .+? "(\S+) ([^"]+) .+?" (\d+)' with open (log_file, 'r' ) as f: for line in f: match = re.match (pattern, line) if match : ip_counter[match .group(1 )] += 1 path_counter[match .group(3 )] += 1 status_counter[match .group(4 )] += 1 return { 'top_ips' : ip_counter.most_common(10 ), 'top_paths' : path_counter.most_common(10 ), 'status_codes' : dict (status_counter) }
模式10:批量重命名 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 import reimport osfrom pathlib import Pathdef batch_rename (directory, pattern, replacement ): """批量重命名文件""" directory = Path(directory) for filepath in directory.iterdir(): if filepath.is_file(): new_name = re.sub(pattern, replacement, filepath.name) if new_name != filepath.name: new_path = filepath.parent / new_name filepath.rename(new_path) print (f"Renamed: {filepath.name} -> {new_name} " ) batch_rename('./images' , r'^(.*)\.jpg$' , r'2024_\1.jpg' ) batch_rename('./photos' , r'IMG_(\d+)\.JPG' , r'img_\1.jpg' ) batch_rename('./files' , r'[^\w\-.]' , r'_' ) def batch_replace (directory, file_pattern, old_text, new_text ): """批量替换文件内容""" directory = Path(directory) for filepath in directory.rglob(file_pattern): if filepath.is_file(): content = filepath.read_text(encoding='utf-8' ) new_content = re.sub(old_text, new_text, content) if content != new_content: filepath.write_text(new_content, encoding='utf-8' ) print (f"Updated: {filepath} " ) batch_replace('./project' , '*.py' , r'old_api\.call' , r'new_api.run' )
高级技巧 贪婪 vs 非贪婪 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import retext = "<div>内容1</div><div>内容2</div>" greedy = re.findall(r'<div>.*</div>' , text) print (greedy) non_greedy = re.findall(r'<div>.*?</div>' , text) print (non_greedy) text2 = "价格:$10.99,优惠:$5.50" greedy = re.findall(r'\$.+\$' , text2) print (greedy) non_greedy = re.findall(r'\$.+?\$' , text2) print (non_greedy) correct = re.findall(r'\$[\d.]+' , text2) print (correct)
前瞻和后顾 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import retext = "hello123world456" result = re.findall(r'(?<=[a-z])\d+' , text) print (result) result = re.findall(r'[a-z]+(?=\d)' , text) print (result) result = re.findall(r'[a-z]+(?!\d)' , text) print (result) text2 = "价格:$100,数量:5" result = re.findall(r'(?<=\$)\d+' , text2) print (result) result = re.findall(r'(?<!\$)\d+' , text2) print (result)
命名分组 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import retext = "张三,男,25岁,来自北京" pattern = r'(?P<name>\w+),(?P<gender>\w),(?P<age>\d+)岁,来自(?P<city>\w+)' match = re.search(pattern, text)if match : print (match .group('name' )) print (match .group('gender' )) print (match .group('age' )) print (match .group('city' )) print (match .groupdict()) text2 = "2024-01-15" pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})' result = re.sub(pattern, r'\g<month>/\g<day>/\g<year>' , text2) print (result)
条件匹配 1 2 3 4 5 6 7 8 9 10 import retext = '名称:"测试" 或 测试' pattern = r'(")?(.*?)(?(1)"|)' matches = re.findall(pattern, text) print (matches)
修饰符 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 import retext = """Hello World Test""" result = re.findall(r'.+' , text, re.DOTALL) print (result) result = re.findall(r'hello' , 'Hello World' , re.IGNORECASE) print (result) text2 = """第一行 第二行 第三行""" result = re.findall(r'^第' , text2, re.MULTILINE) print (result) pattern = re.compile (r""" \b # 单词边界 \d{3} # 区号 [-\s]? # 分隔符 \d{4} # 前四位 [-\s]? # 分隔符 \d{4} # 后四位 \b # 单词边界 """ , re.VERBOSE)result = pattern.findall("电话:138-0013-8000" ) print (result)result = re.findall(r'hello' , 'HELLO\nHello' , re.IGNORECASE | re.MULTILINE)
性能优化 编译正则表达式 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import reimport timetext = "这是一段测试文本,包含邮箱 test@example.com" def test_no_compile (): start = time.time() for _ in range (100000 ): re.search(r'\w+@\w+\.\w+' , text) return time.time() - start pattern = re.compile (r'\w+@\w+\.\w+' ) def test_compile (): start = time.time() for _ in range (100000 ): pattern.search(text) return time.time() - start print (f"不编译: {test_no_compile():.3 f} s" )print (f"预编译: {test_compile():.3 f} s" )
避免回溯 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import rebad_pattern = r'(a+)+b' good_pattern = r'a+b' import timetext = 'a' * 30 + 'c' start = time.time() try : re.search(bad_pattern, text, timeout=1 ) except : pass print (f"危险模式: {time.time() - start:.3 f} s" )start = time.time() re.search(good_pattern, text) print (f"安全模式: {time.time() - start:.6 f} s" )
使用re.Scanner 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import redef scanner_example (): tokens = [ ('NUMBER' , r'\d+' ), ('WORD' , r'\w+' ), ('SPACE' , r'\s+' ), ('PUNCT' , r'[^\w\s]' ), ] pattern = '|' .join(f'(?P<{name} >{regex} )' for name, regex in tokens) text = "Hello 123 world!" for match in re.finditer(pattern, text): kind = match .lastgroup value = match .group() print (f"{kind} : {value} " ) scanner_example()
避坑指南 坑1:忘记转义 1 2 3 4 5 6 7 8 9 10 11 12 13 14 import retext = "file.txt file2txt" result = re.findall(r'file.txt' , text) print (result) result = re.findall(r'file\.txt' , text) print (result)
坑2:贪婪匹配陷阱 1 2 3 4 5 6 7 8 9 10 11 import retext = "<div>内容1</div><div>内容2</div>" result = re.findall(r'<div>.*</div>' , text) print (result) result = re.findall(r'<div>.*?</div>' , text) print (result)
坑3:中文字符匹配 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import retext = "Hello 世界 Python" result = re.findall(r'\w+' , text) print (result) result = re.findall(r'[\u4e00-\u9fa5]+' , text) print (result) result = re.findall(r'[\u4e00-\u9fa5\w]+' , text) print (result)
坑4:多行匹配 1 2 3 4 5 6 7 8 9 10 11 12 13 import retext = """第一行 第二行 第三行""" result = re.findall(r'^第' , text) print (result) result = re.findall(r'^第' , text, re.MULTILINE) print (result)
坑5:特殊字符处理 1 2 3 4 5 6 7 8 import reuser_input = "file.txt" pattern = re.escape(user_input) result = re.findall(pattern, "file.txt file2txt" ) print (result)
推荐:AI Python零基础实战营 想系统学习Python文本处理?
课程内容:
✅ Python基础语法 ✅ 正则表达式详解 ✅ 数据清洗与处理 ✅ 实战项目练习 🎁 限时福利 :送《Python编程从入门到实践》实体书
👉 点击了解详情
相关阅读 PS:正则表达式是程序员的必备技能。记不住没关系,收藏这篇当速查手册。记住:能用简单方法就别用复杂正则!
📚 推荐教材 主教材 :《Python 编程从入门到实践(第 3 版)》
📚 推荐:Python 零基础实战营 系统学习Python,推荐这个免费入门课程 👇
特点 说明 🎯 专为0基础设计 门槛低,上手快 📹 配套视频讲解 配合文章学习效果更好 💬 专属答疑群 遇到问题有人带 🎁 实体书赠送 优秀学员送《Python编程从入门到实践》
👉 点击免费领取 Python 零基础实战营
💬 联系我 主营业务 :AI 编程培训、企业内训、技术咨询
🎓 AI 编程实战课程 想系统学习 AI 编程?程序员晚枫的 AI 编程实战课 帮你从零上手!