大家好,我是正在实战各种 AI 项目的程序员晚枫。
🎬 开篇:同样的数据处理,为什么别人更快更省内存? 你有没有遇到过这样的场景?
处理一个 100 万行的 CSV 文件:
同事 A 的代码:几秒钟跑完,内存占用不到 100MB 你的代码:跑了 2 分钟,内存占用飙到 2GB 为什么差距这么大?
答案就在今天要讲的内容:列表推导式 、生成器表达式 、以及 Python 容器的底层原理。
一个真实的案例 2024 年,我帮一个金融公司优化数据处理脚本。
原代码用传统的 for 循环处理每日报表,100 万条数据需要 3 分钟。
我用生成器表达式 + 批量处理优化后,同样的数据只需 15 秒,内存占用从 1.8GB 降到 200MB。
这就是 Python 容器使用的差距。
🚀 列表推导式:更快更简洁的数据处理 什么是列表推导式? 列表推导式(List Comprehension)是 Python 的一种语法糖,让你用一行代码完成列表的创建和转换。
1 2 3 4 5 6 7 8 9 squares = [] for x in range (10 ): squares.append(x ** 2 ) print (squares) squares = [x ** 2 for x in range (10 )] print (squares)
列表推导式的语法 1 2 3 4 5 6 7 8 9 10 11 [expression for item in iterable] [expression for item in iterable if condition] [expression_if_true if condition else expression_if_false for item in iterable] [expression for item1 in iterable1 for item2 in iterable2]
基础用法详解 1. 简单转换 1 2 3 4 5 6 7 8 9 10 11 12 13 squares = [x ** 2 for x in range (10 )] words = ['hello' , 'world' , 'python' ] upper_words = [word.upper() for word in words] users = [{'name' : '张三' , 'age' : 25 }, {'name' : '李四' , 'age' : 30 }] names = [user['name' ] for user in users]
2. 条件筛选 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 evens = [x for x in range (20 ) if x % 2 == 0 ] texts = ['hello' , '' , 'world' , None , 'python' , '' ] valid_texts = [s for s in texts if s] products = [ {'name' : 'iPhone' , 'price' : 5999 }, {'name' : '小米' , 'price' : 1999 }, {'name' : '华为' , 'price' : 4999 }, ] expensive = [p for p in products if p['price' ] > 3000 ]
3. 条件表达式(三元运算符) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 numbers = range (10 ) labels = ['偶数' if x % 2 == 0 else '奇数' for x in numbers] values = [-3 , -1 , 0 , 2 , 4 , -5 ] abs_values = [x if x >= 0 else -x for x in values] names = ['Alice' , '' , 'Bob' , None , 'Charlie' ] valid_names = [name if name else '匿名' for name in names]
4. 多重循环 1 2 3 4 5 6 7 8 9 10 11 12 13 14 colors = ['红' , '蓝' , '绿' ] sizes = ['S' , 'M' , 'L' ] combinations = [(color, size) for color in colors for size in sizes] matrix = [[1 , 2 , 3 ], [4 , 5 , 6 ], [7 , 8 , 9 ]] flat = [item for row in matrix for item in row] pairs = [(x, y) for x in range (5 ) for y in range (5 ) if x < y]
列表推导式的嵌套 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 matrix = [[i * j for j in range (1 , 4 )] for i in range (1 , 4 )] matrix = [] for i in range (1 , 4 ): row = [] for j in range (1 , 4 ): row.append(i * j) matrix.append(row) transposed = [[row[i] for row in matrix] for i in range (len (matrix[0 ]))]
性能对比 列表推导式比传统循环快 1.5-2 倍 ,原因:
避免 append 方法调用 :每次 append() 都是函数调用开销内部 C 优化 :列表推导式在 Python 内部使用优化的 C 代码减少字节码 :更少的字节码指令1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import timeitlist_comp = '[x ** 2 for x in range(1000)]' list_comp_time = timeit.timeit(list_comp, number=10000 ) loop = ''' result = [] for x in range(1000): result.append(x ** 2) ''' loop_time = timeit.timeit(loop, number=10000 ) map_time = timeit.timeit('list(map(lambda x: x ** 2, range(1000)))' , number=10000 ) print (f"列表推导式: {list_comp_time:.4 f} s" )print (f"普通循环: {loop_time:.4 f} s" )print (f"map+lambda: {map_time:.4 f} s" )print (f"推导式比循环快: {(loop_time / list_comp_time - 1 ) * 100 :.1 f} %" )
输出示例:
1 2 3 4 列表推导式: 0.3214s 普通循环: 0.5623s map+lambda: 0.4891s 推导式比循环快: 75.0%
何时使用列表推导式? ✅ 适合使用:
❌ 不适合使用:
需要复杂的处理逻辑 需要 try-except 处理异常 需要多个步骤的中间变量 1 2 3 4 5 6 7 8 9 10 11 12 13 prices = [99.9 , 199.9 , 299.9 ] int_prices = [int (p) for p in prices] results = [] for item in data: try : value = complex_calculation(item) if validate(value): results.append(process(value)) except ValueError: results.append(None )
💡 生成器表达式:惰性求值的威力 什么是生成器表达式? 生成器表达式(Generator Expression)是列表推导式的"惰性版本":
不立即计算所有值 只在需要时才计算 节省内存,特别是处理大数据时 1 2 3 4 5 squares_list = [x ** 2 for x in range (1000000 )] squares_gen = (x ** 2 for x in range (1000000 ))
语法对比 1 2 3 4 5 6 7 8 9 10 11 squares_list = [x ** 2 for x in range (10 )] squares_gen = (x ** 2 for x in range (10 )) squares_dict = {x: x ** 2 for x in range (10 )} squares_set = {x ** 2 for x in range (10 )}
内存占用对比 1 2 3 4 5 6 7 8 9 import sysbig_list = [x for x in range (1000000 )] print (f"列表大小: {sys.getsizeof(big_list) / 1024 / 1024 :.2 f} MB" ) big_gen = (x for x in range (1000000 )) print (f"生成器大小: {sys.getsizeof(big_gen)} bytes" )
使用场景详解 1. 求和、求最大最小值 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 total = sum ([x ** 2 for x in range (1000000 )]) total = sum (x ** 2 for x in range (1000000 )) import timeitlist_sum = timeit.timeit('sum([x**2 for x in range(10000)])' , number=1000 ) gen_sum = timeit.timeit('sum(x**2 for x in range(10000))' , number=1000 ) print (f"列表求和: {list_sum:.4 f} s" )print (f"生成器求和: {gen_sum:.4 f} s" )
2. 文件处理 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 with open ('large_file.txt' ) as f: lines = [line.strip() for line in f] for line in lines: process(line) with open ('large_file.txt' ) as f: for line in (line.strip() for line in f): process(line) with open ('large_file.txt' ) as f: for line in f: process(line.strip())
3. 管道式处理 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 def process_logs (filename ): with open (filename) as f: errors = ( line for line in f if 'ERROR' in line ) timestamps = ( line.split()[0 ] for line in errors ) return list (timestamps) def process_logs_verbose (filename ): with open (filename) as f: result = [] for line in f: if 'ERROR' in line: timestamp = line.split()[0 ] result.append(timestamp) return result
生成器表达式 vs 列表推导式 特性 列表推导式 生成器表达式 语法 [...](...)内存占用 高(存储所有元素) 低(只存储算法) 计算时机 立即计算 惰性计算 可迭代次数 无限次 只能一次 支持索引 支持 不支持 支持切片 支持 不支持
选择建议:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 data = [x ** 2 for x in range (100 )] print (sum (data)) print (max (data)) print (min (data)) total = sum (x ** 2 for x in range (1000000 )) data = [x ** 2 for x in range (100 )] print (data[10 ]) print (data[10 :20 ]) big_data = (process(x) for x in huge_dataset)
📦 元组:不只是不可变的列表 元组的两种用途 Python 元组有两个截然不同的用途:
不可变列表 :存储不能修改的数据记录 :存储不同类型的数据项(类似数据库行)1 2 3 4 5 6 7 coordinates = (10 , 20 , 30 ) rgb = (255 , 128 , 0 ) person = ('张三' , 25 , 'engineer' , 'zhang@example.com' )
元组拆包 元组拆包(Unpacking)是 Python 的强大特性:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 point = (3 , 4 ) x, y = point print (x, y) a, b = 1 , 2 a, b = b, a print (a, b) person = ('张三' , 25 , 'engineer' ) name, age, job = person print (f"{name} 是 {age} 岁的 {job} " )first, *rest = [1 , 2 , 3 , 4 , 5 ] print (first) print (rest) head, *middle, tail = [1 , 2 , 3 , 4 , 5 ] print (head) print (middle) print (tail) name, _, email = ('张三' , 25 , 'zhang@example.com' ) print (name, email)
嵌套拆包 1 2 3 4 5 6 7 8 9 10 11 person = ('张三' , (25 , 'engineer' ), ['Python' , 'Java' ]) name, (age, job), skills = person print (name) print (age) print (job) print (skills) for index, (name, score) in enumerate ([('Alice' , 95 ), ('Bob' , 87 )]): print (f"{index} : {name} - {score} " )
命名元组 namedtuple 给元组字段命名,让代码更可读:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 from collections import namedtuplePoint = namedtuple('Point' , ['x' , 'y' ]) Person = namedtuple('Person' , 'name age job' ) p = Point(3 , 4 ) print (p.x, p.y) print (p[0 ], p[1 ]) print (p._fields) print (p._asdict()) p2 = p._replace(x=5 ) print (p2) Employee = namedtuple('Employee' , 'id name department salary' ) emp = Employee(1 , '张三' , '技术部' , 15000 ) print (f"员工 {emp.name} 薪资: {emp.salary} " )record = ('张三' , 25 , 'engineer' ) print (f"姓名: {record[0 ]} , 年龄: {record[1 ]} " )Person = namedtuple('Person' , 'name age job' ) record = Person('张三' , 25 , 'engineer' ) print (f"姓名: {record.name} , 年龄: {record.age} " )
命名元组进阶用法 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 from collections import namedtuplePerson = namedtuple('Person' , 'name age' , defaults=['未知' , 0 ]) p = Person('张三' ) print (p) from typing import NamedTupleclass Point (NamedTuple ): x: float y: float def distance (self ): return (self.x ** 2 + self.y ** 2 ) ** 0.5 p = Point(3 , 4 ) print (p.distance()) from dataclasses import dataclass@dataclass class Person : name: str age: int = 0 job: str = '未就业' def __str__ (self ): return f"{self.name} ({self.age} 岁, {self.job} )" p = Person('张三' , job='工程师' ) print (p)
🗂️ 字典:Python 最强大的数据结构 字典推导式 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 words = ['apple' , 'banana' , 'cherry' ] word_lengths = {word: len (word) for word in words} print (word_lengths) keys = ['a' , 'b' , 'c' ] values = [1 , 2 , 3 ] d = {k: v for k, v in zip (keys, values)} print (d) scores = {'Alice' : 95 , 'Bob' : 67 , 'Charlie' : 82 , 'David' : 55 } passed = {name: score for name, score in scores.items() if score >= 60 } print (passed) prices = {'apple' : 3.5 , 'banana' : 2.8 , 'cherry' : 4.2 } formatted = {fruit: f'¥{price:.2 f} ' for fruit, price in prices.items()} print (formatted)
字典合并的多种方式 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 dict1 = {'a' : 1 , 'b' : 2 } dict2 = {'c' : 3 , 'd' : 4 } dict3 = {'b' : 20 , 'e' : 5 } result = dict1.copy() result.update(dict2) print (result) result = {**dict1, **dict2} print (result) result = dict1 | dict2 print (result) result = dict1 | dict3 print (result) dict1 |= dict2 print (dict1)
字典的常用技巧 1. setdefault 模式 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 word_counts = {} for word in ['apple' , 'banana' , 'apple' , 'cherry' , 'banana' , 'apple' ]: if word not in word_counts: word_counts[word] = 0 word_counts[word] += 1 word_counts = {} for word in ['apple' , 'banana' , 'apple' , 'cherry' , 'banana' , 'apple' ]: word_counts.setdefault(word, 0 ) word_counts[word] += 1 from collections import defaultdict, Counterword_counts = defaultdict(int ) for word in ['apple' , 'banana' , 'apple' , 'cherry' , 'banana' , 'apple' ]: word_counts[word] += 1 word_counts = Counter(['apple' , 'banana' , 'apple' , 'cherry' , 'banana' , 'apple' ]) print (word_counts)
2. defaultdict 高级用法 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 from collections import defaultdictwords = ['apple' , 'apricot' , 'banana' , 'blueberry' , 'cherry' ] by_first_letter = defaultdict(list ) for word in words: by_first_letter[word[0 ]].append(word) print (dict (by_first_letter))from collections import defaultdictmulti_dict = defaultdict(list ) multi_dict['fruits' ].extend(['apple' , 'banana' ]) multi_dict['fruits' ].append('cherry' ) print (dict (multi_dict)) def default_person (): return {'name' : '未知' , 'age' : 0 } people = defaultdict(default_person) print (people['nonexistent' ])
3. 字典视图 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 d = {'a' : 1 , 'b' : 2 , 'c' : 3 } keys = d.keys() values = d.values() items = d.items() d['d' ] = 4 print (list (keys)) print (list (values)) d1 = {'a' : 1 , 'b' : 2 } d2 = {'b' : 20 , 'c' : 3 } common_keys = d1.keys() & d2.keys() print (common_keys) all_keys = d1.keys() | d2.keys() print (all_keys) diff_keys = d1.keys() - d2.keys() print (diff_keys)
字典的内存优化 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 import sysd = {} for i in range (1000 ): d[i] = i * 2 print (f"普通字典: {sys.getsizeof(d)} bytes" ) class Point : __slots__ = ['x' , 'y' ] def __init__ (self, x, y ): self.x = x self.y = y points = [Point(i, i*2 ) for i in range (1000 )] print (f"__slots__对象: {sys.getsizeof(points[0 ])} bytes" )
🔪 切片:Python 最优雅的数据提取方式 切片基础 1 2 3 4 5 6 7 8 9 10 11 nums = [0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ] print (nums[2 :5 ]) print (nums[:3 ]) print (nums[7 :]) print (nums[-3 :]) print (nums[:-3 ]) print (nums[::2 ]) print (nums[1 ::2 ]) print (nums[::-1 ])
切片赋值 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 nums = [0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ] nums[2 :5 ] = [20 , 30 , 40 ] print (nums) nums[2 :5 ] = [] print (nums) nums[2 :2 ] = [2 , 3 , 4 ] print (nums) nums = [0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ] nums[::2 ] = [0 , 0 , 0 , 0 , 0 ] print (nums)
多维切片 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import numpy as nparr = np.array([[1 , 2 , 3 ], [4 , 5 , 6 ], [7 , 8 , 9 ]]) print (arr[:2 , 1 :]) class Matrix : def __init__ (self, data ): self.data = data def __getitem__ (self, key ): if isinstance (key, tuple ): row, col = key if isinstance (row, slice ) and isinstance (col, slice ): return [r[col] for r in self.data[row]] elif isinstance (row, slice ): return [r[col] for r in self.data[row]] elif isinstance (col, slice ): return self.data[row][col] else : return self.data[row][col] return self.data[key] m = Matrix([[1 , 2 , 3 ], [4 , 5 , 6 ], [7 , 8 , 9 ]]) print (m[0 , 1 ]) print (m[0 , :]) print (m[:, 0 ])
slice 对象 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 items = list (range (10 )) first_three = slice (0 , 3 ) last_three = slice (-3 , None ) every_other = slice (None , None , 2 ) print (items[first_three]) print (items[last_three]) print (items[every_other]) records = [ "2024-01-15 Alice 95" , "2024-01-16 Bob 87" , "2024-01-17 Carol 92" , ] date_slice = slice (0 , 10 ) name_slice = slice (11 , 17 ) score_slice = slice (18 , 20 ) for record in records: print (f"日期: {record[date_slice]} , 姓名: {record[name_slice]} , 分数: {record[score_slice]} " )
⚠️ 避坑指南 陷阱 1:在推导式中修改外部变量 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 x = 10 result = [x := x + i for i in range (5 )] def accumulate (start, values ): result = [] current = start for v in values: current += v result.append(current) return result result = accumulate(10 , range (5 )) print (result)
陷阱 2:生成器只能遍历一次 1 2 3 4 5 6 7 8 9 10 gen = (x ** 2 for x in range (5 )) print (list (gen)) print (list (gen)) gen = (x ** 2 for x in range (5 )) lst = list (gen) print (lst) print (lst)
陷阱 3:切片创建的是新对象 1 2 3 4 5 6 7 8 9 10 11 12 13 14 a = [1 , 2 , 3 , 4 , 5 ] b = a[:] print (a is b) a = [[1 , 2 ], [3 , 4 ]] b = a[:] b[0 ][0 ] = 999 print (a) import copyb = copy.deepcopy(a)
陷阱 4:字典迭代顺序 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 d = {} d['a' ] = 1 d['c' ] = 3 d['b' ] = 2 print (list (d.keys())) d = {'c' : 3 , 'a' : 1 , 'b' : 2 } print (list (d.keys())) sorted_keys = sorted (d.keys()) print (sorted_keys)
🎯 实战案例:处理大型 CSV 文件 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 import csvfrom collections import defaultdictdef process_large_csv (filename ): """ 处理大型 CSV 文件,按类别统计销售额 使用生成器避免内存溢出 """ category_sales = defaultdict(float ) with open (filename, 'r' , encoding='utf-8' ) as f: reader = csv.DictReader(f) valid_rows = ( row for row in reader if row['amount' ] and float (row['amount' ]) > 0 ) for row in valid_rows: category = row['category' ] amount = float (row['amount' ]) category_sales[category] += amount return dict (category_sales) def generate_test_data (filename, rows=1000000 ): import random categories = ['电子产品' , '服装' , '食品' , '家居' , '图书' ] with open (filename, 'w' , newline='' , encoding='utf-8' ) as f: writer = csv.writer(f) writer.writerow(['category' , 'amount' , 'date' ]) for _ in range (rows): category = random.choice(categories) amount = random.uniform(10 , 1000 ) date = f'2024-{random.randint(1 ,12 ):02d} -{random.randint(1 ,28 ):02d} ' writer.writerow([category, f'{amount:.2 f} ' , date])
🎯 本讲总结 通过本讲,我们掌握了:
知识点 核心要点 列表推导式 更快更简洁,适合简单转换;比普通循环快 1.5-2 倍 生成器表达式 惰性求值,节省内存;只能遍历一次 元组拆包 优雅的数据提取,支持 * 捕获剩余元素 命名元组 给元组字段命名,提高可读性 字典推导式 创建和转换字典的简洁方式 字典合并 Python 3.9+ 用 | 运算符,旧版本用 ** 解包 切片 强大的数据提取和修改方式,支持 step 和负索引
记住这句话 :
选择正确的数据容器和操作方式,能让你的代码更快、更省内存、更易读。
📚 推荐教材 《Python 编程从入门到实践(第 3 版)》 | 《流畅的 Python(第 2 版)》 | 《CPython 设计与实现》
学习路线: 零基础 → 《从入门到实践》 → 《流畅的 Python》 → 本门课程 → 《CPython 设计与实现》
🎓 加入《流畅的 Python》直播共读营 学到这里,如果你想系统吃透这本书——欢迎加入我的直播共读课。
每周直播精讲,逐章拆解核心知识点 专属学习群,随时答疑交流 试运营特惠:499 元 → 299 元 👉 【立即报名《流畅的 Python》共读课】 :https://mp.weixin.qq.com/s/ivHJwn1nNx5ug4TFrapvGg
🔗 课程导航 ← 上一讲:Python 数据模型 | 下一讲:集合与映射 →
💬 联系我 主营业务 :AI 编程培训、企业内训、技术咨询
🎓 AI 编程实战课程 想系统学习 AI 编程?程序员晚枫的 AI 编程实战课 帮你从零上手!