大家好,我是正在实战各种 AI 项目的程序员晚枫。
🎬 开篇:一个去重问题引发的思考 你有没有写过这样的代码?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 text = "the quick brown fox jumps over the lazy dog the fox" words = text.split() unique_words = [] for word in words: if word not in unique_words: unique_words.append(word) print (unique_words)unique_words = list (set (words)) print (unique_words)
但问题来了 :为什么 set 去重这么快?底层原理是什么?
今天我们就深入理解 Python 的集合(set)和映射(dict),它们都基于哈希表 实现,是 Python 最强大的数据结构之一。
🔥 集合(set):去重和集合运算的神器 什么是集合? 集合(set)是 Python 的内置类型,特点:
无序 :元素没有固定顺序唯一 :自动去重可变 :可以添加和删除元素(frozenset 是不可变版本)元素必须可哈希 :列表、字典等不能作为集合元素1 2 3 4 5 6 7 8 9 10 11 12 s1 = {1 , 2 , 3 } s2 = set ([1 , 2 , 2 , 3 ]) s3 = set ('hello' ) print (s1) print (s2) print (s3) empty = set ()
集合的底层原理:哈希表 Python 的集合基于哈希表 实现,这决定了它的性能特点:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import timelst = list (range (100000 )) start = time.time() for _ in range (1000 ): 99999 in lst print (f"列表检查: {time.time() - start:.4 f} s" ) s = set (range (100000 )) start = time.time() for _ in range (1000 ): 99999 in s print (f"集合检查: {time.time() - start:.4 f} s" )
为什么集合查找这么快?
哈希表的查找过程:
计算元素的哈希值:hash(element) 根据哈希值定位到"桶"(bucket) 直接访问该位置(O(1) 时间) 1 2 3 4 5 6 7 print (hash (42 )) print (hash ('hello' )) print (hash ((1 , 2 , 3 )))
集合运算 集合支持数学中的各种集合运算:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 a = {1 , 2 , 3 , 4 , 5 } b = {4 , 5 , 6 , 7 , 8 } print (a & b) print (a.intersection(b)) print (a | b) print (a.union(b)) print (a - b) print (a.difference(b)) print (a ^ b) print (a.symmetric_difference(b)) c = {1 , 2 } print (c.issubset(a)) print (a.issuperset(c)) print (a.isdisjoint(b))
集合运算的实际应用 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 article1_words = {'python' , 'programming' , 'code' , 'data' } article2_words = {'java' , 'programming' , 'code' , 'web' } common = article1_words & article2_words print (f"共同词汇: {common} " ) unique_to_article1 = article1_words - article2_words print (f"文章1独有: {unique_to_article1} " ) all_words = article1_words | article2_words print (f"所有词汇: {all_words} " )alice_friends = {'Bob' , 'Charlie' , 'David' , 'Eve' } bob_friends = {'Charlie' , 'David' , 'Frank' , 'Grace' } mutual_friends = alice_friends & bob_friends print (f"共同好友: {mutual_friends} " ) user_permissions = {'read' , 'write' , 'execute' } required_permissions = {'read' , 'execute' } if required_permissions.issubset(user_permissions): print ("权限充足,允许操作" ) else : print ("权限不足" )
集合的性能对比 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 import timedef test_membership (container, item, iterations=10000 ): start = time.time() for _ in range (iterations): item in container return time.time() - start lst = list (range (100000 )) s = set (range (100000 )) print ("查找存在的元素:" )print (f" 列表: {test_membership(lst, 50000 ):.4 f} s" )print (f" 集合: {test_membership(s, 50000 ):.4 f} s" )print ("查找不存在的元素:" )print (f" 列表: {test_membership(lst, 200000 ):.4 f} s" )print (f" 集合: {test_membership(s, 200000 ):.4 f} s" )
结论 :集合的成员检查比列表快 100-1000 倍 !
frozenset:不可变集合 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 fs = frozenset ([1 , 2 , 3 ]) d = {frozenset ([1 , 2 ]): 'value' } print (d[frozenset ([1 , 2 ])]) nested = {frozenset ([1 , 2 ]), frozenset ([3 , 4 ])} print (nested)
📊 字典(dict):Python 的核心数据结构 字典的底层原理 字典和集合一样,基于哈希表 实现:
1 2 3 4 5 6 7 8 d = {i: i * 2 for i in range (100000 )} import timestart = time.time() for _ in range (10000 ): d[50000 ] print (f"字典查找: {time.time() - start:.4 f} s" )
字典的内部结构(简化版) :
1 2 3 4 5 6 7 8 9 10 11 12 字典 = { key1: value1, key2: value2, } 哈希表: +-------+-------+-------+ | hash | key | value | +-------+-------+-------+ | 12345 | key1 | value1| | 67890 | key2 | value2| +-------+-------+-------+
字典的高级用法 1. 字典视图 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 d = {'a' : 1 , 'b' : 2 , 'c' : 3 } keys = d.keys() values = d.values() items = d.items() print (type (keys)) print (type (values)) d['d' ] = 4 print (list (keys)) d1 = {'a' : 1 , 'b' : 2 } d2 = {'b' : 20 , 'c' : 3 } common_keys = d1.keys() & d2.keys() print (common_keys) common_items = d1.items() & d2.items() print (common_items)
2. 字典的合并与更新 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 d1 = {'a' : 1 , 'b' : 2 } d2 = {'c' : 3 , 'd' : 4 } merged = d1 | d2 print (merged) d1 |= d2 print (d1) d1 = {'a' : 1 , 'b' : 2 } d2 = {'b' : 20 , 'c' : 3 } merged = d1 | d2 print (merged) merged = {**d1, **d2} print (merged)
3. 字典的 get、setdefault、pop 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 d = {'a' : 1 , 'b' : 2 } print (d.get('a' )) print (d.get('c' )) print (d.get('c' , 0 )) d.setdefault('a' , 100 ) d.setdefault('c' , 3 ) print (d) value = d.pop('b' ) print (value) print (d) value = d.pop('x' , '不存在' ) print (value) d = {'a' : 1 , 'b' : 2 , 'c' : 3 } key, value = d.popitem() print (key, value)
🛠️ collections 模块的高级容器 defaultdict:自动初始化的字典 defaultdict 解决了字典键不存在时的处理问题:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 from collections import defaultdictword_counts = {} words = ['apple' , 'banana' , 'apple' , 'cherry' , 'banana' , 'apple' ] for word in words: if word not in word_counts: word_counts[word] = 0 word_counts[word] += 1 word_counts = defaultdict(int ) for word in words: word_counts[word] += 1 print (dict (word_counts))
defaultdict 的常用场景 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 from collections import defaultdictstudents = [ ('Alice' , 'Math' ), ('Bob' , 'Physics' ), ('Alice' , 'Physics' ), ('Charlie' , 'Math' ), ('Bob' , 'Chemistry' ), ] by_student = defaultdict(list ) for student, subject in students: by_student[student].append(subject) print (dict (by_student))def tree (): return defaultdict(tree) config = tree() config['database' ]['host' ] = 'localhost' config['database' ]['port' ] = 5432 config['cache' ]['redis' ]['host' ] = 'redis-server' print (config['database' ]['host' ]) print (config['database' ]['port' ]) print (config['cache' ]['redis' ]['host' ]) from collections import defaultdictcounter = defaultdict(int ) for char in 'hello world' : counter[char] += 1 print (dict (counter))from collections import defaultdicttags_by_article = defaultdict(set ) tags_by_article['article1' ].add('python' ) tags_by_article['article1' ].add('programming' ) tags_by_article['article2' ].add('python' ) tags_by_article['article2' ].add('web' ) print (dict (tags_by_article))
Counter:专业的计数器 Counter 是专门用于计数的字典子类:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 from collections import Counterwords = ['apple' , 'banana' , 'apple' , 'cherry' , 'banana' , 'apple' ] counter = Counter(words) print (counter) print (counter.most_common(2 )) print (counter.elements()) print (list (counter.elements())) counter.update(['apple' , 'durian' ]) print (counter) counter.subtract(['apple' , 'apple' ]) print (counter)
Counter 的高级用法 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 from collections import Countertext = "the quick brown fox jumps over the lazy dog" word_counter = Counter(text.split()) print (word_counter.most_common(3 ))char_counter = Counter('mississippi' ) print (char_counter) c1 = Counter(['a' , 'b' , 'c' , 'a' ]) c2 = Counter(['a' , 'b' , 'd' ]) print (c1 + c2) print (c1 - c2) print (c1 & c2) print (c1 | c2) data = [1 , 2 , 2 , 3 , 3 , 3 , 4 , 4 , 4 , 4 ] counter = Counter(data) duplicates = [item for item, count in counter.items() if count > 1 ] print (duplicates) def similarity (text1, text2 ): c1 = Counter(text1.split()) c2 = Counter(text2.split()) common = c1 & c2 return sum (common.values()) / max (sum (c1.values()), sum (c2.values())) t1 = "the quick brown fox" t2 = "the quick blue fox" print (f"相似度: {similarity(t1, t2):.2 %} " )
OrderedDict:有序字典 Python 3.7+ 的普通字典已经保持插入顺序,但 OrderedDict 仍有独特功能:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 from collections import OrderedDictod = OrderedDict() od['a' ] = 1 od['b' ] = 2 od['c' ] = 3 print (list (od.keys())) od.move_to_end('a' ) print (list (od.keys())) od.move_to_end('a' , last=False ) print (list (od.keys())) class LRUCache : def __init__ (self, capacity ): self.capacity = capacity self.cache = OrderedDict() def get (self, key ): if key not in self.cache: return -1 self.cache.move_to_end(key) return self.cache[key] def put (self, key, value ): if key in self.cache: self.cache.move_to_end(key) self.cache[key] = value if len (self.cache) > self.capacity: self.cache.popitem(last=False ) cache = LRUCache(2 ) cache.put(1 , 'a' ) cache.put(2 , 'b' ) cache.get(1 ) cache.put(3 , 'c' ) print (cache.cache)
ChainMap:链式字典 ChainMap 可以将多个字典合并为一个视图:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 from collections import ChainMapdefaults = {'theme' : 'dark' , 'language' : 'en' , 'timezone' : 'UTC' } user_config = {'theme' : 'light' , 'language' : 'zh' } system_config = {'timezone' : 'Asia/Shanghai' } config = ChainMap(user_config, system_config, defaults) print (config['theme' ]) print (config['language' ]) print (config['timezone' ]) print (config['debug' ]) print (list (config.keys()))config['debug' ] = True print ('debug' in user_config) import osfrom collections import ChainMapdefaults = {'host' : 'localhost' , 'port' : 8080 } env_config = { 'host' : os.environ.get('APP_HOST' ), 'port' : os.environ.get('APP_PORT' ), } env_config = {k: v for k, v in env_config.items() if v is not None } config = ChainMap({}, env_config, defaults) print (f"Server running on {config['host' ]} :{config['port' ]} " )
📊 性能对比与选择指南 各种容器的性能对比 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 import timefrom collections import defaultdict, Counterdef benchmark (func, iterations=10000 ): start = time.time() for _ in range (iterations): func() return time.time() - start data = list (range (10000 )) lst = data s = set (data) d = {x: x for x in data} print ("成员检查性能(查找存在的元素):" )print (f" list: {benchmark(lambda : 5000 in lst):.4 f} s" )print (f" set: {benchmark(lambda : 5000 in s):.4 f} s" )print (f" dict: {benchmark(lambda : 5000 in d):.4 f} s" )words = ['word' ] * 1000 def count_manual (): counts = {} for word in words: counts[word] = counts.get(word, 0 ) + 1 return counts def count_defaultdict (): counts = defaultdict(int ) for word in words: counts[word] += 1 return counts def count_counter (): return Counter(words) print ("\n计数性能:" )print (f" 手动计数: {benchmark(count_manual, 100 ):.4 f} s" )print (f" defaultdict: {benchmark(count_defaultdict, 100 ):.4 f} s" )print (f" Counter: {benchmark(count_counter, 100 ):.4 f} s" )
容器选择指南 需求 推荐容器 原因 去重 setO(1) 自动去重 快速成员检查 set / dictO(1) 查找 保持插入顺序 dict / listPython 3.7+ dict 有序 计数 Counter专业计数工具 分组 defaultdict(list)自动初始化列表 配置合并 ChainMap不复制数据 LRU 缓存 OrderedDict支持移动元素 不可变集合 frozenset可作为字典键
🎯 实战案例:文本分析 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 from collections import Counter, defaultdictimport reclass TextAnalyzer : """文本分析工具""" def __init__ (self, text ): self.text = text self.words = self._tokenize(text) def _tokenize (self, text ): """分词""" words = re.findall(r'\b\w+\b' , text.lower()) return words def word_frequency (self, top_n=10 ): """词频统计""" counter = Counter(self.words) return counter.most_common(top_n) def word_length_distribution (self ): """单词长度分布""" length_dist = Counter(len (word) for word in self.words) return dict (sorted (length_dist.items())) def ngrams (self, n=2 ): """N-gram 分析""" ngram_counter = Counter() for i in range (len (self.words) - n + 1 ): ngram = tuple (self.words[i:i+n]) ngram_counter[ngram] += 1 return ngram_counter.most_common(10 ) def vocabulary_richness (self ): """词汇丰富度(类型/标记比)""" return len (set (self.words)) / len (self.words) def find_collocations (self, word, window=2 ): """找出与指定词共现的词""" collocations = defaultdict(int ) for i, w in enumerate (self.words): if w == word: start = max (0 , i - window) end = min (len (self.words), i + window + 1 ) for j in range (start, end): if j != i: collocations[self.words[j]] += 1 return Counter(collocations).most_common(10 ) text = """ Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Python is dynamically typed and garbage-collected. """ analyzer = TextAnalyzer(text) print ("高频词:" , analyzer.word_frequency(5 ))print ("\n词长分布:" , analyzer.word_length_distribution())print ("\n二元语法:" , analyzer.ngrams(2 )[:5 ])print (f"\n词汇丰富度: {analyzer.vocabulary_richness():.2 %} " )print ("\n'python' 共现词:" , analyzer.find_collocations('python' ))
⚠️ 避坑指南 陷阱 1:修改集合时遍历 1 2 3 4 5 6 7 8 9 10 s = {1 , 2 , 3 , 4 , 5 } for item in s: if item % 2 == 0 : s.remove(item) s = {1 , 2 , 3 , 4 , 5 } s = {item for item in s if item % 2 != 0 } print (s)
陷阱 2:集合元素必须是可哈希的 1 2 3 4 5 6 7 8 9 10 11 12 s = {(1 , 2 ), (3 , 4 )} print (s) s = {frozenset ([('a' , 1 )]), frozenset ([('b' , 2 )])}
陷阱 3:Counter 的算术运算 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 from collections import Counterc1 = Counter('aabbcc' ) c2 = Counter('aabb' ) print (c1 - c2) print (c1 - c2 + c2) c3 = Counter('aaa' ) c4 = Counter('aaaaa' ) print (c3 - c4)
陷阱 4:defaultdict 的默认值陷阱 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from collections import defaultdictd = defaultdict(list ) print (d['new_key' ]) print ('new_key' in d) d = defaultdict(int ) if d['count' ] > 0 : print ("有值" ) print (d) if 'count' in d and d['count' ] > 0 : print ("有值" )
🎯 本讲总结 通过本讲,我们掌握了:
知识点 核心要点 集合(set) 基于哈希表,O(1) 查找,自动去重,支持集合运算 frozenset 不可变集合,可作为字典键 字典(dict) Python 3.7+ 保持插入顺序,O(1) 查找 defaultdict 自动初始化默认值,适合分组、计数 Counter 专业计数工具,支持算术运算 OrderedDict 有序字典,支持 move_to_end ChainMap 链式映射,合并多个字典视图 性能关键 成员检查用 set,计数用 Counter
记住这句话 :
选择正确的容器类型,能让你的代码性能提升 100 倍,同时更简洁易读。
📚 推荐教材 《Python 编程从入门到实践(第 3 版)》 | 《流畅的 Python(第 2 版)》 | 《CPython 设计与实现》
学习路线: 零基础 → 《从入门到实践》 → 《流畅的 Python》 → 本门课程 → 《CPython 设计与实现》
🎓 加入《流畅的 Python》直播共读营 学到这里,如果你想系统吃透这本书——欢迎加入我的直播共读课。
每周直播精讲,逐章拆解核心知识点 专属学习群,随时答疑交流 试运营特惠:499 元 → 299 元 👉 【立即报名《流畅的 Python》共读课】 :https://mp.weixin.qq.com/s/ivHJwn1nNx5ug4TFrapvGg
🔗 课程导航 ← 上一讲:数据容器深度解析 | 下一讲:文本与字节 →
💬 联系我 主营业务 :AI 编程培训、企业内训、技术咨询
🎓 AI 编程实战课程 想系统学习 AI 编程?程序员晚枫的 AI 编程实战课 帮你从零上手!