第 3 讲：Python 集合与映射 | dict、set、defaultdict、Counter 深度解析

大家好，我是正在实战各种 AI 项目的程序员晚枫。

🎬 开篇：一个去重问题引发的思考

你有没有写过这样的代码？

# 需求：统计文章中出现过的所有单词
text = "the quick brown fox jumps over the lazy dog the fox"
words = text.split()

# ❌ 传统写法
unique_words = []
for word in words:
    if word not in unique_words:
        unique_words.append(word)
print(unique_words)
# ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']

# ✅ set 写法：一行搞定
unique_words = list(set(words))
print(unique_words)

但问题来了：为什么 set 去重这么快？底层原理是什么？

今天我们就深入理解 Python 的集合（set）和映射（dict），它们都基于哈希表实现，是 Python 最强大的数据结构之一。

🔥 集合（set）：去重和集合运算的神器

什么是集合？

集合（set）是 Python 的内置类型，特点：

无序：元素没有固定顺序
唯一：自动去重
可变：可以添加和删除元素（frozenset 是不可变版本）
元素必须可哈希：列表、字典等不能作为集合元素

# 创建集合
s1 = {1, 2, 3}          # 直接使用花括号
s2 = set([1, 2, 2, 3])  # 从列表创建
s3 = set('hello')       # 从字符串创建

print(s1)  # {1, 2, 3}
print(s2)  # {1, 2, 3} - 自动去重
print(s3)  # {'h', 'e', 'l', 'o'} - 注意：只有一个 'l'

# 空集合必须用 set()
empty = set()    # 正确
# empty = {}     # 错误！这是空字典

集合的底层原理：哈希表

Python 的集合基于哈希表实现，这决定了它的性能特点：

import time

# 列表的成员检查：O(n) - 随着数据量增大变慢
lst = list(range(100000))
start = time.time()
for _ in range(1000):
    99999 in lst
print(f"列表检查: {time.time() - start:.4f}s")  # 约 0.5s

# 集合的成员检查：O(1) - 恒定时间
s = set(range(100000))
start = time.time()
for _ in range(1000):
    99999 in s
print(f"集合检查: {time.time() - start:.4f}s")  # 约 0.0001s

为什么集合查找这么快？

哈希表的查找过程：

计算元素的哈希值：hash(element)
根据哈希值定位到"桶"（bucket）
直接访问该位置（O(1) 时间）

# 理解哈希值
print(hash(42))        # 42 - 整数的哈希值就是自己
print(hash('hello'))   # 固定的哈希值
print(hash((1, 2, 3))) # 元组可哈希

# 列表不可哈希
# hash([1, 2, 3])  # TypeError: unhashable type: 'list'

集合运算

集合支持数学中的各种集合运算：

a = {1, 2, 3, 4, 5}
b = {4, 5, 6, 7, 8}

# 交集：两个集合都有的元素
print(a & b)        # {4, 5}
print(a.intersection(b))  # 同上

# 并集：两个集合的所有元素
print(a | b)        # {1, 2, 3, 4, 5, 6, 7, 8}
print(a.union(b))   # 同上

# 差集：a 有但 b 没有的元素
print(a - b)        # {1, 2, 3}
print(a.difference(b))  # 同上

# 对称差集：只在其中一个集合中的元素
print(a ^ b)        # {1, 2, 3, 6, 7, 8}
print(a.symmetric_difference(b))  # 同上

# 子集和超集判断
c = {1, 2}
print(c.issubset(a))      # True - c 是 a 的子集
print(a.issuperset(c))    # True - a 是 c 的超集
print(a.isdisjoint(b))    # False - a 和 b 有交集

集合运算的实际应用

# 场景1：找出两篇文章的共同词汇
article1_words = {'python', 'programming', 'code', 'data'}
article2_words = {'java', 'programming', 'code', 'web'}

common = article1_words & article2_words
print(f"共同词汇: {common}")  # {'code', 'programming'}

# 场景2：找出只在文章1出现的词汇
unique_to_article1 = article1_words - article2_words
print(f"文章1独有: {unique_to_article1}")  # {'python', 'data'}

# 场景3：合并所有词汇（去重）
all_words = article1_words | article2_words
print(f"所有词汇: {all_words}")

# 场景4：找出两人共同好友
alice_friends = {'Bob', 'Charlie', 'David', 'Eve'}
bob_friends = {'Charlie', 'David', 'Frank', 'Grace'}

mutual_friends = alice_friends & bob_friends
print(f"共同好友: {mutual_friends}")  # {'Charlie', 'David'}

# 场景5：检查权限
user_permissions = {'read', 'write', 'execute'}
required_permissions = {'read', 'execute'}

if required_permissions.issubset(user_permissions):
    print("权限充足，允许操作")
else:
    print("权限不足")

集合的性能对比

import time

# 测试：列表 vs 集合的成员检查
def test_membership(container, item, iterations=10000):
    start = time.time()
    for _ in range(iterations):
        item in container
    return time.time() - start

# 创建大数据集
lst = list(range(100000))
s = set(range(100000))

# 测试查找存在的元素
print("查找存在的元素:")
print(f"  列表: {test_membership(lst, 50000):.4f}s")
print(f"  集合: {test_membership(s, 50000):.4f}s")

# 测试查找不存在的元素
print("查找不存在的元素:")
print(f"  列表: {test_membership(lst, 200000):.4f}s")
print(f"  集合: {test_membership(s, 200000):.4f}s")

结论：集合的成员检查比列表快 100-1000 倍！

frozenset：不可变集合

# frozenset 创建后不能修改
fs = frozenset([1, 2, 3])
# fs.add(4)  # AttributeError - 不能添加
# fs.remove(1)  # AttributeError - 不能删除

# 用途：作为字典的键或集合的元素
# 普通集合不能作为字典键
# d = {{1, 2}: 'value'}  # TypeError

# frozenset 可以
d = {frozenset([1, 2]): 'value'}
print(d[frozenset([1, 2])])  # 'value'

# 嵌套集合
nested = {frozenset([1, 2]), frozenset([3, 4])}
print(nested)

📊 字典（dict）：Python 的核心数据结构

字典的底层原理

字典和集合一样，基于哈希表实现：

# 字典的查找是 O(1) 时间复杂度
d = {i: i * 2 for i in range(100000)}

import time
start = time.time()
for _ in range(10000):
    d[50000]
print(f"字典查找: {time.time() - start:.4f}s")  # 约 0.001s

字典的内部结构（简化版）：

字典 = {
    key1: value1,
    key2: value2,
}

哈希表：
+-------+-------+-------+
| hash  | key   | value |
+-------+-------+-------+
| 12345 | key1  | value1|
| 67890 | key2  | value2|
+-------+-------+-------+

字典的高级用法

1. 字典视图

d = {'a': 1, 'b': 2, 'c': 3}

# keys(), values(), items() 返回的是视图，不是列表
keys = d.keys()
values = d.values()
items = d.items()

print(type(keys))   # <class 'dict_keys'>
print(type(values)) # <class 'dict_values'>

# 视图是动态的
d['d'] = 4
print(list(keys))   # ['a', 'b', 'c', 'd'] - 自动更新

# 视图支持集合操作
d1 = {'a': 1, 'b': 2}
d2 = {'b': 20, 'c': 3}

# 找出共同的键
common_keys = d1.keys() & d2.keys()
print(common_keys)  # {'b'}

# 找出键值对都相同的项
common_items = d1.items() & d2.items()
print(common_items)  # set() - 没有相同的键值对

2. 字典的合并与更新

# Python 3.9+ 的合并运算符
d1 = {'a': 1, 'b': 2}
d2 = {'c': 3, 'd': 4}

# 合并（创建新字典）
merged = d1 | d2
print(merged)  # {'a': 1, 'b': 2, 'c': 3, 'd': 4}

# 原地更新
d1 |= d2
print(d1)  # {'a': 1, 'b': 2, 'c': 3, 'd': 4}

# 处理键冲突
d1 = {'a': 1, 'b': 2}
d2 = {'b': 20, 'c': 3}

merged = d1 | d2
print(merged)  # {'a': 1, 'b': 20, 'c': 3} - d2 覆盖 d1

# Python 3.5-3.8 的合并方式
merged = {**d1, **d2}
print(merged)  # {'a': 1, 'b': 20, 'c': 3}

3. 字典的 get、setdefault、pop

d = {'a': 1, 'b': 2}

# get()：安全获取，不存在返回默认值
print(d.get('a'))      # 1
print(d.get('c'))      # None
print(d.get('c', 0))   # 0

# setdefault()：获取或设置默认值
d.setdefault('a', 100)  # 返回 1，不修改（已存在）
d.setdefault('c', 3)    # 返回 3，添加新键值对
print(d)  # {'a': 1, 'b': 2, 'c': 3}

# pop()：删除并返回值
value = d.pop('b')
print(value)  # 2
print(d)      # {'a': 1, 'c': 3}

# pop() 带默认值
value = d.pop('x', '不存在')
print(value)  # '不存在' - 不报错

# popitem()：删除并返回最后一个键值对
d = {'a': 1, 'b': 2, 'c': 3}
key, value = d.popitem()
print(key, value)  # 'c' 3

🛠️ collections 模块的高级容器

defaultdict：自动初始化的字典

defaultdict 解决了字典键不存在时的处理问题：

from collections import defaultdict

# 普通字典的问题
word_counts = {}
words = ['apple', 'banana', 'apple', 'cherry', 'banana', 'apple']

for word in words:
    # 必须先检查键是否存在
    if word not in word_counts:
        word_counts[word] = 0
    word_counts[word] += 1

# defaultdict 自动处理
word_counts = defaultdict(int)  # 默认值是 0
for word in words:
    word_counts[word] += 1  # 不需要检查

print(dict(word_counts))  # {'apple': 3, 'banana': 2, 'cherry': 1}

defaultdict 的常用场景

from collections import defaultdict

# 场景1：分组
students = [
    ('Alice', 'Math'),
    ('Bob', 'Physics'),
    ('Alice', 'Physics'),
    ('Charlie', 'Math'),
    ('Bob', 'Chemistry'),
]

by_student = defaultdict(list)
for student, subject in students:
    by_student[student].append(subject)

print(dict(by_student))
# {'Alice': ['Math', 'Physics'], 'Bob': ['Physics', 'Chemistry'], 'Charlie': ['Math']}

# 场景2：构建树形结构
def tree():
    return defaultdict(tree)

# 自动创建嵌套结构
config = tree()
config['database']['host'] = 'localhost'
config['database']['port'] = 5432
config['cache']['redis']['host'] = 'redis-server'

print(config['database']['host'])  # 'localhost'
print(config['database']['port'])  # 5432
print(config['cache']['redis']['host'])  # 'redis-server'

# 场景3：计数器（简化版）
from collections import defaultdict

counter = defaultdict(int)
for char in 'hello world':
    counter[char] += 1

print(dict(counter))
# {'h': 1, 'e': 1, 'l': 3, 'o': 2, ' ': 1, 'w': 1, 'r': 1, 'd': 1}

# 场景4：集合字典
from collections import defaultdict

tags_by_article = defaultdict(set)
tags_by_article['article1'].add('python')
tags_by_article['article1'].add('programming')
tags_by_article['article2'].add('python')
tags_by_article['article2'].add('web')

print(dict(tags_by_article))
# {'article1': {'python', 'programming'}, 'article2': {'python', 'web'}}

Counter：专业的计数器

Counter 是专门用于计数的字典子类：

from collections import Counter

# 创建 Counter
words = ['apple', 'banana', 'apple', 'cherry', 'banana', 'apple']
counter = Counter(words)

print(counter)  # Counter({'apple': 3, 'banana': 2, 'cherry': 1})

# 常用方法
print(counter.most_common(2))  # [('apple', 3), ('banana', 1)] - 最常见的2个
print(counter.elements())       # 迭代器，重复元素
print(list(counter.elements())) # ['apple', 'apple', 'apple', 'banana', 'banana', 'cherry']

# 更新计数
counter.update(['apple', 'durian'])
print(counter)  # Counter({'apple': 4, 'banana': 2, 'cherry': 1, 'durian': 1})

# 减少计数
counter.subtract(['apple', 'apple'])
print(counter)  # Counter({'apple': 2, 'banana': 2, 'cherry': 1, 'durian': 1})

Counter 的高级用法

from collections import Counter

# 1. 从字符串计数
text = "the quick brown fox jumps over the lazy dog"
word_counter = Counter(text.split())
print(word_counter.most_common(3))

# 2. 字符频率统计
char_counter = Counter('mississippi')
print(char_counter)  # Counter({'i': 4, 's': 4, 'p': 2, 'm': 1})

# 3. Counter 的算术运算
c1 = Counter(['a', 'b', 'c', 'a'])
c2 = Counter(['a', 'b', 'd'])

print(c1 + c2)  # Counter({'a': 3, 'b': 2, 'c': 1, 'd': 1})
print(c1 - c2)  # Counter({'a': 1, 'c': 1}) - 只保留正数
print(c1 & c2)  # Counter({'a': 1, 'b': 1}) - 交集（最小值）
print(c1 | c2)  # Counter({'a': 2, 'b': 1, 'c': 1, 'd': 1}) - 并集（最大值）

# 4. 找出重复元素
data = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]
counter = Counter(data)
duplicates = [item for item, count in counter.items() if count > 1]
print(duplicates)  # [2, 3, 4]

# 5. 字符串相似度（词频比较）
def similarity(text1, text2):
    c1 = Counter(text1.split())
    c2 = Counter(text2.split())
    common = c1 & c2
    return sum(common.values()) / max(sum(c1.values()), sum(c2.values()))

t1 = "the quick brown fox"
t2 = "the quick blue fox"
print(f"相似度: {similarity(t1, t2):.2%}")  # 相似度: 75.00%

OrderedDict：有序字典

Python 3.7+ 的普通字典已经保持插入顺序，但 OrderedDict 仍有独特功能：

from collections import OrderedDict

# 创建有序字典
od = OrderedDict()
od['a'] = 1
od['b'] = 2
od['c'] = 3
print(list(od.keys()))  # ['a', 'b', 'c']

# 特有方法：move_to_end
od.move_to_end('a')  # 将 'a' 移到最后
print(list(od.keys()))  # ['b', 'c', 'a']

od.move_to_end('a', last=False)  # 将 'a' 移到最前
print(list(od.keys()))  # ['a', 'b', 'c']

# 实现LRU缓存
class LRUCache:
    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = OrderedDict()
    
    def get(self, key):
        if key not in self.cache:
            return -1
        # 访问时移到最后（最近使用）
        self.cache.move_to_end(key)
        return self.cache[key]
    
    def put(self, key, value):
        if key in self.cache:
            # 更新并移到最后
            self.cache.move_to_end(key)
        self.cache[key] = value
        # 超出容量，删除最久未使用的
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)

# 使用
cache = LRUCache(2)
cache.put(1, 'a')
cache.put(2, 'b')
cache.get(1)        # 返回 'a'，现在 1 是最近使用的
cache.put(3, 'c')   # 容量满了，删除最久未使用的 2
print(cache.cache)  # OrderedDict([(1, 'a'), (3, 'c')])

ChainMap：链式字典

ChainMap 可以将多个字典合并为一个视图：

from collections import ChainMap

# 创建链式映射
defaults = {'theme': 'dark', 'language': 'en', 'timezone': 'UTC'}
user_config = {'theme': 'light', 'language': 'zh'}
system_config = {'timezone': 'Asia/Shanghai'}

config = ChainMap(user_config, system_config, defaults)

# 查找时按顺序查找
print(config['theme'])     # 'light' - 来自 user_config
print(config['language'])  # 'zh' - 来自 user_config
print(config['timezone'])  # 'Asia/Shanghai' - 来自 system_config
print(config['debug'])     # KeyError - 都不存在

# 查看所有键
print(list(config.keys()))
# ['theme', 'language', 'timezone']

# 新值总是添加到第一个字典
config['debug'] = True
print('debug' in user_config)  # True

# 实际应用：配置管理
import os
from collections import ChainMap

# 命令行参数 > 环境变量 > 默认配置
defaults = {'host': 'localhost', 'port': 8080}
env_config = {
    'host': os.environ.get('APP_HOST'),
    'port': os.environ.get('APP_PORT'),
}
# 过滤掉 None 值
env_config = {k: v for k, v in env_config.items() if v is not None}

config = ChainMap({}, env_config, defaults)
print(f"Server running on {config['host']}:{config['port']}")

📊 性能对比与选择指南

各种容器的性能对比

import time
from collections import defaultdict, Counter

def benchmark(func, iterations=10000):
    start = time.time()
    for _ in range(iterations):
        func()
    return time.time() - start

# 测试成员检查
data = list(range(10000))
lst = data
s = set(data)
d = {x: x for x in data}

print("成员检查性能（查找存在的元素）:")
print(f"  list:    {benchmark(lambda: 5000 in lst):.4f}s")
print(f"  set:     {benchmark(lambda: 5000 in s):.4f}s")
print(f"  dict:    {benchmark(lambda: 5000 in d):.4f}s")

# 测试计数
words = ['word'] * 1000

def count_manual():
    counts = {}
    for word in words:
        counts[word] = counts.get(word, 0) + 1
    return counts

def count_defaultdict():
    counts = defaultdict(int)
    for word in words:
        counts[word] += 1
    return counts

def count_counter():
    return Counter(words)

print("\n计数性能:")
print(f"  手动计数:     {benchmark(count_manual, 100):.4f}s")
print(f"  defaultdict:  {benchmark(count_defaultdict, 100):.4f}s")
print(f"  Counter:      {benchmark(count_counter, 100):.4f}s")

容器选择指南

需求	推荐容器	原因
去重	`set`	O(1) 自动去重
快速成员检查	`set` / `dict`	O(1) 查找
保持插入顺序	`dict` / `list`	Python 3.7+ dict 有序
计数	`Counter`	专业计数工具
分组	`defaultdict(list)`	自动初始化列表
配置合并	`ChainMap`	不复制数据
LRU 缓存	`OrderedDict`	支持移动元素
不可变集合	`frozenset`	可作为字典键

🎯 实战案例：文本分析

from collections import Counter, defaultdict
import re

class TextAnalyzer:
    """文本分析工具"""
    
    def __init__(self, text):
        self.text = text
        self.words = self._tokenize(text)
    
    def _tokenize(self, text):
        """分词"""
        # 简单的英文分词
        words = re.findall(r'\b\w+\b', text.lower())
        return words
    
    def word_frequency(self, top_n=10):
        """词频统计"""
        counter = Counter(self.words)
        return counter.most_common(top_n)
    
    def word_length_distribution(self):
        """单词长度分布"""
        length_dist = Counter(len(word) for word in self.words)
        return dict(sorted(length_dist.items()))
    
    def ngrams(self, n=2):
        """N-gram 分析"""
        ngram_counter = Counter()
        for i in range(len(self.words) - n + 1):
            ngram = tuple(self.words[i:i+n])
            ngram_counter[ngram] += 1
        return ngram_counter.most_common(10)
    
    def vocabulary_richness(self):
        """词汇丰富度（类型/标记比）"""
        return len(set(self.words)) / len(self.words)
    
    def find_collocations(self, word, window=2):
        """找出与指定词共现的词"""
        collocations = defaultdict(int)
        for i, w in enumerate(self.words):
            if w == word:
                # 查找窗口内的词
                start = max(0, i - window)
                end = min(len(self.words), i + window + 1)
                for j in range(start, end):
                    if j != i:
                        collocations[self.words[j]] += 1
        return Counter(collocations).most_common(10)

# 使用示例
text = """
Python is an interpreted, high-level, general-purpose programming language.
Created by Guido van Rossum and first released in 1991, Python's design philosophy
emphasizes code readability with its notable use of significant whitespace.
Python is dynamically typed and garbage-collected.
"""

analyzer = TextAnalyzer(text)

print("高频词:", analyzer.word_frequency(5))
print("\n词长分布:", analyzer.word_length_distribution())
print("\n二元语法:", analyzer.ngrams(2)[:5])
print(f"\n词汇丰富度: {analyzer.vocabulary_richness():.2%}")
print("\n'python' 共现词:", analyzer.find_collocations('python'))

⚠️ 避坑指南

陷阱 1：修改集合时遍历

# ❌ 错误：遍历时修改集合
s = {1, 2, 3, 4, 5}
for item in s:
    if item % 2 == 0:
        s.remove(item)  # RuntimeError: Set changed size during iteration

# ✅ 正确：创建副本或使用集合运算
s = {1, 2, 3, 4, 5}
s = {item for item in s if item % 2 != 0}
print(s)  # {1, 3, 5}

陷阱 2：集合元素必须是可哈希的

# ❌ 错误：列表不能作为集合元素
# s = {[1, 2], [3, 4]}  # TypeError

# ✅ 正确：使用元组
s = {(1, 2), (3, 4)}
print(s)  # {(1, 2), (3, 4)}

# ❌ 错误：字典不能作为集合元素
# s = {{'a': 1}, {'b': 2}}  # TypeError

# ✅ 正确：使用 frozenset 或元组
s = {frozenset([('a', 1)]), frozenset([('b', 2)])}

陷阱 3：Counter 的算术运算

from collections import Counter

c1 = Counter('aabbcc')
c2 = Counter('aabb')

# 减法只保留正数
print(c1 - c2)  # Counter({'c': 2})

# 这意味着 c1 - c2 + c2 可能不等于 c1
print(c1 - c2 + c2)  # Counter({'a': 2, 'b': 2, 'c': 2}) - 正确

# 但如果是负数...
c3 = Counter('aaa')
c4 = Counter('aaaaa')
print(c3 - c4)  # Counter() - 空！

陷阱 4：defaultdict 的默认值陷阱

from collections import defaultdict

d = defaultdict(list)

# 访问不存在的键会创建空列表
print(d['new_key'])  # []
print('new_key' in d)  # True - 键被创建了！

# 这可能导致意外行为
d = defaultdict(int)
if d['count'] > 0:  # 访问时创建了键
    print("有值")
print(d)  # {'count': 0} - 键被创建了

# 解决方案：使用 in 检查
if 'count' in d and d['count'] > 0:
    print("有值")

🎯 本讲总结

通过本讲，我们掌握了：

知识点	核心要点
集合（set）	基于哈希表，O(1) 查找，自动去重，支持集合运算
frozenset	不可变集合，可作为字典键
字典（dict）	Python 3.7+ 保持插入顺序，O(1) 查找
defaultdict	自动初始化默认值，适合分组、计数
Counter	专业计数工具，支持算术运算
OrderedDict	有序字典，支持 move_to_end
ChainMap	链式映射，合并多个字典视图
性能关键	成员检查用 set，计数用 Counter

记住这句话：

选择正确的容器类型，能让你的代码性能提升 100 倍，同时更简洁易读。

学习路线： 零基础 → 《从入门到实践》 → 《流畅的 Python》 → 本门课程 → 《CPython 设计与实现》

🎓 加入《流畅的 Python》直播共读营

学到这里，如果你想系统吃透这本书——欢迎加入我的直播共读课。

每周直播精讲，逐章拆解核心知识点
专属学习群，随时答疑交流
试运营特惠：~~499 元~~ → 299 元

👉 【立即报名《流畅的 Python》共读课】：https://mp.weixin.qq.com/s/ivHJwn1nNx5ug4TFrapvGg

🔗 课程导航

← 上一讲：数据容器深度解析 | 下一讲：文本与字节 →

💬 联系我

平台	账号/链接
微信	扫码加好友
微博	@程序员晚枫
知乎	@程序员晚枫
抖音	@程序员晚枫
小红书	@程序员晚枫
B 站	Python 自动化办公社区

主营业务：AI 编程培训、企业内训、技术咨询

🎓 AI 编程实战课程

想系统学习 AI 编程？程序员晚枫的 AI 编程实战课 帮你从零上手！

👉 免费试看：B站免费试看前3讲，先看看适不适合自己
👉 课程报名：点击这里报名，现在报名还送书📖