第 11 讲：字符串类型实现——Unicode 与 Intern 机制

大家好，我是正在实战各种 AI 项目的程序员晚枫。

为什么字符串比较用 == 和 is 结果可能不同？字符串驻留（intern）机制是怎么回事？这一讲彻底搞懂。

📖 开篇：字符串不是 C 的 char 数组

在 C 语言中，字符串就是字符数组。但在 Python 中，字符串是复杂的对象：

s = "hello"
print(len(s))     # 5
print(s[0])       # 'h'
print(id(s))      # 内存地址

# 字符串是不可变的！
# s[0] = 'H'  # TypeError!

不可变性是 Python 字符串的核心特性——这让字符串可以安全地用作字典键和集合元素。

🔤 PyASCIIObject / PyUnicodeObject

// Include/unicodeobject.h
typedef struct {
    PyObject_HEAD
    Py_ssize_t length;      // 字符串长度
    Py_hash_t hash;         // 缓存的哈希值（-1 表示未计算）
    struct {
        unsigned int interned:2;  // intern 状态
        unsigned int kind:2;      // 编码类型
        unsigned int compact:1;   // 是否紧凑格式
        unsigned int ascii:1;     // 是否 ASCII
    } state;
    void *data;             // 字符数据指针
} PyASCIIObject;

Python 3.3+ 做了重大优化，区分了紧凑 ASCII 和 Unicode：

import sys

s1 = "hello"       # ASCII: kind=1, ascii=1
s2 = "你好"        # Unicode: kind=1, ascii=0

print(sys.getsizeof(s1))  # 比 s2 小
print(sys.getsizeof(s2))

紧凑格式：数据直接存储在结构体后面，避免额外的指针间接访问。

🎯 Intern 机制

什么是 intern？

Intern 是「字符串驻留」——相同内容的字符串共享同一个内存对象：

# 自动 intern 的字符串（编译时确定）
a = "hello"
b = "hello"
print(a is b)  # True！同一个对象

# 运行时拼接不 intern
c = "hel" + "lo"
print(a is c)  # False！运行时生成的

为什么需要 intern？

节省内存：相同字符串只存一份
加速比较：is 直接比较指针，比逐字符比较快
字典键优化：相同的键字符串共享对象

手动 intern

import sys

# 普通字符串
s1 = "hello world" * 100
s2 = "hello world" * 100
print(s1 is s2)  # False

# 手动 intern
s1_interned = sys.intern("hello world" * 100)
s2_interned = sys.intern("hello world" * 100)
print(s1_interned is s2_interned)  # True！

典型应用场景

# 场景：解析大量相同关键字的文本
import sys

keywords = {sys.intern("def"), sys.intern("class"), sys.intern("return")}

text = "def foo():
    return 42
class Bar:
    pass"
for word in keywords:
    print(word in text)

🔍 字符串的内存布局

import sys

# ASCII 字符串
s1 = "abc"
print(sys.getsizeof(s1))   # 50 字节（包含结构体 + 紧凑数据）

# Unicode 字符串（中文）
s2 = "你好"
print(sys.getsizeof(s2))   # 76 字节（更宽的字符存储）

# 空字符串（共享同一个对象）
s3 = ""
s4 = ""
print(s3 is s4)  # True！空字符串也被 intern

📝 字符串拼接的性能

import time

# 方式 1：+ 拼接（慢！每次创建新字符串）
start = time.perf_counter()
s = ""
for _ in range(10000):
    s += "x"
end = time.perf_counter()
print(f"+ 拼接: {end - start:.3f}s")

# 方式 2：join（快！预分配内存）
start = time.perf_counter()
s = "".join(["x"] * 10000)
end = time.perf_counter()
print(f"join: {end - start:.3f}s")

原因：+ 每次都创建新字符串对象并复制内容，时间复杂度 O(n²)。join 预分配总空间，一次复制。

⚠️ 字符串与 bytes 的区别

# str：Unicode 字符序列
s = "你好"
print(len(s))        # 2（字符数）
print(s.encode())     # b'\xe4\xbd\xa0\xe5\xa5\xbd'（UTF-8 字节）

# bytes：原始字节序列
b = b"\xe4\xbd\xa0\xe5\xa5\xbd"
print(len(b))         # 6（字节数）
print(b.decode())     # 你好