io模块与流处理

Python进阶编程专题 · Python IO系统与流式数据处理

专题：Python进阶编程系统学习

关键词：Python, io模块, BytesIO, StringIO, 流处理, 缓冲区, 大文件处理

一、Python IO系统架构概述

Python 的 io 模块是标准库中处理所有输入输出操作的核心模块，它定义了一套分层清晰的流式处理架构。在 Python 3 中，io 模块替代了 Python 2 中混乱的文件操作模型，提供了统一的、面向对象的 IO 接口。这套体系的核心思想是将数据抽象为"流"（stream）：无论是磁盘文件、网络 socket 还是内存缓冲区，都可以通过统一的流接口进行读写操作。

IO 系统采用三层架构设计：最底层是原始 IO（Raw IO），处理裸字节序列；中间层是缓冲 IO（Buffered IO），在原始 IO 之上添加缓冲区以提升性能；最顶层是文本 IO（Text IO），负责字节与字符串之间的编码/解码转换。这种分层设计使得每一层可以独立优化和替换，体现了关注点分离的设计原则。

核心概念：Python 的 IO 系统遵循"装饰器模式"，TextIOWrapper 装饰 BufferedIOBase，BufferedIOBase 装饰 RawIOBase。每一层在上一层的基础上添加新功能，形成完整的 IO 栈。

IO 模块的类层次结构

io 模块以 IOBase 为抽象基类，派生出三个主要分支：

RawIOBase： 原始字节读写，无缓冲，是操作系统级文件描述符的薄封装
BufferedIOBase： 在 RawIOBase 之上添加缓冲层，支持按块读取、预读、延迟写入
TextIOBase： 处理文本数据，封装了编码器和解码器

理解这个继承体系是掌握 Python IO 系统的关键。FileIO 实现 RawIOBase，BufferedReader/BufferedWriter/BufferedRandom 实现 BufferedIOBase，StringIO 和 TextIOWrapper 实现 TextIOBase。而最常用的 open() 函数，实际上是一个智能工厂方法，根据 mode 参数自动选择合适的类来构建完整的 IO 栈。

二、IOBase 抽象基类体系

IOBase 是所有 IO 类的抽象基类，定义了 IO 操作的基本接口。它主要提供了以下核心方法：close()、closed（属性）、__enter__ 和 __exit__（上下文管理器支持）、fileno()、flush()、isatty()、readable()、seekable()、writable() 以及 read/readline/readlines/write/writelines/truncate/seek/tell 等。

最值得关注的是 IOBase 实现了上下文管理协议，这意味着所有 IO 对象都可以使用 with 语句自动管理资源释放，这是 Python 中处理文件资源的标准做法。

from io import IOBase, RawIOBase, BufferedIOBase, TextIOBase

# 查看继承关系
print("IOBase MRO:", [c.__name__ for c in IOBase.__mro__])
print("TextIOBase 是否为 IOBase 子类:", issubclass(TextIOBase, IOBase))
print("BufferedIOBase 是否为 RawIOBase 子类:", issubclass(BufferedIOBase, RawIOBase))

# 查看所有抽象方法
import abc
abstracts = [m for m in dir(IOBase) if getattr(getattr(IOBase, m, None), '__isabstractmethod__', False)]
print("IOBase 抽象方法:", abstracts)

设计要点：IOBase 使用抽象基类（ABC）机制定义接口契约。子类必须实现所有抽象方法，否则在实例化时会抛出 TypeError。这种设计确保了所有 IO 实现都遵循统一的行为约定。

RawIOBase 详解

RawIOBase 继承自 IOBase，是原始二进制 IO 的抽象基类。它的核心抽象方法是 read(n) 和 write(b)，其中 read(n) 读取最多 n 个字节并返回 bytes 对象，write(b) 写入字节串并返回实际写入的字节数。RawIOBase 不提供缓冲，每次 read/write 都直接映射到操作系统调用。

from io import FileIO

# FileIO 是 RawIOBase 的具体实现
f = FileIO('/tmp/raw_demo.bin', 'w')
f.write(b'Hello Raw IO World!')
f.close()

# 无缓冲读取——每次 read() 直接调用 OS 系统调用
f = FileIO('/tmp/raw_demo.bin', 'r')
data = f.read(5)
print("Read 5 bytes:", data)
print("Position after read:", f.tell())
print("File descriptor:", f.fileno())
f.close()

注意事项：FileIO 是无缓冲的。如果你以 1 字节为单位读取一个大文件，每次都会触发一次系统调用，性能极差。这正是 BufferedIOBase 存在的理由——通过缓冲减少系统调用次数。

三、BytesIO 与 StringIO：内存中的流操作

BytesIO 和 StringIO 是 io 模块中两个极其重要的工具类，它们允许你将内存中的字节数组或字符串当作文件对象来操作。这在需要"假装读写文件"但实际读写内存的场景中非常有用——比如单元测试、字符串解析、协议编解码等。

BytesIO——二进制内存流

BytesIO 继承自 BufferedIOBase，是对内存中 bytes 缓冲区的流式封装。由于数据完全在内存中，BytesIO 的读写速度远快于磁盘文件 IO。

from io import BytesIO

# 创建 BytesIO 并写入数据
buf = BytesIO()
buf.write(b'Python IO module ')
buf.write(b'is powerful.')

# 读取——指针需要先寻回开头
buf.seek(0)
print("Full content:", buf.read())

# 像文件一样读取行
buf.seek(0)
print("First 6 bytes:", buf.read(6))

# 获取缓冲区中的 bytes 值
print("Buffer value:", buf.getvalue())

buf.close()

StringIO——文本内存流

StringIO 继承自 TextIOBase，提供文本模式的内存流。与 BytesIO 不同，StringIO 处理的是 str 类型（Unicode 字符串），且具有编码感知能力。

from io import StringIO

# 用字符串初始化 StringIO
csv_data = """\
name,age,city
Alice,30,New York
Bob,25,London
Charlie,35,Beijing
"""

f = StringIO(csv_data)
# 可以逐行读取
header = f.readline().strip().split(',')
print("Header:", header)

for line in f:
    row = line.strip().split(',')
    record = dict(zip(header, row))
    print("Record:", record)

f.close()

实用技巧：StringIO 和 BytesIO 都可以作为上下文管理器使用（with 语句）。更重要的是，它们可以在不创建临时文件的前提下，为那些期望文件对象作为参数的函数提供输入——这是编写单元测试的利器。

实战：用 BytesIO 处理图像数据

from io import BytesIO
import requests
from PIL import Image

# 从网络下载图片到内存
resp = requests.get('https://example.com/sample.jpg')
img_data = BytesIO(resp.content)

# BytesIO 可以直接传给 PIL 打开——无需写磁盘
img = Image.open(img_data)
print(f"Image size: {img.size}, format: {img.format}")

# 处理后的图片也可以写回 BytesIO
output = BytesIO()
img.save(output, format='PNG')
processed_data = output.getvalue()
print(f"Processed image size: {len(processed_data)} bytes")

四、二进制流 vs 文本流 vs 原始流

Python 的 open() 函数通过 mode 参数决定创建的 IO 栈类型。理解三种流的区别和适用场景，是正确使用 IO 系统的基础。

特性	原始流 (RawIOBase)	二进制流 (BufferedIOBase)	文本流 (TextIOBase)
数据单元	bytes	bytes	str (Unicode)
构造方式	FileIO("f", "rb")	open("f", "rb")	open("f", "r")
缓冲	无	有（可配置）	有（通过底层缓冲流）
编码	无	无	自动编码/解码
适用场景	低级系统操作	二进制文件、网络数据	文本文件、CSV、JSON

三种流的实际差异

import io

# 1. 原始流——读到的永远是 bytes
raw = io.FileIO('/tmp/demo.txt', 'w')
raw.write(b'hello\nworld\n')
raw.close()

raw = io.FileIO('/tmp/demo.txt', 'r')
data = raw.read()
print(f"RawIO type: {type(data).__name__}, value: {data}")
raw.close()

# 2. 二进制缓冲流——带缓冲的字节读写
with open('/tmp/demo.txt', 'rb') as f:
    print(f"Binary stream type: {type(f).__name__}")
    data = f.read()
    print(f"Data type: {type(data).__name__}")

# 3. 文本流——自动解码为 str
with open('/tmp/demo.txt', 'r') as f:
    print(f"Text stream type: {type(f).__name__}")
    text = f.read()
    print(f"Text type: {type(text).__name__}")
    print(f"Encoding: {f.encoding}")

性能对比：原始流每读 1 字节 = 1 次系统调用。缓冲流读 1 字节，实际读取 8KB 到缓冲区，后续 8191 字节直接从内存返回。文本流在缓冲流上叠加编码器，额外有编解码开销。实际项目中 90% 的场景使用文本流（处理文本）或二进制流（处理图片/音频/序列化数据），原始流仅在需要精细控制 IO 行为的底层场景中使用。

五、缓冲策略详解

缓冲是 IO 系统中最关键的优化手段之一。Python 的 io 模块提供了多种缓冲实现，以适应不同的读写模式。

BufferedReader——带缓冲的读取

BufferedReader 包装一个原始的字节流（RawIOBase），在内部维护一个读缓冲区。当调用 read(n) 时，它会尝试从缓冲区返回数据；缓冲区不够时，一次从底层原始流读取更大的块（默认 8KB）填充缓冲区。

from io import FileIO, BufferedReader

raw = FileIO('/tmp/demo.txt', 'r')
buf_reader = BufferedReader(raw, buffer_size=4096)  # 4KB 缓冲区

# 第一次 read(10)：从底层原始流读取 4096 字节到缓冲区，返回前 10
chunk1 = buf_reader.read(10)
print(f"First read: {chunk1}")

# 第二次 read(10)：直接从缓冲区返回，无系统调用
chunk2 = buf_reader.read(10)
print(f"Second read: {chunk2}")

# peek() 方法可以预览数据但不消耗缓冲区
peeked = buf_reader.peek(5)
print(f"Peeked: {peeked}")

buf_reader.close()

BufferedWriter——带缓冲的写入

BufferedWriter 使用类似的策略优化写入操作。写入的数据先存放在写缓冲区，当缓冲区满或显式调用 flush() 时，才将数据一次性写入底层原始流。

from io import FileIO, BufferedWriter

raw = FileIO('/tmp/bufwrite_demo.txt', 'w')
buf_writer = BufferedWriter(raw, buffer_size=8192)  # 8KB 缓冲区

# 写入 4KB——数据只进入内存缓冲区，尚未写入磁盘
data = b'X' * 4096
buf_writer.write(data)

# 缓冲区满时自动 flush；也可以手动强制写入
buf_writer.flush()

# close() 会自动调用 flush()
buf_writer.close()

重要提醒：使用 BufferedWriter 时务必确保 close() 被调用（推荐使用 with 语句）。如果程序在 flush() 之前崩溃，写缓冲区中的数据会丢失。在断电等极端情况下，即便调用了 flush()，操作系统内核缓冲区中的数据也可能丢失——这是应用层无法控制的。

BufferedRandom——可随机读写的缓冲流

BufferedRandom 结合了 BufferedReader 和 BufferedWriter 的功能，适用于需要同时读写且支持 seek 的场景，典型应用是数据库文件和随机访问的二进制格式。

from io import BufferedRandom, FileIO

raw = FileIO('/tmp/random_demo.bin', 'w+b')

# 写入 100 字节
raw.write(b'\x00' * 100)
raw.close()

# 使用 BufferedRandom 进行随机读写
raw = FileIO('/tmp/random_demo.bin', 'r+b')
buf = BufferedRandom(raw)

# 在偏移 20 处写入 4 字节整数
buf.seek(20)
buf.write(b'\x41\x42\x43\x44')

# 读回验证
buf.seek(20)
data = buf.read(4)
print(f"Data at offset 20: {data.hex()}")

buf.close()

缓冲区大小的选择策略

缓冲区大小的选择对 IO 性能有显著影响。以下是不同场景下的推荐策略：

小文件（< 1MB）： 默认缓冲区大小（8KB）即可，瓶颈不在 IO
大文件顺序读： 增大缓冲区到 64KB-1MB，可显著减少系统调用次数
大文件随机读： 使用小缓冲区（4KB-8KB），避免预读浪费
高吞吐写入： 增大缓冲区到 64KB-256KB，匹配磁盘块大小
网络流： 通常使用默认值或 4KB（匹配 TCP 段大小）

六、TextIOWrapper——文本编码封装器

TextIOWrapper 是文本 IO 的核心实现，它包装一个 BufferedIOBase 对象，在其上添加了编码/解码层。这是 open("file.txt", "r") 背后实际使用的类。理解 TextIOWrapper 的工作原理，对于处理多编码文件、性能优化和大文件分块处理至关重要。

TextIOWrapper 的内部结构

当调用 open("file.txt", "r", encoding="utf-8") 时，Python 实际构建了这样一个 IO 栈：

TextIOWrapper → BufferedReader → FileIO → 操作系统文件描述符

from io import TextIOWrapper, BufferedReader, FileIO

# 手动构建 IO 栈——理解 open() 的内部工作原理
raw = FileIO('/tmp/utf8_demo.txt', 'w')
raw.write('你好世界 Hello World'.encode('utf-8'))
raw.close()

# 手动构建 IO 栈
raw = FileIO('/tmp/utf8_demo.txt', 'r')
buf = BufferedReader(raw)
text = TextIOWrapper(buf, encoding='utf-8')

# 现在可以以文本形式读取
content = text.read()
print(f"Text content: {content}")
print(f"Encoding: {text.encoding}")
print(f"Errors policy: {text.errors}")
print(f"Newlines mode: {text.newlines}")

# 检查 IO 栈的层级关系
print(f"Is wrapped: {isinstance(text, TextIOWrapper)}")
print(f"Underlying buffer: {type(text.buffer).__name__}")
print(f"Underlying raw: {type(text.buffer.raw).__name__}")

text.close()

编码错误处理

TextIOWrapper 提供了丰富的编码错误处理策略，通过 errors 参数控制：

from io import TextIOWrapper, BytesIO

# 模拟一个包含无效 UTF-8 字节的流
bad_bytes = b'Hello\xffWorld\xfeGood'
buf = BytesIO(bad_bytes)

# strict 模式：遇到无效字节抛出 UnicodeDecodeError
try:
    text = TextIOWrapper(buf, encoding='utf-8', errors='strict')
    print(text.read())
except UnicodeDecodeError as e:
    print(f"Strict mode error: {e}")

# ignore 模式：静默忽略无效字节
buf = BytesIO(bad_bytes)
text = TextIOWrapper(buf, encoding='utf-8', errors='ignore')
print(f"Ignore mode: '{text.read()}'")

# replace 模式：用替换字符替代无效字节
buf = BytesIO(bad_bytes)
text = TextIOWrapper(buf, encoding='utf-8', errors='replace')
print(f"Replace mode: '{text.read()}'")

# surrogateescape：保留无效字节（PEP 383）
buf = BytesIO(bad_bytes)
text = TextIOWrapper(buf, encoding='utf-8', errors='surrogateescape')
print(f"Surrogateescape mode: '{text.read()}'")

PEP 383 与 surrogateescape：在 Unix 系统中，文件名可能不是有效的 UTF-8 编码。Python 3.1 引入的 surrogateescape 错误处理方案将无效字节映射到 U+DC80..U+DCFF 范围内的代理字符，保证 round-trip 转换无损。这是 os.listdir() 等函数能够处理任意文件名的底层机制。

行缓冲模式

from io import TextIOWrapper, BytesIO

# 行缓冲：遇到换行符即 flush
buf = BytesIO()
text = TextIOWrapper(buf, encoding='utf-8', line_buffering=True)

# 写入不包含换行的内容——数据暂存在缓冲区
text.write('Waiting for newline...')

# 写入换行——立即 flush
text.write('\n')  # 此处自动 flush

# 查看缓冲区内容
buf.seek(0)
print("Buffer content:", buf.read())

text.close()

七、流式处理大文件

当处理超过可用内存大小的文件时，流式处理是唯一可行的方案。核心思想是：不要一次性将整个文件读入内存，而是以小批次（chunk）的方式逐块处理数据。

逐行流式处理

def process_large_file(filepath: str, chunk_size: int = 65536):
    """
    流式处理大文件——每次只读一个 chunk，不把整个文件加载到内存。

    Args:
        filepath: 文件路径
        chunk_size: 每次读取的块大小（默认 64KB）
    """
    with open(filepath, 'r', encoding='utf-8') as f:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            yield chunk

# 使用生成器流式处理
for chunk in process_large_file('huge_log.txt'):
    process_chunk(chunk)  # 每次只处理 64KB

逐行读取的性能考量

Python 的文件迭代器已经内置了缓冲。for line in f 实际上是调用 f.readline()，而 readline() 内部使用 BufferedReader 提供的缓冲区。但要注意极端场景：如果文件中有一行特别长（比如 2GB 的 JSON 单行），逐行读取也会耗尽内存。

def read_large_file_safely(filepath, max_line_length=10_000_000):
    """安全地读取大文件，对超长行做截断处理"""
    with open(filepath, 'r', encoding='utf-8') as f:
        while True:
            line = f.readline(max_line_length)
            if not line:
                break
            if len(line) >= max_line_length:
                print(f"Warning: truncated long line at position {f.tell()}")
            yield line

# 固定大小 chunk 处理二进制文件
def copy_file_streaming(src: str, dst: str, chunk_size=65536):
    """流式复制文件（适用于大文件，内存占用恒定为 chunk_size）"""
    with open(src, 'rb') as fin, with open(dst, 'wb') as fout:
        while True:
            chunk = fin.read(chunk_size)
            if not chunk:
                break
            fout.write(chunk)

# 使用 memoryview 避免拷贝
def copy_streaming_zero_copy(src: str, dst: str):
    """利用 memoryview 减少内存拷贝次数"""
    with open(src, 'rb') as fin, with open(dst, 'wb') as fout:
        while True:
            chunk = fin.read(65536)
            if not chunk:
                break
            mv = memoryview(chunk)
            fout.write(mv)

经验法则：对于日志文件、CSV 文件等文本格式，使用逐行迭代器（for line in f）即可。对于二进制文件（图片、视频、压缩包），使用固定大小的 chunk 读取（通常 64KB-1MB）。永远不要在内存不足时使用 f.read() 不加参数——这会将整个文件读入内存。

八、管道流与流连接

管道流（pipe）和流连接（stream chaining）是 IO 模块中高级但极其强大的特性。它们允许你像 Unix 管道一样将多个处理环节串联起来，构建数据处理流水线。

os.pipe()——操作系统管道

os.pipe() 创建一对文件描述符：读端和写端。结合 io 模块的包装，可以实现线程间或进程间的数据流传递。

import os
from io import FileIO, BufferedReader
import threading

# 创建操作系统管道
r_fd, w_fd = os.pipe()

# 包装为 Python IO 对象
reader = FileIO(r_fd, 'r', closefd=True)
writer = FileIO(w_fd, 'w', closefd=True)

# 在子线程中写入数据
def writer_thread(w):
    w.write(b'Hello from pipe!\n')
    w.write(b'Second line\n')
    w.close()

t = threading.Thread(target=writer_thread, args=(writer,))
t.start()

# 在主线程中读取
buf_reader = BufferedReader(reader)
print(buf_reader.readline().decode())
print(buf_reader.readline().decode())
buf_reader.close()
t.join()

流连接（chaining）模式

Python 的 IO 系统天然支持装饰器模式，可以灵活组合不同层级的流对象。下面展示如何构建一个完整的 IO 流水线：

from io import (
    BytesIO, BufferedReader, TextIOWrapper, BufferedWriter
)

# 构建处理流水线：
# 1. 原始数据源——内存字节缓冲区
raw_source = BytesIO(b'{"name": "Stream", "value": 42}\n{"name": "Pipeline", "value": 100}')

# 2. 装饰为缓冲读取流
buffered = BufferedReader(raw_source)

# 3. 装饰为文本流（自动解码）
text_stream = TextIOWrapper(buffered, encoding='utf-8')

# 使用流水线逐行处理 JSON 数据
import json
for line in text_stream:
    data = json.loads(line.strip())
    print(f"Processed: name={data['name']}, value={data['value']}")

# 输出流方向同理
output_buffer = BytesIO()
output_buffered = BufferedWriter(output_buffer)
output_text = TextIOWrapper(output_buffered, encoding='utf-8')

output_text.write('Written through a chain of streams\n')
output_text.flush()  # 确保数据写入底层 BytesIO

output_buffer.seek(0)
print("Output:", output_buffer.read())

核心洞察：流连接模式的核心优势在于每一层只负责一件事。TextIOWrapper 不需要关心数据来自文件还是内存，BufferedReader 不需要关心数据是文本还是二进制，FileIO 不需要关心上层是否有缓冲。这种关注点分离使得测试和扩展变得异常简单。

九、自定义流实现

当标准 IO 类无法满足特定需求时，你可以通过继承 RawIOBase、BufferedIOBase 或 TextIOBase 来实现自定义流。这是 io 模块最强大的扩展点之一。

实现 RawIOBase 子类

下面实现一个加密输入流，在读取时自动解密数据：

from io import RawIOBase
import os

class Rot13Stream(RawIOBase):
    """实现 ROT13 变换的原始流——读取时自动解密"""

    def __init__(self, inner_stream: RawIOBase):
        self._inner = inner_stream

    def read(self, n: int = -1) -> bytes:
        data = self._inner.read(n)
        if not data:
            return data
        return _rot13_transform(data)

    def readinto(self, buffer: bytearray) -> int:
        data = self.read(len(buffer))
        n = len(data)
        buffer[:n] = data
        return n

    def seek(self, offset: int, whence: int = 0) -> int:
        return self._inner.seek(offset, whence)

    def tell(self) -> int:
        return self._inner.tell()

    def close(self):
        if self._inner:
            self._inner.close()

def _rot13_transform(data: bytes) -> bytes:
    """ROT13 字节变换"""
    result = bytearray(len(data))
    for i, b in enumerate(data):
        if 97 <= b <= 122:   # a-z
            result[i] = ((b - 97) + 13) % 26 + 97
        elif 65 <= b <= 90:  # A-Z
            result[i] = ((b - 65) + 13) % 26 + 65
        else:
            result[i] = b
    return bytes(result)

# 使用自定义流
from io import BytesIO

# 创建 ROT13 编码数据
source = BytesIO(b'Uryyb, Jbeyq!')  # "Hello, World!" 的 ROT13
rot13_stream = Rot13Stream(source)
print("Decoded:", rot13_stream.read().decode())

实现可包装的自定义流

更常见的做法是实现一个"过滤器流"——它可以插入到标准 IO 栈中，作为一个中间处理层：

from io import BufferedIOBase, RawIOBase

class ChunkedChecksumStream(RawIOBase):
    """
    包装一个原始流，每次读取数据时计算并打印校验和。
    可以插入到标准 IO 栈的任何位置。
    """

    def __init__(self, inner: RawIOBase, chunk_size=4096):
        self._inner = inner
        self._chunk_size = chunk_size
        self._checksums = []

    def read(self, n: int = -1) -> bytes:
        data = self._inner.read(n if n > 0 else self._chunk_size)
        if data:
            cs = sum(data) % 256
            self._checksums.append(cs)
            print(f"Read {len(data)} bytes, checksum={cs}")
        return data

    def seekable(self) -> bool:
        return self._inner.seekable()

    def seek(self, pos, whence=0) -> int:
        self._checksums.clear()
        return self._inner.seek(pos, whence)

# 插入到标准 IO 栈中
from io import FileIO, BufferedReader, TextIOWrapper

raw = FileIO('/tmp/test.txt', 'w')
raw.write(b'Hello World! This is a test file for streaming.\n' * 100)
raw.close()

raw = FileIO('/tmp/test.txt', 'r')
checksum_stream = ChunkedChecksumStream(raw, chunk_size=32)
buf = BufferedReader(checksum_stream)
text = TextIOWrapper(buf, encoding='utf-8')

line_count = 0
for line in text:
    line_count += 1
print(f"Total lines: {line_count}")

扩展思路：基于自定义流的思想，可以实现压缩流（透明解压）、加密流（透明解密）、速率限制流（控制读取速度）、监控流（统计 IO 吞吐量）等。这种"中间件"模式在高级 IO 系统中极为常见。

十、io 与 os 模块的低级文件描述符交互

io 模块的高层抽象建立在操作系统文件描述符之上。理解 io 与 os 模块之间的交互，对于编写高性能或跨平台代码至关重要。

文件描述符与 Python 对象的桥接

import os
from io import FileIO

# os.open 返回原始文件描述符（整数）
fd = os.open('/tmp/os_io_demo.txt', os.O_RDWR | os.O_CREAT, 0o644)
print(f"OS file descriptor: {fd}")

# 包装为 Python IO 对象
f = FileIO(fd, 'r+', closefd=True)
print(f"Python IO object: {f}")
print(f"fileno(): {f.fileno()}")

# 可以通过 os.fdopen 将 fd 转为 Python 文件对象
f2 = os.fdopen(fd, 'w')
f2.write('Written via os.fdopen\n')
f2.close()
# 注意：closefd=False 时，关闭 Python 对象不会关闭底层 fd
# closefd=True（默认）时，关闭 Python 对象会同时关闭 fd

# fileno() 方法允许将 Python IO 对象传递给需要 fd 的 os 函数
with open('/tmp/os_io_demo.txt', 'r') as f:
    fd = f.fileno()
    # 使用 os 函数操作底层 fd
    flags = os.fcntl.fcntl(fd, os.fcntl.F_GETFL)
    print(f"File status flags: {os.O_RDONLY if flags & os.O_RDONLY else 'not RDONLY'}")

dup2 与文件描述符重定向

import os
import sys
from io import TextIOWrapper, FileIO

# 将标准输出重定向到文件
output_fd = os.open('/tmp/redirected_output.txt',
                    os.O_WRONLY | os.O_CREAT | os.O_TRUNC, 0o644)

# 备份原始 stdout fd
saved_stdout_fd = os.dup(sys.stdout.fileno())

# 重定向：将 fd 1 (stdout) 指向我们的文件
os.dup2(output_fd, sys.stdout.fileno())

# 现在所有 print 输出都写入文件
print("This goes to the file, not the terminal!")
print("So does this line.")

# 恢复 stdout
os.dup2(saved_stdout_fd, sys.stdout.fileno())
os.close(saved_stdout_fd)
os.close(output_fd)

# 验证文件内容
with open('/tmp/redirected_output.txt', 'r') as f:
    print("File content:", f.read())

os.fdopen 与 makefile

import socket
from io import TextIOWrapper

# socket 的 makefile() 方法将网络连接包装为文件对象
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(('example.com', 80))

# 发送 HTTP 请求——使用文件接口操作 socket
sock_file = sock.makefile('rw', buffering=4096)
sock_file.write('GET / HTTP/1.0\r\nHost: example.com\r\n\r\n')
sock_file.flush()

# 读取响应（逐行读取）
status_line = sock_file.readline()
print("Status:", status_line.strip())

# 读取头部
for line in sock_file:
    line = line.strip()
    if not line:
        break
    print("Header:", line)

sock_file.close()
sock.close()

关键理解：操作系统的文件描述符是所有 IO 的终极抽象。socket 描述符、管道描述符、regular file 描述符在系统层面是统一的。Python 的 io 模块正是建立在这一事实之上——无论是文件、socket 还是管道，都可以包装为统一的 IO 对象，使用相同的 read/write 接口。这正是"一切皆文件"（Unix philosophy）的体现。

十一、核心要点总结

1. 三层架构：Python IO 系统采用 RawIOBase → BufferedIOBase → TextIOBase 三层架构，每一层在下一层的基础上添加功能。

2. 流式处理：处理大文件时必须使用流式方式（chunk-by-chunk 或 line-by-line），避免一次性将整个文件读入内存。推荐 chunk 大小：64KB-1MB。

3. 内存流：BytesIO 和 StringIO 是"假文件"，在内存中模拟文件行为。单元测试、数据转换、协议编解码是最佳应用场景。

4. 缓冲策略：缓冲减少系统调用次数。BufferedReader 预读、BufferedWriter 延迟写、BufferedRandom 支持随机访问。Buffer 大小的选择直接影响 IO 性能。

5. 编码封装：TextIOWrapper 是文本 IO 的核心，支持多种编码错误处理策略（strict/ignore/replace/surrogateescape）。行缓冲模式适用于交互式应用。

6. 装饰器模式：IO 系统天然支持装饰器模式，可以灵活组合流对象构建处理流水线。自定义流通过继承 RawIOBase/BufferedIOBase/TextIOBase 实现。

7. 文件描述符：Python IO 对象底层对应操作系统文件描述符（整数）。通过 fileno()、os.fdopen()、socket.makefile() 可以在高层抽象和底层系统调用之间自由切换。

8. open() 的魔力：Python 的 open() 函数是一个智能工厂方法。mode="r" 构建 TextIOWrapper(BufferedReader(FileIO(...)))；mode="rb" 构建 BufferedReader(FileIO(...))。理解这一过程就等于理解了 Python IO 系统的全貌。

十二、进一步思考与实践

IO 系统是 Python 中最能体现"Pythonic"设计哲学的部分之一。它通过简洁的面向对象接口，隐藏了操作系统 IO 的复杂性，同时保留了足够的灵活性以满足高级需求。在实际项目中，理解 IO 系统的内部工作原理有助于：

实践建议：

性能调优：当发现文件读写成为性能瓶颈时，首先检查缓冲区大小。调整 buffer_size 参数可能是最简单有效的优化手段。
管道模式：利用 subprocess.PIPE 结合 io 模块，可以构建复杂的进程间通信流水线。例如从子进程读取输出流并逐行处理。
异步 IO：io 模块是同步的。对于高并发场景，需要结合 asyncio 的事件循环和 asyncio.StreamReader/StreamWriter。
内存映射：对于超大文件的随机访问场景，io 模块的流式接口可能不够高效。此时应考虑 mmap 模块，它将文件映射到虚拟内存，利用操作系统的页面缓存机制。
序列化兼容：pickle、json、struct 等模块都接受文件对象作为参数，这意味着它们可以与 BytesIO/StringIO 无缝协作，实现内存中的序列化与反序列化。

掌握 io 模块不只意味着学会使用 open() 和 read/write，更要理解其背后的流式处理思维。当你能将任意数据源（文件、内存、网络、管道）统一视为"流"来处理时，你的编程抽象能力就完成了一次重要的跃迁。

思考题：

如何实现一个同时支持读写且可 seek 的自定义流？它需要继承哪个基类，实现哪些方法？
TextIOWrapper 的 line_buffering 参数与 BufferedWriter 的缓冲区大小之间有什么交互关系？
为什么说 surrogateescape 错误处理方案对于 Unix 系统编程至关重要？它解决了什么问题？
在多线程环境下，多个线程共享同一个文件描述符时需要注意什么？Python 的 GIL 是否会影响 IO 并发性能？