文本分析与自然语言处理基础

用Python分析文本数据 -- 从预处理到深度学习的完整技术栈

学习日期：2026年5月 | 技术栈：Python 3.12 + NLTK + spaCy + scikit-learn + gensim + Transformers

核心主题： 文本分析与自然语言处理（NLP）基础知识体系

主要内容： 文本预处理、文本向量化、情感分析、主题建模、文本分类、词云生成、命名实体识别

技术工具： jieba / NLTK / spaCy / scikit-learn / gensim / Word2Vec / BERT / pyLDAvis / wordcloud

关键词： NLP, 文本分析, jieba, TF-IDF, Word2Vec, LDA, 情感分析, 词云, 文本分类, 命名实体识别

一、文本预处理

文本预处理是NLP流水线的第一环，也是最关键的一环。原始文本包含大量噪声，只有经过系统清洗和标准化，后续的向量化、建模才能得到可靠结果。预处理通常包含以下步骤：

1.1 中文分词（jieba）

中文文本没有天然的空格分隔，必须借助分词工具将连续的汉字序列切分成有意义的词语。jieba（结巴分词）是Python中最主流的中文分词库，支持精确模式、全模式和搜索引擎模式三种分词策略。

import jieba

text = "自然语言处理是人工智能领域的重要分支"

# 精确模式（最常用）
words = jieba.lcut(text, cut_all=False)
print(words)
# ['自然语言', '处理', '是', '人工智能', '领域', '的', '重要', '分支']

# 全模式（速度快，但有冗余）
words_all = jieba.lcut(text, cut_all=True)
print(words_all)
# ['自然', '自然语言', '语言', '处理', '是', '人工', '人工智能', '智能', '领域', '的', '重要', '分支']

# 搜索引擎模式（精确模式基础上再切分长词）
words_search = jieba.lcut_for_search(text)
print(words_search)
# ['自然', '语言', '自然语言', '处理', '是', '人工', '智能', '人工智能', '领域', '的', '重要', '分支']

# 添加自定义词典
jieba.add_word("自然语言处理")
jieba.add_word("人工智能领域")

1.2 英文分词与NLTK

NLTK（Natural Language Toolkit）是Python最经典的NLP库，提供分词、词性标注、句法分析等全套工具。英文分词相对简单，按空格和标点切分即可，但需处理缩写（don't、U.S.）等特殊情况。

import nltk
nltk.download('punkt_tab')
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing (NLP) is a fascinating field. It enables computers to understand human language."

# 分句
sentences = sent_tokenize(text)
print(sentences)
# ['Natural Language Processing (NLP) is a fascinating field.',
#  'It enables computers to understand human language.']

# 分词
tokens = word_tokenize(text)
print(tokens)
# ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is',
#  'a', 'fascinating', 'field', '.', 'It', 'enables', 'computers',
#  'to', 'understand', 'human', 'language', '.']

1.3 spaCy 流水线处理

spaCy 是工业级NLP框架，提供端到端的文本处理流水线，包括分词、词性标注、依存句法分析、命名实体识别等，且速度极快。

import spacy

# 使用预训练模型（首次需下载）
# python -m spacy download zh_core_web_sm
# python -m spacy download en_core_web_sm
nlp = spacy.load("zh_core_web_sm")

doc = nlp("上海市浦东新区张江高科技园区祖冲之路100号")

# 分词与词性标注
for token in doc:
    print(f"{token.text:8s} | {token.pos_:6s} | {token.dep_:10s}")
# 输出样例：
# 上海市      | PROPN | nsubj
# 浦东新区    | PROPN | compound
# 张江        | PROPN | compound
# 高科技      | NOUN  | compound
# 园区        | NOUN  | attr
# 祖冲之路    | PROPN | compound
# 100         | NUM   | nummod
# 号          | PART  | dep

1.4 去停用词

停用词（Stop Words）是文本中频繁出现但对分析贡献极小的词语，如"的"、"了"、"是"、"在"、"a"、"the"、"is"等。去除停用词可显著降低特征空间的维度，提升模型效果。

import jieba
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

# 英文停用词
en_stops = set(stopwords.words('english'))
print(f"英文停用词数量: {len(en_stops)}")
# 英文停用词数量: 179

# 中文停用词（常用集合约2000词）
# 可以从 github.com/goto456/stopwords 下载
zh_stops = set([
    "的", "了", "在", "是", "我", "有", "和", "就", "不", "人",
    "都", "一", "一个", "上", "也", "很", "到", "说", "要", "去",
    "你", "会", "着", "没有", "看", "好", "自己", "这", "他", "她"
])

# 过滤停用词
text = "人工智能技术正在深刻地改变我们的生活方式"
words = jieba.lcut(text)
filtered = [w for w in words if w not in zh_stops and w.strip()]
print(filtered)
# ['人工智能技术', '正在', '深刻地', '改变', '生活', '方式']

1.5 词干提取与词形还原

词干提取（Stemming） 通过粗略的规则去掉单词的词缀，得到词干。优点是速度快，缺点是可能产生非真实词。词形还原（Lemmatization） 基于词典将单词还原为基本形式（lemma），结果一定是真实单词，但速度较慢。

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
import nltk

nltk.download('wordnet')
nltk.download('omw-1.4')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "runner", "ran", "easily", "better", "studies"]

print(f"{'原始':12s} | {'词干提取':16s} | {'词形还原':14s}")
print("-" * 44)
for w in words:
    stem = stemmer.stem(w)
    lemma = lemmatizer.lemmatize(w, pos='v')  # v=动词
    print(f"{w:12s} | {stem:16s} | {lemma:14s}")

# 输出：
# 原始          | 词干提取          | 词形还原
# running       | run              | run
# runner        | runner           | runner
# ran           | ran              | run
# easily        | easili           | easily
# better        | better           | better
# studies       | studi            | study

# WordNetLemmatizer 的 pos 参数
print(lemmatizer.lemmatize("leaves", pos='n'))  # leaf（名词）
print(lemmatizer.lemmatize("leaves", pos='v'))  # leave（动词）

1.6 大小写统一与标点去除

大小写统一将所有字母转为小写，消除因大小写导致的同一词语被错误视为不同词条的问题。标点符号通常不含语义信息（某些场景除外），需要一并去除。

import re
import string

def clean_text(text):
    """执行基础文本清洗"""
    # 1. 统一小写
    text = text.lower()
    # 2. 去除 HTML 标签
    text = re.sub(r'<[^>]+>', '', text)
    # 3. 去除 URL
    text = re.sub(r'http[s]?://\S+', '', text)
    # 4. 去除 @ 提及和 # 话题标签
    text = re.sub(r'@\w+|#\w+', '', text)
    # 5. 去除标点符号（保留中文字符、字母、数字、空格）
    text = re.sub(f'[{re.escape(string.punctuation)}]', '', text)
    # 6. 合并多余空格
    text = re.sub(r'\s+', ' ', text).strip()
    return text

raw = "Hello World! Check out https://example.com for #NLP tips. @user 你好！"
cleaned = clean_text(raw)
print(cleaned)
# "hello world check out  for nlp tips  user 你好"

1.7 正则表达式深度清洗

正则表达式是文本清洗最强大的武器，可以处理邮箱、电话号码、身份证号、金额等特定模式的匹配和过滤。

import re

def advanced_clean(text):
    """高级文本清洗"""
    # 去除邮箱地址
    text = re.sub(r'\S+@\S+\.\S+', '[EMAIL]', text)
    # 替换手机号为占位符
    text = re.sub(r'1[3-9]\d{9}', '[PHONE]', text)
    # 替换金额（如 ¥100.50 或 100元）
    text = re.sub(r'[¥￥$€]\d+(?:\.\d+)?|\d+(?:\.\d+)?[元美元欧元]', '[AMOUNT]', text)
    # 去除连续重复字符（如 "哈哈哈" -> "哈"）
    text = re.sub(r'(.)\1{2,}', r'\1', text)
    # 去除 Unicode 控制字符
    text = re.sub(r'[--]', '', text)
    # 规范中文引号
    text = text.replace('"', '"').replace('"', '"').replace(''', "'").replace(''', "'")
    return text

sample = "联系我 test@example.com 或 13800138000，支付 ¥99.99。哈哈哈！！！"
print(advanced_clean(sample))
# "联系我 [EMAIL] 或 [PHONE]，支付 [AMOUNT]。哈！！！"

预处理流水线总结

一套完整的文本预处理流水线按以下顺序执行：

去噪： HTML标签 → URL → 邮箱/电话 → 特殊字符
标准化： 大小写统一 → Unicode规范化 → 繁简转换
分词： 中文用jieba，英文用NLTK/spaCy
过滤： 停用词 → 低频词 → 过短词（长度小于2的token）
归一化： 词干提取（英文）或词形还原（推荐）

二、文本向量化

计算机无法直接理解文字，必须将文本转换为数值向量。不同的向量化方法从不同角度捕捉文本的语义信息，从简单的词频统计到上下文感知的深度语义向量。

2.1 词袋模型（CountVectorizer）

词袋模型（Bag of Words, BoW）是最基础的向量化方法。它构建一个词汇表（Vocabulary），然后统计每个文档中各词汇的出现频次，忽略词序信息。

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "我 喜欢 自然 语言 处理",
    "自然 语言 处理 是 人工智能 的 重要 分支",
    "深度 学习 推动 了 自然 语言 处理 的 发展"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print("词汇表:", vectorizer.get_feature_names_out())
# ['人工智能', '分支', '发展', '推动', '深度', '学习', '重要', '自然语言处理']

print("词袋矩阵形状:", X.shape)   # (3, 8)
print("词袋矩阵(密集):")
print(X.toarray())
# [[0 0 0 0 0 0 0 1]
#  [1 1 0 0 0 0 1 1]
#  [0 0 1 1 1 1 0 1]]

2.2 TF-IDF（TfidfVectorizer）

TF-IDF（词频-逆文档频率）是对词袋模型的重要改进。一个词的重要性不仅取决于它在当前文档中的出现频率（TF），还取决于它在整个语料库中的稀有程度（IDF）。常见词（如"的"、"是"）的IDF值低，而领域专用词的IDF值高。

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "the cat sat on the mat",
    "the dog sat on the log",
    "cats and dogs are pets"
]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

print("词汇表:", vectorizer.get_feature_names_out())
# ['cat', 'cats', 'dog', 'dogs', 'log', 'mat', 'pets', 'sat']

print("TF-IDF矩阵形状:", X.shape)   # (3, 8)
print("TF-IDF矩阵(保留3位小数):")
import numpy as np
np.set_printoptions(precision=3, suppress=True)
print(X.toarray())
# [[0.469 0.    0.    0.    0.    0.58  0.    0.58 ]
#  [0.    0.    0.    0.    0.58  0.    0.    0.58 ]
#  [0.    0.5   0.    0.5   0.    0.    0.5   0.   ]]

# 配置 n-gram 范围（考虑词组）
vectorizer_ngram = TfidfVectorizer(
    ngram_range=(1, 2),    # 单字词 + 双字词组
    max_features=1000,      # 限制最大特征数
    min_df=2,               # 至少在2个文档中出现
    max_df=0.8              # 最多在80%的文档中出现（过滤极高词）
)
X_ngram = vectorizer_ngram.fit_transform(corpus)
print("n-gram词汇量:", len(vectorizer_ngram.get_feature_names_out()))

2.3 HashingVectorizer

HashingVectorizer 使用哈希技巧（hashing trick）将特征映射到固定维度的向量空间，无需维护词汇表，内存效率极高，适用于大规模文本流处理。缺点是结果不可逆，无法知道哪个维度对应哪个词。

from sklearn.feature_extraction.text import HashingVectorizer

# 固定输出维度为 2^12 = 4096
vectorizer = HashingVectorizer(n_features=4096, alternate_sign=False)

corpus = ["text mining and analysis", "machine learning nlp"]
X = vectorizer.fit_transform(corpus)

print("哈希向量形状:", X.shape)   # (2, 4096)
# 每个维度不对应具体词汇，但文档间的相似度计算有效

# 应用场景：大规模流式文本处理
# 在线学习 + 哈希向量化，无需提前构建词汇表

2.4 Word2Vec 词向量

word2vec由Google的Mikolov等人于2013年提出，通过神经网络训练得到稠密词向量（通常100-300维），能够捕捉词语之间的语义相似度。核心思想："You shall know a word by the company it keeps"（一个词的含义由其上下文决定）。

from gensim.models import Word2Vec

# 训练语料（已分词的句子列表）
sentences = [
    ["自然", "语言", "处理", "是", "人工智能", "核心"],
    ["机器", "学习", "是", "人工智能", "分支"],
    ["深度", "学习", "是", "机器", "学习", "子领域"],
    ["自然", "语言", "处理", "包含", "文本", "分类", "情感", "分析"],
]

# 训练 Word2Vec 模型
model = Word2Vec(
    sentences=sentences,
    vector_size=100,        # 词向量维度
    window=5,               # 上下文窗口大小
    min_count=1,            # 忽略出现次数少于1的词
    workers=4,              # 并行训练线程数
    epochs=50,               # 训练轮次
    sg=0                    # 0=CBOW, 1=Skip-gram
)

# 获取词向量
vector = model.wv["人工智能"]
print("词向量维度:", vector.shape)           # (100,)

# 寻找相似词
similar_words = model.wv.most_similar("自然", topn=3)
print("与'自然'最相似的词:")
for word, score in similar_words:
    print(f"  {word}: {score:.4f}")

# 词类比：国王 - 男人 + 女人 = 女王
# model.wv.most_similar(positive=['女王', '男人'], negative=['女人'])

# 保存与加载
model.save("word2vec.model")
loaded_model = Word2Vec.load("word2vec.model")

2.5 GloVe 与 FastText

GloVe（Global Vectors for Word Representation）由斯坦福大学提出，结合了全局矩阵分解（如LSA）和局部上下文窗口（如Word2Vec）的优点，利用词-词共现矩阵进行训练。FastText 由Facebook提出，在Word2Vec基础上引入子词（subword）信息——将每个词拆分为字符n-gram，因此能处理未登录词（OOV）。

# 使用预训练的 FastText 模型（中文）
# 下载地址：https://fasttext.cc/docs/en/crawl-vectors.html

import gensim.downloader as api

# 方式一：使用 gensim 内置下载
# model = api.load("glove-wiki-gigaword-100")
# result = model.most_similar("computer", topn=5)
# 输出：[('computers', 0.77), ('software', 0.66), ...]

# 方式二：加载本地 FastText 模型
# from gensim.models.fasttext import FastText
# model = FastText.load_fasttext_format("cc.zh.300.bin")

# FastText 对未登录词的友好性
# "playing" -> "play" (Word2Vec 需要词汇表中有)
# "playfulness" -> (FastText 通过字符n-gram匹配能识别)

print("向量化方法对比概览：")
print("=" * 72)
print(f"{'方法':14s} | {'维度':8s} | {'上下文感知':10s} | {'OOV支持':8s} | {'特点'}")
print("-" * 72)
print(f"{'CountVectorizer':14s} | {'Varies':8s} | {'No':10s} | {'No':8s} | 简单稀疏，可解释性强")
print(f"{'TfidfVectorizer':14s} | {'Varies':8s} | {'No':10s} | {'No':8s} | 加权词频，降低常见词权重")
print(f"{'HashingVectorizer':14s} | {'Fixed':8s} | {'No':10s} | {'Yes':8s} | 固定维度，内存高效")
print(f"{'Word2Vec':14s} | {'100-300':8s} | {'Yes':10s} | {'No':8s} | 稠密语义，词类比")
print(f"{'GloVe':14s} | {'100-300':8s} | {'Yes':10s} | {'No':8s} | 全局共现统计")
print(f"{'FastText':14s} | {'100-300':8s} | {'Yes':10s} | {'Yes':8s} | 子词信息，OOV支持")
print("=" * 72)

三、情感分析

情感分析（Sentiment Analysis）是NLP最广泛的应用之一，目的是自动判断文本的情感倾向（正面/负面/中性）或具体情绪（喜/怒/哀/乐）。根据技术路线可分为以下四类：

3.1 基于词典的情感分析

使用预构建的情感词典（Sentiment Lexicon）对文本中的情感词进行匹配和计分。中文常用知网Hownet情感词典或大连理工大学情感词汇本体库。英文常用LIWC或AFINN。

# 基于情感词典的简单情感分析实现

import re

# 简化的情感词典
positive_words = {
    "好": 1.0, "优秀": 2.0, "完美": 2.5, "喜欢": 1.5, "推荐": 1.5,
    "满意": 1.5, "赞": 2.0, "棒": 2.0, "开心": 1.5, "值得": 1.0,
    "厉害": 1.5, "漂亮": 1.0, "舒服": 1.5
}

negative_words = {
    "差": -1.5, "烂": -2.5, "垃圾": -2.5, "失望": -2.0, "后悔": -1.5,
    "贵": -1.0, "慢": -1.0, "难用": -2.0, "糟糕": -2.0, "恶心": -2.0,
    "坑": -2.0, "差评": -2.5, "不值": -1.5
}

# 程度副词（增强或减弱情感强度）
intensifiers = {"非常": 1.5, "很": 1.3, "太": 1.8, "极其": 2.0, "有点": 0.5}
negators = {"不", "没", "别", "勿", "未"}  # 否定词会反转情感极性

def sentiment_score(text):
    """基于词典的情感得分计算"""
    score = 0.0
    words = jieba.lcut(text)
    i = 0
    while i < len(words):
        multiplier = 1.0
        # 检查前面是否有程度副词
        if i > 0 and words[i-1] in intensifiers:
            multiplier = intensifiers[words[i-1]]
        # 检查前面是否有否定词
        negated = i > 0 and words[i-1] in negators

        if words[i] in positive_words:
            s = positive_words[words[i]] * multiplier
            score += -s if negated else s
        elif words[i] in negative_words:
            s = negative_words[words[i]] * multiplier
            score += -s if negated else s
        i += 1
    return score

# 测试
reviews = [
    "这个产品非常好，我很喜欢",
    "质量太差了，极其失望",
    "东西不错，但价格有点贵"
]

for rev in reviews:
    score = sentiment_score(rev)
    sentiment = "正面" if score > 0.5 else ("负面" if score < -0.5 else "中性")
    print(f"评论: {rev}")
    print(f"得分: {score:.2f} -> 情感: {sentiment}\n")

3.2 VADER 情感分析

VADER（Valence Aware Dictionary and sEntiment Reasoner）是专为社交媒体文本设计的情感分析工具，对表情符号、大写字母、标点重复（如"good!!!"）等社交媒体特征有特殊处理，无需训练即可使用。

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

texts = [
    "This product is absolutely amazing!!!",
    "The service was terrible, I'm very disappointed.",
    "It's okay, nothing special.",
    "The movie was not bad at all :)",
    "OMG I can't believe how GREAT this is!!! <3"
]

for text in texts:
    scores = analyzer.polarity_scores(text)
    sentiment = "正面" if scores['compound'] > 0.05 else ("负面" if scores['compound'] < -0.05 else "中性")
    print(f"文本: {text}")
    print(f"  VADER: neg={scores['neg']:.2f}, neu={scores['neu']:.2f}, ",
          end="")
    print(f"pos={scores['pos']:.2f}, compound={scores['compound']:.2f} -> {sentiment}")

3.3 TextBlob 情感分析

TextBlob 是一个更易用的NLP库，内置情感分析接口，返回极性和主观性两个指标。

from textblob import TextBlob

texts = [
    "I love this beautiful weather!",
    "This is a terrible experience.",
    "The book is interesting but quite long."
]

for text in texts:
    blob = TextBlob(text)
    polarity = blob.sentiment.polarity      # -1.0 ~ 1.0
    subjectivity = blob.sentiment.subjectivity  # 0.0 ~ 1.0
    sentiment = "正面" if polarity > 0 else ("负面" if polarity < 0 else "中性")
    print(f"文本: {text}")
    print(f"  Polar={polarity:.2f}, Subj={subjectivity:.2f} -> {sentiment}")

3.4 基于机器学习的情感分类

使用标注好的情感数据训练分类模型。典型流程：TF-IDF向量化 + 逻辑回归/朴素贝叶斯/SVM分类器。

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report, accuracy_score

# 模拟训练数据
texts = [
    "这个产品太棒了，质量很好",
    "非常满意，下次还会购买",
    "物流很快，包装也很好",
    "性价比很高，推荐购买",
    "质量很差，用了一次就坏了",
    "客服态度恶劣，非常失望",
    "价格太贵了，不值这个价",
    "收到就有问题，差评"
]
labels = [1, 1, 1, 1, 0, 0, 0, 0]  # 1=正面, 0=负面

# 向量化
vectorizer = TfidfVectorizer(use_idf=True, max_features=1000)
X = vectorizer.fit_transform(texts)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.25, random_state=42
)

# 训练多种模型对比
models = {
    "逻辑回归": LogisticRegression(max_iter=1000),
    "朴素贝叶斯": MultinomialNB(),
    "线性SVM": LinearSVC(max_iter=1000)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"{name}: 准确率 = {acc:.2%}")

# 预测新文本
new_texts = ["非常不错，值得推荐", "质量太差了"]
X_new = vectorizer.transform(new_texts)
for t, pred in zip(new_texts, models["逻辑回归"].predict(X_new)):
    sentiment = "正面" if pred == 1 else "负面"
    print(f"'{t}' -> {sentiment}")

3.5 BERT 情感分析

基于Transformer的预训练模型（如BERT、RoBERTa）在情感分析任务上达到了当前最高水平。使用HuggingFace Transformers库可以轻松加载预训练模型进行迁移学习。

# BERT 情感分析示例
# 需要安装：pip install transformers torch

from transformers import pipeline

# 加载预训练情感分析模型
# 中文情感分析模型
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="lxyuan/distilbert-base-multilingual-cased-sentiments-student",
    # 或使用中文模型：model="uer/roberta-base-finetuned-dianping-chinese"
)

# 推理
results = sentiment_pipeline([
    "这家餐厅的服务非常好，菜品也很美味！",
    "体验太糟糕了，再也不会来了。"
])

for text, result in zip(texts, results):
    label = result['label']
    score = result['score']
    emoji = "😊" if label == "POSITIVE" else "😞"
    print(f"文本: {text}")
    print(f"  BERT结果: {label} ({score:.2%}) {emoji}\n")

# 批量处理优化（GPU加速）
# sentiments = sentiment_pipeline(large_text_list, batch_size=32)

情感分析技术选型建议

场景	推荐方法	理由
社交媒体/评论快速分析	VADER	无需训练，速度快，适合短文本
中文电商评论分析	词典法 + 机器学习	词典灵活可定制，LR/SVM效果稳定
长文本/新闻分析	BERT微调	能理解上下文语境，处理复杂语义
多语言/跨领域	XLM-R / mBERT	跨语言迁移能力强
细粒度情绪检测	微调预训练模型	可输出多种情绪类别

四、主题建模

主题建模是一种无监督学习方法，自动从大量文本文档中发现隐藏的主题结构。每个主题由一组词语的概率分布表示，每个文档由一组主题的概率分布表示。

4.1 LDA（隐含狄利克雷分配）

LDA（Latent Dirichlet Allocation）是最经典的主题模型算法。它假设每个文档由若干主题的混合构成，每个主题由若干词语的概率分布构成。通过Gibbs采样或变分推断来估计这些分布。

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# 模拟文档集
documents = [
    "苹果发布了新款iPhone手机，摄像头和处理器都有升级",
    "华为Mate系列手机搭载麒麟芯片性能强劲",
    "特斯拉电动车续航里程突破800公里",
    "比亚迪发布刀片电池新技术",
    "央行调整存款准备金率影响股市走势",
    "A股市场今日放量上涨，北向资金持续流入",
    "北京发布人工智能产业发展白皮书",
    "上海加快推进科技创新中心建设"
]

# 向量化（使用词袋模型）
vectorizer = CountVectorizer(max_features=1000, min_df=1)
doc_term_matrix = vectorizer.fit_transform(documents)

# 训练 LDA 模型
n_topics = 3
lda = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=42,
    learning_method='batch',  # 'batch' 或 'online'
    max_iter=10,
    n_jobs=-1
)
lda.fit(doc_term_matrix)

# 获取每个主题的前N个词
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
    top_words_idx = topic.argsort()[:-10:-1]
    top_words = [feature_names[i] for i in top_words_idx]
    print(f"主题 {topic_idx+1}: {' | '.join(top_words)}")

# 输出示例：
# 主题 1: 手机 | 苹果 | 发布 | iPhone | 摄像头 | 处理器 | 华为 | 芯片
# 主题 2: 市场 | 股市 | 央行 | 资金 | 上涨 | 调整 | 存款 | A股
# 主题 3: 人工智能 | 科技 | 创新 | 产业 | 建设 | 推进 | 上海 | 北京

# 获取文档的主题分布
doc_topic_dist = lda.transform(doc_term_matrix)
for i, dist in enumerate(doc_topic_dist):
    dominant_topic = np.argmax(dist) + 1
    print(f"文档{i+1}: 主题分布={np.round(dist, 2)}, 主导主题=主题{dominant_topic}")

4.2 NMF（非负矩阵分解）

NMF（Non-negative Matrix Factorization）是另一种有效的主题建模方法。它将文档-词项矩阵分解为两个非负矩阵的乘积，分别对应文档-主题和主题-词项的表示。NMF的主题通常比LDA更稀疏、更易解释。

from sklearn.decomposition import NMF

# 使用 Tfidf 特征（NMF 配合 TF-IDF 效果更佳）
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=1000, min_df=1)
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# 训练 NMF 模型
n_components = 3
nmf = NMF(
    n_components=n_components,
    random_state=42,
    init='nndsvdar',
    beta_loss='frobenius',
    max_iter=200
)
W = nmf.fit_transform(tfidf_matrix)  # 文档-主题矩阵
H = nmf.components_                  # 主题-词项矩阵

# 展示每个主题的关键词（NMF 主题更稀疏）
feature_names = tfidf_vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(H):
    top_words_idx = topic.argsort()[:-8:-1]
    top_words = [feature_names[i] for i in top_words_idx]
    scores = [topic[i] for i in top_words_idx]
    print(f"NMF 主题 {topic_idx+1}:")
    for word, score in zip(top_words, scores):
        print(f"  {word}: {score:.4f}")

4.3 gensim 实现 LDA

gensim 提供了更丰富的LDA实现，支持流式训练（可处理超大规模语料）、模型持久化、以及更多的评估指标。

from gensim import corpora, models
import jieba

# 中文文本分词
texts = [
    "苹果发布了新款iPhone手机摄像头和处理器都有升级",
    "华为Mate系列手机搭载麒麟芯片性能强劲",
    "特斯拉电动车续航里程突破800公里",
    "央行调整存款准备金率影响股市走势",
    "北京发布人工智能产业发展白皮书"
]

# 分词
tokenized_texts = [jieba.lcut(t) for t in texts]
print("分词结果:", tokenized_texts[:2])

# 创建词典和语料
dictionary = corpora.Dictionary(tokenized_texts)
print("词典大小:", len(dictionary))

# 过滤极端词
dictionary.filter_extremes(no_below=1, no_above=0.8)
print("过滤后词典大小:", len(dictionary))

# 转换为词袋语料
corpus = [dictionary.doc2bow(text) for text in tokenized_texts]

# 训练 gensim LDA
gensim_lda = models.LdaModel(
    corpus=corpus,
    id2word=dictionary,
    num_topics=3,
    random_state=42,
    updates=1,
    passes=10,           # 训练轮次
    alpha='auto',        # 自动学习主题-文档先验
    eta='auto',          # 自动学习主题-词项先验
    per_word_topics=True
)

# 展示主题
topics = gensim_lda.show_topics(num_topics=3, num_words=8, formatted=False)
for topic_id, words in topics:
    print(f"主题 {topic_id + 1}:")
    for word, prob in words:
        print(f"  {word}: {prob:.4f}")

# 计算困惑度（Perplexity，越低越好）
print(f"困惑度: {gensim_lda.log_perplexity(corpus):.2f}")

# 计算主题一致性（Coherence，越高越好）
from gensim.models.coherencemodel import CoherenceModel
coherence_model = CoherenceModel(
    model=gensim_lda,
    texts=tokenized_texts,
    dictionary=dictionary,
    coherence='c_v'
)
print(f"主题一致性: {coherence_model.get_coherence():.4f}")

4.4 pyLDAvis 可视化

pyLDAvis 基于LDAvis（Sievert & Shirley, 2014），提供交互式主题模型可视化。它通过多维缩放（MDS）将主题投影到二维平面，并展示每个主题的关键词分布，帮助分析师理解和评估主题模型质量。

# pyLDAvis 可视化（需安装: pip install pyLDAvis）
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

# 准备可视化数据
vis_data = gensimvis.prepare(gensim_lda, corpus, dictionary)

# 保存为交互式HTML文件
pyLDAvis.save_html(vis_data, 'lda_visualization.html')

# 在Jupyter中显示
# pyLDAvis.display(vis_data)

print("可视化已保存: lda_visualization.html")
print("左侧面板: 主题间距离（MDS投影）")
print("右侧面板: 主题关键词柱状图（红色=该主题特有的词）")
print("用法: 调整 λ 滑块（0-1）控制关键词排序权重")

4.5 最优主题数选择

选择最优主题数是主题建模的关键决策。常用方法：计算不同主题数下的困惑度（Perplexity）和主题一致性（Coherence），找到拐点或最大值。

import matplotlib.pyplot as plt
from gensim.models import LdaModel
from gensim.models.coherencemodel import CoherenceModel

def evaluate_topics(corpus, dictionary, texts, min_topics=2, max_topics=10):
    """评估不同主题数的效果"""
    perplexities = []
    coherences = []
    models = []

    for k in range(min_topics, max_topics + 1):
        print(f"正在训练 K={k} 的主题模型...")
        model = LdaModel(
            corpus=corpus,
            id2word=dictionary,
            num_topics=k,
            random_state=42,
            passes=10,
            alpha='auto'
        )
        models.append(model)

        # 困惑度（越低越好）
        perplexity = model.log_perplexity(corpus)
        perplexities.append(perplexity)

        # 主题一致性（越高越好）
        cm = CoherenceModel(
            model=model,
            texts=texts,
            dictionary=dictionary,
            coherence='c_v'
        )
        coherence = cm.get_coherence()
        coherences.append(coherence)

        print(f"  K={k}: Perplexity={perplexity:.2f}, Coherence={coherence:.4f}")

    # 绘制评估曲线
    ks = list(range(min_topics, max_topics + 1))

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

    ax1.plot(ks, perplexities, 'bo-')
    ax1.set_xlabel('主题数 K')
    ax1.set_ylabel('困惑度 Perplexity')
    ax1.set_title('主题数 vs 困惑度')
    ax1.grid(True)

    ax2.plot(ks, coherences, 'ro-')
    ax2.set_xlabel('主题数 K')
    ax2.set_ylabel('主题一致性 Coherence')
    ax2.set_title('主题数 vs 主题一致性')
    ax2.grid(True)

    plt.tight_layout()
    plt.show()

    # 选择最佳模型
    best_idx = np.argmax(coherences)
    best_k = ks[best_idx]
    print(f"\n推荐主题数: K={best_k} (Coherence={coherences[best_idx]:.4f})")

    return models[best_idx]

五、文本分类

文本分类是NLP中最成熟的监督学习任务之一，应用包括垃圾邮件检测、新闻分类、情感分类、意图识别等。根据标签数量可分为二分类、多分类和多标签分类。

5.1 朴素贝叶斯分类器

朴素贝叶斯（Naive Bayes）基于贝叶斯定理和特征独立假设，训练速度极快，在小样本和高维稀疏数据上表现优秀。常用变体包括MultinomialNB（多项式朴素贝叶斯，适合计数特征）和BernoulliNB（伯努利朴素贝叶斯，适合二值特征）。

from sklearn.naive_bayes import MultinomialNB, BernoulliNB, ComplementNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# 新闻分类示例数据
news_data = [
    ("普京与拜登举行视频会晤讨论乌克兰局势", "国际"),
    ("日本首相访问印度加强双边合作", "国际"),
    ("A股三大指数集体收涨创业板指涨逾2%", "财经"),
    ("央行宣布定向降准释放资金约5000亿", "财经"),
    ("中国男足1比3不敌越南队无缘世界杯", "体育"),
    ("CBA联赛广东队夺得总冠军实现三连冠", "体育"),
    ("嫦娥五号月球样品研究成果发布", "科技"),
    ("华为发布鸿蒙操作系统3.0版本", "科技"),
]

texts, labels = zip(*news_data)

# 构建流水线
nb_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(use_idf=True, max_features=1000)),
    ('clf', MultinomialNB(alpha=1.0))  # alpha: 拉普拉斯平滑参数
])

# 交叉验证
scores = cross_val_score(nb_pipeline, texts, labels, cv=3)
print(f"朴素贝叶斯交叉验证准确率: {scores.mean():.2%}")

# 训练完整模型并预测
nb_pipeline.fit(texts, labels)
pred = nb_pipeline.predict(["OpenAI发布GPT-5模型"])
print(f"预测结果: {pred[0]}")

# 预测概率
probs = nb_pipeline.predict_proba(["特斯拉股价突破1000美元"])
classes = nb_pipeline.named_steps['clf'].classes_
for cls, prob in zip(classes, probs[0]):
    print(f"  {cls}: {prob:.2%}")

5.2 逻辑回归与SVM

逻辑回归（Logistic Regression）和SVM（Support Vector Machine）是文本分类的另外两种高效算法。逻辑回归通过softmax输出概率分布，SVM通过最大化分类间隔提高泛化能力。线性核SVM在高维文本数据上通常表现优秀。

from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split

# 扩展数据集
texts = texts * 5  # 简单扩增
labels = labels * 5

X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.3, random_state=42, stratify=labels
)

# 构建向量化+分类器的流水线
models = {
    'LogisticRegression': Pipeline([
        ('tfidf', TfidfVectorizer(max_features=2000, ngram_range=(1, 2))),
        ('clf', LogisticRegression(max_iter=1000, C=1.0, multi_class='multinomial'))
    ]),
    'LinearSVM': Pipeline([
        ('tfidf', TfidfVectorizer(max_features=2000, ngram_range=(1, 2))),
        ('clf', LinearSVC(C=1.0, max_iter=2000, dual='auto'))
    ])
}

for name, pipeline in models.items():
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    print(f"\n{name} 分类报告:")
    print(f"  准确率: {accuracy_score(y_test, y_pred):.2%}")
    print(f"  宏平均F1: {f1_score(y_test, y_pred, average='weighted'):.2%}")

5.3 深度学习文本分类

使用深度神经网络（如TextCNN、LSTM、Transformer）进行文本分类，能够自动学习文本特征，无需手工特征工程。以下使用HuggingFace Transformers库快速微调一个文本分类模型。

# 深度学习文本分类（使用 Transformers）
# pip install transformers torch datasets

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
import torch

# 1. 加载预训练模型和分词器
model_name = "bert-base-chinese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=4  # 4个分类类别
)

# 2. 数据预处理
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

# 3. 训练参数配置
training_args = TrainingArguments(
    output_dir="./text_classification_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

# 4. 创建训练器
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,     # Dataset对象
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

# 5. 开始训练
# trainer.train()

# 6. 保存模型
# model.save_pretrained("./my_text_classifier")
# tokenizer.save_pretrained("./my_text_classifier")

print("深度学习文本分类流水线已配置就绪")
print(f"模型: {model_name}")
print(f"类别数: {model.config.num_labels}")

# 推理示例
def predict(text, model, tokenizer):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = torch.argmax(logits, dim=1).item()
    return predicted_class

5.4 多标签文本分类

多标签分类（Multi-label Classification）中每个文档可能同时属于多个类别（如一篇新闻同时属于"科技"和"财经"）。常用方法包括问题转化法（Binary Relevance、Classifier Chains）和算法自适应法（ML-kNN）。

from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import hamming_loss, jaccard_score

# 多标签数据：每个文档可属于多个类别
multi_texts = [
    "苹果发布新款MacBook Pro搭载M3芯片",
    "央行降息对房地产和股市的影响分析",
    "皇马夺得欧冠冠军本泽马梅开二度",
    "特斯拉FSD自动驾驶系统在中国获批测试"
]

# 多标签编码：[科技, 财经, 体育, 汽车]
multi_labels = [
    [1, 0, 0, 0],  # 只有科技
    [0, 1, 0, 0],  # 只有财经
    [0, 0, 1, 0],  # 只有体育
    [1, 0, 0, 1],  # 科技+汽车
]

# 使用多输出分类器（为每个标签训练一个二分类器）
vectorizer = TfidfVectorizer(max_features=1000)
X = vectorizer.fit_transform(multi_texts)

base_clf = RandomForestClassifier(n_estimators=100, random_state=42)
multi_clf = MultiOutputClassifier(base_clf, n_jobs=-1)
multi_clf.fit(X, multi_labels)

# 预测
pred = multi_clf.predict(vectorizer.transform(["小米SU7电动车发布智能座舱系统"]))
print("预测标签:", pred[0])  # [1, 0, 0, 1] -> 科技+汽车

# 评估指标
print(f"Hamming Loss: {hamming_loss(multi_labels, multi_clf.predict(X)):.4f}")
print(f"Jaccard Score: {jaccard_score(multi_labels, multi_clf.predict(X), average='samples'):.4f}")

5.5 文本分类流水线

将完整的文本分类流程封装为可复用的流水线，包括预处理、特征提取、模型训练、超参数调优和模型持久化。

# 完整的文本分类流水线
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import joblib

class TextClassifier:
    """端到端文本分类器"""

    def __init__(self, vectorizer=None, classifier=None):
        self.vectorizer = vectorizer or TfidfVectorizer(
            max_features=5000,
            ngram_range=(1, 2),
            min_df=2,
            max_df=0.8,
            sublinear_tf=True  # 使用 1 + log(tf)
        )
        self.classifier = classifier or LogisticRegression(
            max_iter=2000,
            C=1.0,
            class_weight='balanced',
            random_state=42
        )
        self.pipeline = Pipeline([
            ('vec', self.vectorizer),
            ('clf', self.classifier)
        ])

    def train(self, X_train, y_train):
        self.pipeline.fit(X_train, y_train)
        return self

    def predict(self, texts):
        return self.pipeline.predict(texts)

    def predict_proba(self, texts):
        return self.pipeline.predict_proba(texts)

    def evaluate(self, X_test, y_test):
        from sklearn.metrics import classification_report
        y_pred = self.pipeline.predict(X_test)
        return classification_report(y_test, y_pred)

    def tune_hyperparams(self, X, y, cv=5):
        """超参数网格搜索"""
        param_grid = {
            'vec__max_features': [1000, 3000, 5000],
            'vec__ngram_range': [(1, 1), (1, 2), (1, 3)],
            'clf__C': [0.1, 1.0, 10.0],
            'clf__penalty': ['l2']
        }
        grid_search = GridSearchCV(
            self.pipeline,
            param_grid,
            cv=cv,
            scoring='f1_weighted',
            n_jobs=-1,
            verbose=1
        )
        grid_search.fit(X, y)
        print(f"最佳参数: {grid_search.best_params_}")
        print(f"最佳得分: {grid_search.best_score_:.4f}")
        self.pipeline = grid_search.best_estimator_
        return self

    def save(self, filepath):
        joblib.dump(self.pipeline, filepath)
        print(f"模型已保存: {filepath}")

    def load(self, filepath):
        self.pipeline = joblib.load(filepath)
        print(f"模型已加载: {filepath}")
        return self

# 使用示例
clf = TextClassifier()
# clf.train(train_texts, train_labels)
# report = clf.evaluate(test_texts, test_labels)
# clf.save("text_classifier.pkl")

六、词云生成

词云（Word Cloud）是文本数据最直观的可视化方式。词云中词语的大小反映其在文本中的重要性（频率），色彩和布局增强视觉吸引力，是探索性数据分析中不可或缺的工具。

6.1 基础词云

使用Python的wordcloud库可以快速生成高质量的词云图。支持自定义停用词、颜色映射、形状掩码和字体。

# pip install wordcloud matplotlib

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import jieba
from collections import Counter

# 示例文本
text = """
自然语言处理是人工智能的重要分支，它研究如何让计算机理解和生成人类语言。
机器学习深度学习是自然语言处理的核心技术，通过大量文本数据的训练，
模型可以完成文本分类情感分析机器翻译问答系统等多种任务。
近年来预测练语言模型如BERT和GPT系列在自然语言处理领域取得了突破性进展。
"""

# 分词并统计词频
words = jieba.lcut(text)
# 过滤停用词和单字词
stopwords = {"的", "了", "是", "在", "和", "就", "也", "都", "与",
             "为", "等", "之", "及", "或", "更", "被", "由", "所"}
filtered = [w for w in words if len(w) > 1 and w not in stopwords]
word_freq = Counter(filtered)

# 基础词云
wordcloud = WordCloud(
    font_path='C:/Windows/Fonts/msyh.ttc',  # 中文字体路径
    width=800,
    height=600,
    background_color='white',
    max_words=100,
    max_font_size=200,
    min_font_size=10,
    colormap='viridis',        # 颜色映射
    random_state=42,
    margin=5
).generate_from_frequencies(word_freq)

# 显示
plt.figure(figsize=(12, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('文本词云分析', fontsize=16)
plt.show()

# 保存
wordcloud.to_file("basic_wordcloud.png")

6.2 形状词云（Mask）

使用图片的轮廓作为词云的形状掩码，使词云更具视觉冲击力。例如使用五角星、动物轮廓或公司Logo形状。

import numpy as np
from PIL import Image
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt

# 加载形状掩码图片
# mask = np.array(Image.open("heart_shape.png"))

# 使用在线图片作为掩码（需要下载）
# 也可以用简单的NumPy数组创建自定义形状
def create_circle_mask(height, width):
    """创建圆形掩码"""
    mask = np.zeros((height, width), dtype=np.uint8)
    center_y, center_x = height // 2, width // 2
    radius = min(height, width) // 3
    y, x = np.ogrid[:height, :width]
    mask_area = (y - center_y)**2 + (x - center_x)**2 <= radius**2
    mask[mask_area] = 255
    return mask

circle_mask = create_circle_mask(600, 800)

# 带形状的词云
masked_wordcloud = WordCloud(
    font_path='C:/Windows/Fonts/msyh.ttc',
    width=800,
    height=600,
    background_color='black',    # 背景黑色突出形状
    mask=circle_mask,            # 形状掩码
    contour_width=2,             # 轮廓线宽度
    contour_color='steelblue',   # 轮廓线颜色
    max_words=150,
    colormap='coolwarm',
    random_state=42
).generate_from_frequencies(word_freq)

plt.figure(figsize=(12, 10))
plt.imshow(masked_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('形状词云（圆形掩码）', fontsize=16)
plt.show()

# 使用图片颜色作为词云颜色
# image_colors = ImageColorGenerator(np.array(Image.open("image.jpg")))
# wordcloud.recolor(color_func=image_colors)

6.3 进阶词云定制

通过自定义颜色函数和停用词列表，可以精细控制词云的视觉效果和内容质量。

from wordcloud import WordCloud
import random

# 自定义颜色函数
def custom_color_func(word, font_size, position, orientation,
                       random_state=None, **kwargs):
    """根据词频大小返回不同颜色"""
    colors = ['#e74c3c', '#3498db', '#2ecc71', '#f39c12',
              '#9b59b6', '#1abc9c', '#e67e22', '#34495e']
    # 字体越大颜色越深
    if font_size > 150:
        return '#c0392b'
    elif font_size > 100:
        return '#e74c3c'
    elif font_size > 60:
        return '#f39c12'
    else:
        return random.choice(colors)

# 自定义停用词列表（扩展版）
custom_stopwords = set([
    "的", "了", "是", "在", "和", "就", "也", "都", "与", "为",
    "等", "之", "及", "或", "更", "被", "由", "所", "有", "对",
    "以", "从", "到", "让", "会", "但", "这", "那", "它", "们",
    "要", "可以", "没有", "这个", "那个", "自己", "进行", "使用",
    "通过", "通过", "以及", "因此", "然而", "所以", "如果", "因为"
])

advanced_wordcloud = WordCloud(
    font_path='C:/Windows/Fonts/msyh.ttc',
    width=1200,
    height=800,
    background_color='white',
    stopwords=custom_stopwords,
    max_words=200,
    max_font_size=300,
    min_font_size=12,
    color_func=custom_color_func,
    prefer_horizontal=0.7,     # 水平放置词语的比例
    relative_scaling=0.5,      # 词频和字体大小的关联强度
    collocations=False,        # 避免重复词组
    random_state=42
)

# 从文本直接生成（内部自动分词并统计词频）
# advanced_wordcloud.generate(text)

# 使用自定义颜色函数
# advanced_wordcloud.recolor(color_func=custom_color_func)
# advanced_wordcloud.to_file("advanced_wordcloud.png")

print("词云定制选项说明：")
print("- color_func: 自定义颜色映射函数")
print("- colormap: 使用matplotlib内置色图")
print("- ImageColorGenerator: 从图片提取颜色")
print("- relative_scaling: 控制词频对字体大小的影响程度")
print("- prefer_horizontal: 水平放置词语的比例（0~1）")
print("- collocations: 是否包含词组搭配")

词云最佳实践

字体选择： 中文词云务必指定中文字体路径（如msyh.ttc），否则出现方框乱码
停用词过滤： 务必过滤高频无意义词，否则词云被"的"、"了"等占据
颜色控制： 使用colormap或自定义color_func增强可读性
分辨率： 打印用途建议至少1200x800，网页使用800x600即可
形状选择： 黑白轮廓图效果最佳，避免复杂背景
文本量： 建议使用5000字以上的文本，低频词过多导致词云杂乱

七、命名实体识别（NER）

命名实体识别（Named Entity Recognition, NER）是从文本中识别出具有特定意义的实体，如人名、地名、组织机构、时间、日期、金额等。NER是信息抽取、知识图谱构建、问答系统等上层应用的基础。

7.1 spaCy NER

spaCy提供了开箱即用的NER功能，支持多种语言的预训练实体识别模型。中文模型可识别人名（PER）、地名（GPE）、组织机构（ORG）、日期（DATE）、金额（MONEY）等实体类型。

import spacy

# 加载中文模型（需先下载: python -m spacy download zh_core_web_sm）
# nlp = spacy.load("zh_core_web_sm")

# 英文NER示例
nlp_en = spacy.load("en_core_web_sm")

text_en = """
Apple Inc. is planning to open a new store in New York City next month.
Tim Cook announced the expansion during a conference in San Francisco in March 2024.
The project is expected to cost $5 million.
"""

doc_en = nlp_en(text_en)

print("英文命名实体识别结果：")
print(f"{'实体':20s} | {'类型':12s} | {'说明':20s}")
print("-" * 54)
for ent in doc_en.ents:
    print(f"{ent.text:20s} | {ent.label_:12s} | {spacy.explain(ent.label_):20s}")

# 输出示例：
# Apple Inc.           | ORG         | Companies, agencies, institutions
# New York City        | GPE         | Countries, cities, states
# next month           | DATE        | Absolute or relative dates or periods
# Tim Cook             | PERSON      | People, including fictional
# San Francisco        | GPE         | Countries, cities, states
# March 2024           | DATE        | Absolute or relative dates or periods
# $5 million           | MONEY       | Monetary values, including unit

# 视觉化标注（在Jupyter中显示）
# from spacy import displacy
# displacy.render(doc_en, style="ent", jupyter=True)

7.2 中文NER

中文NER比英文更具挑战性，因为中文没有空格分隔词汇，且实体边界模糊。spaCy中文模型支持多种实体类型，但效果依赖于训练数据质量。

# 中文命名实体识别

# 方式一：使用 spaCy 中文模型
# nlp_zh = spacy.load("zh_core_web_sm")
# doc_zh = nlp_zh("华为技术有限公司成立于1987年，总部位于深圳龙岗区。任正非是创始人。")

# 方式二：使用 HanLP（更强的中文NLP工具包）
# pip install hanlp
# import hanlp
# zh_ner = hanlp.load(hanlp.pretrained.ner.MSRA_NER_ELECTRA_SMALL_ZH)
# result = zh_ner([("华为", "ORG"), ("深圳", "GPE")])

# 方式三：在文本中手动实现基于规则的NER
import re

class RuleBasedNER:
    """基于规则的中文命名实体识别"""

    def __init__(self):
        # 预定义实体模式
        self.patterns = {
            '电话': r'1[3-9]\d{9}',
            '邮箱': r'\S+@\S+\.\S+',
            '身份证': r'\d{18}|\d{17}[Xx]',
            '金额': r'[¥￥$€]\d+(?:\.\d+)?|\d+(?:\.\d+)?[元美元欧元]',
        }

        # 自定义词典法识别
        self.known_organizations = [
            "阿里巴巴", "腾讯", "百度", "华为", "字节跳动", "京东",
            "美团", "小米", "中兴", "中国科学院", "清华大学"
        ]
        self.known_locations = [
            "北京", "上海", "广州", "深圳", "杭州", "南京", "武汉",
            "成都", "西安", "长三角", "珠三角", "京津冀"
        ]

    def extract_by_regex(self, text):
        """使用正则表达式提取实体"""
        entities = []
        for label, pattern in self.patterns.items():
            for match in re.finditer(pattern, text):
                entities.append({
                    'text': match.group(),
                    'label': label,
                    'start': match.start(),
                    'end': match.end()
                })
        return entities

    def extract_by_dictionary(self, text):
        """使用词典匹配提取实体"""
        entities = []
        for org in self.known_organizations:
            if org in text:
                idx = text.index(org)
                entities.append({
                    'text': org,
                    'label': '机构',
                    'start': idx,
                    'end': idx + len(org)
                })
        for loc in self.known_locations:
            if loc in text:
                idx = text.index(loc)
                # 避免重复
                if not any(e['start'] <= idx < e['end'] for e in entities):
                    entities.append({
                        'text': loc,
                        'label': '地点',
                        'start': idx,
                        'end': idx + len(loc)
                    })
        return entities

ner_engine = RuleBasedNER()
sample_text = "华为公司在深圳和北京都有研发中心，联系方式 13800138000，预算 ¥500万"
all_entities = ner_engine.extract_by_dictionary(sample_text) + ner_engine.extract_by_regex(sample_text)

print("基于规则的NER结果：")
for ent in sorted(all_entities, key=lambda x: x['start']):
    print(f"  {ent['text']:12s} -> {ent['label']}")

7.3 自定义实体识别

在实际项目中，预训练模型通常无法覆盖领域特定的实体类型（如药品名、疾病名、产品型号、法律条文编号等），需要使用自定义训练或规则补充。

# 使用 spaCy 添加自定义实体模式（基于规则）
import spacy
from spacy.tokens import Span
from spacy.matcher import Matcher

nlp = spacy.load("zh_core_web_sm")

# 方法一：使用 EntityRuler 添加基于模式的自定义实体
from spacy.pipeline import EntityRuler

# 创建 ruler 并添加到流水线
ruler = nlp.add_pipe("entity_ruler", before="ner")

# 添加自定义实体模式
patterns = [
    {"label": "DRUG", "pattern": [{"TEXT": {"REGEX": r"(阿司匹林|布洛芬|青霉素|头孢|二甲双胍)"}}]},
    {"label": "DISEASE", "pattern": [{"TEXT": {"REGEX": r"(高血压|糖尿病|冠心病|肺炎|胃炎)"}}]},
    {"label": "PRODUCT", "pattern": [{"TEXT": {"REGEX": r"(iPhone\d+|Mate\d+|SU\d+)"}}]},
]
ruler.add_patterns(patterns)

# 测试
text = "患者患有高血压和糖尿病，建议服用二甲双胍治疗。最新款iPhone16已发布。"
doc = nlp(text)

print("自定义实体识别结果：")
for ent in doc.ents:
    print(f"  {ent.text:12s} -> {ent.label_:10s}")

# 方法二：使用 Matcher 进行更复杂的模式匹配
matcher = Matcher(nlp.vocab)

# 匹配"公司"前接一个专有名词的模式
pattern = [
    {"POS": "PROPN"},  # 专有名词
    {"TEXT": "公司", "OP": "+"}  # 公司字样
]
matcher.add("COMPANY", [pattern])

# 查找匹配
matches = matcher(doc)
print("\nMatcher 匹配结果：")
for match_id, start, end in matches:
    span = doc[start:end]
    print(f"  {span.text}")

# 方法三：使用训练更新模型
# nlp.update(examples)  # 需要标注好的训练数据

7.4 NER的布局分析与应用

在真实的文档分析场景中，文本往往不是纯文本格式，而是包含PDF、扫描件、表格等复杂布局。需要结合OCR和布局分析技术识别文本位置，再进行NER。

# 复杂文档中的NER（布局感知）
# 场景：从发票、合同、简历等结构化文档中提取信息

class DocumentNER:
    """支持布局信息的文档实体抽取"""

    def __init__(self):
        self.nlp = spacy.load("zh_core_web_sm")
        # 添加自定义实体规则
        self._add_custom_rules()

    def _add_custom_rules(self):
        """添加文档特定的实体规则"""
        ruler = self.nlp.add_pipe("entity_ruler", before="ner")
        invoice_patterns = [
            {"label": "INVOICE_NO", "pattern": [{"TEXT": {"REGEX": r"\d{8,12}"}}]},
            {"label": "TAX_ID", "pattern": [{"TEXT": {"REGEX": r"[A-Za-z0-9]{15,20}"}}]},
        ]
        ruler.add_patterns(invoice_patterns)

    def extract_from_text(self, text):
        """从纯文本中提取实体"""
        doc = self.nlp(text)
        return [
            {"text": ent.text, "label": ent.label_, "start": ent.start_char, "end": ent.end_char}
            for ent in doc.ents
        ]

    def extract_with_layout(self, text_blocks):
        """从带布局信息的文本块中提取实体（如PDF解析结果）"""
        results = []
        for block in text_blocks:
            # block 结构：{"text": "...", "bbox": [x1,y1,x2,y2], "page": 0}
            entities = self.extract_from_text(block["text"])
            for ent in entities:
                ent["bbox"] = block.get("bbox")
                ent["page"] = block.get("page")
            results.extend(entities)
        return results

# 模拟从PDF解析的带布局文本块
pdf_blocks = [
    {"text": "发票号码：20240500123456", "bbox": [100, 50, 400, 70], "page": 1},
    {"text": "开票日期：2024年5月1日", "bbox": [100, 80, 400, 100], "page": 1},
    {"text": "购买方：阿里巴巴集团", "bbox": [100, 120, 400, 140], "page": 1},
    {"text": "统一社会信用代码：91330100MA2AX12345", "bbox": [100, 150, 500, 170], "page": 1},
    {"text": "金额：¥99,999.00", "bbox": [100, 180, 300, 200], "page": 1},
]

doc_ner = DocumentNER()
layout_entities = doc_ner.extract_with_layout(pdf_blocks)
print("布局感知NER结果：")
for ent in layout_entities:
    print(f"  [{ent['label']:12s}] {ent['text']:20s} (位置: {ent.get('bbox')})")

NER技术选型建议

场景	推荐方法	优缺点
通用实体提取（人名、地名）	spaCy 预训练模型	开箱即用，覆盖常见实体类型
领域特定实体（药品、疾病）	EntityRuler + 自定义模式	灵活可配置，无需标注数据
高精度NER（学术/生产级）	BERT-CRF 微调	需要大量标注数据，效果最好
结构化文档（发票、合同）	布局分析 + 规则匹配	结合位置信息的精确提取
大规模非结构化文本挖掘	词典匹配 + 正则	速度快，适合在海量文本中召回

八、核心要点总结

文本预处理是地基： 80%的NLP项目时间花在数据清洗上。一套标准的预处理流水线包括：去噪、标准化、分词、停用词过滤、词形归一化。中文务必使用jieba分词，英文使用NLTK或spaCy。
向量化方法各有所长： CountVectorizer简单直观，TF-IDF降低常见词权重，Word2Vec提供稠密语义表示，FastText支持未登录词。选择时需权衡可解释性、语义丰富度和计算开销。
情感分析路线清晰： 快速原型用VADER（英文）或词典法（中文），生产环境用BERT微调。机器学习方法（TF-IDF + LR/SVM）在标注数据充足时性价比最优。
主题建模看一致性： LDA和NMF各有优势，LDA适合一般文档发现主题，NMF主题更稀疏易解释。使用困惑度+主题一致性双指标选择最优主题数，pyLDAvis是评估模型的利器。
文本分类从简到繁： 朴素贝叶斯适合基线，逻辑回归和SVM适合中等规模，深度学习（BERT等）适合复杂场景。多标签分类使用MultiOutputClassifier包装。
词云提升可视化报告： wordcloud库功能强大，形状掩码增强视觉吸引力，自定义颜色函数和停用词过滤是关键优化点。
NER从规则到深度： 通用场景用spaCy预训练模型，领域场景用自定义EntityRuler，高精度场景用BERT-CRF微调。结构化文档需要结合布局信息。
流水线思维： 所有NLP任务都应封装为可复用的流水线（Pipeline），包括预处理、特征工程、模型训练和评估。使用scikit-learn Pipeline和joblib模型持久化确保工程化质量。

九、进一步思考与实践建议

学习路径建议

入门阶段： 掌握jieba分词、TF-IDF向量化和朴素贝叶斯分类器，完成一个简单的文本分类或情感分析项目
进阶阶段： 学习Word2Vec/LDA主题建模，掌握spaCy的NER和依存句法分析，构建完整的NLP流水线
高级阶段： 深入学习BERT/GPT等预训练模型的微调技术，掌握Prompt Engineering和RAG（检索增强生成）
工程化阶段： 学习模型部署（FastAPI/Flask）、性能优化（ONNX/TensorRT）、大规模文本处理（Spark NLP）

注意事项

数据隐私： 处理用户文本数据时注意脱敏和合规，不要将敏感数据上传到第三方API
模型偏见： 预训练模型可能包含训练数据中的偏见（性别、地域等），需要进行公平性评估
资源消耗： 深度学习模型推理需要GPU支持，大规模部署需考虑成本和延迟
持续迭代： NLP模型在域外数据上可能性能大幅下降，需要持续监控和更新
结果可解释： 在医疗、金融等高风险场景中，模型的决策过程需要可解释