数据脱敏Hook：敏感数据自动脱敏

核心思想：数据脱敏Hook在用户提示发送给AI模型之前拦截请求，自动检测并替换敏感信息（PII、凭证、IP等），保护隐私数据的同时保持AI理解上下文的能力。脱敏后的数据可以配置恢复策略，确保敏感信息不会泄露给第三方AI服务。

一、数据脱敏Hook的设计

数据脱敏Hook的核心价值在于：在用户提示发送给AI之前自动脱敏敏感信息，保护隐私数据。它作为一个中间件层，拦截每一次API请求，对文本内容进行扫描和脱敏处理。

自动检测

智能识别文本中的敏感信息，包括PII、凭证、网络信息等

智能脱敏

保留部分格式信息，让AI能理解上下文而不暴露真实数据

可配置规则

支持自定义脱敏规则，按数据敏感级别配置不同策略

审计日志

所有脱敏操作记录到审计日志，便于追溯和安全审查

// 数据脱敏Hook的基本架构
class DataMaskingHook {
    constructor(options = {}) {
        this.rules = options.rules || [];
        this.auditLog = [];
        this.whitelist = options.whitelist || [];
        this.enabled = options.enabled !== false;
    }

    // 在用户提示提交前执行脱敏
    async beforeUserPromptSubmit(context) {
        if (!this.enabled) return context;

        const originalContent = context.prompt;
        const maskedContent = await this.applyMasking(originalContent);

        // 记录审计日志
        this.auditLog.push({
            timestamp: new Date().toISOString(),
            originalLength: originalContent.length,
            maskedCount: this.getMaskedCount(originalContent, maskedContent)
        });

        return { ...context, prompt: maskedContent };
    }

    // 在AI响应返回后恢复脱敏数据（可选）
    async afterAssistantMessage(context) {
        // 如果配置了恢复策略，将占位符恢复为可读描述
        return context;
    }
}

设计要点：Hook执行时机为 before:user-prompt-submit，在用户点击发送之前截获内容。脱敏后的内容仍保持语义完整性，AI能够理解上下文但无法获取真实敏感数据。

二、PII信息检测和脱敏Hook（before:user-prompt-submit）

个人身份信息（PII）是最常见的敏感数据类型。该Hook在用户提示提交前自动检测并脱敏PII信息，包括姓名、身份证号、手机号、邮箱、银行卡号等，采用模式匹配和正则表达式进行识别和替换。

2.1 PII检测规则实现

// PII检测和脱敏规则实现
const piiRules = [
    {
        name: '手机号',
        pattern: /(1[3-9]\d)\d{4}(\d{4})/g,
        replacement: '$1****$2',
        description: '保留前3位和后4位，中间隐藏'
    },
    {
        name: '身份证号',
        pattern: /(\d{6})\d{8}(\d{4})/g,
        replacement: '$1********$2',
        description: '保留前6位和后4位，中间隐藏'
    },
    {
        name: '邮箱',
        pattern: /(\w{2})\w+@(\w+\.\w+)/g,
        replacement: '$1***@$2',
        description: '保留邮箱前2位和域名'
    },
    {
        name: '银行卡号',
        pattern: /(\d{4})\d{8,12}(\d{4})/g,
        replacement: '$1********$2',
        description: '保留前4位和后4位'
    },
    {
        name: '姓名',
        pattern: /(?:姓名[：:]\s*)([一-龥]{2,4})/g,
        replacement: (match, name) => {
            if (name.length === 2) return '姓名：' + name[0] + '某';
            return '姓名：' + name[0] + '某' + name[name.length - 1];
        },
        description: '保留姓氏，名字用"某"代替'
    }
];

2.2 脱敏效果示例

数据类型	原始内容	脱敏后
手机号	13812345678	138****5678
身份证号	110101199001011234	110101********1234
邮箱	zhangsan@example.com	zh***@example.com
银行卡号	6222021234567890	6222********7890
姓名	姓名：张三丰	姓名：张某丰

设计思路：脱敏时保留部分格式信息（如手机号前3位和后4位），这样AI仍然能理解数据所属的地区、银行等信息，但无法获取完整的敏感数据。脱敏后的数据同时记录到审计日志，便于后续安全审查。

三、凭证信息脱敏Hook

API Key、Token、密码等凭证信息是高风险敏感数据，一旦泄露到AI服务可能导致严重安全事件。凭证信息脱敏Hook专门检测并替换这些凭证。

3.1 凭证检测和替换

// 凭证信息脱敏规则
const credentialRules = [
    {
        name: 'API Key',
        // 匹配常见的API Key格式
        pattern: /(sk-[a-zA-Z0-9]{20,})|(api[_-]?key[=:]\s*['"]?[a-zA-Z0-9]{16,})/gi,
        replacement: '[API_KEY]',
        description: '将API Key替换为占位符'
    },
    {
        name: 'Bearer Token',
        pattern: /Bearer\s+[A-Za-z0-9\-._~+/]{20,}/g,
        replacement: 'Bearer [TOKEN]',
        description: '将Bearer Token替换'
    },
    {
        name: '数据库连接串',
        pattern: /(postgresql|mysql|mongodb):\/\/[^:]+:[^@]+@/g,
        replacement: '$1://[USER]:[PASSWORD]@',
        description: '脱敏数据库连接串中的密码'
    },
    {
        name: '密码字段',
        pattern: /(password|passwd|pwd)[=:]\s*['"]?[^'"&\s]{4,}['"]?/gi,
        replacement: '$1=[PASSWORD]',
        description: '将密码替换为占位符'
    },
    {
        name: '私钥',
        pattern: /-----BEGIN (RSA |EC )?PRIVATE KEY-----[^>]+-----END/g,
        replacement: '[PRIVATE_KEY]',
        description: '检测并替换私钥内容'
    }
];

安全提醒：切勿将任何形式的API Key、Token或密码发送给第三方AI服务。凭证脱敏Hook应当是最后一道防线，开发者还应从源头避免在提示中包含凭证信息。

3.2 自定义凭证模式扩展

// 支持自定义凭证模式扩展
class CredentialMaskingHook {
    constructor() {
        this.customPatterns = new Map();
    }

    // 注册自定义凭证模式
    registerPattern(name, pattern, replacement) {
        this.customPatterns.set(name, {
            pattern: new RegExp(pattern, 'gi'),
            replacement: replacement || `[${name.toUpperCase()}]`
        });
    }

    // 批量注册自定义模式
    registerPatterns(patterns) {
        for (const [name, config] of Object.entries(patterns)) {
            this.registerPattern(name, config.pattern, config.replacement);
        }
    }

    maskCredentials(text) {
        let masked = text;
        for (const [, rule] of this.customPatterns) {
            masked = masked.replace(rule.pattern, rule.replacement);
        }
        return masked;
    }
}

// 使用示例
const hook = new CredentialMaskingHook();
hook.registerPattern('aws_key', 'AKIA[0-9A-Z]{16}', '[AWS_ACCESS_KEY]');
hook.registerPattern('github_token', 'ghp_[a-zA-Z0-9]{36}', '[GITHUB_TOKEN]');

最佳实践：凭证脱敏后使用有意义的占位符（如[API_KEY]、[PASSWORD]），让AI知道此处原本有凭证信息，避免AI因内容被替换而产生误解。

四、IP地址和网络信息脱敏

网络信息如IP地址、域名等可能暴露内部网络架构和服务器信息。IP地址和网络信息脱敏Hook针对不同类型的网络地址采用不同的脱敏策略。

4.1 IP地址脱敏

// IP地址和网络信息脱敏规则
const networkRules = [
    {
        name: '内网IPv4地址',
        // 匹配常见内网IP段：10.x.x.x, 172.16-31.x.x, 192.168.x.x
        pattern: /\b(10\.\d{1,3}\.\d{1,3}\.\d{1,3})\b/g,
        replacement: '[INTERNAL_IP]',
        description: '将内网IP替换为占位符'
    },
    {
        name: '内网IPv4（192.168段）',
        pattern: /\b(192\.168\.\d{1,3}\.\d{1,3})\b/g,
        replacement: '[LOCAL_IP]',
        description: '192.168段IP脱敏'
    },
    {
        name: '外网IPv4地址（可选）',
        pattern: /\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\b/g,
        replacement: '[PUBLIC_IP]',
        description: '外网IP可选脱敏'
    },
    {
        name: 'IPv6地址',
        pattern: /([0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}/g,
        replacement: '[IPV6_ADDRESS]',
        description: 'IPv6地址脱敏'
    },
    {
        name: '域名中的敏感信息',
        pattern: /(https?:\/\/)?([a-z0-9-]+)\.internal\.([a-z]+)/g,
        replacement: '$1[SUB_DOMAIN].internal.$3',
        description: '脱敏内网域名中的子域名'
    }
];

4.2 脱敏策略对比

网络信息类型	示例	脱敏后	策略说明
内网IPv4	192.168.1.100	[LOCAL_IP]	完全替换，防止内网拓扑泄露
外网IPv4	8.8.8.8	[PUBLIC_IP]	可选脱敏，默认保留
IPv6	fe80::1ffe:23ab	[IPV6_ADDRESS]	完全替换
内网域名	db01.internal.company.com	[SUB_DOMAIN].internal.company.com	仅脱敏敏感子域名部分
带端口URL	http://10.0.0.5:8080	http://[INTERNAL_IP]:[PORT]	IP和端口分别脱敏

注意：公共DNS服务器IP（如8.8.8.8、1.1.1.1）和localhost（127.0.0.1）应当加入白名单，避免过度脱敏。脱敏白名单机制可以有效减少误报。

五、脱敏规则配置和管理

灵活的脱敏规则配置是数据脱敏Hook的核心能力。通过配置文件或管理界面，开发者可以自定义脱敏规则、按敏感级别配置策略、设置白名单、以及进行脱敏测试验证。

5.1 脱敏规则配置文件

// masking-config.json - 脱敏规则配置文件
{
    "version": "2.0",
    "enabled": true,

    "sensitivityLevels": {
        "high": ["身份证号", "银行卡号", "密码", "私钥"],
        "medium": ["手机号", "邮箱", "API Key", "Token"],
        "low": ["IP地址", "姓名", "域名"]
    },

    "rules": {
        "PII": {
            "enabled": true,
            "level": "high",
            "patterns": [
                { "name": "手机号", "pattern": "...", "replacement": "..." },
                { "name": "身份证号", "pattern": "...", "replacement": "..." },
                { "name": "邮箱", "pattern": "...", "replacement": "..." }
            ]
        },
        "credentials": {
            "enabled": true,
            "level": "high",
            "patterns": [
                { "name": "API Key", "pattern": "...", "replacement": "[API_KEY]" },
                { "name": "密码", "pattern": "...", "replacement": "[PASSWORD]" }
            ]
        },
        "network": {
            "enabled": true,
            "level": "low",
            "patterns": [
                { "name": "内网IP", "pattern": "...", "replacement": "[INTERNAL_IP]" }
            ]
        }
    },

    "whitelist": {
        "enabled": true,
        "patterns": [
            "127.0.0.1",
            "localhost",
            "8.8.8.8",
            "1.1.1.1"
        ],
        "domains": [
            "example.com",
            "test.com"
        ]
    },

    "audit": {
        "enabled": true,
        "logLevel": "info",
        "retentionDays": 90
    }
}

5.2 按敏感级别配置不同策略

// 敏感级别策略管理
class SensitivityManager {
    constructor(config) {
        this.levels = config.sensitivityLevels || {};
        this.currentLevel = 'medium'; // 默认级别
    }

    setLevel(level) {
        if (this.levels[level]) {
            this.currentLevel = level;
        }
    }

    shouldMask(ruleName) {
        const currentRules = this.levels[this.currentLevel] || [];
        return currentRules.includes(ruleName);
    }

    // 根据场景自动调整级别
    adjustForContext(context) {
        if (context.includes('生产环境') || context.includes('production')) {
            this.setLevel('high');
        } else if (context.includes('测试') || context.includes('test')) {
            this.setLevel('low');
        }
    }
}

// 不同场景的策略示例
const strategies = {
    development: { level: 'low', audit: false },
    staging: { level: 'medium', audit: true },
    production: { level: 'high', audit: true, retentionDays: 90 },
    compliance: { level: 'high', audit: true, alertOnMatch: true }
};

5.3 脱敏白名单机制

// 白名单管理
class WhitelistManager {
    constructor(whitelistConfig) {
        this.whitelistPatterns = whitelistConfig.patterns || [];
        this.whitelistDomains = whitelistConfig.domains || [];
    }

    isWhitelisted(text) {
        // 检查是否匹配白名单
        for (const pattern of this.whitelistPatterns) {
            if (text.includes(pattern)) return true;
        }
        for (const domain of this.whitelistDomains) {
            if (text.includes('@' + domain) || text.includes('.' + domain)) {
                return true;
            }
        }
        return false;
    }

    addToWhitelist(pattern) {
        this.whitelistPatterns.push(pattern);
    }

    removeFromWhitelist(pattern) {
        this.whitelistPatterns = this.whitelistPatterns.filter(p => p !== pattern);
    }
}

5.4 脱敏测试和验证工具

// 脱敏测试和验证工具
class MaskingTestTool {
    constructor(maskingHook) {
        this.hook = maskingHook;
        this.testResults = [];
    }

    // 运行脱敏测试用例
    runTests(testCases) {
        for (const testCase of testCases) {
            const result = this.hook.applyMasking(testCase.input);
            const passed = result.masked === testCase.expected;

            this.testResults.push({
                name: testCase.name,
                input: testCase.input,
                expected: testCase.expected,
                actual: result.masked,
                passed,
                maskedCount: result.count
            });
        }

        return this.generateReport();
    }

    // 生成测试报告
    generateReport() {
        const total = this.testResults.length;
        const passed = this.testResults.filter(r => r.passed).length;

        return {
            total,
            passed,
            failed: total - passed,
            passRate: ((passed / total) * 100).toFixed(1) + '%',
            details: this.testResults
        };
    }

    // 交互式测试
    async interactiveTest() {
        console.log('=== 数据脱敏Hook交互式测试工具 ===');
        console.log('输入文本即可查看脱敏结果，输入"exit"退出');

        // 测试用例库
        const testCases = [
            {
                name: '包含手机号和邮箱',
                input: '请联系我：手机13812345678，邮箱zhangsan@example.com',
                expected: '请联系我：手机138****5678，邮箱zh***@example.com'
            },
            {
                name: '包含API Key',
                input: '请使用API Key: sk-abcdefghijklmnopqrstuvwxyz',
                expected: '请使用API Key: [API_KEY]'
            },
            {
                name: '包含内网IP和端口',
                input: '数据库地址：192.168.1.100:3306',
                expected: '数据库地址：[LOCAL_IP]:3306'
            }
        ];

        return this.runTests(testCases);
    }
}

// 完整集成示例
const maskingHook = new DataMaskingHook({
    rules: [...piiRules, ...credentialRules, ...networkRules],
    whitelist: mockWhitelistConfig,
    enabled: true
});

// 测试输入
const userInput = `用户信息：
姓名：李四
手机：13912345678
邮箱：lisi@company.com
身份证：110101199003071234

服务器配置：
数据库 mysql://admin:pass123@192.168.1.100:3306/prod_db
API Key: sk-abcdefghijklmnopqrstuvwxyz
内网地址：10.0.0.5:8080`;

const result = maskingHook.applyMasking(userInput);
console.log('脱敏后：');
console.log(result.masked);
console.log(`共脱敏 ${result.count} 处敏感信息`);

// 预期脱敏输出：
用户信息：
姓名：李某
手机：139****5678
邮箱：li***@company.com
身份证：110101********1234

服务器配置：
数据库 mysql://[USER]:[PASSWORD]@[INTERNAL_IP]:3306/[DB_NAME]
API Key: [API_KEY]
内网地址：[INTERNAL_IP]:[PORT]

总结：数据脱敏Hook是AI应用中的关键安全组件，通过在数据发送给AI模型前自动检测和替换敏感信息，有效防止隐私泄露。良好的脱敏设计应当在保护敏感数据的同时，保持语义完整性让AI能够理解上下文。结合灵活的规则配置、白名单机制、审计日志和测试验证工具，可以构建一个健壮可靠的数据脱敏系统。

最佳实践清单： 1. 始终在 before:user-prompt-submit 阶段执行脱敏 2. PII信息保留部分格式（前几位+后几位），方便AI理解上下文 3. 凭证信息使用有意义的占位符（如 [API_KEY]） 4. IP地址按内网/外网区分脱敏策略 5. 公共IP和localhost加入白名单，避免过度脱敏 6. 所有脱敏操作记录审计日志 7. 提供测试验证工具，确保脱敏规则正确 8. 按环境配置不同的敏感级别策略