Skip to content

Instantly share code, notes, and snippets.

@KuRRe8
Last active June 6, 2025 17:35
Show Gist options
  • Save KuRRe8/36f63d23ef205a8e02b7b7ec009cc4e8 to your computer and use it in GitHub Desktop.
Save KuRRe8/36f63d23ef205a8e02b7b7ec009cc4e8 to your computer and use it in GitHub Desktop.
和Python使用有关的一些教程,按类别分为不同文件

Python教程

Python是一个新手友好的语言,并且现在机器学习社区深度依赖于Python,C++, Cuda C, R等语言,使得Python的热度稳居第一。本Gist提供Python相关的一些教程,可以直接在Jupyter Notebook中运行。

  1. 语言级教程,一般不涉及初级主题;
  2. 标准库教程,最常见的标准库基本用法;
  3. 第三方库教程,主要是常见的库如numpy,pytorch诸如此类,只涉及基本用法,不考虑新特性

其他内容就不往这个Gist里放了,注意Gist依旧由git进行版本控制,所以可以git clone 到本地,或者直接Google Colab\ Kaggle打开相应的ipynb文件

直接在网页浏览时,由于没有文件列表,可以按Ctrl + F来检索相应的目录,或者点击下面的超链接。

想要参与贡献的直接在评论区留言,有什么问题的也在评论区说 ^.^

目录-语言部分

目录-库部分

目录-具体业务库部分-本教程更多关注机器学习深度学习内容

目录-附录

  • sigh.md个人对于Python动态语言的看法
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SHAP - 模型解释性与可解释性教程\n",
"\n",
"欢迎来到 SHAP 教程!在机器学习中,特别是在高风险决策领域(如医疗、金融),仅仅得到一个预测结果往往是不够的,理解模型为什么会做出这样的预测(模型的可解释性)变得越来越重要。SHAP (SHapley Additive exPlanations) 是一个基于博弈论中 Shapley 值理论的强大框架,旨在提供一种统一的方法来解释任何机器学习模型的输出。\n",
"\n",
"**为什么需要模型解释性 (XAI - Explainable AI)?**\n",
"\n",
"1. **建立信任**: 用户和利益相关者需要理解模型的决策依据。\n",
"2. **调试模型**: 发现模型可能存在的偏见、错误或意想不到的行为。\n",
"3. **满足合规性**: 某些法规(如 GDPR)要求对自动化决策提供解释。\n",
"4. **改进模型**: 通过理解特征的重要性,可以指导特征工程和模型选择。\n",
"5. **科学发现**: 在科学研究中,理解模型学到了什么模式本身就很有价值。\n",
"\n",
"**SHAP 的核心思想:**\n",
"\n",
"* 将模型的每个预测视为一个“合作博弈”的结果,其中每个输入特征都是一个“玩家”。\n",
"* SHAP 值量化了每个特征(玩家)对特定预测结果(相对于所有特征的平均预测或基线预测)的**贡献度**。\n",
"* SHAP 值具有良好的理论性质(如一致性、加性),使得解释更加可靠。\n",
"\n",
"**本教程将涵盖 SHAP 库的核心用法:**\n",
"\n",
"1. 安装 SHAP\n",
"2. SHAP 值的基本概念\n",
"3. 不同类型的 Explainer (TreeExplainer, KernelExplainer, DeepExplainer - 简介)\n",
"4. 计算 SHAP 值\n",
"5. 可视化解释 (`force_plot`, `summary_plot`, `dependence_plot`)\n",
"6. 解释 Scikit-learn 模型示例 (树模型、线性模型)\n",
"7. (简介) 解释深度学习模型"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. 安装 SHAP\n",
"\n",
"```bash\n",
"pip install shap\n",
"\n",
"# 示例中可能用到的其他库\n",
"pip install numpy pandas scikit-learn matplotlib ipython\n",
"# 对于深度学习示例 (可选)\n",
"# pip install torch torchvision # or tensorflow\n",
"```\n",
"**注意**: `DeepExplainer` 可能对特定版本的 TensorFlow 或 PyTorch 有要求。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import shap\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.datasets import fetch_california_housing, load_breast_cancer # 使用不同的数据集\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# 让 SHAP 的 Plot 在 Notebook 中正确显示\n",
"shap.initjs()\n",
"\n",
"print(f\"SHAP version: {shap.__version__}\")\n",
"\n",
"# --- 准备数据 (乳腺癌数据集 - 分类) ---\n",
"cancer = load_breast_cancer()\n",
"X_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)\n",
"y_cancer = cancer.target # 0: malignant, 1: benign\n",
"\n",
"X_cancer_train, X_cancer_test, y_cancer_train, y_cancer_test = train_test_split(\n",
" X_cancer, y_cancer, test_size=0.2, random_state=42\n",
")\n",
"\n",
"print(f\"Breast Cancer dataset loaded: X_train shape={X_cancer_train.shape}, X_test shape={X_cancer_test.shape}\")\n",
"print(\"Sample features:\")\n",
"print(X_cancer_train.head(2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. SHAP 值的基本概念\n",
"\n",
"对于模型的一个特定预测 `f(x)`,SHAP 值 `phi_i` 表示第 `i` 个特征对该预测**相对于基线预测 `E[f(X)]` (所有样本的平均预测值) 的贡献**。\n",
"\n",
"它们满足**加性 (Additivity)** 的重要性质:\n",
"`f(x) = E[f(X)] + phi_1 + phi_2 + ... + phi_n`\n",
"其中 `n` 是特征的数量。\n",
"\n",
"这意味着,一个预测值可以分解为基线值加上每个特征的贡献值。\n",
"* **正的 SHAP 值**: 表示该特征将预测值**推高** (相对于基线)。\n",
"* **负的 SHAP 值**: 表示该特征将预测值**推低** (相对于基线)。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. 不同类型的 Explainer\n",
"\n",
"SHAP 库提供了针对不同类型模型的优化 Explainer:\n",
"\n",
"* **`shap.TreeExplainer`**: \n",
" * 用于基于树的模型(如 Scikit-learn 的 `RandomForest`, `GradientBoosting`, XGBoost, LightGBM, CatBoost)。\n",
" * 利用树的结构高效精确地计算 SHAP 值。\n",
" * **推荐用于支持的树模型**。\n",
"* **`shap.KernelExplainer`**: \n",
" * 模型无关的方法,理论上可以解释任何返回数值输出的函数(黑盒模型)。\n",
" * 通过对输入特征进行扰动并观察模型输出的变化来近似计算 SHAP 值。\n",
" * 计算成本较高,特别是对于高维数据。\n",
" * 需要一个**背景数据集 (background dataset)** 来表示特征的基线分布。\n",
"* **`shap.DeepExplainer`**: \n",
" * 专门用于解释深度学习模型 (PyTorch, TensorFlow)。\n",
" * 结合了 DeepLIFT 和 Shapley 值的思想。\n",
" * 通常比 `KernelExplainer` 更快,但对模型架构和激活函数有一定要求。\n",
"* **`shap.LinearExplainer`**: \n",
" * 用于解释线性模型,SHAP 值直接与特征系数相关。\n",
"* **其他**: 还有针对特定场景的 Explainer,如 `PermutationExplainer`。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. 计算 SHAP 值\n",
"\n",
"基本流程:\n",
"1. 训练你的机器学习模型。\n",
"2. 根据模型类型选择并创建合适的 SHAP `Explainer` 实例。\n",
"3. 调用 `explainer.shap_values(X_to_explain)` 计算 SHAP 值。\n",
" * `X_to_explain`: 你想要解释其预测的数据样本(通常是测试集或特定实例)。\n",
" * 返回的 `shap_values` 的结构取决于模型输出:\n",
" * 对于回归或二分类(通常解释正类的概率),返回一个数组 `[n_samples, n_features]`。\n",
" * 对于多分类,通常返回一个列表,列表中的每个元素是对应一个类别的 SHAP 值数组 `[n_samples, n_features]`。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Calculating SHAP values for RandomForestClassifier ---\")\n",
"\n",
"# 1. 训练模型 (Random Forest)\n",
"rf_model = RandomForestClassifier(n_estimators=100, random_state=42)\n",
"rf_model.fit(X_cancer_train, y_cancer_train)\n",
"print(\"RandomForestClassifier trained.\")\n",
"\n",
"# 2. 创建 TreeExplainer\n",
"explainer_rf = shap.TreeExplainer(rf_model)\n",
"print(\"TreeExplainer created.\")\n",
"\n",
"# 3. 计算测试集的 SHAP 值\n",
"# 对于二分类,TreeExplainer 通常返回一个列表 [shap_values_class_0, shap_values_class_1]\n",
"# 或者只返回正类 (类别 1) 的 SHAP 值,取决于模型的输出结构和 explainer 配置\n",
"# Let's compute for the test set\n",
"shap_values_rf = explainer_rf.shap_values(X_cancer_test)\n",
"\n",
"# 检查返回的 shap_values 结构\n",
"if isinstance(shap_values_rf, list):\n",
" print(f\"SHAP values returned as a list of length: {len(shap_values_rf)}\")\n",
" print(f\"Shape of SHAP values for class 0: {shap_values_rf[0].shape}\") # (n_samples, n_features)\n",
" print(f\"Shape of SHAP values for class 1: {shap_values_rf[1].shape}\") # (n_samples, n_features)\n",
" # 我们通常关注解释正类 (Benign, 类别 1) 的 SHAP 值\n",
" shap_values_rf_pos_class = shap_values_rf[1]\n",
"else:\n",
" print(f\"Shape of SHAP values (likely for positive class): {shap_values_rf.shape}\")\n",
" shap_values_rf_pos_class = shap_values_rf\n",
"\n",
"# explainer 对象还有一个 expected_value 属性,代表基线预测值\n",
"# 对于二分类,它通常也是一个列表 [expected_value_class_0, expected_value_class_1]\n",
"if isinstance(explainer_rf.expected_value, (list, np.ndarray)) and len(explainer_rf.expected_value) > 1:\n",
" print(f\"Explainer expected value (base prediction for class 1): {explainer_rf.expected_value[1]:.4f}\")\n",
" base_value_rf = explainer_rf.expected_value[1]\n",
"else:\n",
" print(f\"Explainer expected value (base prediction): {explainer_rf.expected_value:.4f}\")\n",
" base_value_rf = explainer_rf.expected_value\n",
"\n",
"# 验证加性原理 (对第一个测试样本)\n",
"first_sample_prediction_proba = rf_model.predict_proba(X_cancer_test.iloc[[0]])[0, 1] # 预测为正类(1)的概率\n",
"sum_shap_values_first_sample = np.sum(shap_values_rf_pos_class[0, :])\n",
"print(f\"\\nVerifying additivity for first test sample:\")\n",
"# 注意:TreeExplainer 的输出通常是对 log-odds 或其他内部表示的解释,不直接等于概率。\n",
"# 需要检查 explainer 的 `model_output` 属性或文档来确定解释的是哪个输出。\n",
"# 这里我们仅展示 SHAP 值求和与基线的关系,与直接的概率输出可能需要转换。\n",
"# print(f\" Prediction probability (class 1): {first_sample_prediction_proba:.4f}\")\n",
"print(f\" Base value (expected output): {base_value_rf:.4f}\")\n",
"print(f\" Sum of SHAP values for class 1: {sum_shap_values_first_sample:.4f}\")\n",
"print(f\" Base value + Sum of SHAP values: {base_value_rf + sum_shap_values_first_sample:.4f}\")\n",
"# 这个值应该接近模型内部对于该样本的原始输出(例如 log-odds),不一定是概率。\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. 可视化解释\n",
"\n",
"SHAP 提供了多种强大的可视化工具来帮助理解 SHAP 值。\n",
"\n",
"* **`shap.force_plot(base_value, shap_values[sample_index], features[sample_index])`**: \n",
" * 解释**单个预测**。\n",
" * 显示基线值、每个特征如何将预测推高(红色)或推低(蓝色)以及最终的预测值。\n",
"* **`shap.force_plot(base_value, shap_values, features)`**: \n",
" * 解释**多个预测** (通常是整个数据集)。\n",
" * 生成一个可交互的图,可以将样本按相似的 SHAP 值模式聚类。\n",
"* **`shap.summary_plot(shap_values, features, plot_type=\"bar\"|\"dot\"|\"violin\")`**: \n",
" * 展示**全局特征重要性**。\n",
" * `plot_type=\"bar\"`: 显示每个特征 SHAP 值绝对值的平均值(总体重要性)。\n",
" * `plot_type=\"dot\"` (默认) 或 `\"violin\"`: 显示每个特征的 SHAP 值分布。点根据特征值着色(通常高值红色,低值蓝色),可以揭示特征值与预测贡献方向的关系。\n",
"* **`shap.dependence_plot(feature_index, shap_values, features, interaction_index=\"auto\")`**: \n",
" * 展示特定特征的值如何影响其自身的 SHAP 值。\n",
" * 可以自动或手动选择另一个特征进行着色,以揭示潜在的**交互效应 (interaction effects)**。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- SHAP Visualizations (Random Forest) ---\")\n",
"\n",
"# --- 1. Force Plot (Single Prediction) ---\n",
"# 解释测试集中第一个样本的预测 (解释类别 1 - Benign)\n",
"sample_index = 0\n",
"print(f\"\\nExplaining prediction for test sample {sample_index} (Class 1 - Benign):\")\n",
"shap.force_plot(base_value_rf, \n",
" shap_values_rf_pos_class[sample_index, :], \n",
" X_cancer_test.iloc[sample_index, :],\n",
" matplotlib=True) # Use matplotlib=True for static plot in some environments\n",
"# plt.show() # Might be needed if matplotlib=True and plot doesn't show\n",
"# 交互式版本通常直接在 Notebook 输出中显示\n",
"\n",
"# --- 2. Force Plot (Multiple Predictions) ---\n",
"# 解释测试集中前 N 个样本 (可能需要滚动查看)\n",
"# print(\"\\nExplaining multiple predictions (interactive plot):\")\n",
"# shap.force_plot(base_value_rf, shap_values_rf_pos_class[:50,:], X_cancer_test.iloc[:50,:])\n",
"# ^^ This plot is interactive and might be large, commented out by default\n",
"\n",
"# --- 3. Summary Plot (Global Feature Importance) ---\n",
"print(\"\\nShowing Summary Plot (dot plot):\")\n",
"shap.summary_plot(shap_values_rf_pos_class, X_cancer_test, plot_type=\"dot\")\n",
"# X轴: SHAP 值 (对类别 1 的贡献)\n",
"# Y轴: 特征名称 (按重要性排序)\n",
"# 颜色: 特征值的高低 (红高蓝低)\n",
"\n",
"print(\"\\nShowing Summary Plot (bar plot):\")\n",
"shap.summary_plot(shap_values_rf_pos_class, X_cancer_test, plot_type=\"bar\")\n",
"# 显示平均绝对 SHAP 值\n",
"\n",
"# --- 4. Dependence Plot (Feature Dependence & Interaction) ---\n",
"# 查看 'worst perimeter' 特征如何影响 SHAP 值,并自动选择交互特征着色\n",
"feature_to_plot = \"worst perimeter\"\n",
"print(f\"\\nShowing Dependence Plot for '{feature_to_plot}':\")\n",
"shap.dependence_plot(feature_to_plot, \n",
" shap_values_rf_pos_class, \n",
" X_cancer_test, \n",
" interaction_index=\"auto\") # 自动寻找交互特征\n",
"# X轴: 特征 'worst perimeter' 的值\n",
"# Y轴: 该特征的 SHAP 值\n",
"# 颜色: 交互特征的值"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. 解释 Scikit-learn 模型示例 (其他类型)\n",
"\n",
"SHAP 可以用于解释多种模型。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"\\n--- Explaining Logistic Regression ---\")\n",
"\n",
"# 1. 训练模型 (需要标准化特征)\n",
"from sklearn.preprocessing import StandardScaler\n",
"scaler = StandardScaler()\n",
"X_cancer_train_scaled = scaler.fit_transform(X_cancer_train)\n",
"X_cancer_test_scaled = scaler.transform(X_cancer_test)\n",
"\n",
"lr_model = LogisticRegression(max_iter=5000, random_state=42)\n",
"lr_model.fit(X_cancer_train_scaled, y_cancer_train)\n",
"print(\"LogisticRegression trained.\")\n",
"\n",
"# 2. 创建 Explainer\n",
"# a) LinearExplainer (最适合线性模型)\n",
"# explainer_lr = shap.LinearExplainer(lr_model, X_cancer_train_scaled)\n",
"\n",
"# b) KernelExplainer (模型无关,较慢,需要背景数据)\n",
"# 通常从训练集中采样一部分作为背景数据 (e.g., k-means 聚类中心或随机样本)\n",
"# background_data = shap.sample(X_cancer_train_scaled, 50) # Sample 50 points\n",
"# 或者使用 kmeans\n",
"background_data_kmeans = shap.kmeans(X_cancer_train_scaled, 10) # Use 10 k-means centroids\n",
"explainer_lr_kernel = shap.KernelExplainer(lr_model.predict_proba, background_data_kmeans)\n",
"print(\"KernelExplainer created for Logistic Regression.\")\n",
"\n",
"# 3. 计算 SHAP 值 (解释类别 1 的概率)\n",
"# KernelExplainer 可能需要较长时间,我们只解释前几个样本\n",
"num_samples_to_explain = 10\n",
"print(f\"Calculating SHAP values for first {num_samples_to_explain} test samples using KernelExplainer...\")\n",
"shap_values_lr_kernel = explainer_lr_kernel.shap_values(X_cancer_test_scaled[:num_samples_to_explain])\n",
"print(\"SHAP values calculated.\")\n",
"\n",
"# KernelExplainer for predict_proba returns list [class0_shap, class1_shap]\n",
"shap_values_lr_kernel_pos_class = shap_values_lr_kernel[1]\n",
"base_value_lr_kernel = explainer_lr_kernel.expected_value[1]\n",
"\n",
"# 4. 可视化 (单个样本)\n",
"print(f\"\\nExplaining prediction for test sample 0 (Logistic Regression - Kernel):\")\n",
"shap.force_plot(base_value_lr_kernel, \n",
" shap_values_lr_kernel_pos_class[0,:],\n",
" X_cancer_test.iloc[0,:], # Use original feature names for display\n",
" matplotlib=True)\n",
"# plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. (简介) 解释深度学习模型\n",
"\n",
"对于 PyTorch 和 TensorFlow 模型,可以使用 `shap.DeepExplainer`。\n",
"\n",
"**基本流程 (PyTorch 示例):**\n",
"1. 训练你的 PyTorch 模型。\n",
"2. 选择一个背景数据集 (通常是训练集的一个子集,或代表性的样本)。\n",
"3. 创建 `shap.DeepExplainer(model, background_data_tensor)`。\n",
"4. 计算 SHAP 值: `shap_values = explainer.shap_values(input_data_tensor)`。\n",
"5. 使用 SHAP 可视化工具。\n",
"\n",
"```python\n",
"# import torch\n",
"# import shap\n",
"\n",
"# # 假设 model 是你训练好的 PyTorch 模型\n",
"# # 假设 background_data 是一个 PyTorch 张量\n",
"# # 假设 input_data 是你想解释的输入张量\n",
"\n",
"# model.eval()\n",
"# background_data = background_data.to(device)\n",
"# input_data = input_data.to(device)\n",
"\n",
"# explainer = shap.DeepExplainer(model, background_data)\n",
"# shap_values = explainer.shap_values(input_data)\n",
"\n",
"# # shap_values 的结构取决于模型输出层\n",
"# # 例如,对于多分类,可能是 list of tensors [num_classes, num_samples, *input_shape]\n",
"\n",
"# # 可视化 (可能需要将 shap_values 和 input_data 移回 CPU 并转为 NumPy)\n",
"# shap.summary_plot(shap_values[class_index].cpu().numpy(), input_data.cpu().numpy())\n",
"```\n",
"**注意**: `DeepExplainer` 的细节和行为可能因模型架构、激活函数和 PyTorch/TensorFlow 版本而异,需要仔细阅读 SHAP 文档。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 总结\n",
"\n",
"SHAP 提供了一个强大且理论基础扎实的框架来解释机器学习模型的预测。通过计算和可视化 SHAP 值,我们可以深入了解每个特征对单个预测和模型整体行为的贡献。\n",
"\n",
"**关键要点:**\n",
"* SHAP 值量化了每个特征对预测结果(相对于基线)的贡献。\n",
"* 选择合适的 Explainer (`TreeExplainer`, `KernelExplainer`, `DeepExplainer`等) 取决于你的模型类型。\n",
"* `force_plot` 用于解释单个或多个预测。\n",
"* `summary_plot` 用于展示全局特征重要性和特征值与贡献的关系。\n",
"* `dependence_plot` 用于探索特征依赖性和交互效应。\n",
"* 模型可解释性对于建立信任、调试模型和满足合规性要求至关重要。\n",
"\n",
"熟练使用 SHAP 可以极大地增强你理解和沟通机器学习模型的能力。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 5
}

对动态语言Python的一些感慨

众所周知Python是完全动态的语言,体现在

  1. 类型动态绑定
  2. 运行时检查
  3. 对象结构内容可动态修改(而不仅仅是值)
  4. 反射
  5. 一切皆对象(instance, class, method)
  6. 可动态执行代码(eval, exec)
  7. 鸭子类型支持

动态语言的约束更少,对使用者来说更易于入门,但相应的也会有代价就是运行时开销很大,和底层汇编执行逻辑完全解耦不知道代码到底是怎么执行的。

而且还有几点是我认为较为严重的缺陷。下面进行梳理。

破坏了OOP的语义

较为流行的编程语言大多支持OOP编程范式。即继承和多态。同样,Python在执行简单任务时候可以纯命令式(Imperative Programming),也可以使用复杂的面向对象OOP。

但是,其动态特性破环了OOP的结构:

  1. 类型模糊:任何类型实例,都可以在运行时添加或者删除属性或者方法(相比之下静态语言只能在运行时修改它们的值)。经此修改的实例,按理说不再属于原来的类型,毕竟和原类型已经有了明显的区别。但是该实例的内建__class__属性依旧会指向原类型,这会给类型的认知造成困惑。符合一个class不应该只是名义上符合,而是内容上也应该符合。
  2. 破坏继承:体现在以下两个方面
    1. 大部分实践没有虚接口继承。abc模块提供了虚接口的基类ABC,经典的做法是让自己的抽象类继承自ABC,然后具体类继承自自己的抽象类,然后去实现抽象方法。但PEP提案认为Pythonic的做法是用typing.Protocol来取代ABC,具体类完全不继承任何虚类,只要实现相应的方法,那么就可以被静态检查器认为是符合Protocol的。
    2. 不需要继承自具体父类。和上一条一样,即使一个类没有任何父类(除了object类),它依旧可以生成同名的方法,以实现和父类方法相同的调用接口。这样在语义逻辑上,类的定义完全看不出和其他类有何种关系。完全可以是一种松散的组织结构,任何两个类之间都没继承关系。
  3. 破坏多态:任何一个入参出参,天然不限制类型。这使得要求父类型的参数处,传入子类型显得没有意义,依旧是因为任何类型都能动态修改满足要求。

破坏了设计模式

经典的模式诸如工厂模式,抽象工厂,访问者模式,都严重依赖于继承和多态的性质。但是在python的设计中,其动态能力使得设计模式形同虚设。 大家常见的库中使用设计模式的有transformers库,其中的from_pretrained系列则是工厂模式,通过字符串名称确定了具体的构造器得到具体的子类。而工厂构造器的输出类型是一个所有模型的基类。

安全性问题

Python在代码层面一般不直接管理指针,所以指针越界,野指针,悬空指针等问题一般不存在。而gc机制也能自动处理垃圾回收使得编码过程不必关注这类安全性问题。但与之相对的,Python也有自己的安全性问题。以往非托管形式的代码的攻击难度较大,注入代码想要稳定执行需要避免破坏原来的结构导致程序直接崩溃(段错误)。 Python却可以直接注入任何代码修改原本的逻辑,并且由于不是在code段固定的内容,攻击时候也无需有额外考虑。运行时可以手动修改globals() locals()内容,亦有一定风险。 另一个危险则是类型不匹配导致的代码执行问题,因为只有在运行时才确定类型,无法提前做出保证,可能会产生类型错误的异常,造成程序崩溃。

总结

我出身于C++。但是近年来一直在用python编程。而且python的市场占有率已经多年第一,且遥遥领先。这和其灵活性分不开关系。对于一个面向大众的编程语言,这样的特性是必要的。即使以上说了诸多python的不严谨之处,但是对于程序员依旧可以选择严谨的面向对象写法。所以,程序的优劣不在于语言怎么样,而在于程序员本身。程序员有责任写出易于维护,清晰,规范的代码~

Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@KuRRe8
Copy link
Author

KuRRe8 commented May 8, 2025

返回顶部

有见解,有问题,或者单纯想盖楼灌水,都可以在这里发表!

因为文档比较多,有时候渲染不出来ipynb是浏览器性能的问题,刷新即可

或者git clone到本地来阅读

ChatGPT Image May 9, 2025, 04_45_04 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment