Skip to content

Instantly share code, notes, and snippets.

@KuRRe8
Last active June 6, 2025 17:35
Show Gist options
  • Save KuRRe8/36f63d23ef205a8e02b7b7ec009cc4e8 to your computer and use it in GitHub Desktop.
Save KuRRe8/36f63d23ef205a8e02b7b7ec009cc4e8 to your computer and use it in GitHub Desktop.
和Python使用有关的一些教程,按类别分为不同文件

Python教程

Python是一个新手友好的语言,并且现在机器学习社区深度依赖于Python,C++, Cuda C, R等语言,使得Python的热度稳居第一。本Gist提供Python相关的一些教程,可以直接在Jupyter Notebook中运行。

  1. 语言级教程,一般不涉及初级主题;
  2. 标准库教程,最常见的标准库基本用法;
  3. 第三方库教程,主要是常见的库如numpy,pytorch诸如此类,只涉及基本用法,不考虑新特性

其他内容就不往这个Gist里放了,注意Gist依旧由git进行版本控制,所以可以git clone 到本地,或者直接Google Colab\ Kaggle打开相应的ipynb文件

直接在网页浏览时,由于没有文件列表,可以按Ctrl + F来检索相应的目录,或者点击下面的超链接。

想要参与贡献的直接在评论区留言,有什么问题的也在评论区说 ^.^

目录-语言部分

目录-库部分

目录-具体业务库部分-本教程更多关注机器学习深度学习内容

目录-附录

  • sigh.md个人对于Python动态语言的看法
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Scikit-learn - Python 通用机器学习库教程\n",
"\n",
"欢迎来到 Scikit-learn 教程!Scikit-learn (也写作 sklearn) 是 Python 中最流行、最全面的“经典”机器学习库。它提供了大量用于数据预处理、模型选择、模型训练、评估以及常见机器学习算法(分类、回归、聚类、降维)的工具。\n",
"\n",
"**为什么 Scikit-learn 对 ML/DL/数据科学很重要?**\n",
"\n",
"1. **一致的 API**:提供了简单、一致的接口 (`fit`, `predict`, `transform`) 来使用不同的算法。\n",
"2. **广泛的算法覆盖**:包含了绝大多数常用的非深度学习算法。\n",
"3. **数据预处理工具**:提供了丰富的特征缩放、编码、缺失值处理等工具。\n",
"4. **模型选择与评估**:内置交叉验证、超参数调优和各种评估指标。\n",
"5. **与其他库集成良好**:与 NumPy, SciPy, Pandas, Matplotlib 等紧密集成。\n",
"6. **优秀的文档和社区支持**:文档清晰,示例丰富,社区活跃。\n",
"7. **学习基础**:Scikit-learn 的设计思想和 API 风格对许多其他机器学习库(甚至深度学习库的部分接口)产生了深远影响。\n",
"\n",
"**本教程将涵盖 Scikit-learn 的核心工作流程和常用功能:**\n",
"\n",
"1. 加载数据集\n",
"2. 数据预处理 (特征缩放, 编码)\n",
"3. 数据集划分 (训练集/测试集)\n",
"4. 模型训练 (`fit`)\n",
"5. 模型预测 (`predict`, `predict_proba`)\n",
"6. 常用模型示例 (分类: Logistic Regression, Random Forest; 回归: Linear Regression; 聚类: KMeans)\n",
"7. 模型评估 (常用指标)\n",
"8. 交叉验证 (`cross_val_score`)\n",
"9. 超参数调优 (`GridSearchCV`)\n",
"10. 管道 (`Pipeline`)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 准备工作:导入必要的库\n",
"\n",
"我们将导入 Scikit-learn 中需要的模块,以及 NumPy 和 Pandas。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Standard Libraries\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"# Scikit-learn modules\n",
"from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV\n",
"from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, LabelEncoder\n",
"from sklearn.impute import SimpleImputer\n",
"from sklearn.linear_model import LinearRegression, LogisticRegression\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n",
"from sklearn.svm import SVC\n",
"from sklearn.cluster import KMeans\n",
"from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, \n",
" mean_squared_error, r2_score, confusion_matrix, classification_report\n",
"from sklearn.pipeline import Pipeline\n",
"from sklearn.datasets import load_iris, load_boston # Example datasets (load_boston deprecated, use fetch_california_housing or others)\n",
"from sklearn.compose import ColumnTransformer\n",
"\n",
"# Set plotting style\n",
"sns.set_theme(style=\"whitegrid\")\n",
"\n",
"print(\"Libraries imported.\")\n",
"\n",
"# Handle potential deprecation of load_boston\n",
"try:\n",
" from sklearn.datasets import fetch_california_housing\n",
" california_housing = fetch_california_housing(as_frame=True) # Use as_frame=True to get a Pandas DataFrame\n",
" print(\"Using California Housing dataset.\")\n",
" regression_data = california_housing\n",
"except ImportError:\n",
" print(\"fetch_california_housing not available. Some regression examples might be limited.\")\n",
" regression_data = None\n",
"\n",
"# Load Iris dataset for classification\n",
"iris = load_iris(as_frame=True)\n",
"iris_df = iris.data\n",
"iris_df['target'] = iris.target\n",
"iris_df['species'] = iris_df['target'].map({i: name for i, name in enumerate(iris.target_names)}) # Add species names\n",
"print(\"\\nIris dataset loaded into DataFrame:\")\n",
"print(iris_df.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. 数据预处理\n",
"\n",
"机器学习模型通常需要数值型、标准化的输入数据。预处理步骤包括处理缺失值、转换分类特征和缩放数值特征。\n",
"\n",
"* **特征缩放 (Scaling)**:将数值特征缩放到相似的范围,防止某些特征因数值范围过大而主导模型。\n",
" * `StandardScaler`: 标准化 (均值为0,方差为1)。\n",
" * `MinMaxScaler`: 归一化到 [0, 1] (或指定范围)。\n",
"* **分类特征编码 (Encoding)**:将文本或类别标签转换为数值表示。\n",
" * `LabelEncoder`: 将标签编码为 0 到 n_classes-1 的整数 (通常用于目标变量)。\n",
" * `OneHotEncoder`: 将具有 k 个类别的特征转换为 k 个二元 (0/1) 特征 (通常用于输入特征,避免引入错误的顺序关系)。\n",
"* **缺失值处理 (Imputation)**:用特定策略(如均值、中位数、众数)填充缺失值 (`NaN`)。\n",
" * `SimpleImputer`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Data Preprocessing Examples ---\")\n",
"\n",
"# --- Feature Scaling ---\n",
"print(\"\\n--- Feature Scaling ---\")\n",
"data_scale = np.array([[1., -1., 2.], [2., 0., 0.], [0., 1., -1.]])\n",
"print(f\"Original data:\\n{data_scale}\")\n",
"\n",
"scaler_standard = StandardScaler()\n",
"scaled_standard = scaler_standard.fit_transform(data_scale)\n",
"print(f\"\\nStandardized data (mean ~0, std ~1):\\n{scaled_standard}\")\n",
"print(f\"Mean after scaling: {scaled_standard.mean(axis=0)}\")\n",
"print(f\"Std after scaling: {scaled_standard.std(axis=0)}\")\n",
"\n",
"scaler_minmax = MinMaxScaler()\n",
"scaled_minmax = scaler_minmax.fit_transform(data_scale)\n",
"print(f\"\\nMinMax scaled data (range [0, 1]):\\n{scaled_minmax}\")\n",
"\n",
"# --- Categorical Encoding ---\n",
"print(\"\\n--- Categorical Encoding ---\")\n",
"categorical_feature = [['Male', 'Low'], ['Female', 'Medium'], ['Female', 'High'], ['Male', 'Low']]\n",
"df_cat = pd.DataFrame(categorical_feature, columns=['Gender', 'Level'])\n",
"print(f\"Original categorical data:\\n{df_cat}\")\n",
"\n",
"# OneHotEncoder for input features\n",
"onehot_encoder = OneHotEncoder(sparse_output=False) # sparse=False returns numpy array\n",
"encoded_features = onehot_encoder.fit_transform(df_cat)\n",
"print(f\"\\nOneHot encoded features:\\n{encoded_features}\")\n",
"print(f\"Feature names: {onehot_encoder.get_feature_names_out()}\")\n",
"\n",
"# LabelEncoder for target variable (example)\n",
"target_labels = ['Cat', 'Dog', 'Cat', 'Fish', 'Dog']\n",
"label_encoder = LabelEncoder()\n",
"encoded_labels = label_encoder.fit_transform(target_labels)\n",
"print(f\"\\nOriginal labels: {target_labels}\")\n",
"print(f\"Label encoded labels: {encoded_labels}\")\n",
"print(f\"Encoded classes: {label_encoder.classes_}\")\n",
"\n",
"# --- Imputation --- \n",
"print(\"\\n--- Imputation (Missing Values) ---\")\n",
"data_missing = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])\n",
"print(f\"Data with missing values:\\n{data_missing}\")\n",
"\n",
"imputer_mean = SimpleImputer(strategy='mean')\n",
"imputed_data = imputer_mean.fit_transform(data_missing)\n",
"print(f\"\\nImputed data (using mean):\\n{imputed_data}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. 数据集划分 (训练集/测试集)\n",
"\n",
"将数据集分为训练集和测试集是评估模型泛化能力的关键步骤。\n",
"`train_test_split` 是常用的工具。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Train/Test Split --- \")\n",
"# 使用 Iris 数据集\n",
"X = iris_df[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']] # 特征\n",
"y = iris_df['target'] # 目标变量 (类别标签 0, 1, 2)\n",
"\n",
"# test_size: 测试集比例或数量\n",
"# random_state: 随机种子,确保每次划分结果一致\n",
"# stratify=y: 对于分类问题,确保训练集和测试集中各类别比例与原始数据一致\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)\n",
"\n",
"print(f\"Original dataset shape: X={X.shape}, y={y.shape}\")\n",
"print(f\"Training set shape: X_train={X_train.shape}, y_train={y_train.shape}\")\n",
"print(f\"Test set shape: X_test={X_test.shape}, y_test={y_test.shape}\")\n",
"print(f\"\\nProportion of classes in original y:\\n{y.value_counts(normalize=True)}\")\n",
"print(f\"\\nProportion of classes in y_train:\\n{y_train.value_counts(normalize=True)}\")\n",
"print(f\"\\nProportion of classes in y_test:\\n{y_test.value_counts(normalize=True)}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. 模型训练 (`fit`)\n",
"\n",
"Scikit-learn 的核心 API 非常一致:\n",
"1. 选择一个模型类并实例化。\n",
"2. 使用训练数据调用实例的 `fit(X_train, y_train)` 方法。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Model Training Example (Logistic Regression) ---\")\n",
"\n",
"# 0. (Optional but often needed) Scale the features\n",
"scaler = StandardScaler()\n",
"X_train_scaled = scaler.fit_transform(X_train) # Fit on training data, then transform\n",
"# IMPORTANT: Use the SAME scaler fitted on training data to transform the test data\n",
"X_test_scaled = scaler.transform(X_test)\n",
"print(\"Features scaled using StandardScaler.\")\n",
"\n",
"# 1. Instantiate the model\n",
"log_reg = LogisticRegression(random_state=42, max_iter=200) # Increase max_iter if needed\n",
"print(f\"Model instantiated: {log_reg}\")\n",
"\n",
"# 2. Train the model using scaled training data\n",
"log_reg.fit(X_train_scaled, y_train)\n",
"print(\"Model training completed (fitted).\")\n",
"\n",
"# 模型已训练完成,其内部参数(如系数、截距)已被学习"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. 模型预测 (`predict`, `predict_proba`)\n",
"\n",
"训练好的模型可以用来对新数据(通常是测试集)进行预测。\n",
"* `predict(X_test)`: 对 `X_test` 中的每个样本预测类别标签(分类)或数值(回归)。\n",
"* `predict_proba(X_test)`: (仅限分类模型)返回每个样本属于各个类别的概率。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Model Prediction --- \")\n",
"\n",
"# 使用训练好的 log_reg 模型和缩放后的测试数据 X_test_scaled\n",
"y_pred = log_reg.predict(X_test_scaled)\n",
"print(f\"Predicted labels for test set (first 10): {y_pred[:10]}\")\n",
"print(f\"Actual labels for test set (first 10): {y_test.values[:10]}\")\n",
"\n",
"# 获取预测概率\n",
"y_pred_proba = log_reg.predict_proba(X_test_scaled)\n",
"print(f\"\\nPredicted probabilities for test set (first 5 samples):\\n{y_pred_proba[:5].round(3)}\")\n",
"# 每一行对应一个样本,每一列对应一个类别的概率 (顺序由 model.classes_ 决定)\n",
"print(f\"Model classes: {log_reg.classes_}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. 常用模型示例\n",
"\n",
"Scikit-learn 提供了多种模型。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Other Model Examples ---\")\n",
"\n",
"# --- Random Forest Classifier ---\n",
"print(\"\\nTraining Random Forest Classifier...\")\n",
"rf_clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1) # n_jobs=-1 使用所有CPU核心\n",
"rf_clf.fit(X_train_scaled, y_train)\n",
"y_pred_rf = rf_clf.predict(X_test_scaled)\n",
"print(\"Random Forest training complete.\")\n",
"# Evaluation will be done later\n",
"\n",
"# --- Support Vector Classifier (SVC) ---\n",
"print(\"\\nTraining Support Vector Classifier...\")\n",
"svc_clf = SVC(probability=True, random_state=42) # probability=True enables predict_proba\n",
"svc_clf.fit(X_train_scaled, y_train)\n",
"y_pred_svc = svc_clf.predict(X_test_scaled)\n",
"print(\"SVC training complete.\")\n",
"\n",
"# --- Linear Regression (using California Housing) ---\n",
"if regression_data is not None:\n",
" print(\"\\n--- Linear Regression Example --- \")\n",
" X_reg = regression_data.data\n",
" y_reg = regression_data.target\n",
" \n",
" X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)\n",
" \n",
" # Scale regression features\n",
" reg_scaler = StandardScaler()\n",
" X_reg_train_scaled = reg_scaler.fit_transform(X_reg_train)\n",
" X_reg_test_scaled = reg_scaler.transform(X_reg_test)\n",
" \n",
" lin_reg = LinearRegression()\n",
" lin_reg.fit(X_reg_train_scaled, y_reg_train)\n",
" y_pred_reg = lin_reg.predict(X_reg_test_scaled)\n",
" print(\"Linear Regression training complete.\")\n",
" print(f\"First 5 regression predictions: {y_pred_reg[:5].round(2)}\")\n",
" print(f\"First 5 actual regression values: {y_reg_test.values[:5].round(2)}\")\n",
"else:\n",
" print(\"\\nSkipping Linear Regression example (dataset not loaded).\")\n",
"\n",
"# --- K-Means Clustering (Unsupervised) ---\n",
"print(\"\\n--- K-Means Clustering Example --- \")\n",
"# 使用未标记的 Iris 特征 (假设我们不知道类别)\n",
"X_iris_unscaled = X # Use original unscaled data for clustering interpretation, or scale it\n",
"kmeans = KMeans(n_clusters=3, random_state=42, n_init=10) # n_clusters 通常需要预先确定或尝试不同值; n_init for stability\n",
"kmeans.fit(X_iris_unscaled) # 无监督学习,只需要 X\n",
"cluster_labels = kmeans.labels_\n",
"print(f\"K-Means training complete. Cluster labels (first 20): {cluster_labels[:20]}\")\n",
"print(f\"Cluster centers (centroids):\\n{kmeans.cluster_centers_.round(2)}\")\n",
"\n",
"# 比较 K-Means 找到的簇与真实类别 (需要注意簇标签与真实标签可能不对应)\n",
"df_cluster_comparison = pd.DataFrame({'TrueLabel': y, 'ClusterLabel': cluster_labels})\n",
"print(\"\\nCross-tabulation of True Labels vs K-Means Clusters:\")\n",
"print(pd.crosstab(df_cluster_comparison['TrueLabel'], df_cluster_comparison['ClusterLabel']))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. 模型评估\n",
"\n",
"评估模型在测试集上的性能至关重要。\n",
"\n",
"**分类常用指标:**\n",
"* `accuracy_score`: 准确率 (正确预测的比例)。\n",
"* `precision_score`: 精确率 (预测为正的样本中,实际为正的比例)。\n",
"* `recall_score`: 召回率 (实际为正的样本中,被正确预测为正的比例)。\n",
"* `f1_score`: F1 分数 (精确率和召回率的调和平均数)。\n",
"* `confusion_matrix`: 混淆矩阵。\n",
"* `classification_report`: 包含精确率、召回率、F1 分数的文本报告。\n",
"* `roc_auc_score`: ROC 曲线下面积 (适用于二分类或多分类的 OvR/OvO)。\n",
"\n",
"**回归常用指标:**\n",
"* `mean_squared_error (MSE)`: 均方误差。\n",
"* `mean_absolute_error (MAE)`: 平均绝对误差。\n",
"* `r2_score`: R 方(决定系数),表示模型解释了多少因变量的方差。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Model Evaluation ---\")\n",
"\n",
"# --- Classification Evaluation (using Logistic Regression results) ---\n",
"print(\"\\n--- Logistic Regression Evaluation ---\")\n",
"accuracy_logreg = accuracy_score(y_test, y_pred)\n",
"# For multiclass precision/recall/f1, need 'average' parameter (e.g., 'macro', 'micro', 'weighted')\n",
"precision_logreg = precision_score(y_test, y_pred, average='weighted')\n",
"recall_logreg = recall_score(y_test, y_pred, average='weighted')\n",
"f1_logreg = f1_score(y_test, y_pred, average='weighted')\n",
"\n",
"print(f\"Accuracy: {accuracy_logreg:.4f}\")\n",
"print(f\"Precision (weighted): {precision_logreg:.4f}\")\n",
"print(f\"Recall (weighted): {recall_logreg:.4f}\")\n",
"print(f\"F1 Score (weighted): {f1_logreg:.4f}\")\n",
"\n",
"print(\"\\nConfusion Matrix:\")\n",
"print(confusion_matrix(y_test, y_pred))\n",
"\n",
"print(\"\\nClassification Report:\")\n",
"print(classification_report(y_test, y_pred, target_names=iris.target_names))\n",
"\n",
"# ROC AUC (requires probabilities, use OvR for multiclass)\n",
"# roc_auc = roc_auc_score(y_test, y_pred_proba, multi_class='ovr', average='weighted')\n",
"# print(f\"ROC AUC (weighted OvR): {roc_auc:.4f}\")\n",
"\n",
"# --- Regression Evaluation (using Linear Regression results) ---\n",
"if regression_data is not None:\n",
" print(\"\\n--- Linear Regression Evaluation ---\")\n",
" mse_linreg = mean_squared_error(y_reg_test, y_pred_reg)\n",
" r2_linreg = r2_score(y_reg_test, y_pred_reg)\n",
" print(f\"Mean Squared Error (MSE): {mse_linreg:.4f}\")\n",
" print(f\"R-squared (R2): {r2_linreg:.4f}\")\n",
"else:\n",
" print(\"\\nSkipping Regression Evaluation.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. 交叉验证 (`cross_val_score`)\n",
"\n",
"简单的训练/测试集划分可能因划分的随机性导致评估结果不稳定。交叉验证通过将数据分成多个“折”(fold),轮流使用其中一折作为验证集,其余作为训练集,然后对多次评估结果取平均,来提供更稳健的模型性能估计。\n",
"\n",
"`cross_val_score` 是一个便捷的函数。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Cross-Validation Example (Random Forest on Iris) ---\")\n",
"\n",
"# 使用完整的、缩放后的 Iris 数据集进行交叉验证\n",
"X_iris_scaled = scaler.fit_transform(X) # Scale the full dataset\n",
"y_iris = y\n",
"\n",
"rf_clf_cv = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)\n",
"\n",
"# cv 参数指定折数 (e.g., 5-fold cross-validation)\n",
"# scoring 参数指定评估指标 (e.g., 'accuracy', 'f1_weighted', 'neg_mean_squared_error' for regression)\n",
"cv_scores = cross_val_score(rf_clf_cv, X_iris_scaled, y_iris, cv=5, scoring='accuracy')\n",
"\n",
"print(f\"Cross-validation scores (accuracy): {cv_scores}\")\n",
"print(f\"Average CV accuracy: {cv_scores.mean():.4f}\")\n",
"print(f\"Standard deviation of CV accuracy: {cv_scores.std():.4f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. 超参数调优 (`GridSearchCV`)\n",
"\n",
"机器学习模型通常有许多**超参数**(在训练前设置的参数,如 Random Forest 的 `n_estimators` 或 SVC 的 `C` 和 `gamma`),它们会影响模型性能。\n",
"\n",
"`GridSearchCV` 通过系统地尝试超参数网格中的所有组合,并使用交叉验证来评估每种组合的性能,从而找到最佳的超参数设置。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Hyperparameter Tuning with GridSearchCV (SVC on Iris) ---\")\n",
"\n",
"# 定义要搜索的参数网格\n",
"param_grid = {\n",
" 'C': [0.1, 1, 10, 100], # 正则化参数\n",
" 'gamma': [1, 0.1, 0.01, 0.001], # RBF 核的系数 ('rbf'是默认核)\n",
" 'kernel': ['rbf', 'linear'] # 尝试不同的核函数\n",
"}\n",
"\n",
"# 创建 GridSearchCV 对象\n",
"# estimator: 要调优的模型实例\n",
"# param_grid: 参数网格\n",
"# cv: 交叉验证折数\n",
"# scoring: 评估指标\n",
"# n_jobs=-1: 使用所有CPU核心\n",
"grid_search = GridSearchCV(SVC(random_state=42), \n",
" param_grid, \n",
" cv=5, \n",
" scoring='accuracy', \n",
" n_jobs=-1, \n",
" verbose=1) # verbose 控制输出信息的详细程度\n",
"\n",
"# 在训练数据上执行网格搜索 (它会自动进行交叉验证)\n",
"print(\"Starting Grid Search...\")\n",
"grid_search.fit(X_train_scaled, y_train) # Fit on the training set\n",
"print(\"Grid Search complete.\")\n",
"\n",
"# 查看最佳参数和最佳分数\n",
"print(f\"\\nBest parameters found: {grid_search.best_params_}\")\n",
"print(f\"Best cross-validation accuracy: {grid_search.best_score_:.4f}\")\n",
"\n",
"# 获取最佳模型\n",
"best_svc = grid_search.best_estimator_\n",
"\n",
"# 使用最佳模型在测试集上评估\n",
"y_pred_best_svc = best_svc.predict(X_test_scaled)\n",
"accuracy_best_svc = accuracy_score(y_test, y_pred_best_svc)\n",
"print(f\"\\nAccuracy on test set with best SVC: {accuracy_best_svc:.4f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 10. 管道 (`Pipeline`)\n",
"\n",
"管道 (`Pipeline`) 可以将多个数据处理步骤(如缩放、特征选择)和一个最终的模型估计器链接在一起。这有几个好处:\n",
"* **方便性**:只需对管道调用一次 `fit` 和 `predict`。\n",
"* **防止数据泄露**:确保预处理步骤(如缩放)只在训练数据上 `fit`,然后应用于测试数据,这在交叉验证和网格搜索中尤其重要。\n",
"* **代码简洁**:将工作流封装在一个对象中。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Pipeline Example (Scaling + SVC) ---\")\n",
"\n",
"# 创建管道:包含一个缩放器和一个分类器\n",
"pipeline = Pipeline([\n",
" ('scaler', StandardScaler()), # 步骤1:命名为 'scaler',使用 StandardScaler\n",
" ('svc', SVC(random_state=42)) # 步骤2:命名为 'svc',使用 SVC\n",
"])\n",
"\n",
"# 直接在原始训练数据上训练管道\n",
"# 管道会自动处理:\n",
"# 1. scaler.fit_transform(X_train)\n",
"# 2. svc.fit(scaled_X_train, y_train)\n",
"pipeline.fit(X_train, y_train) \n",
"print(\"Pipeline trained.\")\n",
"\n",
"# 使用管道进行预测\n",
"# 管道会自动处理:\n",
"# 1. scaler.transform(X_test) (使用在训练集上fit的scaler)\n",
"# 2. svc.predict(scaled_X_test)\n",
"y_pred_pipeline = pipeline.predict(X_test)\n",
"accuracy_pipeline = accuracy_score(y_test, y_pred_pipeline)\n",
"print(f\"Accuracy using pipeline on test set: {accuracy_pipeline:.4f}\")\n",
"\n",
"# 管道也可以与 GridSearchCV 结合使用,调优模型参数甚至预处理步骤的参数\n",
"print(\"\\n--- Pipeline with GridSearchCV ---\")\n",
"param_grid_pipeline = {\n",
" 'svc__C': [0.1, 1, 10], # 注意参数名前缀:步骤名 + 双下划线 + 参数名\n",
" 'svc__gamma': [0.1, 0.01]\n",
"}\n",
"\n",
"grid_search_pipe = GridSearchCV(pipeline, param_grid_pipeline, cv=3, scoring='accuracy', verbose=0)\n",
"print(\"Starting Grid Search with Pipeline...\")\n",
"grid_search_pipe.fit(X_train, y_train) # Fit on original X_train\n",
"print(\"Grid Search with Pipeline complete.\")\n",
"print(f\"Best params for pipeline: {grid_search_pipe.best_params_}\")\n",
"print(f\"Best CV score for pipeline: {grid_search_pipe.best_score_:.4f}\")\n",
"\n",
"# 评估最佳管道\n",
"best_pipeline = grid_search_pipe.best_estimator_\n",
"y_pred_best_pipe = best_pipeline.predict(X_test)\n",
"accuracy_best_pipe = accuracy_score(y_test, y_pred_best_pipe)\n",
"print(f\"Accuracy on test set with best pipeline: {accuracy_best_pipe:.4f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 总结\n",
"\n",
"Scikit-learn 是 Python 中进行经典机器学习任务的强大而全面的库。其一致的 API、丰富的算法和工具集,使其成为数据科学家和机器学习工程师的必备技能。\n",
"\n",
"**关键要点:**\n",
"* 遵循**数据加载 -> 预处理 -> 划分 -> 训练 -> 预测 -> 评估**的基本流程。\n",
"* 熟练使用 `StandardScaler`, `OneHotEncoder` 等预处理工具。\n",
"* 掌握 `fit()`, `predict()`, `transform()` 核心方法。\n",
"* 了解常用分类、回归、聚类算法的 Scikit-learn 实现。\n",
"* 使用交叉验证和网格搜索进行稳健的模型评估和超参数调优。\n",
"* 利用 `Pipeline` 简化工作流程并避免数据泄露。\n",
"\n",
"Scikit-learn 的功能远不止于此,还包括特征选择、降维 (PCA, t-SNE)、更复杂的模型集成、半监督学习等。强烈建议深入学习其官方文档和用户指南。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 5
}
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

对动态语言Python的一些感慨

众所周知Python是完全动态的语言,体现在

  1. 类型动态绑定
  2. 运行时检查
  3. 对象结构内容可动态修改(而不仅仅是值)
  4. 反射
  5. 一切皆对象(instance, class, method)
  6. 可动态执行代码(eval, exec)
  7. 鸭子类型支持

动态语言的约束更少,对使用者来说更易于入门,但相应的也会有代价就是运行时开销很大,和底层汇编执行逻辑完全解耦不知道代码到底是怎么执行的。

而且还有几点是我认为较为严重的缺陷。下面进行梳理。

破坏了OOP的语义

较为流行的编程语言大多支持OOP编程范式。即继承和多态。同样,Python在执行简单任务时候可以纯命令式(Imperative Programming),也可以使用复杂的面向对象OOP。

但是,其动态特性破环了OOP的结构:

  1. 类型模糊:任何类型实例,都可以在运行时添加或者删除属性或者方法(相比之下静态语言只能在运行时修改它们的值)。经此修改的实例,按理说不再属于原来的类型,毕竟和原类型已经有了明显的区别。但是该实例的内建__class__属性依旧会指向原类型,这会给类型的认知造成困惑。符合一个class不应该只是名义上符合,而是内容上也应该符合。
  2. 破坏继承:体现在以下两个方面
    1. 大部分实践没有虚接口继承。abc模块提供了虚接口的基类ABC,经典的做法是让自己的抽象类继承自ABC,然后具体类继承自自己的抽象类,然后去实现抽象方法。但PEP提案认为Pythonic的做法是用typing.Protocol来取代ABC,具体类完全不继承任何虚类,只要实现相应的方法,那么就可以被静态检查器认为是符合Protocol的。
    2. 不需要继承自具体父类。和上一条一样,即使一个类没有任何父类(除了object类),它依旧可以生成同名的方法,以实现和父类方法相同的调用接口。这样在语义逻辑上,类的定义完全看不出和其他类有何种关系。完全可以是一种松散的组织结构,任何两个类之间都没继承关系。
  3. 破坏多态:任何一个入参出参,天然不限制类型。这使得要求父类型的参数处,传入子类型显得没有意义,依旧是因为任何类型都能动态修改满足要求。

破坏了设计模式

经典的模式诸如工厂模式,抽象工厂,访问者模式,都严重依赖于继承和多态的性质。但是在python的设计中,其动态能力使得设计模式形同虚设。 大家常见的库中使用设计模式的有transformers库,其中的from_pretrained系列则是工厂模式,通过字符串名称确定了具体的构造器得到具体的子类。而工厂构造器的输出类型是一个所有模型的基类。

安全性问题

Python在代码层面一般不直接管理指针,所以指针越界,野指针,悬空指针等问题一般不存在。而gc机制也能自动处理垃圾回收使得编码过程不必关注这类安全性问题。但与之相对的,Python也有自己的安全性问题。以往非托管形式的代码的攻击难度较大,注入代码想要稳定执行需要避免破坏原来的结构导致程序直接崩溃(段错误)。 Python却可以直接注入任何代码修改原本的逻辑,并且由于不是在code段固定的内容,攻击时候也无需有额外考虑。运行时可以手动修改globals() locals()内容,亦有一定风险。 另一个危险则是类型不匹配导致的代码执行问题,因为只有在运行时才确定类型,无法提前做出保证,可能会产生类型错误的异常,造成程序崩溃。

总结

我出身于C++。但是近年来一直在用python编程。而且python的市场占有率已经多年第一,且遥遥领先。这和其灵活性分不开关系。对于一个面向大众的编程语言,这样的特性是必要的。即使以上说了诸多python的不严谨之处,但是对于程序员依旧可以选择严谨的面向对象写法。所以,程序的优劣不在于语言怎么样,而在于程序员本身。程序员有责任写出易于维护,清晰,规范的代码~

Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@KuRRe8
Copy link
Author

KuRRe8 commented May 8, 2025

返回顶部

有见解,有问题,或者单纯想盖楼灌水,都可以在这里发表!

因为文档比较多,有时候渲染不出来ipynb是浏览器性能的问题,刷新即可

或者git clone到本地来阅读

ChatGPT Image May 9, 2025, 04_45_04 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment