Skip to content

Instantly share code, notes, and snippets.

@KuRRe8
Last active June 6, 2025 17:35
Show Gist options
  • Save KuRRe8/36f63d23ef205a8e02b7b7ec009cc4e8 to your computer and use it in GitHub Desktop.
Save KuRRe8/36f63d23ef205a8e02b7b7ec009cc4e8 to your computer and use it in GitHub Desktop.
和Python使用有关的一些教程,按类别分为不同文件

Python教程

Python是一个新手友好的语言,并且现在机器学习社区深度依赖于Python,C++, Cuda C, R等语言,使得Python的热度稳居第一。本Gist提供Python相关的一些教程,可以直接在Jupyter Notebook中运行。

  1. 语言级教程,一般不涉及初级主题;
  2. 标准库教程,最常见的标准库基本用法;
  3. 第三方库教程,主要是常见的库如numpy,pytorch诸如此类,只涉及基本用法,不考虑新特性

其他内容就不往这个Gist里放了,注意Gist依旧由git进行版本控制,所以可以git clone 到本地,或者直接Google Colab\ Kaggle打开相应的ipynb文件

直接在网页浏览时,由于没有文件列表,可以按Ctrl + F来检索相应的目录,或者点击下面的超链接。

想要参与贡献的直接在评论区留言,有什么问题的也在评论区说 ^.^

目录-语言部分

目录-库部分

目录-具体业务库部分-本教程更多关注机器学习深度学习内容

目录-附录

  • sigh.md个人对于Python动态语言的看法
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas - Python 数据处理与分析利器教程\n",
"\n",
"欢迎来到 Pandas 教程!Pandas (Panel Data Analysis) 是建立在 NumPy 之上的 Python 库,提供了高性能、易于使用的数据结构和数据分析工具。它是 Python 数据科学生态系统的核心组件之一。\n",
"\n",
"**为什么 Pandas 对 ML/DL/数据科学如此重要?**\n",
"\n",
"1. **强大的数据结构**:提供了两种主要的数据结构:\n",
" * **`Series`**:一维带标签的数组,可以存储任何数据类型。\n",
" * **`DataFrame`**:二维带标签的数据结构,类似于电子表格、SQL 表或 R 中的 data.frame。列可以是不同的数据类型。\n",
"2. **数据处理能力**:轻松处理缺失数据、数据清洗、转换、合并、重塑、切片和切块。\n",
"3. **高效的数据对齐**:在进行运算时,Pandas 会自动根据标签(索引和列名)对齐数据。\n",
"4. **集成 I/O 工具**:方便地读取和写入各种格式的数据(CSV, Excel, SQL 数据库, JSON 等)。\n",
"5. **时间序列功能**:内置强大的时间序列数据处理能力。\n",
"6. **与 NumPy 和 Matplotlib 等库的紧密集成**。\n",
"\n",
"**本教程将涵盖 Pandas 的核心概念和常用操作:**\n",
"\n",
"1. 创建 `Series` 和 `DataFrame`\n",
"2. 数据查看与基本检查\n",
"3. 数据选择与索引 (`loc`, `iloc`)\n",
"4. 数据清洗(缺失值、重复值)\n",
"5. 数据转换与应用函数\n",
"6. 合并与连接数据\n",
"7. 分组与聚合 (`groupby`)\n",
"8. 时间序列基础\n",
"9. 读写数据文件 (重点 CSV)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 准备工作:导入 Pandas 和 NumPy\n",
"\n",
"按照惯例,我们将 Pandas 导入并简写为 `pd`,同时通常也会导入 NumPy (`np`)。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"print(f\"Pandas version: {pd.__version__}\")\n",
"print(f\"NumPy version: {np.__version__}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. 创建 `Series` 和 `DataFrame`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# --- 创建 Series (一维带标签数组) ---\n",
"print(\"--- Creating Series ---\")\n",
"# 从列表创建,Pandas 会自动创建默认整数索引\n",
"s1 = pd.Series([10, 20, 30, 40])\n",
"print(f\"Series from list:\\n{s1}\")\n",
"print(f\"Series index: {s1.index}\")\n",
"print(f\"Series values: {s1.values}\") # 返回 NumPy 数组\n",
"\n",
"# 指定索引\n",
"s2 = pd.Series([100, 200, 300], index=['a', 'b', 'c'])\n",
"print(f\"\\nSeries with custom index:\\n{s2}\")\n",
"print(f\"Access element by label 'b': {s2['b']}\")\n",
"\n",
"# 从字典创建 (字典的键成为索引)\n",
"data_dict_s = {'x': 1.1, 'y': 2.2, 'z': 3.3}\n",
"s3 = pd.Series(data_dict_s)\n",
"print(f\"\\nSeries from dict:\\n{s3}\")\n",
"\n",
"# --- 创建 DataFrame (二维带标签表格) ---\n",
"print(\"\\n--- Creating DataFrame ---\")\n",
"# 从字典创建 (字典的键是列名,值是列表/Series/数组)\n",
"data_dict_df = {\n",
" 'col1': [1, 2, 3, 4],\n",
" 'col2': ['A', 'B', 'C', 'D'],\n",
" 'col3': [1.1, 2.2, 3.3, 4.4]\n",
"}\n",
"df1 = pd.DataFrame(data_dict_df)\n",
"print(f\"DataFrame from dict of lists:\\n{df1}\")\n",
"\n",
"# 可以指定索引\n",
"df2 = pd.DataFrame(data_dict_df, index=['row1', 'row2', 'row3', 'row4'])\n",
"print(f\"\\nDataFrame with custom index:\\n{df2}\")\n",
"\n",
"# 从 NumPy 数组创建,可以指定列名和索引\n",
"np_array = np.random.randint(0, 10, size=(3, 4))\n",
"df3 = pd.DataFrame(np_array, columns=['W', 'X', 'Y', 'Z'], index=['r1', 'r2', 'r3'])\n",
"print(f\"\\nDataFrame from NumPy array:\\n{df3}\")\n",
"\n",
"# 从字典列表创建 (每个字典是一行)\n",
"list_of_dicts = [\n",
" {'name': 'Alice', 'age': 30}, \n",
" {'name': 'Bob', 'age': 25, 'city': 'LA'},\n",
" {'name': 'Charlie', 'age': 35}\n",
"]\n",
"df4 = pd.DataFrame(list_of_dicts)\n",
"print(f\"\\nDataFrame from list of dicts:\\n{df4}\") # Pandas 会处理缺失的'city'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. 数据查看与基本检查\n",
"\n",
"了解 DataFrame 的结构和内容非常重要。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 使用之前创建的 df4\n",
"print(f\"DataFrame to inspect:\\n{df4}\")\n",
"\n",
"# 查看头部几行 (默认 5 行)\n",
"print(\"\\n--- df4.head(2) ---\")\n",
"print(df4.head(2))\n",
"\n",
"# 查看尾部几行 (默认 5 行)\n",
"print(\"\\n--- df4.tail(1) ---\")\n",
"print(df4.tail(1))\n",
"\n",
"# 获取 DataFrame 的形状 (行数, 列数)\n",
"print(f\"\\nShape (rows, columns): {df4.shape}\")\n",
"\n",
"# 获取索引信息\n",
"print(f\"Index: {df4.index}\")\n",
"\n",
"# 获取列名\n",
"print(f\"Columns: {df4.columns}\")\n",
"\n",
"# 获取数据类型信息和非空值计数\n",
"print(\"\\n--- df4.info() ---\")\n",
"df4.info()\n",
"\n",
"# 获取数值列的基本描述性统计\n",
"print(\"\\n--- df4.describe() (for numerical columns) ---\")\n",
"print(df4.describe())\n",
"\n",
"# 获取所有列的描述性统计 (包括对象类型/分类)\n",
"print(\"\\n--- df4.describe(include='all') ---\")\n",
"print(df4.describe(include='all'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. 数据选择与索引 (`loc`, `iloc`)\n",
"\n",
"Pandas 提供了多种方式来选择 DataFrame 中的数据:\n",
"\n",
"* **选择列**: `df['column_name']` (返回 Series), `df[['col1', 'col2']]` (返回 DataFrame)\n",
"* **基于标签的选择 (`loc`)**: 使用行标签和列标签进行选择。\n",
" * `df.loc[row_label]`\n",
" * `df.loc[[row1, row2]]`\n",
" * `df.loc[row_label, column_label]`\n",
" * `df.loc[start_row:end_row, start_col:end_col]` (注意:`loc` 的切片包含结束标签)\n",
"* **基于整数位置的选择 (`iloc`)**: 使用整数索引进行选择 (类似 NumPy)。\n",
" * `df.iloc[row_index]`\n",
" * `df.iloc[[idx1, idx2]]`\n",
" * `df.iloc[row_index, col_index]`\n",
" * `df.iloc[start_row:end_row, start_col:end_col]` (注意:`iloc` 的切片不包含结束索引)\n",
"* **条件选择/布尔索引**: `df[boolean_condition]`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 使用之前的 df3\n",
"df_sel = pd.DataFrame(np.random.randn(5, 4), index='A B C D E'.split(), columns='W X Y Z'.split())\n",
"print(f\"DataFrame for selection:\\n{df_sel}\")\n",
"\n",
"# --- 选择列 ---\n",
"print(\"\\n--- Selecting Columns ---\")\n",
"print(f\"Column 'W' (Series):\\n{df_sel['W']}\")\n",
"print(f\"\\nColumns 'W' and 'Z' (DataFrame):\\n{df_sel[['W', 'Z']]}\")\n",
"\n",
"# --- 基于标签的选择 (loc) ---\n",
"print(\"\\n--- Selection using loc ---\")\n",
"print(f\"Row 'A':\\n{df_sel.loc['A']}\") # 返回 Series\n",
"print(f\"\\nRows 'B' and 'D':\\n{df_sel.loc[['B', 'D']]}\") # 返回 DataFrame\n",
"print(f\"\\nElement at row 'C', column 'Y': {df_sel.loc['C', 'Y']}\")\n",
"print(f\"\\nRows 'A' to 'C', columns 'W' to 'Y':\\n{df_sel.loc['A':'C', 'W':'Y']}\") # 包含 'C' 和 'Y'\n",
"\n",
"# --- 基于整数位置的选择 (iloc) ---\n",
"print(\"\\n--- Selection using iloc ---\")\n",
"print(f\"Row at index 0:\\n{df_sel.iloc[0]}\")\n",
"print(f\"\\nRows at index 1 and 3:\\n{df_sel.iloc[[1, 3]]}\")\n",
"print(f\"\\nElement at row index 2, column index 3: {df_sel.iloc[2, 3]}\")\n",
"print(f\"\\nRows 0 to 2 (exclusive), columns 1 to 3 (exclusive):\\n{df_sel.iloc[0:2, 1:3]}\")\n",
"\n",
"# --- 条件选择/布尔索引 ---\n",
"print(\"\\n--- Conditional Selection ---\")\n",
"print(f\"Rows where column 'W' > 0:\\n{df_sel[df_sel['W'] > 0]}\")\n",
"print(f\"\\nRows where column 'Y' > 0 and 'X' < 0:\\n{df_sel[(df_sel['Y'] > 0) & (df_sel['X'] < 0)]}\") # 使用 & (and), | (or)\n",
"\n",
"# 可以结合 loc/iloc 使用条件\n",
"print(f\"\\nSelecting columns 'Y', 'Z' for rows where 'W' > 0:\\n{df_sel.loc[df_sel['W'] > 0, ['Y', 'Z']]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. 数据清洗(缺失值、重复值)\n",
"\n",
"真实世界的数据往往不完美,包含缺失值或重复记录。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 创建一个包含缺失值和重复值的 DataFrame\n",
"data_messy = {\n",
" 'A': [1, 2, np.nan, 4, 5, 5],\n",
" 'B': [10, np.nan, 30, 40, 50, 50],\n",
" 'C': ['X', 'Y', 'Z', 'X', 'Y', 'Y']\n",
"}\n",
"df_messy = pd.DataFrame(data_messy)\n",
"print(f\"Messy DataFrame:\\n{df_messy}\")\n",
"\n",
"# --- 处理缺失值 (NaN) ---\n",
"print(\"\\n--- Handling Missing Values ---\")\n",
"# 检查缺失值\n",
"print(f\"Check for NaN values:\\n{df_messy.isnull()}\")\n",
"print(f\"\\nSum of NaN values per column:\\n{df_messy.isnull().sum()}\")\n",
"\n",
"# 删除包含 NaN 的行 (dropna)\n",
"df_dropped_rows = df_messy.dropna() # 默认删除任何包含 NaN 的行\n",
"print(f\"\\nDataFrame after dropping rows with NaN:\\n{df_dropped_rows}\")\n",
"\n",
"# 删除包含 NaN 的列 (dropna)\n",
"df_dropped_cols = df_messy.dropna(axis=1) # axis=1 表示按列操作\n",
"print(f\"\\nDataFrame after dropping columns with NaN:\\n{df_dropped_cols}\")\n",
"\n",
"# 填充 NaN 值 (fillna)\n",
"# 使用特定值填充\n",
"df_filled_value = df_messy.fillna(value=0) # 用 0 填充所有 NaN\n",
"print(f\"\\nDataFrame after filling NaN with 0:\\n{df_filled_value}\")\n",
"\n",
"# 使用列的均值填充 NaN (仅对数值列有效)\n",
"# df_messy['A'].fillna(df_messy['A'].mean(), inplace=True) # inplace=True 直接修改原 DataFrame\n",
"df_filled_mean = df_messy.copy() # Create a copy to avoid modifying df_messy directly here\n",
"df_filled_mean['A'] = df_filled_mean['A'].fillna(df_filled_mean['A'].mean())\n",
"df_filled_mean['B'] = df_filled_mean['B'].fillna(df_filled_mean['B'].mean())\n",
"print(f\"\\nDataFrame after filling NaN with column mean:\\n{df_filled_mean}\")\n",
"\n",
"# --- 处理重复值 ---\n",
"print(\"\\n--- Handling Duplicates ---\")\n",
"# 检查重复行\n",
"print(f\"Check for duplicate rows:\\n{df_messy.duplicated()}\")\n",
"\n",
"# 删除重复行 (默认保留第一个出现的)\n",
"df_no_duplicates = df_messy.drop_duplicates()\n",
"print(f\"\\nDataFrame after dropping duplicate rows:\\n{df_no_duplicates}\")\n",
"\n",
"# 基于特定列删除重复项\n",
"df_no_duplicates_c = df_messy.drop_duplicates(subset=['C'], keep='last') # 保留最后一个\n",
"print(f\"\\nDataFrame after dropping duplicates based on column 'C' (keep last):\\n{df_no_duplicates_c}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. 数据转换与应用函数\n",
"\n",
"Pandas 提供了多种方法来转换数据和应用自定义函数。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_trans = pd.DataFrame({'colA': [1, 2, 3, 4], 'colB': [10, 20, 30, 40], 'colC': ['low', 'medium', 'high', 'medium']})\n",
"print(f\"DataFrame for transformation:\\n{df_trans}\")\n",
"\n",
"# --- 应用函数 ---\n",
"print(\"\\n--- Applying Functions ---\")\n",
"# 对列应用函数 (通常使用 Series 的方法或 apply/map)\n",
"def times_two(x):\n",
" return x * 2\n",
"df_trans['colA_doubled'] = df_trans['colA'].apply(times_two)\n",
"print(f\"\\nAfter applying times_two to colA:\\n{df_trans}\")\n",
"\n",
"# 使用 lambda 函数\n",
"df_trans['colB_plus_1'] = df_trans['colB'].apply(lambda x: x + 1)\n",
"print(f\"\\nAfter applying lambda x+1 to colB:\\n{df_trans}\")\n",
"\n",
"# 使用 map 对 Series 中的值进行映射/替换\n",
"level_map = {'low': 1, 'medium': 2, 'high': 3}\n",
"df_trans['colC_mapped'] = df_trans['colC'].map(level_map)\n",
"print(f\"\\nAfter mapping colC:\\n{df_trans}\")\n",
"\n",
"# 使用 applymap 对 DataFrame 中的每个元素应用函数\n",
"# def add_suffix(val):\n",
"# return str(val) + \"_suffix\"\n",
"# df_applymap = df_trans[['colA', 'colB']].applymap(add_suffix)\n",
"# print(f\"\\nAfter applymap:\\n{df_applymap}\")\n",
"\n",
"# --- 数据类型转换 ---\n",
"print(\"\\n--- Data Type Conversion ---\")\n",
"print(f\"Original dtypes:\\n{df_trans.dtypes}\")\n",
"df_trans['colA_doubled'] = df_trans['colA_doubled'].astype(float)\n",
"print(f\"\\nDtypes after converting colA_doubled to float:\\n{df_trans.dtypes}\")\n",
"\n",
"# --- 重命名列/索引 ---\n",
"print(\"\\n--- Renaming Columns/Index ---\")\n",
"df_renamed = df_trans.rename(columns={'colA': 'Alpha', 'colB': 'Beta'}, \n",
" index={0: 'zero', 1: 'one'})\n",
"print(f\"DataFrame after renaming:\\n{df_renamed}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. 合并与连接数据 (`merge`, `concat`, `join`)\n",
"\n",
"Pandas 提供了多种将不同 DataFrame 组合在一起的方式。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n",
" 'A': ['A0', 'A1', 'A2', 'A3'],\n",
" 'B': ['B0', 'B1', 'B2', 'B3']})\n",
"\n",
"df_right = pd.DataFrame({'key': ['K0', 'K1', 'K4', 'K5'],\n",
" 'C': ['C0', 'C1', 'C4', 'C5'],\n",
" 'D': ['D0', 'D1', 'D4', 'D5']})\n",
"\n",
"df_upper = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],\n",
" 'A': ['A0_up', 'A1_up', 'A2_up', 'A3_up'],\n",
" 'B': ['B0_up', 'B1_up', 'B2_up', 'B3_up']})\n",
"\n",
"print(f\"Left DataFrame:\\n{df_left}\")\n",
"print(f\"\\nRight DataFrame:\\n{df_right}\")\n",
"print(f\"\\nUpper DataFrame:\\n{df_upper}\")\n",
"\n",
"# --- Merge (类似 SQL JOIN) ---\n",
"print(\"\\n--- Merge (on 'key') ---\")\n",
"# Inner merge (默认): 只保留左右都有的键\n",
"merged_inner = pd.merge(df_left, df_right, on='key', how='inner')\n",
"print(f\"Inner Merge:\\n{merged_inner}\")\n",
"\n",
"# Left merge: 保留左边所有键,右边匹配不上的填 NaN\n",
"merged_left = pd.merge(df_left, df_right, on='key', how='left')\n",
"print(f\"\\nLeft Merge:\\n{merged_left}\")\n",
"\n",
"# Right merge: 保留右边所有键\n",
"merged_right = pd.merge(df_left, df_right, on='key', how='right')\n",
"print(f\"\\nRight Merge:\\n{merged_right}\")\n",
"\n",
"# Outer merge: 保留两边所有键\n",
"merged_outer = pd.merge(df_left, df_right, on='key', how='outer')\n",
"print(f\"\\nOuter Merge:\\n{merged_outer}\")\n",
"\n",
"# --- Concatenate (堆叠) ---\n",
"print(\"\\n--- Concatenate (stacking rows, axis=0) ---\")\n",
"# 按行堆叠 (默认 axis=0)\n",
"concatenated_rows = pd.concat([df_left, df_upper], ignore_index=True) # ignore_index 重置索引\n",
"print(f\"Concatenated Rows:\\n{concatenated_rows}\")\n",
"\n",
"# 按列拼接 (axis=1),需要索引对齐\n",
"df_left_indexed = df_left.set_index('key')\n",
"df_right_indexed = df_right.set_index('key')\n",
"concatenated_cols = pd.concat([df_left_indexed, df_right_indexed], axis=1)\n",
"print(f\"\\nConcatenated Columns (axis=1 on index 'key'):\\n{concatenated_cols}\")\n",
"\n",
"# --- Join (基于索引合并,是 merge 的一种便捷方式) ---\n",
"print(\"\\n--- Join (based on index) ---\")\n",
"# 默认是左连接 (how='left')\n",
"joined_df = df_left_indexed.join(df_right_indexed, lsuffix='_left', rsuffix='_right')\n",
"print(f\"Left Join on index:\\n{joined_df}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. 分组与聚合 (`groupby`)\n",
"\n",
"`groupby` 操作是数据分析的核心,它遵循 \"Split-Apply-Combine\" 的模式:\n",
"1. **Split**: 根据某些条件将数据分成组。\n",
"2. **Apply**: 对每个组独立应用一个函数(如计算总和、均值等)。\n",
"3. **Combine**: 将结果组合成一个新的数据结构。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data_group = {'Company': ['GOOG', 'GOOG', 'MSFT', 'MSFT', 'FB', 'FB'],\n",
" 'Person': ['Sam', 'Charlie', 'Amy', 'Vanessa', 'Carl', 'Sarah'],\n",
" 'Sales': [200, 120, 340, 124, 243, 350]}\n",
"df_group = pd.DataFrame(data_group)\n",
"print(f\"DataFrame for grouping:\\n{df_group}\")\n",
"\n",
"# 按 'Company' 分组\n",
"grouped_by_company = df_group.groupby('Company')\n",
"print(f\"\\nType of grouped object: {type(grouped_by_company)}\")\n",
"\n",
"# 对分组应用聚合函数\n",
"print(f\"\\nMean sales by company:\\n{grouped_by_company['Sales'].mean()}\") #或者 grouped_by_company.mean(numeric_only=True)\n",
"print(f\"\\nSum of sales by company:\\n{grouped_by_company['Sales'].sum()}\")\n",
"print(f\"\\nCount of entries by company:\\n{grouped_by_company.count()}\") # count() 对所有列计数\n",
"print(f\"\\nSize of groups by company:\\n{grouped_by_company.size()}\") # size() 返回每个组的大小\n",
"\n",
"# 使用 agg 进行多个聚合\n",
"agg_results = grouped_by_company['Sales'].agg(['sum', 'mean', 'std', 'count'])\n",
"print(f\"\\nMultiple aggregations on 'Sales':\\n{agg_results}\")\n",
"\n",
"# 可以对不同列应用不同聚合函数\n",
"agg_dict = {'Sales': ['sum', 'mean'], 'Person': 'count'}\n",
"agg_multi_col = grouped_by_company.agg(agg_dict)\n",
"print(f\"\\nAggregating different columns differently:\\n{agg_multi_col}\")\n",
"\n",
"# 也可以直接迭代分组对象\n",
"print(\"\\nIterating through groups:\")\n",
"for name, group_df in grouped_by_company:\n",
" print(f\"\\nGroup Name (Company): {name}\")\n",
" print(group_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. 时间序列基础\n",
"\n",
"Pandas 在处理时间序列数据方面非常强大。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 创建日期范围索引\n",
"date_index = pd.date_range('2023-10-01', periods=6, freq='D') # 6 天\n",
"print(f\"Date Range Index (Daily):\\n{date_index}\")\n",
"\n",
"date_index_monthly = pd.date_range('2023-01-01', periods=4, freq='M') # 月末频率\n",
"print(f\"\\nDate Range Index (Monthly End):\\n{date_index_monthly}\")\n",
"\n",
"# 创建带时间序列索引的 Series 或 DataFrame\n",
"ts_data = np.random.randn(6)\n",
"ts_series = pd.Series(ts_data, index=date_index)\n",
"print(f\"\\nTime Series:\\n{ts_series}\")\n",
"\n",
"# 时间序列索引支持更灵活的切片\n",
"print(f\"\\nSelecting data for '2023-10-03':\\n{ts_series['2023-10-03']}\") # 单日\n",
"print(f\"\\nSelecting data from '2023-10-02' to '2023-10-04':\\n{ts_series['2023-10-02':'2023-10-04']}\")\n",
"print(f\"\\nSelecting data for year 2023:\\n{ts_series['2023']}\")\n",
"\n",
"# 重采样 (Resampling) - 改变时间频率 (例如,日->月)\n",
"# 这里数据太少,仅作演示\n",
"# monthly_mean = ts_series.resample('M').mean() # 按月重采样,计算均值\n",
"# print(f\"\\nMonthly Mean Resampling:\\n{monthly_mean}\")\n",
"\n",
"# 移动窗口计算 (Rolling)\n",
"rolling_mean_2d = ts_series.rolling(window=2).mean() # 2天窗口的移动平均\n",
"print(f\"\\nRolling Mean (window=2):\\n{rolling_mean_2d}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 9. 读写数据文件 (重点 CSV)\n",
"\n",
"Pandas 可以轻松读写多种文件格式。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"# 创建一个示例 DataFrame\n",
"df_io = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['a', 'b', 'c'], 'col3': [True, False, True]})\n",
"csv_io_path = \"pandas_example.csv\"\n",
"excel_io_path = \"pandas_example.xlsx\"\n",
"\n",
"print(f\"DataFrame to write:\\n{df_io}\")\n",
"\n",
"# --- 写入 CSV --- \n",
"df_io.to_csv(csv_io_path, index=False) # index=False 避免将 DataFrame 索引写入文件\n",
"print(f\"\\nDataFrame written to {csv_io_path}\")\n",
"\n",
"# --- 读取 CSV --- \n",
"df_read_csv = pd.read_csv(csv_io_path)\n",
"print(f\"\\nDataFrame read from {csv_io_path}:\\n{df_read_csv}\")\n",
"\n",
"# --- 写入 Excel (需要安装 openpyxl 或 xlsxwriter) --- \n",
"# !pip install openpyxl\n",
"try:\n",
" df_io.to_excel(excel_io_path, sheet_name='Sheet1', index=False)\n",
" print(f\"\\nDataFrame written to {excel_io_path}\")\n",
"\n",
" # --- 读取 Excel --- \n",
" df_read_excel = pd.read_excel(excel_io_path, sheet_name='Sheet1')\n",
" print(f\"\\nDataFrame read from {excel_io_path}:\\n{df_read_excel}\")\n",
"except ImportError:\n",
" print(\"\\nSkipping Excel I/O test. Need 'openpyxl' library installed.\")\n",
"except Exception as e:\n",
" print(f\"\\nError during Excel I/O: {e}\")\n",
"finally:\n",
" # 清理文件\n",
" if os.path.exists(csv_io_path):\n",
" os.remove(csv_io_path)\n",
" if os.path.exists(excel_io_path):\n",
" os.remove(excel_io_path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 总结\n",
"\n",
"Pandas 是 Python 数据科学工具箱中的瑞士军刀。它提供了强大的数据结构 (`Series`, `DataFrame`) 和丰富的功能,用于数据清洗、转换、分析和可视化准备。\n",
"\n",
"**关键要点:**\n",
"* 熟练使用 `Series` 和 `DataFrame`。\n",
"* 掌握 `loc` 和 `iloc` 进行精确的数据选择。\n",
"* 理解如何处理缺失值和重复值。\n",
"* 利用 `apply`, `map` 等进行数据转换。\n",
"* 使用 `merge`, `concat`, `join` 组合数据。\n",
"* 利用 `groupby` 进行分组聚合分析。\n",
"* 了解基本的时间序列操作。\n",
"* 能够读写常见的数据格式 (尤其是 CSV)。\n",
"\n",
"Pandas 的功能非常丰富,本教程只涵盖了基础。深入学习 Pandas 的最佳方式是通过实践处理真实的数据集,并查阅其详尽的官方文档。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 5
}
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

对动态语言Python的一些感慨

众所周知Python是完全动态的语言,体现在

  1. 类型动态绑定
  2. 运行时检查
  3. 对象结构内容可动态修改(而不仅仅是值)
  4. 反射
  5. 一切皆对象(instance, class, method)
  6. 可动态执行代码(eval, exec)
  7. 鸭子类型支持

动态语言的约束更少,对使用者来说更易于入门,但相应的也会有代价就是运行时开销很大,和底层汇编执行逻辑完全解耦不知道代码到底是怎么执行的。

而且还有几点是我认为较为严重的缺陷。下面进行梳理。

破坏了OOP的语义

较为流行的编程语言大多支持OOP编程范式。即继承和多态。同样,Python在执行简单任务时候可以纯命令式(Imperative Programming),也可以使用复杂的面向对象OOP。

但是,其动态特性破环了OOP的结构:

  1. 类型模糊:任何类型实例,都可以在运行时添加或者删除属性或者方法(相比之下静态语言只能在运行时修改它们的值)。经此修改的实例,按理说不再属于原来的类型,毕竟和原类型已经有了明显的区别。但是该实例的内建__class__属性依旧会指向原类型,这会给类型的认知造成困惑。符合一个class不应该只是名义上符合,而是内容上也应该符合。
  2. 破坏继承:体现在以下两个方面
    1. 大部分实践没有虚接口继承。abc模块提供了虚接口的基类ABC,经典的做法是让自己的抽象类继承自ABC,然后具体类继承自自己的抽象类,然后去实现抽象方法。但PEP提案认为Pythonic的做法是用typing.Protocol来取代ABC,具体类完全不继承任何虚类,只要实现相应的方法,那么就可以被静态检查器认为是符合Protocol的。
    2. 不需要继承自具体父类。和上一条一样,即使一个类没有任何父类(除了object类),它依旧可以生成同名的方法,以实现和父类方法相同的调用接口。这样在语义逻辑上,类的定义完全看不出和其他类有何种关系。完全可以是一种松散的组织结构,任何两个类之间都没继承关系。
  3. 破坏多态:任何一个入参出参,天然不限制类型。这使得要求父类型的参数处,传入子类型显得没有意义,依旧是因为任何类型都能动态修改满足要求。

破坏了设计模式

经典的模式诸如工厂模式,抽象工厂,访问者模式,都严重依赖于继承和多态的性质。但是在python的设计中,其动态能力使得设计模式形同虚设。 大家常见的库中使用设计模式的有transformers库,其中的from_pretrained系列则是工厂模式,通过字符串名称确定了具体的构造器得到具体的子类。而工厂构造器的输出类型是一个所有模型的基类。

安全性问题

Python在代码层面一般不直接管理指针,所以指针越界,野指针,悬空指针等问题一般不存在。而gc机制也能自动处理垃圾回收使得编码过程不必关注这类安全性问题。但与之相对的,Python也有自己的安全性问题。以往非托管形式的代码的攻击难度较大,注入代码想要稳定执行需要避免破坏原来的结构导致程序直接崩溃(段错误)。 Python却可以直接注入任何代码修改原本的逻辑,并且由于不是在code段固定的内容,攻击时候也无需有额外考虑。运行时可以手动修改globals() locals()内容,亦有一定风险。 另一个危险则是类型不匹配导致的代码执行问题,因为只有在运行时才确定类型,无法提前做出保证,可能会产生类型错误的异常,造成程序崩溃。

总结

我出身于C++。但是近年来一直在用python编程。而且python的市场占有率已经多年第一,且遥遥领先。这和其灵活性分不开关系。对于一个面向大众的编程语言,这样的特性是必要的。即使以上说了诸多python的不严谨之处,但是对于程序员依旧可以选择严谨的面向对象写法。所以,程序的优劣不在于语言怎么样,而在于程序员本身。程序员有责任写出易于维护,清晰,规范的代码~

Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@KuRRe8
Copy link
Author

KuRRe8 commented May 8, 2025

返回顶部

有见解,有问题,或者单纯想盖楼灌水,都可以在这里发表!

因为文档比较多,有时候渲染不出来ipynb是浏览器性能的问题,刷新即可

或者git clone到本地来阅读

ChatGPT Image May 9, 2025, 04_45_04 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment