Skip to content

Instantly share code, notes, and snippets.

@KuRRe8
Last active June 6, 2025 17:35
Show Gist options
  • Save KuRRe8/36f63d23ef205a8e02b7b7ec009cc4e8 to your computer and use it in GitHub Desktop.
Save KuRRe8/36f63d23ef205a8e02b7b7ec009cc4e8 to your computer and use it in GitHub Desktop.
和Python使用有关的一些教程,按类别分为不同文件

Python教程

Python是一个新手友好的语言,并且现在机器学习社区深度依赖于Python,C++, Cuda C, R等语言,使得Python的热度稳居第一。本Gist提供Python相关的一些教程,可以直接在Jupyter Notebook中运行。

  1. 语言级教程,一般不涉及初级主题;
  2. 标准库教程,最常见的标准库基本用法;
  3. 第三方库教程,主要是常见的库如numpy,pytorch诸如此类,只涉及基本用法,不考虑新特性

其他内容就不往这个Gist里放了,注意Gist依旧由git进行版本控制,所以可以git clone 到本地,或者直接Google Colab\ Kaggle打开相应的ipynb文件

直接在网页浏览时,由于没有文件列表,可以按Ctrl + F来检索相应的目录,或者点击下面的超链接。

想要参与贡献的直接在评论区留言,有什么问题的也在评论区说 ^.^

目录-语言部分

目录-库部分

目录-具体业务库部分-本教程更多关注机器学习深度学习内容

目录-附录

  • sigh.md个人对于Python动态语言的看法
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

对动态语言Python的一些感慨

众所周知Python是完全动态的语言,体现在

  1. 类型动态绑定
  2. 运行时检查
  3. 对象结构内容可动态修改(而不仅仅是值)
  4. 反射
  5. 一切皆对象(instance, class, method)
  6. 可动态执行代码(eval, exec)
  7. 鸭子类型支持

动态语言的约束更少,对使用者来说更易于入门,但相应的也会有代价就是运行时开销很大,和底层汇编执行逻辑完全解耦不知道代码到底是怎么执行的。

而且还有几点是我认为较为严重的缺陷。下面进行梳理。

破坏了OOP的语义

较为流行的编程语言大多支持OOP编程范式。即继承和多态。同样,Python在执行简单任务时候可以纯命令式(Imperative Programming),也可以使用复杂的面向对象OOP。

但是,其动态特性破环了OOP的结构:

  1. 类型模糊:任何类型实例,都可以在运行时添加或者删除属性或者方法(相比之下静态语言只能在运行时修改它们的值)。经此修改的实例,按理说不再属于原来的类型,毕竟和原类型已经有了明显的区别。但是该实例的内建__class__属性依旧会指向原类型,这会给类型的认知造成困惑。符合一个class不应该只是名义上符合,而是内容上也应该符合。
  2. 破坏继承:体现在以下两个方面
    1. 大部分实践没有虚接口继承。abc模块提供了虚接口的基类ABC,经典的做法是让自己的抽象类继承自ABC,然后具体类继承自自己的抽象类,然后去实现抽象方法。但PEP提案认为Pythonic的做法是用typing.Protocol来取代ABC,具体类完全不继承任何虚类,只要实现相应的方法,那么就可以被静态检查器认为是符合Protocol的。
    2. 不需要继承自具体父类。和上一条一样,即使一个类没有任何父类(除了object类),它依旧可以生成同名的方法,以实现和父类方法相同的调用接口。这样在语义逻辑上,类的定义完全看不出和其他类有何种关系。完全可以是一种松散的组织结构,任何两个类之间都没继承关系。
  3. 破坏多态:任何一个入参出参,天然不限制类型。这使得要求父类型的参数处,传入子类型显得没有意义,依旧是因为任何类型都能动态修改满足要求。

破坏了设计模式

经典的模式诸如工厂模式,抽象工厂,访问者模式,都严重依赖于继承和多态的性质。但是在python的设计中,其动态能力使得设计模式形同虚设。 大家常见的库中使用设计模式的有transformers库,其中的from_pretrained系列则是工厂模式,通过字符串名称确定了具体的构造器得到具体的子类。而工厂构造器的输出类型是一个所有模型的基类。

安全性问题

Python在代码层面一般不直接管理指针,所以指针越界,野指针,悬空指针等问题一般不存在。而gc机制也能自动处理垃圾回收使得编码过程不必关注这类安全性问题。但与之相对的,Python也有自己的安全性问题。以往非托管形式的代码的攻击难度较大,注入代码想要稳定执行需要避免破坏原来的结构导致程序直接崩溃(段错误)。 Python却可以直接注入任何代码修改原本的逻辑,并且由于不是在code段固定的内容,攻击时候也无需有额外考虑。运行时可以手动修改globals() locals()内容,亦有一定风险。 另一个危险则是类型不匹配导致的代码执行问题,因为只有在运行时才确定类型,无法提前做出保证,可能会产生类型错误的异常,造成程序崩溃。

总结

我出身于C++。但是近年来一直在用python编程。而且python的市场占有率已经多年第一,且遥遥领先。这和其灵活性分不开关系。对于一个面向大众的编程语言,这样的特性是必要的。即使以上说了诸多python的不严谨之处,但是对于程序员依旧可以选择严谨的面向对象写法。所以,程序的优劣不在于语言怎么样,而在于程序员本身。程序员有责任写出易于维护,清晰,规范的代码~

Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Transformers (Hugging Face) 教程 - 预训练模型中心\n",
"\n",
"欢迎来到 Hugging Face Transformers 库教程!`transformers` 库已经成为现代自然语言处理 (NLP) 领域的事实标准,它提供了数以万计的预训练模型(尤其是基于 Transformer 架构的模型,如 BERT, GPT, T5 等)以及方便的工具来下载、加载和使用这些模型进行推理和微调。\n",
"\n",
"**为什么 Hugging Face Transformers 如此重要?**\n",
"\n",
"1. **庞大的模型库 (Hugging Face Hub)**:轻松访问大量 SOTA (State-of-the-Art) 模型,涵盖 NLP、计算机视觉、音频等多个领域。\n",
"2. **易于使用的 API**:提供了高级的 `pipeline` API 用于快速推理,以及统一的 `AutoModel`, `AutoTokenizer` 等类来加载模型和分词器。\n",
"3. **框架兼容性**:支持 PyTorch, TensorFlow 和 JAX。\n",
"4. **标准化与可复现性**:促进了模型共享和研究的可复现性。\n",
"5. **强大的社区**:活跃的社区贡献了大量模型、数据集和教程。\n",
"6. **迁移学习利器**:使得利用大型预训练模型的强大能力进行下游任务微调变得非常容易。\n",
"\n",
"**本教程将涵盖 Transformers 库的核心概念和用法:**\n",
"\n",
"1. 安装与准备\n",
"2. 核心概念:Pipelines, Tokenizers, Models\n",
"3. 使用 `pipeline` API 进行快速推理 (零样本/少样本)\n",
"4. 加载预训练模型和分词器 (`AutoModel`, `AutoTokenizer`)\n",
"5. 文本分词与编码\n",
"6. 使用加载的模型进行推理 (获取 logits 或隐藏状态)\n",
"7. (简介) 微调 (Fine-tuning) 流程\n",
"8. Hugging Face Hub 简介"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. 安装与准备\n",
"\n",
"你需要安装 `transformers` 库。通常建议同时安装 PyTorch 或 TensorFlow (或两者)。\n",
"\n",
"```bash\n",
"pip install transformers\n",
"\n",
"# 安装 PyTorch (根据你的系统和 CUDA 版本从官网获取命令: pytorch.org)\n",
"# pip install torch torchvision torchaudio \n",
"\n",
"# 或者安装 TensorFlow (根据你的系统和 CUDA 版本从官网获取命令: tensorflow.org)\n",
"# pip install tensorflow\n",
"```\n",
"对于某些特定任务或模型,可能还需要安装额外的依赖,例如 `sentencepiece` (用于某些分词器) 或 `datasets` (用于加载和处理数据集)。\n",
"```bash\n",
"pip install sentencepiece datasets\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 导入必要的库\n",
"import transformers\n",
"from transformers import pipeline # 高级 API\n",
"from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModel # 底层 API\n",
"import torch # 我们主要用 PyTorch 示例\n",
"import pandas as pd\n",
"\n",
"print(f\"Transformers version: {transformers.__version__}\")\n",
"print(f\"PyTorch version: {torch.__version__}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. 核心概念:Pipelines, Tokenizers, Models\n",
"\n",
"* **Pipeline (管道)**:这是最简单易用的接口,将模型的前处理(如分词)、模型推理和后处理封装在一起,可以直接处理原始输入(如文本)并返回易于理解的结果。非常适合快速应用或原型设计。\n",
"* **Tokenizer (分词器)**:负责将原始文本转换为模型可以理解的数值输入(通常是 token IDs、attention mask 等)。每个预训练模型都有其对应的分词器,它们必须匹配使用。\n",
"* **Model (模型)**:代表预训练模型的架构和权重。`transformers` 提供了各种针对不同任务的模型类(如 `AutoModelForSequenceClassification` 用于序列分类,`AutoModelForQuestionAnswering` 用于问答,`AutoModel` 用于获取基础 Transformer 的输出)。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. 使用 `pipeline` API 进行快速推理\n",
"\n",
"`pipeline()` 函数可以自动下载并缓存所需的模型和分词器,让你轻松完成各种任务。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Using pipeline API ---\")\n",
"\n",
"# --- 任务 1: 情感分析 (Sentiment Analysis) ---\n",
"print(\"\\n--- Sentiment Analysis ---\")\n",
"try:\n",
" # 第一次运行时会自动下载模型 (默认模型通常是 distilbert-base-uncased-finetuned-sst-2-english)\n",
" sentiment_pipeline = pipeline(\"sentiment-analysis\")\n",
" \n",
" text1 = \"This movie was absolutely fantastic! Highly recommended.\"\n",
" text2 = \"The plot was predictable and the acting was mediocre.\"\n",
" \n",
" results = sentiment_pipeline([text1, text2])\n",
" for i, result in enumerate(results):\n",
" print(f\"Text {i+1}: Label='{result['label']}', Score={result['score']:.4f}\")\n",
" \n",
"except Exception as e:\n",
" print(f\"Error running sentiment analysis pipeline (maybe model download failed?): {e}\")\n",
"\n",
"# --- 任务 2: 文本生成 (Text Generation) ---\n",
"print(\"\\n--- Text Generation ---\")\n",
"try:\n",
" # 使用一个较小的 GPT-2 模型\n",
" generator = pipeline('text-generation', model='gpt2') \n",
" prompt = \"In a hole in the ground there lived a\"\n",
" generated_texts = generator(prompt, max_length=30, num_return_sequences=2)\n",
" \n",
" print(f\"Prompt: '{prompt}'\")\n",
" for i, text in enumerate(generated_texts):\n",
" print(f\"Generated sequence {i+1}: {text['generated_text']}\")\n",
" \n",
"except Exception as e:\n",
" print(f\"Error running text generation pipeline: {e}\")\n",
"\n",
"# --- 任务 3: 零样本分类 (Zero-Shot Classification) ---\n",
"print(\"\\n--- Zero-Shot Classification ---\")\n",
"try:\n",
" # 可以在没有针对特定标签进行微调的情况下对文本进行分类\n",
" zero_shot_classifier = pipeline(\"zero-shot-classification\")\n",
" sequence_to_classify = \"Who are you voting for in 2024?\"\n",
" candidate_labels = ['politics', 'economy', 'entertainment', 'environment']\n",
" \n",
" result = zero_shot_classifier(sequence_to_classify, candidate_labels)\n",
" print(f\"Sequence: '{sequence_to_classify}'\")\n",
" print(f\"Predicted labels and scores: {result['labels']} - { [f'{s:.3f}' for s in result['scores']] }\")\n",
" \n",
"except Exception as e:\n",
" print(f\"Error running zero-shot classification pipeline: {e}\")\n",
"\n",
"# --- 任务 4: 掩码填充 (Fill-Mask) ---\n",
"print(\"\\n--- Fill-Mask ---\")\n",
"try:\n",
" unmasker = pipeline('fill-mask') # 通常使用 BERT 类型的模型\n",
" masked_text = \"Paris is the <mask> city of France.\"\n",
" results = unmasker(masked_text, top_k=3) # 获取最可能的3个填充词\n",
" \n",
" print(f\"Masked text: '{masked_text}'\")\n",
" for result in results:\n",
" print(f\" Prediction: '{result['token_str']}' (Score: {result['score']:.4f}, Sequence: {result['sequence']})\")\n",
" \n",
"except Exception as e:\n",
" print(f\"Error running fill-mask pipeline: {e}\")\n",
" \n",
"# 还有很多其他任务: 'ner' (命名实体识别), 'question-answering', 'summarization', 'translation_xx_to_yy' 等"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. 加载预训练模型和分词器 (`AutoModel`, `AutoTokenizer`)\n",
"\n",
"`AutoClasses` (如 `AutoTokenizer`, `AutoModel`, `AutoModelForSequenceClassification` 等) 是非常有用的工厂类。你只需要提供模型的标识符 (通常是 Hugging Face Hub 上的模型名称,如 `bert-base-uncased` 或 `distilbert-base-uncased`),它们就能自动推断模型架构并加载相应的分词器和模型类。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Loading Tokenizer and Model using AutoClasses ---\")\n",
"\n",
"# 选择一个模型标识符 (checkpoint)\n",
"model_checkpoint = \"distilbert-base-uncased-finetuned-sst-2-english\" # 同情感分析默认模型\n",
"# model_checkpoint = \"bert-base-uncased\"\n",
"\n",
"try:\n",
" # 1. 加载分词器\n",
" tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)\n",
" print(f\"Tokenizer loaded for '{model_checkpoint}'\")\n",
" print(f\"Tokenizer class: {type(tokenizer)}\")\n",
" \n",
" # 2. 加载模型\n",
" # 选择适合任务的模型类,例如用于序列分类\n",
" model_for_classification = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)\n",
" print(f\"\\nModel loaded for sequence classification: '{model_checkpoint}'\")\n",
" print(f\"Model class: {type(model_for_classification)}\")\n",
"\n",
" # 如果只想获取模型的隐藏状态 (embeddings),使用 AutoModel\n",
" # base_model = AutoModel.from_pretrained(model_checkpoint)\n",
" # print(f\"\\nBase model loaded: '{model_checkpoint}'\")\n",
" # print(f\"Base model class: {type(base_model)}\")\n",
"\n",
"except OSError as e:\n",
" print(f\"\\nError loading model/tokenizer '{model_checkpoint}'. Check model name or internet connection.\")\n",
" print(f\"Error details: {e}\")\n",
" tokenizer = None\n",
" model_for_classification = None\n",
"except Exception as e:\n",
" print(f\"\\nAn unexpected error occurred: {e}\")\n",
" tokenizer = None\n",
" model_for_classification = None"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. 文本分词与编码\n",
"\n",
"分词器将文本转换为模型可以处理的格式。\n",
"\n",
"* **分词 (Tokenization)**:将文本分割成子词单元 (tokens)。\n",
"* **转换为 ID (Conversion to IDs)**:将每个 token 映射到其在模型词汇表中的唯一整数 ID。\n",
"* **添加特殊 Token (Special Tokens)**:添加模型需要的特殊 token,如 `[CLS]` (分类标记), `[SEP]` (分隔标记), `[PAD]` (填充标记)。\n",
"* **生成 Attention Mask**:创建一个与输入 ID 序列相同长度的二进制掩码,用于指示哪些 token 是真实的输入,哪些是填充 (`1` 表示真实, `0` 表示填充)。\n",
"\n",
"调用 `tokenizer(text, ...)` 会执行以上所有步骤。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Tokenization and Encoding ---\")\n",
"\n",
"if tokenizer:\n",
" text_example = \"This is a sample sentence for tokenization.\"\n",
" \n",
" # 基本分词\n",
" tokens = tokenizer.tokenize(text_example)\n",
" print(f\"\\nText: '{text_example}'\")\n",
" print(f\"Tokens: {tokens}\") # 注意可能有 ## 前缀表示子词\n",
" \n",
" # 转换为 ID\n",
" token_ids = tokenizer.convert_tokens_to_ids(tokens)\n",
" print(f\"Token IDs: {token_ids}\")\n",
" \n",
" # 解码回文本\n",
" decoded_text = tokenizer.decode(token_ids)\n",
" print(f\"Decoded text: '{decoded_text}'\") # 可能与原始略有不同\n",
" \n",
" # --- 使用 tokenizer() 一步完成编码 (推荐) ---\n",
" print(\"\\n--- Encoding using tokenizer() ---\")\n",
" text_batch = [\n",
" \"First sentence.\",\n",
" \"This is a slightly longer second sentence.\"\n",
" ]\n",
" \n",
" # padding=True: 将批次中的句子填充到最长句子的长度\n",
" # truncation=True: 如果句子超过模型最大长度,则截断\n",
" # return_tensors=\"pt\": 返回 PyTorch 张量 ('tf' for TensorFlow, 'np' for NumPy)\n",
" encoded_input = tokenizer(text_batch, padding=True, truncation=True, return_tensors=\"pt\")\n",
" \n",
" print(\"Encoded input (PyTorch Tensors):\")\n",
" # .items() 用于方便打印\n",
" for key, value in encoded_input.items():\n",
" print(f\" {key}:\")\n",
" print(value)\n",
" \n",
" print(f\"\\nShape of input_ids: {encoded_input['input_ids'].shape}\")\n",
" print(f\"Shape of attention_mask: {encoded_input['attention_mask'].shape}\")\n",
" \n",
" # 查看特殊 token\n",
" print(f\"\\nSpecial tokens: {tokenizer.special_tokens_map}\")\n",
" print(f\"CLS token: {tokenizer.cls_token}, ID: {tokenizer.cls_token_id}\")\n",
" print(f\"SEP token: {tokenizer.sep_token}, ID: {tokenizer.sep_token_id}\")\n",
" print(f\"PAD token: {tokenizer.pad_token}, ID: {tokenizer.pad_token_id}\")\n",
"else:\n",
" print(\"Tokenizer not loaded, skipping encoding examples.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. 使用加载的模型进行推理\n",
"\n",
"将编码后的输入传递给加载的模型,以获取模型的输出。\n",
"* 对于分类任务 (`AutoModelForSequenceClassification`),输出通常包含 `logits`(原始分数)。\n",
"* 对于基础模型 (`AutoModel`),输出通常包含 `last_hidden_state`(最后一层的隐藏状态/嵌入)。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"--- Model Inference ---\")\n",
"\n",
"if model_for_classification and tokenizer:\n",
" texts_for_inference = [\n",
" \"This library is incredibly useful!\",\n",
" \"I am not sure if I like this product.\"\n",
" ]\n",
" \n",
" # 1. 编码输入\n",
" inputs = tokenizer(texts_for_inference, padding=True, truncation=True, return_tensors=\"pt\")\n",
" print(f\"Encoded inputs for inference:\\n{inputs}\")\n",
" \n",
" # 2. 将模型和输入移到同一设备 (如果使用 GPU)\n",
" # model_for_classification.to(device)\n",
" # inputs = {k: v.to(device) for k, v in inputs.items()}\n",
" print(f\"\\nModel is on device: {next(model_for_classification.parameters()).device}\")\n",
" \n",
" # 3. 执行推理 (在 torch.no_grad() 下,因为不需要计算梯度)\n",
" with torch.no_grad():\n",
" outputs = model_for_classification(**inputs) # 使用 ** 解包字典作为关键字参数\n",
" \n",
" # 4. 处理输出\n",
" print(f\"\\nModel output type: {type(outputs)}\")\n",
" print(f\"Model output keys: {outputs.keys()}\") # 通常包含 'logits'\n",
" \n",
" logits = outputs.logits\n",
" print(f\"\\nLogits (raw scores) shape: {logits.shape}\") # [batch_size, num_labels]\n",
" print(f\"Logits:\\n{logits}\")\n",
" \n",
" # 将 logits 转换为概率 (使用 Softmax)\n",
" probabilities = torch.softmax(logits, dim=-1)\n",
" print(f\"\\nProbabilities:\\n{probabilities.round(decimals=3)}\")\n",
" \n",
" # 获取预测的类别 (概率最高的类别索引)\n",
" predictions = torch.argmax(probabilities, dim=-1)\n",
" print(f\"\\nPredicted class indices: {predictions}\")\n",
" \n",
" # 将索引映射回标签\n",
" # model.config 包含了模型的配置信息,包括标签映射\n",
" id2label = model_for_classification.config.id2label\n",
" predicted_labels = [id2label[idx.item()] for idx in predictions]\n",
" print(f\"Predicted labels: {predicted_labels}\")\n",
" \n",
"else:\n",
" print(\"Model or Tokenizer not loaded, skipping inference example.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 7. (简介) 微调 (Fine-tuning) 流程\n",
"\n",
"微调是指在一个大型预训练模型的基础上,使用特定任务的数据集继续训练模型,以使其适应这个特定任务。这是迁移学习的一种形式,通常比从头开始训练模型更有效。\n",
"\n",
"**基本步骤:**\n",
"1. **加载预训练模型和分词器**: 使用 `AutoModelForTask` (如 `AutoModelForSequenceClassification`) 加载适合下游任务的模型。\n",
"2. **准备数据集**: 加载你的特定任务数据集 (可以使用 `datasets` 库),并使用模型的分词器对其进行编码。\n",
"3. **定义训练参数**: 使用 `TrainingArguments` 类设置训练超参数 (学习率、周期数、批大小等)。\n",
"4. **创建 `Trainer`**: `Trainer` 类封装了训练和评估循环。\n",
"5. **开始训练**: 调用 `trainer.train()`。\n",
"6. **(可选) 评估**: 调用 `trainer.evaluate()`。\n",
"\n",
"**示例 (伪代码/概念):**\n",
"```python\n",
"# from datasets import load_dataset\n",
"# from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer\n",
"\n",
"# # 1. 加载模型和分词器\n",
"# model_name = \"bert-base-uncased\"\n",
"# tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
"# model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=NUM_YOUR_CLASSES)\n",
"\n",
"# # 2. 加载和预处理数据集\n",
"# raw_datasets = load_dataset(\"your_dataset_name\") # or load from files\n",
"# def tokenize_function(examples):\n",
"# return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True)\n",
"# tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)\n",
"# # ... 可能还需要设置格式、移除不必要的列 ...\n",
"# train_dataset = tokenized_datasets[\"train\"]\n",
"# eval_dataset = tokenized_datasets[\"validation\"]\n",
"\n",
"# # 3. 定义训练参数\n",
"# training_args = TrainingArguments(\n",
"# output_dir=\"./results\", # 输出目录\n",
"# evaluation_strategy=\"epoch\", # 每轮结束后评估\n",
"# learning_rate=2e-5, # 学习率\n",
"# per_device_train_batch_size=16, # 训练批大小\n",
"# per_device_eval_batch_size=16, # 评估批大小\n",
"# num_train_epochs=3, # 训练周期\n",
"# weight_decay=0.01, # 权重衰减\n",
"# push_to_hub=False, # 是否推送到 Hub (可选)\n",
"# )\n",
"\n",
"# # 4. 创建 Trainer\n",
"# trainer = Trainer(\n",
"# model=model,\n",
"# args=training_args,\n",
"# train_dataset=train_dataset,\n",
"# eval_dataset=eval_dataset,\n",
"# # compute_metrics=compute_metrics_function # (可选) 自定义评估指标函数\n",
"# )\n",
"\n",
"# # 5. 开始训练\n",
"# trainer.train()\n",
"\n",
"# # 6. 评估\n",
"# trainer.evaluate()\n",
"```\n",
"微调是一个更深入的主题,涉及数据准备、超参数选择等细节,Hugging Face 官方文档和教程提供了更详细的指南。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 8. Hugging Face Hub 简介\n",
"\n",
"Hugging Face Hub ([huggingface.co](https://huggingface.co/)) 是一个协作平台,托管了:\n",
"* **数以万计的预训练模型**:可以通过模型标识符直接在 `transformers` 库中加载。\n",
"* **数千个数据集**:可以使用 `datasets` 库方便地加载。\n",
"* **Spaces**: 用于托管和运行 ML 应用演示的平台。\n",
"* **评估指标**。\n",
"\n",
"你可以浏览 Hub 来发现适用于你任务的模型和数据集,并查看模型卡片 (Model Cards) 来了解模型的细节、用法和限制。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 总结\n",
"\n",
"Hugging Face Transformers 库极大地简化了使用和部署最先进的预训练模型(尤其是基于 Transformer 的模型)的过程。其易用的 `pipeline` API、强大的 `AutoClasses` 以及与 PyTorch/TensorFlow 的良好集成,使其成为 NLP、CV、音频等领域研究和应用开发不可或缺的工具。\n",
"\n",
"**关键要点:**\n",
"* `pipeline` 是进行快速推理的最简单方式。\n",
"* `AutoTokenizer` 和 `AutoModel` (及各种变体) 用于加载模型组件。\n",
"* 分词是将文本转换为模型输入的核心步骤。\n",
"* 可以通过加载的模型进行详细的推理和获取内部状态。\n",
"* 库支持在预训练模型上进行微调以适应特定任务。\n",
"* Hugging Face Hub 是查找和共享模型、数据集的重要资源。\n",
"\n",
"掌握 Transformers 库将使你能够轻松利用强大的预训练模型来解决各种复杂的机器学习任务。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 5
}
@KuRRe8
Copy link
Author

KuRRe8 commented May 8, 2025

返回顶部

有见解,有问题,或者单纯想盖楼灌水,都可以在这里发表!

因为文档比较多,有时候渲染不出来ipynb是浏览器性能的问题,刷新即可

或者git clone到本地来阅读

ChatGPT Image May 9, 2025, 04_45_04 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment