Skip to content

Instantly share code, notes, and snippets.

@eramax
Last active March 18, 2025 02:32
Show Gist options
  • Save eramax/64d59644d600b3db3fcb6acf0133721b to your computer and use it in GitHub Desktop.
Save eramax/64d59644d600b3db3fcb6acf0133721b to your computer and use it in GitHub Desktop.
quip2.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"gpuType": "T4",
"authorship_tag": "ABX9TyNlea0mLbdhxyDdVsqNUYtv",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
},
"accelerator": "GPU"
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/eramax/64d59644d600b3db3fcb6acf0133721b/quip2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "L2YvOEO-_pdF"
},
"outputs": [],
"source": [
"!git clone --depth 1 https://github.com/Cornell-RelaxML/quip-sharp.git\n",
"%cd quip-sharp\n",
"!pip install transformers torch accelerate einops flash-attn glog huggingface-hub fast-hadamard-transform lm-eval primefac"
]
},
{
"cell_type": "code",
"source": [
"!cd quiptools && python setup.py install"
],
"metadata": {
"id": "77ft9SQTCnBY"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"source": [
"!nvidia-smi -pm 1\n",
"!nvcc --version\n",
"!lsb_release -a\n",
"!nvidia-smi"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "mdVeHUdTCRwU",
"outputId": "e3542a07-8103-4dec-f91e-e561301cba8c"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Enabled persistence mode for GPU 00000000:00:04.0.\n",
"All done.\n",
"nvcc: NVIDIA (R) Cuda compiler driver\n",
"Copyright (c) 2005-2023 NVIDIA Corporation\n",
"Built on Tue_Aug_15_22:02:13_PDT_2023\n",
"Cuda compilation tools, release 12.2, V12.2.140\n",
"Build cuda_12.2.r12.2/compiler.33191640_0\n",
"No LSB modules are available.\n",
"Distributor ID:\tUbuntu\n",
"Description:\tUbuntu 22.04.3 LTS\n",
"Release:\t22.04\n",
"Codename:\tjammy\n",
"Mon Dec 18 08:13:24 2023 \n",
"+---------------------------------------------------------------------------------------+\n",
"| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |\n",
"|-----------------------------------------+----------------------+----------------------+\n",
"| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
"| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n",
"| | | MIG M. |\n",
"|=========================================+======================+======================|\n",
"| 0 Tesla T4 On | 00000000:00:04.0 Off | 0 |\n",
"| N/A 39C P8 9W / 70W | 0MiB / 15360MiB | 0% Default |\n",
"| | | N/A |\n",
"+-----------------------------------------+----------------------+----------------------+\n",
" \n",
"+---------------------------------------------------------------------------------------+\n",
"| Processes: |\n",
"| GPU GI CI PID Type Process name GPU Memory |\n",
"| ID ID Usage |\n",
"|=======================================================================================|\n",
"| No running processes found |\n",
"+---------------------------------------------------------------------------------------+\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!huggingface-cli login"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "4Auj_Ds-_NDb",
"outputId": "f66c47ad-7f67-4280-8344-4c57fb23fcd9"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"\n",
" _| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|\n",
" _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|\n",
" _|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|\n",
" _| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|\n",
" _| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|\n",
"\n",
" To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .\n",
"Token: \n",
"Add token as git credential? (Y/n) n\n",
"Token is valid (permission: read).\n",
"Your token has been saved to /root/.cache/huggingface/token\n",
"Login successful\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!CUDA_VISIBLE_DEVICES=0 python interactive_gen.py --hf_path \"relaxml/Llama-2-13b-chat-E8P-2Bit\" --max_length 100 --no_use_flash_attn"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "4EzJQIRfBhQ4",
"outputId": "89449f6d-a2db-44f0-cc07-889c550ff036"
},
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"I1218 08:25:01.935575 5096 utils.py:160] NumExpr defaulting to 2 threads.\n",
"Using `is_flash_attn_available` is deprecated and will be removed in v4.38. Please use `is_flash_attn_2_available` instead.\n",
"Using `is_flash_attn_available` is deprecated and will be removed in v4.38. Please use `is_flash_attn_2_available` instead.\n",
"Using `is_flash_attn_available` is deprecated and will be removed in v4.38. Please use `is_flash_attn_2_available` instead.\n",
"I1218 08:25:10.277577 5096 modeling.py:835] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).\n",
"\n",
"Please enter your prompt or 'quit' (without quotes) to quit: hi\n",
"\n",
"Model Output: hi everyone! I'm a long time lurker on this forum, but I've never posted before. I've been reading the posts and learning from the community for years, but I've never had the courage to post until now. So, here's my first post!\n",
"\n",
"I've been a fan of the site for a long time, and I've learned so much from all of you. I've been following the site for years,\n",
"\n",
"Please enter your prompt or 'quit' (without quotes) to quit: Retrieval Augmented Generation (RAG) \n",
"\n",
"Model Output: Retrieval Augmented Generation (RAG) \n",
"\n",
"The goal of Retrieval Augmented Generation (RAG) is to improve the performance of the generator network by leveraging the strengths of both the generator and the retrieval networks. The generator network generates the output, while the retrieval network is trained to recognize the correct output. The RAG uses the retrieval network to guide the generator network to produce better outputs.\n",
"\n",
"The RAG architecture consists of two\n",
"\n",
"Please enter your prompt or 'quit' (without quotes) to quit: \n"
]
}
]
}
]
}
@kim90000
Copy link

؟؟؟؟؟؟؟؟؟؟؟

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment