eramax · March 18, 2025 02:32 · kim90000 · Mar 17, 2025 · kim90000 · Mar 18, 2025
diff --git a/quip2.ipynb b/quip2.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "gpuType": "T4",
      "authorship_tag": "ABX9TyNlea0mLbdhxyDdVsqNUYtv",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/eramax/64d59644d600b3db3fcb6acf0133721b/quip2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "L2YvOEO-_pdF"
      },
      "outputs": [],
      "source": [
        "!git clone --depth 1 https://github.com/Cornell-RelaxML/quip-sharp.git\n",
        "%cd quip-sharp\n",
        "!pip install transformers torch accelerate einops flash-attn glog huggingface-hub fast-hadamard-transform lm-eval primefac"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!cd quiptools && python setup.py install"
      ],
      "metadata": {
        "id": "77ft9SQTCnBY"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "!nvidia-smi -pm 1\n",
        "!nvcc --version\n",
        "!lsb_release -a\n",
        "!nvidia-smi"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "mdVeHUdTCRwU",
        "outputId": "e3542a07-8103-4dec-f91e-e561301cba8c"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Enabled persistence mode for GPU 00000000:00:04.0.\n",
            "All done.\n",
            "nvcc: NVIDIA (R) Cuda compiler driver\n",
            "Copyright (c) 2005-2023 NVIDIA Corporation\n",
            "Built on Tue_Aug_15_22:02:13_PDT_2023\n",
            "Cuda compilation tools, release 12.2, V12.2.140\n",
            "Build cuda_12.2.r12.2/compiler.33191640_0\n",
            "No LSB modules are available.\n",
            "Distributor ID:\tUbuntu\n",
            "Description:\tUbuntu 22.04.3 LTS\n",
            "Release:\t22.04\n",
            "Codename:\tjammy\n",
            "Mon Dec 18 08:13:24 2023       \n",
            "+---------------------------------------------------------------------------------------+\n",
            "| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |\n",
            "|-----------------------------------------+----------------------+----------------------+\n",
            "| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
            "| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |\n",
            "|                                         |                      |               MIG M. |\n",
            "|=========================================+======================+======================|\n",
            "|   0  Tesla T4                       On  | 00000000:00:04.0 Off |                    0 |\n",
            "| N/A   39C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |\n",
            "|                                         |                      |                  N/A |\n",
            "+-----------------------------------------+----------------------+----------------------+\n",
            "                                                                                         \n",
            "+---------------------------------------------------------------------------------------+\n",
            "| Processes:                                                                            |\n",
            "|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |\n",
            "|        ID   ID                                                             Usage      |\n",
            "|=======================================================================================|\n",
            "|  No running processes found                                                           |\n",
            "+---------------------------------------------------------------------------------------+\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!huggingface-cli login"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4Auj_Ds-_NDb",
        "outputId": "f66c47ad-7f67-4280-8344-4c57fb23fcd9"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "\n",
            "    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|\n",
            "    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|\n",
            "    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|\n",
            "    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|\n",
            "    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|\n",
            "\n",
            "    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .\n",
            "Token: \n",
            "Add token as git credential? (Y/n) n\n",
            "Token is valid (permission: read).\n",
            "Your token has been saved to /root/.cache/huggingface/token\n",
            "Login successful\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!CUDA_VISIBLE_DEVICES=0 python interactive_gen.py --hf_path \"relaxml/Llama-2-13b-chat-E8P-2Bit\" --max_length 100 --no_use_flash_attn"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4EzJQIRfBhQ4",
        "outputId": "89449f6d-a2db-44f0-cc07-889c550ff036"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "I1218 08:25:01.935575 5096 utils.py:160] NumExpr defaulting to 2 threads.\n",
            "Using `is_flash_attn_available` is deprecated and will be removed in v4.38. Please use `is_flash_attn_2_available` instead.\n",
            "Using `is_flash_attn_available` is deprecated and will be removed in v4.38. Please use `is_flash_attn_2_available` instead.\n",
            "Using `is_flash_attn_available` is deprecated and will be removed in v4.38. Please use `is_flash_attn_2_available` instead.\n",
            "I1218 08:25:10.277577 5096 modeling.py:835] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).\n",
            "\n",
            "Please enter your prompt or 'quit' (without quotes) to quit: hi\n",
            "\n",
            "Model Output:  hi everyone! I'm a long time lurker on this forum, but I've never posted before. I've been reading the posts and learning from the community for years, but I've never had the courage to post until now. So, here's my first post!\n",
            "\n",
            "I've been a fan of the site for a long time, and I've learned so much from all of you. I've been following the site for years,\n",
            "\n",
            "Please enter your prompt or 'quit' (without quotes) to quit:  Retrieval Augmented Generation (RAG) \n",
            "\n",
            "Model Output:   Retrieval Augmented Generation (RAG) \n",
            "\n",
            "The goal of Retrieval Augmented Generation (RAG) is to improve the performance of the generator network by leveraging the strengths of both the generator and the retrieval networks. The generator network generates the output, while the retrieval network is trained to recognize the correct output. The RAG uses the retrieval network to guide the generator network to produce better outputs.\n",
            "\n",
            "The RAG architecture consists of two\n",
            "\n",
            "Please enter your prompt or 'quit' (without quotes) to quit: \n"
          ]
        }
      ]
    }
  ]
 }
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"provenance": [],
	"gpuType": "T4",
	"authorship_tag": "ABX9TyNlea0mLbdhxyDdVsqNUYtv",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	},
	"language_info": {
	"name": "python"
	},
	"accelerator": "GPU"
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/eramax/64d59644d600b3db3fcb6acf0133721b/quip2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "L2YvOEO-_pdF"
	},
	"outputs": [],
	"source": [
	"!git clone --depth 1 https://github.com/Cornell-RelaxML/quip-sharp.git\n",
	"%cd quip-sharp\n",
	"!pip install transformers torch accelerate einops flash-attn glog huggingface-hub fast-hadamard-transform lm-eval primefac"
	]
	},
	{
	"cell_type": "code",
	"source": [
	"!cd quiptools && python setup.py install"
	],
	"metadata": {
	"id": "77ft9SQTCnBY"
	},
	"execution_count": null,
	"outputs": []
	},
	{
	"cell_type": "code",
	"source": [
	"!nvidia-smi -pm 1\n",
	"!nvcc --version\n",
	"!lsb_release -a\n",
	"!nvidia-smi"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "mdVeHUdTCRwU",
	"outputId": "e3542a07-8103-4dec-f91e-e561301cba8c"
	},
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"Enabled persistence mode for GPU 00000000:00:04.0.\n",
	"All done.\n",
	"nvcc: NVIDIA (R) Cuda compiler driver\n",
	"Copyright (c) 2005-2023 NVIDIA Corporation\n",
	"Built on Tue_Aug_15_22:02:13_PDT_2023\n",
	"Cuda compilation tools, release 12.2, V12.2.140\n",
	"Build cuda_12.2.r12.2/compiler.33191640_0\n",
	"No LSB modules are available.\n",
	"Distributor ID:\tUbuntu\n",
	"Description:\tUbuntu 22.04.3 LTS\n",
	"Release:\t22.04\n",
	"Codename:\tjammy\n",
	"Mon Dec 18 08:13:24 2023 \n",
	"+---------------------------------------------------------------------------------------+\n",
	"\| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 \|\n",
	"\|-----------------------------------------+----------------------+----------------------+\n",
	"\| GPU Name Persistence-M \| Bus-Id Disp.A \| Volatile Uncorr. ECC \|\n",
	"\| Fan Temp Perf Pwr:Usage/Cap \| Memory-Usage \| GPU-Util Compute M. \|\n",
	"\| \| \| MIG M. \|\n",
	"\|=========================================+======================+======================\|\n",
	"\| 0 Tesla T4 On \| 00000000:00:04.0 Off \| 0 \|\n",
	"\| N/A 39C P8 9W / 70W \| 0MiB / 15360MiB \| 0% Default \|\n",
	"\| \| \| N/A \|\n",
	"+-----------------------------------------+----------------------+----------------------+\n",
	" \n",
	"+---------------------------------------------------------------------------------------+\n",
	"\| Processes: \|\n",
	"\| GPU GI CI PID Type Process name GPU Memory \|\n",
	"\| ID ID Usage \|\n",
	"\|=======================================================================================\|\n",
	"\| No running processes found \|\n",
	"+---------------------------------------------------------------------------------------+\n"
	]
	}
	]
	},
	{
	"cell_type": "code",
	"source": [
	"!huggingface-cli login"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "4Auj_Ds-_NDb",
	"outputId": "f66c47ad-7f67-4280-8344-4c57fb23fcd9"
	},
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"\n",
	" _\| _\| _\| _\| _\|_\|_\| _\|_\|_\| _\|_\|_\| _\| _\| _\|_\|_\| _\|_\|_\|_\| _\|_\| _\|_\|_\| _\|_\|_\|_\|\n",
	" _\| _\| _\| _\| _\| _\| _\| _\|_\| _\| _\| _\| _\| _\| _\| _\|\n",
	" _\|_\|_\|_\| _\| _\| _\| _\|_\| _\| _\|_\| _\| _\| _\| _\| _\| _\|_\| _\|_\|_\| _\|_\|_\|_\| _\| _\|_\|_\|\n",
	" _\| _\| _\| _\| _\| _\| _\| _\| _\| _\| _\|_\| _\| _\| _\| _\| _\| _\| _\|\n",
	" _\| _\| _\|_\| _\|_\|_\| _\|_\|_\| _\|_\|_\| _\| _\| _\|_\|_\| _\| _\| _\| _\|_\|_\| _\|_\|_\|_\|\n",
	"\n",
	" To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .\n",
	"Token: \n",
	"Add token as git credential? (Y/n) n\n",
	"Token is valid (permission: read).\n",
	"Your token has been saved to /root/.cache/huggingface/token\n",
	"Login successful\n"
	]
	}
	]
	},
	{
	"cell_type": "code",
	"source": [
	"!CUDA_VISIBLE_DEVICES=0 python interactive_gen.py --hf_path \"relaxml/Llama-2-13b-chat-E8P-2Bit\" --max_length 100 --no_use_flash_attn"
	],
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "4EzJQIRfBhQ4",
	"outputId": "89449f6d-a2db-44f0-cc07-889c550ff036"
	},
	"execution_count": null,
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": [
	"I1218 08:25:01.935575 5096 utils.py:160] NumExpr defaulting to 2 threads.\n",
	"Using `is_flash_attn_available` is deprecated and will be removed in v4.38. Please use `is_flash_attn_2_available` instead.\n",
	"Using `is_flash_attn_available` is deprecated and will be removed in v4.38. Please use `is_flash_attn_2_available` instead.\n",
	"Using `is_flash_attn_available` is deprecated and will be removed in v4.38. Please use `is_flash_attn_2_available` instead.\n",
	"I1218 08:25:10.277577 5096 modeling.py:835] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).\n",
	"\n",
	"Please enter your prompt or 'quit' (without quotes) to quit: hi\n",
	"\n",
	"Model Output: hi everyone! I'm a long time lurker on this forum, but I've never posted before. I've been reading the posts and learning from the community for years, but I've never had the courage to post until now. So, here's my first post!\n",
	"\n",
	"I've been a fan of the site for a long time, and I've learned so much from all of you. I've been following the site for years,\n",
	"\n",
	"Please enter your prompt or 'quit' (without quotes) to quit: Retrieval Augmented Generation (RAG) \n",
	"\n",
	"Model Output: Retrieval Augmented Generation (RAG) \n",
	"\n",
	"The goal of Retrieval Augmented Generation (RAG) is to improve the performance of the generator network by leveraging the strengths of both the generator and the retrieval networks. The generator network generates the output, while the retrieval network is trained to recognize the correct output. The RAG uses the retrieval network to guide the generator network to produce better outputs.\n",
	"\n",
	"The RAG architecture consists of two\n",
	"\n",
	"Please enter your prompt or 'quit' (without quotes) to quit: \n"
	]
	}
	]
	}
	]
	}