Skip to content

Instantly share code, notes, and snippets.

@M1ndBlast
Last active November 28, 2023 16:33
Show Gist options
  • Select an option

  • Save M1ndBlast/64a46e60107d0319efd3a29d3d28da92 to your computer and use it in GitHub Desktop.

Select an option

Save M1ndBlast/64a46e60107d0319efd3a29d3d28da92 to your computer and use it in GitHub Desktop.
autoavsr.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/M1ndBlast/64a46e60107d0319efd3a29d3d28da92/autoavsr.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "89tk8KwgJIEX"
},
"source": [
"# Auto-AVSR Tutorial\n",
"**Authors**: [Pingchuan Ma](https://mpc001.github.io/), [Alexandros Haliassos](https://dblp.org/pid/257/3052.html), [Adriana Fernandez-Lopez](https://scholar.google.com/citations?user=DiVeQHkAAAAJ), [Honglie Chen](https://scholar.google.com/citations?user=HPwdvwEAAAAJ), [Stavros Petridis](https://ibug.doc.ic.ac.uk/people/spetridis), [Maja Pantic](https://ibug.doc.ic.ac.uk/people/mpantic).\n",
"\n",
"This tutorial shows how to use Auto-AVSR model to perform speech recognition (ASR, VSR, and AV-ASR), crop mouth ROIs or extract visual speech features.\n",
"\n",
"**Disclaimer**: Please note that both the VSR model and AV-ASR model have been trained with videos that were pre-processed by RetinaFace. For the purpose of improving inference speed, we use mediapipe instead."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "kbRr5DNhJed7",
"outputId": "0a23c774-4c6a-4acc-b9a5-f6acf1e1f963"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"/content\n",
"Cloning into 'Visual_Speech_Recognition_for_Multiple_Languages'...\n",
"remote: Enumerating objects: 277, done.\u001b[K\n",
"remote: Counting objects: 100% (100/100), done.\u001b[K\n",
"remote: Compressing objects: 100% (74/74), done.\u001b[K\n",
"remote: Total 277 (delta 33), reused 81 (delta 22), pack-reused 177\u001b[K\n",
"Receiving objects: 100% (277/277), 69.77 MiB | 15.68 MiB/s, done.\n",
"Resolving deltas: 100% (58/58), done.\n",
"/content/Visual_Speech_Recognition_for_Multiple_Languages\n"
]
}
],
"source": [
"%cd \"/content/\"\n",
"!git clone https://github.com/mpc001/Visual_Speech_Recognition_for_Multiple_Languages.git\n",
"%cd \"Visual_Speech_Recognition_for_Multiple_Languages\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"id": "JRR0bdqNLXTc",
"outputId": "000aa245-c7c1-422f-e7af-4c7ca1b6a297",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.1.0+cu118)\n",
"Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (0.16.0+cu118)\n",
"Requirement already satisfied: torchaudio in /usr/local/lib/python3.10/dist-packages (2.1.0+cu118)\n",
"Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch) (3.13.1)\n",
"Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch) (4.5.0)\n",
"Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch) (1.12)\n",
"Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.2.1)\n",
"Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.2)\n",
"Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch) (2023.6.0)\n",
"Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch) (2.1.0)\n",
"Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torchvision) (1.23.5)\n",
"Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from torchvision) (2.31.0)\n",
"Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision) (9.4.0)\n",
"Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (2.1.3)\n",
"Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision) (3.3.2)\n",
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision) (3.4)\n",
"Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision) (2.0.7)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision) (2023.7.22)\n",
"Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch) (1.3.0)\n",
"Requirement already satisfied: opencv-python in /usr/local/lib/python3.10/dist-packages (4.8.0.76)\n",
"Requirement already satisfied: numpy>=1.21.2 in /usr/local/lib/python3.10/dist-packages (from opencv-python) (1.23.5)\n",
"Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (1.11.3)\n",
"Requirement already satisfied: numpy<1.28.0,>=1.21.6 in /usr/local/lib/python3.10/dist-packages (from scipy) (1.23.5)\n",
"Requirement already satisfied: scikit-image in /usr/local/lib/python3.10/dist-packages (0.19.3)\n",
"Requirement already satisfied: numpy>=1.17.0 in /usr/local/lib/python3.10/dist-packages (from scikit-image) (1.23.5)\n",
"Requirement already satisfied: scipy>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from scikit-image) (1.11.3)\n",
"Requirement already satisfied: networkx>=2.2 in /usr/local/lib/python3.10/dist-packages (from scikit-image) (3.2.1)\n",
"Requirement already satisfied: pillow!=7.1.0,!=7.1.1,!=8.3.0,>=6.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-image) (9.4.0)\n",
"Requirement already satisfied: imageio>=2.4.1 in /usr/local/lib/python3.10/dist-packages (from scikit-image) (2.31.6)\n",
"Requirement already satisfied: tifffile>=2019.7.26 in /usr/local/lib/python3.10/dist-packages (from scikit-image) (2023.9.26)\n",
"Requirement already satisfied: PyWavelets>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-image) (1.4.1)\n",
"Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from scikit-image) (23.2)\n",
"Collecting av\n",
" Downloading av-11.0.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (32.9 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m32.9/32.9 MB\u001b[0m \u001b[31m49.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hInstalling collected packages: av\n",
"Successfully installed av-11.0.0\n",
"Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (1.16.0)\n",
"Collecting mediapipe\n",
" Downloading mediapipe-0.10.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m34.5/34.5 MB\u001b[0m \u001b[31m48.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25hRequirement already satisfied: absl-py in /usr/local/lib/python3.10/dist-packages (from mediapipe) (1.4.0)\n",
"Requirement already satisfied: attrs>=19.1.0 in /usr/local/lib/python3.10/dist-packages (from mediapipe) (23.1.0)\n",
"Requirement already satisfied: flatbuffers>=2.0 in /usr/local/lib/python3.10/dist-packages (from mediapipe) (23.5.26)\n",
"Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from mediapipe) (3.7.1)\n",
"Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from mediapipe) (1.23.5)\n",
"Requirement already satisfied: opencv-contrib-python in /usr/local/lib/python3.10/dist-packages (from mediapipe) (4.8.0.76)\n",
"Requirement already satisfied: protobuf<4,>=3.11 in /usr/local/lib/python3.10/dist-packages (from mediapipe) (3.20.3)\n",
"Collecting sounddevice>=0.4.4 (from mediapipe)\n",
" Downloading sounddevice-0.4.6-py3-none-any.whl (31 kB)\n",
"Requirement already satisfied: CFFI>=1.0 in /usr/local/lib/python3.10/dist-packages (from sounddevice>=0.4.4->mediapipe) (1.16.0)\n",
"Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->mediapipe) (1.2.0)\n",
"Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->mediapipe) (0.12.1)\n",
"Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->mediapipe) (4.44.3)\n",
"Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->mediapipe) (1.4.5)\n",
"Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->mediapipe) (23.2)\n",
"Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->mediapipe) (9.4.0)\n",
"Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->mediapipe) (3.1.1)\n",
"Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib->mediapipe) (2.8.2)\n",
"Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from CFFI>=1.0->sounddevice>=0.4.4->mediapipe) (2.21)\n",
"Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib->mediapipe) (1.16.0)\n",
"Installing collected packages: sounddevice, mediapipe\n",
"Successfully installed mediapipe-0.10.8 sounddevice-0.4.6\n",
"Collecting ffmpeg-python\n",
" Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)\n",
"Requirement already satisfied: future in /usr/local/lib/python3.10/dist-packages (from ffmpeg-python) (0.18.3)\n",
"Installing collected packages: ffmpeg-python\n",
"Successfully installed ffmpeg-python-0.2.0\n"
]
}
],
"source": [
"!pip install torch torchvision torchaudio\n",
"!pip install opencv-python\n",
"!pip install scipy\n",
"!pip install scikit-image\n",
"!pip install av\n",
"!pip install six\n",
"\n",
"!pip install mediapipe\n",
"!pip install ffmpeg-python"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "wIoPeIizMxVi"
},
"source": [
"## Video preparation"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "QBO8QaRHSCIJ"
},
"source": [
"1. Download a video."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Mtwa6fV4NHX4",
"outputId": "07d2ea68-9fe2-490f-f07c-00efcb7745fd"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"--2023-11-28 15:55:22-- http://www.doc.ic.ac.uk/~pm4115/autoAVSR/autoavsr_demo_video.mp4\n",
"Resolving www.doc.ic.ac.uk (www.doc.ic.ac.uk)... 146.169.13.6\n",
"Connecting to www.doc.ic.ac.uk (www.doc.ic.ac.uk)|146.169.13.6|:80... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 3644186 (3.5M) [video/mp4]\n",
"Saving to: ‘/content/data/clip.mp4’\n",
"\n",
"/content/data/clip. 100%[===================>] 3.47M 3.41MB/s in 1.0s \n",
"\n",
"2023-11-28 15:55:24 (3.41 MB/s) - ‘/content/data/clip.mp4’ saved [3644186/3644186]\n",
"\n"
]
}
],
"source": [
"!mkdir -p /content/data/\n",
"!wget --content-disposition http://www.doc.ic.ac.uk/~pm4115/autoAVSR/autoavsr_demo_video.mp4 -O /content/data/clip.mp4"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "fArWyDh2NIqI"
},
"outputs": [],
"source": [
"from IPython.display import HTML\n",
"from base64 import b64encode\n",
"\n",
"## play_video function based on: https://colab.research.google.com/drive/1bNXkfpHiVHzXQH8WjGhzQ-fsDxolpUjD\n",
"\n",
"def play_video(video_path, width=200):\n",
" mp4 = open(video_path,'rb').read()\n",
" data_url = \"data:video/mp4;base64,\" + b64encode(mp4).decode()\n",
" return HTML(f\"\"\"\n",
" <video width={width} controls>\n",
" <source src=\"{data_url}\" type=\"video/mp4\">\n",
" </video>\n",
" \"\"\")"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "VAX2S30tNTzC",
"outputId": "e6fb17ae-311f-4f14-acc5-58352bdca38e",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 555
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"<IPython.core.display.HTML object>"
],
"text/html": [
"\n",
" <video width=300 controls>\n",
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment