rnett · December 2, 2020 03:47
diff --git a/tutorial.ipynb b/tutorial.ipynb
 {
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Tutorial.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true,
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/rnett/e80d481d1f8a22a4e8d129019bcfd546/tutorial.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Px2VD_550WpB"
      },
      "source": [
        "# UniProt\n",
        "\n",
        "### Objective\n",
        " \n",
        "Use UniProt to assist us with gathering data to answer the following question: How do mutations in the p53 sequence alter the shape of the protein?\n",
        "\n",
        "### Introduction\n",
        "UniProt is an online database of protein sequence and function information. We will be using UniProt to download FASTA files for protein sequences we are interested in for later analyses. \n",
        "\n",
        "### Tutorial \n",
        "\n",
        "**Step 1:** To begin, navigate to UniProt (https://www.uniprot.org/). Enter the protein you are interested in. We will be searching for “p53”. \n",
        "\n",
        "![](https://i.ibb.co/sWZDBHr/fs01kpf-YHk9s3u3-G98-FKEj-PSm4-BEIv4-Ni5-XYo-JFe-Fv-Apj63-X3lmn-N8y-a9z7-Utgf-Tr-WAFMy417no-Gp-Hyhi9.png)\n",
        "\n",
        "**Step 2:** Explore the results! \n",
        "\n",
        "![](https://i.ibb.co/RCx1Lqn/image.png)\n",
        "\n",
        "\n",
        "![](https://i.ibb.co/j8nyS9R/image.png)\n",
        "\n",
        "Learn more about the protein function, mutations, similar proteins, and more from the search result page of each entry\n",
        "\n",
        "![](https://i.ibb.co/FKgD74b/image.png)\n",
        "\n",
        "From here, you can download a fasta file by scrolling down to the sequence section.\n",
        "\n",
        "![](https://i.ibb.co/hmkf1NF/image.png)\n",
        "\n",
        "Click on the <img src=\"https://i.ibb.co/Y2bm9zs/image.png\" height=40 /> button.\n",
        "\n",
        "You can now copy the link to the fasta entry or copy and paste the entire fasta file. We will use the link to the fasta entry for later tutorials. \n",
        "\n",
        "![](https://i.ibb.co/s5Pmtx7/image.png)\n",
        "\n",
        "There is a lot of useful information on Uniprot. Below will showcase some of the most common sections that you might use for research.\n",
        "\n",
        "This first section is names and taxonomy.\n",
        "\n",
        "![](https://i.ibb.co/9wrSyc7/image.png)\n",
        "\n",
        "The subcellular location section shows where the protein is found. \n",
        "\n",
        "![](https://i.ibb.co/k6ybc7Z/image.png)\n",
        "\n",
        "The pathology and biotech section offers information about the genes involvement in different diseases. \n",
        "\n",
        "![](https://i.ibb.co/0fHVg4x/image.png)\n",
        "\n",
        "The interaction section offers information about the genes interaction with other subunits of proteins\n",
        "\n",
        "![](https://i.ibb.co/G3N5dXk/image.png)\n",
        "\n",
        "The structure section offers information about the proteins 3D structure. \n",
        "\n",
        "![](https://i.ibb.co/9vv42YV/image.png)\n",
        "\n",
        "### Conclusion\n",
        "It's important to know what your research question is and what information you need to answer your question. This will help guide you through online sources like Uniprot during data extraction. Let's remind ourselves again of our original question: How do mutations in the p53 sequence alter the shape of the protein? Although UniProt cannot directly give this answer, we can already see how this knowledge base provides us with the data needed in later analyses. Next, jump to the MUSCLE tutorial to see how to use the data extracted from a UniProt for a multiple sequence alignment. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "95qAhhM-57Zr"
      },
      "source": [
        "## Code\n",
        "\n",
        "To work with FASTA files from UniProt, we need packages for downloading and reading them.  We also make a class and several utility methods for working with them.\n",
        "\n",
        "First, we install the `biopython` (for reading) and `wget` (for downloading) packages, and import the required modules.\n",
        "\n",
        "In Jupyter notebooks, which includes Colab notebooks, you can run shell commands by starting the line with `!`.  You can include python variables using `$my_variable`, which will be stringified using `str`.\n",
        "\n",
        "So to install packages with `pip`, we just use `pip install ...`.\n",
        "\n",
        "We will use similar methods to install BLAST and MUSCLE later."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Z6FeqblC6ksy",
        "outputId": "9d04894a-d5a5-4352-b7c1-b4be8445c15d"
      },
      "source": [
        "!pip install biopython wget\n",
        "\n",
        "import os\n",
        "from pathlib import Path\n",
        "from Bio import SeqIO\n",
        "import wget"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Collecting biopython\n",
            "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/76/02/8b606c4aa92ff61b5eda71d23b499ab1de57d5e818be33f77b01a6f435a8/biopython-1.78-cp36-cp36m-manylinux1_x86_64.whl (2.3MB)\n",
            "\u001b[K     |████████████████████████████████| 2.3MB 6.0MB/s \n",
            "\u001b[?25hCollecting wget\n",
            "  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from biopython) (1.18.5)\n",
            "Building wheels for collected packages: wget\n",
            "  Building wheel for wget (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Created wheel for wget: filename=wget-3.2-cp36-none-any.whl size=9682 sha256=94d80a8f7a883851ee350dcb5d11e7af2a8dc72f07d20159e826f8293c187a47\n",
            "  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f\n",
            "Successfully built wget\n",
            "Installing collected packages: biopython, wget\n",
            "Successfully installed biopython-1.78 wget-3.2\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "V1vxHzvR6ivS"
      },
      "source": [
        "Here, we create a class representing a FASTA file.  You could do everything with filenames without much trouble, but this allows us to make utility methods for accessing sequences and combining files.\n",
        "\n",
        "It is important we define `__str__`, as this is what python uses to convert an object to a string, such as in `str()` or when used in a magic command.\n",
        "\n",
        "`__add__` works similairly, except with `+` instead of `str()`.  `a + b` gets converted to `a.__add__(b)`.  This allows for combining files by putting all of their sequences in one file."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "BQCoo8a-54te"
      },
      "source": [
        "# A utility function for printing sequences\n",
        "def _display_seq(d, labels):\n",
        "  if labels:\n",
        "    return \">\" + d.description + \"\\n\" + str(d.seq)\n",
        "  else:\n",
        "    return str(d.seq)\n",
        "  \n",
        "class FastaFile:\n",
        "  def __init__(self, file):\n",
        "    self.file = str(file)\n",
        "    with open(file) as f:\n",
        "      self.text = f.read()\n",
        "\n",
        "    self.data = list(SeqIO.parse(file, \"fasta\"))\n",
        "\n",
        "  # utility method for getting the sequence from a single-sequence file\n",
        "  @property\n",
        "  def single(self):\n",
        "    if len(self.data) == 1:\n",
        "      return self.data[0]\n",
        "    else:\n",
        "      raise ValueError(f\"Fasta file {self.file} doesn't have exactly one entry\")\n",
        "  \n",
        "  # convert this class to a string\n",
        "  def __str__(self):\n",
        "    return self.file\n",
        "\n",
        "  # get display text\n",
        "  def display(self, labels=True, spacing=True):\n",
        "    line = \"\\n\\n\" if spacing else \"\\n\"\n",
        "    text = line.join([_display_seq(d, labels) for d in self.data])\n",
        "    return text\n",
        "\n",
        "  # allows using + to combine files\n",
        "  def __add__(self, other):\n",
        "    return combine(self, other)\n",
        "\n",
        "def file_to_str(file):\n",
        "  if isinstance(file, FastaFile):\n",
        "    return file.file\n",
        "  else:\n",
        "    return str(file)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Lifkifxu6n4U"
      },
      "source": [
        "We can now create a utlity method to load a fasta file if it exists, or download it and save it if not."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1cf5MFOP6oVJ"
      },
      "source": [
        "def load_fasta(url, file=None) -> FastaFile:\n",
        "  if file is None:\n",
        "    if not os.path.exists(\"./data\"):\n",
        "      os.mkdir(\"./data\")\n",
        "    file = \"./data/\" + url.split(\"/\")[-1]\n",
        "\n",
        "  if not os.path.exists(file):\n",
        "      wget.download(url, file)\n",
        "  \n",
        "  return FastaFile(file)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1VI6FQA96rat"
      },
      "source": [
        "We also need methods to combine sequences into a single file for muscle.  We use python's vararg support with `*args`.  When a function uses `*args` as a parameter, you can pass in as many arguments as you want, and python will turn them into a list you can reference as `args`.  To pass a list as `args` to a function that uses `*args`, you can expand the list using `*`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6H3SW_br6r7_"
      },
      "source": [
        "def combine(*args):\n",
        "  names = [Path(str(a)).stem for a in args]\n",
        "  new_name = \"+\".join(names) + \".fasta\"\n",
        "  new_file = Path(str(args[0])).parent / new_name\n",
        "  return combineToNew(new_file, *args)\n",
        "\n",
        "def combineToNew(new_file, *args):\n",
        "  files = [str(a) for a in args]\n",
        "  texts = [open(f).read() for f in files]\n",
        "  text = \"\\n\".join(texts)\n",
        "  \n",
        "  with open(new_file, 'w+') as of:\n",
        "    of.write(text)\n",
        "  \n",
        "  return FastaFile(new_file)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "l7uKSNUk6vvy"
      },
      "source": [
        "We can now load the fasta files we want to work with, and use our `FastaFile` utility methods to show the human p53 sequence.."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ul_39ZPP6xoJ",
        "outputId": "244d0312-1c19-44e6-cc63-0984a58b7cb6"
      },
      "source": [
        "human = load_fasta(\"https://www.uniprot.org/uniprot/P04637.fasta\")\n",
        "mouse = load_fasta(\"https://www.uniprot.org/uniprot/P02340.fasta\")\n",
        "rat = load_fasta(\"https://www.uniprot.org/uniprot/P10361.fasta\")\n",
        "\n",
        "print(human.single.seq)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dV-wC4mG7GVx"
      },
      "source": [
        "# BLAST\n",
        "\n",
        "### Objective\n",
        "To use BLAST to align sequences in order to answer the question, how do the p53 sequences of different species compare? \n",
        "\n",
        "### Introduction\n",
        "BLAST is used to align and compare multiple sequences. BLAST is short for basic local alignment search tool. We will use BLAST here to compare the p53 sequence of humans and mice to see how similar they are.\n",
        "\n",
        "### Tutorial\n",
        "We will use NCBI's BLAST web-interface for this tutorial, and then show how to use blast from our notebook.\n",
        "\n",
        "**Step 1:**\n",
        "Get the FASTA file or sequence for the protein you are interested, using UniProt like we showed in the previous tutorial.\n",
        "\n",
        "We will use the human p53.\n",
        "\n",
        "**Step 2:**\n",
        "Go to NCBI BLAST at https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome which will take you to the standard protein BLAST page.\n",
        "\n",
        "![](https://i.ibb.co/BBnkRnw/image.png)\n",
        "\n",
        "**Step 3:**\n",
        "Next, enter the FASTA sequence that you got from UniProt into the box labeled “Enter accession number(s), gif(s), or FASTA sequence(s)”.  You can also upload a FASTA file. Now press the blue button that says “BLAST”.\n",
        "\n",
        "![](https://i.ibb.co/DQ1Tr5f/image.png)\n",
        "\n",
        "**Step 4:**\n",
        "Give the database a minute or so to pull up information on the FASTA sequence. Once it does, lots of information is displayed about the p53 protein. \n",
        "\n",
        "![](https://i.ibb.co/478MWBf/image.png)\n",
        "\n",
        "The description tab can give you information on what your sequence most likely belongs to. This is especially useful if you have a sequence, but you're not sure what it is. The graphic summary tab can give detailed information on where and what things are going on within the sequence. It also compares the proteins listed under the description tab in a figure that makes it easy to visualize. The alignments tab shows comparisons between the sequence and the proteins listed under the description tab. Finally, the taxonomy tab shows a lineage of taxa that are most closely related to the sequence.\n",
        "\n",
        "**Step 5:**\n",
        "The next step is to use BLAST to run a pairwise alignment. Back on UniProt, you need to get both the human p53 FASTA file and the mouse p53 FASTA file (or just use our Python variables to print them).\n",
        "\n",
        "Now go to BLAST and enter the human p53 FASTA file into the same box as before. Then check the button that says “Align two or more sequences” and enter the mouse p53 FASTA file into the second box that appears. Now press the blue button that says “BLAST”.\n",
        "\n",
        "![](https://i.ibb.co/mym9v9w/image.png)\n",
        "\n",
        "**Step 6:**\n",
        "\n",
        "Under the alignments tab, a pairwise alignment with information on it will show how related the two sequences are.\n",
        "\n",
        "![](https://i.ibb.co/xY7yYW2/image.png)\n",
        "\n",
        "### Conclusion\n",
        "\n",
        "In conclusion, we are able to see that mice and humans have pretty similar sequences. More than three fourths of the sequences matched, which we can identify at 77 percent and there were only 1 percent gaps. Both mice and humans have relatively high cancer rates and are highly susceptible to cancerous tumor growth. Using BLAST, we are able to see that they have similar p53 sequences which means that they may have similar difficulties with tumor suppression. To answer the question above, I would say that the p53 sequences of these two species are very similar, but they are not the same. Since the sequences are different. This may be why the two species, and other species, have different levels of tumor suppresion and respond to mutations in the p53 sequence differently. BLAST is a good basic tool to detect similarities and differences between sequences, but MUSCLE can be more valuable when it comes to comparing many different sequences.\n",
        "\n",
        "### References\n",
        "Allan Balmain, Curtis C.Harris, Carcinogenesis in mouse and human cells: parallels and paradoxes, Carcinogenesis, Volume 21, Issue 3, March 2000, Pages 371–377, https://doi.org/10.1093/carcin/21.3.371\n",
        "\n",
        "“BLAST: Basic Local Alignment Search Tool.” National Center for Biotechnology Information, U.S. National Library of Medicine, blast.ncbi.nlm.nih.gov/Blast.cgi. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xueTIukW_TUR"
      },
      "source": [
        "## Algorithm\n",
        "\n",
        "\n",
        "\n",
        "### Sequence Alignment\n",
        "\n",
        "Our goal for this section of the tutorial is to find the difference between two amino acid chains, and to find the most optimal way to align them (more on that later). These chains are represented as strings, e.g. \"PAPAPSWPLSSSVPSQKTYQGSYGFRLGF\". For the sake of simplicity, we will start by considering pairwise alignment, and expand it to multiple alignment near the end.  This is essentially what BLAST is doing.\n",
        "\n",
        "The naive approach is obvious: iterate through both strings, add to the score when they don't match. But even this has issues: what if they aren't the same size? We could just count the over-extended part as mismatches, but what if the smaller sequences match the middle of the larger one? We would get a more optimal score by having part of the over-extension at the beginning and part at the end than having it fully at either end. The real problem with this approach is that it doesn't do any alignment. Due to biological processes, it is not uncommon for amino acids to change here and there, or be inserted or removed. We want our method to be able to handle this gracefully, essentially saying \"ignore this one, but keep matching things around it\".\n",
        "\n",
        "To this end, we use indels. An indel is an insertion or deletion in one of the strings, rather than matching a character. It will occur when removing a character is cheaper than trying to match it, generally because it makes the surroundings match well. We represent indels in the string with \"-\". For example, the optimal alignment of \"ABC\" and \"ABBC\" is \"AB-C\" and \"ABBC\"; we use an indel to match the extra \"B\". The reason they are called indels instead of insertion or deletions is that an indel is the same as inserting on one string or deleting on the other. In our example, it could be handled by deleting the \"B\" in string 2, or by inserting a \"B\" in string 1. Indels also conveniently handle the size issues we mentioned earlier.\n",
        "\n",
        "\n",
        "A further complication is that we can't simply assign a 1 for a match and a 0 for a mismatch.  Due to biological factors, some matches are worth a lot, and some have huge penalties.  We use a scoring matrix for this.  These matrices give a score for each match, and assumes that mismatches are treated as 0.  One of the more common ones is BLOSUM62, pictured below.\n",
        "\n",
        "<img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/0/02/BLOSUM62.png/800px-BLOSUM62.png\" width=800px>\n",
        "\n",
        "Now we just need a way to actually do this comparison.  It's easy enough to adapt our naive approach from earlier: iterate through the strings, choose to match or add an indel in the way that adds the most score.  The problem is that sometimes we want to take sub-optimal solutions for a while, to line up a big match later.  To do this, we use dynamic programming.  I'm not going to give a full explanation of dynamic programming (there's tutorials out there for that), but the basics is that we want to solve the problem optimally for a 1-smaller chunk, then add on the optimal choice for our current position.  Instead of re-calculating the solution to the sub-problem each time, we will store the results in a table.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FwY_BWVsAJRZ"
      },
      "source": [
        "### Dynamic Programming Intro - Making Change\n",
        "\n",
        "A good introductory problem for dynamic programming is the change making problem: given an integer amount of money and a list of coin denominations, find the way to make change with the least amount of coins.  Consider making change for 42 cents, with coins with values 6, 5, and 1.  There are only three possible ways to make change for 42: 6 + `change(36)`, 5 + `change(37)`, and 1 + `change(41)`, where `change` is our function.\n",
        "\n",
        "We can proceed down this branching path until we get `change(0)`, which is 0.  However, if we do it like this, there is a lot of overlap.  We can reduce the computational cost by starting at 0 and working our way up, storing the result each time.\n",
        "\n",
        "Here is an example algorithm with extra code to print the list we are building up:\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Yg_Tftt2AL0v",
        "outputId": "27bccde7-5ba0-4d56-bc6c-41d1b3b7732d"
      },
      "source": [
        "import numpy as np\n",
        "\n",
        "def make_change(money,coins):\n",
        "    min_coins = [np.Inf] * (money + 1)\n",
        "    min_coins[0] = 0\n",
        "\n",
        "    arrays = [np.copy(min_coins)]\n",
        "\n",
        "    for i in range(1, money + 1):\n",
        "        costs = [min_coins[i - c] + 1 for c in coins if i - c >= 0]\n",
        "        min_coins[i] = np.min(costs)\n",
        "\n",
        "        arrays.append(np.copy(min_coins))\n",
        "\n",
        "    print(np.array2string(np.stack(arrays), 200))\n",
        "\n",
        "    return min_coins[money]\n",
        "\n",
        "\n",
        "make_change(27,[6,5,1])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[[ 0. inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1. inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2. inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3. inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4. inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1. inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1. inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2. inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3. inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4. inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2. inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2. inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2. inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3. inf inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4. inf inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4.  3. inf inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4.  3.  3. inf inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4.  3.  3.  3. inf inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4.  3.  3.  3.  3. inf inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4.  3.  3.  3.  3.  4. inf inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4.  3.  3.  3.  3.  4.  4. inf inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4.  3.  3.  3.  3.  4.  4.  4. inf inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4.  3.  3.  3.  3.  4.  4.  4.  4. inf inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4.  3.  3.  3.  3.  4.  4.  4.  4.  4. inf inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4.  3.  3.  3.  3.  4.  4.  4.  4.  4.  4. inf inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4.  3.  3.  3.  3.  4.  4.  4.  4.  4.  4.  5. inf inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4.  3.  3.  3.  3.  4.  4.  4.  4.  4.  4.  5.  5. inf]\n",
            " [ 0.  1.  2.  3.  4.  1.  1.  2.  3.  4.  2.  2.  2.  3.  4.  3.  3.  3.  3.  4.  4.  4.  4.  4.  4.  5.  5.  5.]]\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "5"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 6
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XJ5OkXENAPHG"
      },
      "source": [
        "Ignore everything with `arrays`, it is only there to provide you with the full table output.  Each row in that table is the `min_coins` array on an iteration, you can see how it is built up. \n",
        "\n",
        "Note that we use numpy, importing it as `np`.  Numpy is a python library for working with multi-dimensional arrays.  You could do everything with python's lists, but numpy makes things much easier.  Numpy's documentation is located [here](https://numpy.org/doc/stable/) (look at API reference mostly) and if you google \"numpy ___\" you will probably get good results.  Most functions, such as min and max, are called using `np.min` or `np.max`.  Normally we would have to install it using `pip`, but it is already installed in Colab's runtime.\n",
        "\n",
        "The algorithm itself is not terribly complicated.  We create our result storing array, `min_coins`, and fill it with infinity.  We make it `money + 1` long so that there is a slot for 0, all the way up to a slot for `money` (remember python's arrays are 0-indexed).  We start with `min_coins[0] = 0`, and the work our way up to money (remember, `range` excludes the end value, thus the `money + 1`).  On each iteration, we can use the relationship we pointed out earlier: for each coin, look at the result that many steps ago.  We then chose the previous result that is the smallest, and add one.\n",
        "\n",
        "This is shown in the table.  For example, look at the 6th row.  Remember the table is 0-indexed, so this corresponds to making change of 5.  We have two options here, add a 1 to change of 4, or add a 5 to change of 0.  Since the change of 4 is 4, and the change of 0 is 0, we take the change of 0 and add 1.\n",
        "\n",
        "### Dynamic Programming for Sequence Alignment\n",
        "\n",
        "We want to use this strategy for sequence alignment.  However, we rather quickly run into a problem: what do we use as an index.  It's not as simple as just using the current money, we have two strings to compare.  Additionally, we can insert an indel into either string, so we can't just step back in one direction.  The solution here is to use a 2D array, where dimension corresponds to one of the strings, each index corresponds to using that much of it's string.\n",
        "\n",
        "For the strings \"KNAT\" and \"TSPN\", this would look like: \n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 204
        },
        "id": "JJFAPRXlAXQc",
        "outputId": "f4df726f-6112-4c61-e9c9-3ebb36cf425a"
      },
      "source": [
        "#@title\n",
        "import pandas as pd\n",
        "s1,s2=\"KNAT\",\"TSPN\"\n",
        "scores = pd.DataFrame(index=[\"-\"]+[s1[:i+1] for i in range(len(s1))],columns=[\"-\"]+[s2[:i+1] for i in range(len(s2))])\n",
        "scores"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>-</th>\n",
              "      <th>T</th>\n",
              "      <th>TS</th>\n",
              "      <th>TSP</th>\n",
              "      <th>TSPN</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>-</th>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>K</th>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>KN</th>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>KNA</th>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>KNAT</th>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "      <td>NaN</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "        -    T   TS  TSP TSPN\n",
              "-     NaN  NaN  NaN  NaN  NaN\n",
              "K     NaN  NaN  NaN  NaN  NaN\n",
              "KN    NaN  NaN  NaN  NaN  NaN\n",
              "KNA   NaN  NaN  NaN  NaN  NaN\n",
              "KNAT  NaN  NaN  NaN  NaN  NaN"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 7
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CPFlJnx1AbP6"
      },
      "source": [
        "You can ignore the code used to show this.  For convenience, we won't actually store the strings, just the indices.  We know that an index of `x` means \"the first x characters\".\n",
        "\n",
        "When comparing characters in sequence alignment, we have three options, each with a score:\n",
        "*   Compare the characters: use the value from the the scoring matrix\n",
        "*   Add an indel to string 1: 0\n",
        "*   Add an indel to string 2: 0\n",
        "\n",
        "These map rather cleanly onto our scoring table.  Let's say we are at point `[i, j]` in the table.  Therefore, the characters we are comparing are `i-1` from string 1 and `j-1` from string 2 (remember, the 0 index of our table corresponds to the empty string, and the 1 index corresponds to the first character, which is index 0 in the string).  If we want to add the score based on comparing the characters, we need to get the alignment of the strings without those two characters.  That is at index `[i-1, j-1]`.  Remember to add the score from the scoring matrix!\n",
        "\n",
        "If we are adding an indel to string 1 we want to look at `[i, j-1]`, since the indel matches a character from string 2, essentially removing it.  If we are adding to string 2, we do the same thing and end up looking at `[i-1, j]`.\n",
        "\n",
        "Now that we have the structure of our table, and the moves we can make on it, the algorithm is pretty simple to write.  For the sake of simplicity, we won't use a scoring matrix and just add 1 if there is a match.  It is easy to add later.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "pFSothSTAc9K",
        "outputId": "a014d581-10b0-41b8-b113-3eff9bac1e7d"
      },
      "source": [
        "def align_dynamic(s1,s2):\n",
        "    scores = np.zeros((len(s1) + 1, len(s2) + 1), np.int)\n",
        "\n",
        "    nrows,ncols = scores.shape\n",
        "    for i in range(1,nrows):\n",
        "        for j in range(1,ncols):\n",
        "            match = scores[i-1, j-1] + 1 if s1[i - 1] == s2[j - 1] else 0\n",
        "            ins_1 = scores[i, j-1]\n",
        "            ins_2 = scores[i-1, j]\n",
        "\n",
        "            scores[i, j] = max(match, ins_1, ins_2)\n",
        "            \n",
        "    return scores[nrows - 1, ncols - 1]\n",
        "\n",
        "\n",
        "align_dynamic(\"MEEPQSDPSVEPPLSQETFSDLWKLL\",\"MTAMEESQSDISLELPLSQETFSGLWKLLPP\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "21"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 8
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Isa5xE8fAfY1"
      },
      "source": [
        "Like with the change making algorithm, we start by making the array. We don't bother to fill everything with infinity, since we know that each cell will be calculated by the time we want to use it. We want the left and top sides to be initialized to 0, so we just initialize the whole array to 0. We then iterate across the table, calculating our three options at each point and taking the largest. Once we've done the whole thing, our score for the entirety of both strings is at the bottom right corner.\n",
        "\n",
        "However, we have a problem: what is the alignment!?  We only calculated the score.  To calculate the alignment, we need to store what option we took at each point, and then backtrack through the table from the bottom right.  The algorithm for that isn't that hard, if kind of uguly."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "allEFIcOAhyo",
        "outputId": "dc1e540b-932e-4714-9f5b-929c3f7d89e1"
      },
      "source": [
        "def really_align_dynamic(s1,s2):\n",
        "    scores = np.zeros((len(s1) + 1, len(s2) + 1), np.int)\n",
        "    backtrack = np.zeros((len(s1) + 1, len(s2) + 1), np.int)\n",
        "\n",
        "    for i in range(len(s1) + 1):\n",
        "        backtrack[i, 0] = 2\n",
        "\n",
        "    for j in range(len(s2) + 1):\n",
        "        backtrack[0, j] = 1\n",
        "\n",
        "    nrows,ncols = scores.shape\n",
        "    for i in range(1,nrows):\n",
        "        for j in range(1,ncols):\n",
        "            match = scores[i-1, j-1] + 1 if s1[i - 1] == s2[j - 1] else 0\n",
        "            ins_1 = scores[i, j-1]\n",
        "            ins_2 = scores[i-1, j]\n",
        "\n",
        "            choices = [match, ins_1, ins_2]\n",
        "            taken = np.argmax(choices)\n",
        "\n",
        "            backtrack[i, j] = taken\n",
        "            scores[i, j] = choices[taken]\n",
        "\n",
        "    s1_align = \"\"\n",
        "    s2_align = \"\"\n",
        "    i, j = nrows - 1, ncols - 1\n",
        "    while i != 0 or j != 0:\n",
        "        step = backtrack[i, j]\n",
        "        if step == 0:\n",
        "            s1_align = s1[i - 1] + s1_align\n",
        "            s2_align = s1[i - 1] + s2_align\n",
        "            i, j = i-1, j-1\n",
        "        elif step == 1:\n",
        "            s1_align = '-' + s1_align\n",
        "            s2_align = s2[j - 1] + s2_align\n",
        "            i, j = i, j-1\n",
        "        elif step == 2:\n",
        "            s1_align = s1[i - 1] + s1_align\n",
        "            s2_align = '-' + s2_align\n",
        "            i, j = i-1, j\n",
        "\n",
        "    return scores[nrows - 1, ncols - 1], s1_align, s2_align\n",
        "\n",
        "really_align_dynamic(\"MEEPQSDPSVEPPLSQETFSDLWKLL\",\"MTAMEESQSDISLELPLSQETFSGLWKLLPP\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(21,\n",
              " '---MEEP-QSDP-SV-EP-PLSQETFSD-LWKLL--',\n",
              " 'MTAMEE-SQSD-IS-LE-LPLSQETFS-GLWKLLPP')"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 9
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GE4ePhFuAkbc"
      },
      "source": [
        "We have to initialize the 0 sides of the backtrack array (we don't have to do this for the score since we want them at 0). To recover the strings from the backtrack array, we do what we described earlier and step backwards according to the backtrack values, adding characters or indels to the front of the strings.\n",
        "\n",
        "Argmax, used as `np.argmax`, gets the index of the maximum value instead of getting the maximum value.  So `np.argmax(choices)` gets the index of the choice we're taking, which will be either 0, 1, or 2.  We can get the score of the array by indexing back into `choices`, like `choices[np.argmax(choices)]`.\n",
        "\n",
        "This algorithm is the core of BLAST.  Of course, BLAST does much more, and works by looking for a target sequence in a database sequence rather than doing full sequence comparison, but it is still doing sequence alignment.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NfV3c--79YDp"
      },
      "source": [
        "## Code\n",
        "\n",
        "We want to be able to use BLAST from code.  While the web interface is nice, if you are comparing large amounts of sequences or are offline, doing BLAST locally is nessecary.\n",
        "\n",
        "Lucky for us, BLAST is available as part of the `apt` package repository, which is used by Ubuntu.  Since colab notebooks are ran on Ubuntu servers, we can install it easily.  If you are hosting this notebook yourself, you will need to make the `blast` and `makeblastdb` commands accessible.\n",
        "\n",
        "So to install blast, we need to run `sudo apt-get install ncbi-blast+`.  We also need to run `sudo apt-get update` beforehand to update the repository."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "vkkT9Lkm-g4w",
        "outputId": "8de2f1c5-620e-4b67-dd88-f878e0d41cb2"
      },
      "source": [
        "!sudo apt-get update\n",
        "!sudo apt-get install ncbi-blast+"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "\r0% [Working]\r            \rHit:1 http://archive.ubuntu.com/ubuntu bionic InRelease\n",
            "\r0% [Connecting to security.ubuntu.com (91.189.88.152)] [Connected to cloud.r-pr\r0% [1 InRelease gpgv 242 kB] [Waiting for headers] [Connecting to security.ubun\r                                                                               \rGet:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]\n",
            "\r0% [1 InRelease gpgv 242 kB] [2 InRelease 14.2 kB/88.7 kB 16%] [Connecting to s\r                                                                               \rIgn:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease\n",
            "\r0% [1 InRelease gpgv 242 kB] [2 InRelease 47.5 kB/88.7 kB 54%] [Waiting for hea\r0% [1 InRelease gpgv 242 kB] [Waiting for headers] [Waiting for headers] [Waiti\r                                                                               \rGet:4 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]\n",
            "\r0% [1 InRelease gpgv 242 kB] [Waiting for headers] [Waiting for headers] [4 InR\r0% [1 InRelease gpgv 242 kB] [Waiting for headers] [Waiting for headers] [Waiti\r                                                                               \rGet:5 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]\n",
            "\r0% [1 InRelease gpgv 242 kB] [5 InRelease 4,052 B/74.6 kB 5%] [Waiting for head\r0% [1 InRelease gpgv 242 kB] [Waiting for headers] [Waiting for headers] [Waiti\r                                                                               \rGet:6 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]\n",
            "\r0% [1 InRelease gpgv 242 kB] [Waiting for headers] [6 InRelease 9,828 B/15.9 kB\r                                                                               \rIgn:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease\n",
            "\r0% [1 InRelease gpgv 242 kB] [Waiting for headers] [6 InRelease 14.2 kB/15.9 kB\r                                                                               \rGet:8 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]\n",
            "\r0% [1 InRelease gpgv 242 kB] [8 InRelease 14.2 kB/88.7 kB 16%] [6 InRelease 14.\r                                                                               \rGet:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release [697 B]\n",
            "\r0% [1 InRelease gpgv 242 kB] [8 InRelease 14.2 kB/88.7 kB 16%] [6 InRelease 14.\r0% [1 InRelease gpgv 242 kB] [8 InRelease 14.2 kB/88.7 kB 16%] [6 InRelease 14.\r                                                                               \rHit:10 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release\n",
            "Get:11 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release.gpg [836 B]\n",
            "Get:12 http://archive.ubuntu.com/ubuntu bionic-updates/restricted amd64 Packages [252 kB]\n",
            "Hit:13 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease\n",
            "Get:14 http://archive.ubuntu.com/ubuntu bionic-updates/multiverse amd64 Packages [54.3 kB]\n",
            "Get:15 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [2,130 kB]\n",
            "Get:16 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [2,208 kB]\n",
            "Ign:18 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages\n",
            "Get:18 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages [442 kB]\n",
            "Get:19 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main Sources [1,690 kB]\n",
            "Get:20 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [1,781 kB]\n",
            "Get:21 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,365 kB]\n",
            "Get:22 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main amd64 Packages [865 kB]\n",
            "Fetched 11.1 MB in 3s (4,143 kB/s)\n",
            "Reading package lists... Done\n",
            "Reading package lists... Done\n",
            "Building dependency tree       \n",
            "Reading state information... Done\n",
            "The following additional packages will be installed:\n",
            "  ncbi-data\n",
            "The following NEW packages will be installed:\n",
            "  ncbi-blast+ ncbi-data\n",
            "0 upgraded, 2 newly installed, 0 to remove and 59 not upgraded.\n",
            "Need to get 13.1 MB of archives.\n",
            "After this operation, 66.7 MB of additional disk space will be used.\n",
            "Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 ncbi-data all 6.1.20170106-2 [3,645 kB]\n",
            "Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 ncbi-blast+ amd64 2.6.0-1 [9,446 kB]\n",
            "Fetched 13.1 MB in 1s (9,688 kB/s)\n",
            "debconf: unable to initialize frontend: Dialog\n",
            "debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 2.)\n",
            "debconf: falling back to frontend: Readline\n",
            "debconf: unable to initialize frontend: Readline\n",
            "debconf: (This frontend requires a controlling tty.)\n",
            "debconf: falling back to frontend: Teletype\n",
            "dpkg-preconfigure: unable to re-open stdin: \n",
            "Selecting previously unselected package ncbi-data.\n",
            "(Reading database ... 144793 files and directories currently installed.)\n",
            "Preparing to unpack .../ncbi-data_6.1.20170106-2_all.deb ...\n",
            "Unpacking ncbi-data (6.1.20170106-2) ...\n",
            "Selecting previously unselected package ncbi-blast+.\n",
            "Preparing to unpack .../ncbi-blast+_2.6.0-1_amd64.deb ...\n",
            "Unpacking ncbi-blast+ (2.6.0-1) ...\n",
            "Setting up ncbi-data (6.1.20170106-2) ...\n",
            "Setting up ncbi-blast+ (2.6.0-1) ...\n",
            "Processing triggers for hicolor-icon-theme (0.17-2) ...\n",
            "Processing triggers for man-db (2.8.3-2ubuntu0.1) ...\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_EfQmd1z9sen"
      },
      "source": [
        "Here we define methods to create a blast db, and to call blast.  These both use magic commands, and we create the blast db as part of the blast call since our sequences are small enough.\n",
        "\n",
        "We don't want to see the database building output, so we redirect it to `/dev/null`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aCDtUB0S-394"
      },
      "source": [
        "def build_blast_db(file):\n",
        "  !makeblastdb -in $file -dbtype prot > /dev/null\n",
        "\n",
        "def blast(query, db, outfile = \"blast.txt\"):\n",
        "  build_blast_db(db)\n",
        "  !blastp -query $query -db $db -evalue 1e-6 -num_threads 4 -out $outfile\n",
        "  text = open(outfile).read()\n",
        "  return text"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Ms2Ishrv-7L3"
      },
      "source": [
        "Note that our implementation builds a new database every time.  Since we are only calling it a few times, this if fine, but if you're making a lot of calls against the same database you should pull `build_blast_db` out of `blast`.\n",
        "\n",
        "We now call blast to get the alignment between human and mouse versions of p53, and pring the output."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "DeqNLyjW-9oT",
        "outputId": "01053cfb-9f25-4674-cfa9-8ad619812f8b"
      },
      "source": [
        "blast_text = blast(human, mouse)\n",
        "print(blast_text)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "BLASTP 2.6.0+\n",
            "\n",
            "\n",
            "Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.\n",
            "Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.\n",
            "Lipman (1997), \"Gapped BLAST and PSI-BLAST: a new generation of\n",
            "protein database search programs\", Nucleic Acids Res. 25:3389-3402.\n",
            "\n",
            "\n",
            "Reference for composition-based statistics: Alejandro A. Schaffer,\n",
            "L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri\n",
            "I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),\n",
            "\"Improving the accuracy of PSI-BLAST protein database searches with\n",
            "composition-based statistics and other refinements\", Nucleic Acids\n",
            "Res. 29:2994-3005.\n",
            "\n",
            "\n",
            "\n",
            "Database: ./data/P02340.fasta\n",
            "           1 sequences; 390 total letters\n",
            "\n",
            "\n",
            "\n",
            "Query= sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens\n",
            "OX=9606 GN=TP53 PE=1 SV=4\n",
            "\n",
            "Length=393\n",
            "                                                                      Score     E\n",
            "Sequences producing significant alignments:                          (Bits)  Value\n",
            "\n",
            "  sp|P02340|P53_MOUSE Cellular tumor antigen p53 OS=Mus musculus ...  598     0.0  \n",
            "\n",
            "\n",
            "> sp|P02340|P53_MOUSE Cellular tumor antigen p53 OS=Mus musculus \n",
            "OX=10090 GN=Tp53 PE=1 SV=4\n",
            "Length=390\n",
            "\n",
            " Score = 598 bits (1543),  Expect = 0.0, Method: Compositional matrix adjust.\n",
            " Identities = 304/393 (77%), Positives = 326/393 (83%), Gaps = 6/393 (2%)\n",
            "\n",
            "Query  1    MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP  60\n",
            "            MEE QSD S+E PLSQETFS LWKLLP  ++L P P   MDDL+L P D+E++F    GP\n",
            "Sbjct  4    MEESQSDISLELPLSQETFSGLWKLLPPEDIL-PSP-HCMDDLLL-PQDVEEFFE---GP  57\n",
            "\n",
            "Query  61   DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK  120\n",
            "             EA R+  A     P    P P APAPA  WPLSS VPSQKTYQG+YGF LGFL SGTAK\n",
            "Sbjct  58   SEALRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAK  117\n",
            "\n",
            "Query  121  SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE  180\n",
            "            SV CTYSP LNK+FCQLAKTCPVQLWV +TPP G+RVRAMAIYK+SQHMTEVVRRCPHHE\n",
            "Sbjct  118  SVMCTYSPPLNKLFCQLAKTCPVQLWVSATPPAGSRVRAMAIYKKSQHMTEVVRRCPHHE  177\n",
            "\n",
            "Query  181  RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS  240\n",
            "            RCSD DGLAPPQHLIRVEGNL  EYL+DR TFRHSVVVPYEPPE GS+ TTIHY YMCNS\n",
            "Sbjct  178  RCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNS  237\n",
            "\n",
            "Query  241  SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP  300\n",
            "            SCMGGMNRRPILTIITLEDSSGNLLGR+SFEVRVCACPGRDRRTEEEN RKK     ELP\n",
            "Sbjct  238  SCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELP  297\n",
            "\n",
            "Query  301  PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG  360\n",
            "            PGS KRALP  TS+SP  KKKPLDGEYFTL+IRGR+RFEMFRELNEALELKDA A +E G\n",
            "Sbjct  298  PGSAKRALPTCTSASPPQKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESG  357\n",
            "\n",
            "Query  361  GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD  393\n",
            "             SRAHSS+LK+KKGQSTSRHKK M K  GPDSD\n",
            "Sbjct  358  DSRAHSSYLKTKKGQSTSRHKKTMVKKVGPDSD  390\n",
            "\n",
            "\n",
            "\n",
            "Lambda      K        H        a         alpha\n",
            "   0.314    0.131    0.404    0.792     4.96 \n",
            "\n",
            "Gapped\n",
            "Lambda      K        H        a         alpha    sigma\n",
            "   0.267   0.0410    0.140     1.90     42.6     43.6 \n",
            "\n",
            "Effective search space used: 129958\n",
            "\n",
            "\n",
            "  Database: ./data/P02340.fasta\n",
            "    Posted date:  Nov 22, 2020  10:45 PM\n",
            "  Number of letters in database: 390\n",
            "  Number of sequences in database:  1\n",
            "\n",
            "\n",
            "\n",
            "Matrix: BLOSUM62\n",
            "Gap Penalties: Existence: 11, Extension: 1\n",
            "Neighboring words threshold: 11\n",
            "Window for multiple hits: 40\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EqFiw7L9AsYO"
      },
      "source": [
        "# MUSCLE\n",
        "\n",
        "### Objective\n",
        "\n",
        "Use MUSCLE to perform multiple sequence alignments and answer the following question: Are mutations in the p53 protein sequence associated with the development of cancer?\n",
        "\n",
        "### Introduction\n",
        "\n",
        "MUSCLE stands for Multiple Sequence Comparison by Log Expectation. MUSCLE is an open-source algorithm used to perform multiple alignments of protein sequences. In other words, MUSCLE compares two or more sequences to a reference sequence, assigning each a score of how closely the sequence matches the reference. These scores can be used to determine evolutionary relationships, predict protein structure and function, identify residues, and more.\n",
        "\n",
        "In this tutorial, we will continue our exploration of the p53 protein by performing multiple alignments between p53 sequences using MUSCLE. We will compare p53 protein sequences found in species with different cancer rates to test the theory that p53 mutations are associated with the development of cancerous cells. The species with high cancer rates that will be studied in this tutorial are rats and mice. The species with low cancer rates that will be studied are cows (bovine), horses, and beluga whales. We will also study humans and dogs to determine how these compare to the species with known high and low cancer rates.\n",
        "\n",
        "Like with BLAST, we'll use the web interface at first and then show how to run MUSCLE from the notebook.\n",
        "\n",
        "### Tutorial\n",
        "\n",
        "**Step 1:**\n",
        "\n",
        "The first step is to download FASTA files of the protein sequences from UniProt.  We want to download files for the following species in UniProt:\n",
        "\n",
        " * P53_HUMAN: Homo sapiens (Human)\n",
        " * P53_MOUSE: Mus musculus (Mouse)\n",
        " * P53_RAT: Rattus norvegicus (Rat)\n",
        " * P53_DELLE: Delphinapterus leucas (Beluga whale)\n",
        " * P53_HORSE: Equus caballus (Horse)\n",
        " * P53_BOVIN: Bos taurus (Bovine)\n",
        " * P53_CANLF: Canis lupus familiaris (Dog) (Canis familiaris)\n",
        "\n",
        "You can use the notebook or the web interface, we won't use all of these for our in-notebook MUSCLE, just the human, mouse, and rat we already have.\n",
        "\n",
        "**Step 2:**\n",
        "\n",
        "Navigate to MUSCLE (https://www.ebi.ac.uk/Tools/msa/muscle/). Copy the FASTA file text from UniProt and paste into the MUSCLE “STEP 1” text box, shown below. Be sure to include the headings of the FASTA files for each species, just as it appears in UniProt.\n",
        "\n",
        "![](https://i.ibb.co/r7qQLf8/image.png)\n",
        "\n",
        "**Step 3:**\n",
        "\n",
        "Select your desired output format using the drop down shown below. By default, MUSCLE selects the ClustalW output format, which is commonly used among bioinformaticians. For this tutorial, we will keep the output format set to ClustalW.\n",
        "\n",
        "![](https://i.ibb.co/Bn91WK5/image.png)\n",
        "\n",
        "**Step 4:**\n",
        "\n",
        "Notice the box under STEP 3 to be notified by email when results are available. This could be a desirable option if you are performing a computationally-intensive multiple alignment. For this tutorial, the computation should take no more than 10 seconds, so checking this box is unnecessary. Click SUBMIT.\n",
        "\n",
        "![](https://i.ibb.co/g3n0zSf/image.png)\n",
        "\n",
        "**Step 5:**\n",
        "\n",
        "The algorithm will perform the multiple alignment using your given protein sequences and display the results in ClustalW format. Click on the “Phylogenetic Tree” tab to visualize the relationships between the p53 sequences of these species. The “Cladogram” is a simple relational tree with no meaning to the length of the branches, whereas the “Real” tree uses the degree of relatedness to determine the length of each branch.\n",
        "\n",
        "![](https://i.ibb.co/CnCYLs7/image.png)\n",
        "\n",
        "**Step 6:**\n",
        "\n",
        "We will now create a real, interactive phylogenetic tree that more clearly presents the relationships between these species’ p53 sequences. Scrolling down in this tab, you will find the “Tree Data” for this multiple alignment. Copy this text file. If needed, refer to *Appendix A* for the format of the phylogenetic tree text.\n",
        "\n",
        "![](https://i.ibb.co/RB0HPfs/image.png)\n",
        "\n",
        "**Step 7:**\n",
        "\n",
        "Navigate to Interactive Tree of Life’s upload feature (https://itol.embl.de/upload.cgi). This is a useful tool for displaying, annotating, and customizing phylogenetic trees. Enter a tree name if desired and paste the tree text from MUSCLE into the “Tree text” box shown below. Click Upload.\n",
        "\n",
        "![](https://i.ibb.co/bNcBwxJ/image.png)\n",
        "\n",
        "The tool will create the phylogenetic tree shown below. To interpret this tree, think of a horizontal axis of time. The placement of the nodes represents the point in time at which those species may have evolved from a common ancestor. The species with nodes closer to the right are more closely related, while species with nodes closer to the left are less closely related.\n",
        "\n",
        "![](https://i.ibb.co/YX3zCzt/image.png)\n",
        "\n",
        "### Conclusion\n",
        "\n",
        "Let’s think about the implications of this tree with respect to our original question: Are mutations in the p53 protein sequence associated with the development of cancer? To answer this question, recall that rats and mice are known to have high cancer rates, cows (bovine), horses, and beluga whales are known to have low cancer rates, and humans and dogs are our unknowns in this study.\n",
        "\n",
        "The results of the multiple alignment show a close relationship between the rat and mouse p53 sequences and a somewhat close relationship between cow and beluga whale p53 sequences. This could mean that rats and mice have similar mutations in the p53 sequence, resulting in a less effective tumor suppressor protein. On the other hand, cows and beluga whales may have fewer mutations in the p53 sequence, resulting in a more effective tumor suppressor protein. Considering our unknowns, it appears that human p53 sequences are somewhere in between the two extremes, while dog sequences are most similar to horse, cow, and beluga whale sequences.\n",
        "\n",
        "It seems there is a correlation between p53 mutations and cancer rates, however we cannot draw a causal relationship with this data alone.\n",
        "\n",
        "You have now learned to perform multiple sequence alignments using MUSCLE and visualize the results using Interactive Tree of Life.\n",
        "\n",
        "### References\n",
        "\n",
        "Edgar, Robert C. “MUSCLE: multiple sequence alignment with high accuracy and high throughput.” Nucleic acids research vol. 32,5 1792-7. 19 Mar. 2004, doi:10.1093/nar/gkh340\n",
        "\n",
        "Seluanov, Andrei et al. “Mechanisms of cancer resistance in long-lived mammals.” Nature reviews. Cancer vol. 18,7 (2018): 433-441. doi:10.1038/s41568-018-0004-9\n",
        "\n",
        "\n",
        "### Appendix A: Text of the Phylogenetic Tree\n",
        "```\n",
        "(\n",
        "(\n",
        "(\n",
        "(\n",
        "sp|P02340|P53_MOUSE:0.05888,\n",
        "sp|P10361|P53_RAT:0.05252)\n",
        ":0.08665,\n",
        "sp|P04637|P53_HUMAN:0.07571)\n",
        ":0.00908,\n",
        "(\n",
        "sp|P67939|P53_BOVIN:0.08569,\n",
        "sp|Q8SPZ3|P53_DELLE:0.03608)\n",
        ":0.01438)\n",
        ":0.00733,\n",
        "sp|Q29537|P53_CANLF:0.08228,\n",
        "sp|P79892|P53_HORSE:0.03242);\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_Ko0dNj8AuPy"
      },
      "source": [
        "## Algorithm\n",
        "\n",
        "We already covered how to align two sequences as part of the BLAST tutorial.  Conceptually, aligning more is the same sort of problem.  However, it is significantly more complicated to actually implement.  The problem is called Multiple Sequence Alignment.\n",
        "\n",
        "While our sequence alignment algorithm can handle BLAST like problems, MUSCLE does something different: it aligns multiple sequences at the same time.  This is quite a bit harder to do, and we aren't going to give a full algorithm for it, but we will describe how it is done.  The key lies in our earlier construction of the table for sequence alignment.  It's not 2D by chance, it's 2D because we have two strings and need an index for each string.  That naturally extends itself to multiple strings: the table will have as many dimensions as we have strings.  Of course, that complicates the possible moves.  At each point, we could add an indel to any combination of the strings, from none to all of them.  The indices for each of those choices can be made using the same pattern that we used for two strings: use the current index if we are inserting an indel, otherwise use 1 before.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LgY9_P2qEEi8"
      },
      "source": [
        "## Code\n",
        "\n",
        "We want to be able to preform MUSCLE alignment from code for the same reasons as for BLAST.  Like BLAST, MUSCLE is available on `apt`, so it is easy to install.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "8WiFgwMVERpN",
        "outputId": "d129506e-8d93-48e8-9300-5fe8b4157c74"
      },
      "source": [
        "!sudo apt-get update\n",
        "!sudo apt-get install muscle"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "\r0% [Working]\r            \rHit:1 http://security.ubuntu.com/ubuntu bionic-security InRelease\n",
            "Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease\n",
            "Hit:3 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease\n",
            "Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease\n",
            "Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease\n",
            "Hit:6 http://archive.ubuntu.com/ubuntu bionic-updates InRelease\n",
            "Hit:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release\n",
            "Hit:8 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease\n",
            "Hit:9 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release\n",
            "Hit:10 http://archive.ubuntu.com/ubuntu bionic-backports InRelease\n",
            "Hit:11 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease\n",
            "Reading package lists... Done\n",
            "Reading package lists... Done\n",
            "Building dependency tree       \n",
            "Reading state information... Done\n",
            "The following NEW packages will be installed:\n",
            "  muscle\n",
            "0 upgraded, 1 newly installed, 0 to remove and 59 not upgraded.\n",
            "Need to get 156 kB of archives.\n",
            "After this operation, 484 kB of additional disk space will be used.\n",
            "Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 muscle amd64 1:3.8.31+dfsg-3 [156 kB]\n",
            "Fetched 156 kB in 0s (1,228 kB/s)\n",
            "debconf: unable to initialize frontend: Dialog\n",
            "debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)\n",
            "debconf: falling back to frontend: Readline\n",
            "debconf: unable to initialize frontend: Readline\n",
            "debconf: (This frontend requires a controlling tty.)\n",
            "debconf: falling back to frontend: Teletype\n",
            "dpkg-preconfigure: unable to re-open stdin: \n",
            "Selecting previously unselected package muscle.\n",
            "(Reading database ... 145014 files and directories currently installed.)\n",
            "Preparing to unpack .../muscle_1%3a3.8.31+dfsg-3_amd64.deb ...\n",
            "Unpacking muscle (1:3.8.31+dfsg-3) ...\n",
            "Setting up muscle (1:3.8.31+dfsg-3) ...\n",
            "Processing triggers for man-db (2.8.3-2ubuntu0.1) ...\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NXgJzwtIEW10"
      },
      "source": [
        "We now have to define functions to call MUSCLE.  Since it takes multiple sequences, we want to have it take multiple files and combine them, using args.  This is why we made `combine` functions earlier.\n",
        "\n",
        "Since muscle produces an output fasta file, we also want to return it as a `FastaFile`.\n",
        "\n",
        "Muscle writes progress output to stderr, so we redirect stderr to `/dev/null` (which is the same as ignoring it) using ` 2> /dev/null`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "WJRaGHXrEfr-"
      },
      "source": [
        "def muscle(*args):\n",
        "  infile = combine(*args)\n",
        "  in_name = \"MUSCLE_\" + Path(str(infile)).name\n",
        "  outfile = str(in_name)\n",
        "\n",
        "  !muscle -in $infile -out $outfile 2> /dev/null\n",
        "  return FastaFile(outfile)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cjQfop1zEk0q"
      },
      "source": [
        "We can now run muscle on human, mouse, and rat versions of p53."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "HTcHxnnJEi38",
        "outputId": "65842662-ec4a-404f-ed15-f445ba29b3ac"
      },
      "source": [
        "muscle_result = muscle(human, mouse, rat)\n",
        "print(muscle_result.display())"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            ">sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4\n",
            "---MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPL---PSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD\n",
            "\n",
            ">sp|P02340|P53_MOUSE Cellular tumor antigen p53 OS=Mus musculus OX=10090 GN=Tp53 PE=1 SV=4\n",
            "MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILP-----SPHCMDDLLL-PQDVEEFFE---GPSEALRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFCQLAKTCPVQLWVSATPPAGSRVRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGSAKRALPTCTSASPPQKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSRAHSSYLKTKKGQSTSRHKKTMVKKVGPDSD\n",
            "\n",
            ">sp|P10361|P53_RAT Cellular tumor antigen p53 OS=Rattus norvegicus OX=10116 GN=Tp53 PE=1 SV=1\n",
            "---MEDSQSDMSIELPLSQETFSCLWKLLPPDDILPTTATGSPNSMEDLFL-PQDVAELLE---GPEEALQVS-APAAQEPGTEAPAPVAPASATPWPLSSSVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSISLNKLFCQLAKTCPVQLWVTSTPPPGTRVRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNPYAEYLDDRQTFRHSVVVPYEPPEVGSDYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEEHCPELPPGSAKRALPTSTSSSPQQKKKPLDGEYFTLKIRGRERFEMFRELNEALELKDARAAEESGDSRAHSSYPKTKKGQSTSRHKKPMIKKVGPDSD\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MEUhtRYqEhuX"
      },
      "source": [
        "We can display the fasta file without labels or spacing, to more easily see the alignment."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "WItdNBwmEokt",
        "outputId": "78f393ed-0041-4298-b5b7-051e21239ad8"
      },
      "source": [
        "print(muscle_result.display(labels=False, spacing=False))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "---MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPL---PSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD\n",
            "MTAMEESQSDISLELPLSQETFSGLWKLLPPEDILP-----SPHCMDDLLL-PQDVEEFFE---GPSEALRVSGAPAAQDPVTETPGPVAPAPATPWPLSSFVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSPPLNKLFCQLAKTCPVQLWVSATPPAGSRVRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNLYPEYLEDRQTFRHSVVVPYEPPEAGSEYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEVLCPELPPGSAKRALPTCTSASPPQKKKPLDGEYFTLKIRGRKRFEMFRELNEALELKDAHATEESGDSRAHSSYLKTKKGQSTSRHKKTMVKKVGPDSD\n",
            "---MEDSQSDMSIELPLSQETFSCLWKLLPPDDILPTTATGSPNSMEDLFL-PQDVAELLE---GPEEALQVS-APAAQEPGTEAPAPVAPASATPWPLSSSVPSQKTYQGNYGFHLGFLQSGTAKSVMCTYSISLNKLFCQLAKTCPVQLWVTSTPPPGTRVRAMAIYKKSQHMTEVVRRCPHHERCSDGDGLAPPQHLIRVEGNPYAEYLDDRQTFRHSVVVPYEPPEVGSDYTTIHYKYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRDSFEVRVCACPGRDRRTEEENFRKKEEHCPELPPGSAKRALPTSTSSSPQQKKKPLDGEYFTLKIRGRERFEMFRELNEALELKDARAAEESGDSRAHSSYPKTKKGQSTSRHKKPMIKKVGPDSD\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6g0slebgEqjG"
      },
      "source": [
        "# SWISS-MODEL\n",
        "\n",
        "### Objective\n",
        "\n",
        "Use SWISS-MODEL to generate a homology-based 3-D model of the p53 protein so we can associate changes in amino acid sequences with changes to protein structure.\n",
        "\n",
        "### Introduction\n",
        "\n",
        "SWISS-MODEL is a bioinformatics tool used for homology-based 3-D protein modeling. \n",
        "\n",
        "### Tutorial\n",
        "\n",
        "**Step 1:**\n",
        "\n",
        "Access SWISS-MODEL at https://swissmodel.expasy.org/ and select “Start Modeling” \n",
        "\n",
        "![](https://i.ibb.co/hFqSR66/image.png)\n",
        "\n",
        "**Step 2:**\n",
        "\n",
        "This selection will bring you to this window, where you can begin building your protein model by inputting a sequence in FASTA, Clustal, plain string, or UniProtKB AC format. \n",
        "\n",
        "![](https://ibb.co/k62cM9X)\n",
        "\n",
        "**Step 3:**\n",
        "\n",
        "Upload sequence in one of the previously listed accepted formats and click “Build Model”\n",
        "\n",
        "![](https://i.ibb.co/M8h37f4/image.png)\n",
        "\n",
        "**Step 4:**\n",
        "\n",
        "After the program generates homology-based 3-D models, you can view them in this window along with oligo-states, ligands, residue interactions, etc. \n",
        "\n",
        "![](https://i.ibb.co/5hVMBmp/image.png)\n",
        "\n",
        "**Step 5:**\n",
        "\n",
        "You can also observe the alignment the sequence of the particular model you are observing has with the model template sequence, allowing you to see where substitutions with the model template affect the structure of your protein of interest.\n",
        "\n",
        "![](https://i.ibb.co/cvMD5Yw/image.png)\n",
        "\n",
        "**Step 6:**\n",
        "\n",
        "Along with these features, alterations can be made to the 3-D model itself, for applications such as more easily viewing hydrophobicity in different regions of the protein. This can be done by selecting from the toolbar below the model, with this particular selection being “surface”. \n",
        "\n",
        "![](https://i.ibb.co/85sTFx1/image.png)\n",
        "\n",
        "**Step 7:**\n",
        "\n",
        "Switching from the “Models” tab to the “Templates” tab allows you to view more information on homology between your protein models such as oligomeric state, stoichiometry, topology, and interface similarity. The particular selection below is the “stoichiometry” selection. \n",
        "\n",
        "![](https://i.ibb.co/PW6bfp7/image.png)\n",
        "\n",
        "### Conclusion\n",
        "\n",
        "SWISS-MODEL is a powerful tool for investigating the three-dimensional structure emergent from strings of amino acid sequences, allowing the user insight into features of a protein model like sterics, binding affinities, hydrophobicity, and sequence alignment. Along with these features, SWISS-MODEL’s most powerful feature is undoubtedly the ability to group these like protein models by sequence and structural homology, allowing the user an entire dataset from which to view the evolutionary differences between seemingly identical proteins. This tool proves extremely useful in answering our overarching question: How do mutations in the p53 sequence alter the shape of the protein? Through the use of this tool along with support from the other bioinformatic tools we’ve used so far, we can confidently answer that question. \n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yaZ9zsQyHGbt"
      },
      "source": [
        "## Algorithm\n",
        "\n",
        "Protein folding is an extremely complicated problem.  Tools like SWISS-Model do full molecular dynamics simulations to figure out the shape.  Something like that is far beyond the scope of this tutorial.  However, we can tackle the sub problem of secondary structure prediction: predicting whether a sequence of amino acids is an alpha helix, beta sheet, coil, or turn.  Most modern methods use machine learning for this, which is also out of the scope of this tutorial, so we will be focusing on an older but still relatively good method known as GOR, which is a statistics based method.\n",
        "\n",
        "We calculate our statistics using a protein database used by the original GOR, which must be downloaded."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rSN7S-ewHNmg"
      },
      "source": [
        "gor_data_dir = Path(\"gor/data\")\n",
        "gor_data_dir.mkdir(parents=True, exist_ok=True)\n",
        "\n",
        "import enum\n",
        "class GorType(enum.Enum):\n",
        "  Helix = 0\n",
        "  Coil = 1\n",
        "  Sheet = 2\n",
        "\n",
        "def type_for_code(ss):\n",
        "  if ss == 'C':\n",
        "    return GorType.Coil\n",
        "  elif ss == \"E\":\n",
        "    return GorType.Sheet\n",
        "  elif ss == \"H\":\n",
        "    return GorType.Helix\n",
        "  raise ValueError\n",
        "\n",
        "used_types = [GorType.Helix, GorType.Coil, GorType.Sheet]\n",
        "\n",
        "def other_types(type: GorType):\n",
        "  return [t for t in list(GorType) if not t is type]\n",
        "\n",
        "db_file = Path(\"gor/db.txt\")\n",
        "\n",
        "if not db_file.exists():\n",
        "  wget.download(\"https://raw.githubusercontent.com/martin-steinegger/gor/master/data/GOR/CB513DSSP.db\", \"gor/db.txt\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bB8ZpvWxHPmD"
      },
      "source": [
        "GOR works on 17-amino-acid windows, and uses three count matrices of size 17x20, one for each substructure except turn.  These matrices are constructed from a ground truth set of labeled sequences, which must be found manually.  Each matrix is initialized with ones, and as the training data is iterated over, if the center amino acid is part of the matrix’s substructure, `[aa, i]` is incremented for each position `i` in the window and the amino acid `aa` at that position.\n",
        "\n",
        "Here we initalize our matrices by taking windows from each sequence and initalizeing them as we just described."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tOjtd29aHWu7"
      },
      "source": [
        "aas = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']\n",
        "\n",
        "with db_file.open('r') as f:\n",
        "  text = [l.split(\"\\n\")[1:] for l in f.read().split(\"\\n\\n\")]\n",
        "  text = [l for l in text if len(l) > 0]\n",
        "  text = [(l[0].split(\" \")[1], l[1].split(\" \")[1]) for l in text]\n",
        "\n",
        "gor_data = {t: np.ones([len(aas), 17], dtype='float32') for t in used_types}\n",
        "\n",
        "for seq, gt in text:\n",
        "  i = 8\n",
        "  while i + 8 < len(seq):\n",
        "    aa_window = seq[i-8:i+9]\n",
        "\n",
        "    ss = gt[i]\n",
        "    mat = gor_data[type_for_code(ss)]\n",
        "\n",
        "    for j in range(len(aa_window)):\n",
        "      aa = aa_window[j]\n",
        "      if not aa in aas:\n",
        "        continue\n",
        "      \n",
        "      aa_idx = aas.index(aa)\n",
        "      mat[aa_idx, j] += 1\n",
        "    i += 1"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NQBWMp1YHZI2"
      },
      "source": [
        "We can then predict the substructure for an amino acid using information theory: $I(S; R) = log(f(S,R) / f(n-S,R)) + log(f(n-S) / f(S))$.  f is a function to get the frequency, using our calculated data.  S is the substructure, n-S is the other substructures, and R is the residue.  So that function can be said to be the log of the counted matches over the counted non-matches, plus a normalization factor that is the log of the total number of other substructure matches over the number of this substructure’s matches.  We pre-calculate the normalization factors, since they only depend on our matrices, not the input.  "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "hspNl-OsHbRy",
        "outputId": "0f2fd9a4-dccb-4c71-c497-d07162592060"
      },
      "source": [
        "def norm(overall, type):\n",
        "  return np.log(sum(overall[t] for t in other_types(type)) / overall[type])\n",
        "\n",
        "def compute_normalization_factors(gor_data):\n",
        "  overall = {k: np.sum(v) for k, v in gor_data.items()}\n",
        "  diffs = {t: norm(overall, t) for t in list(GorType)}\n",
        "\n",
        "  return diffs\n",
        "\n",
        "def single_info(type: GorType, aa_idx, index, gor_data, norms):\n",
        "  data = gor_data[type]\n",
        "  norm = norms[type]\n",
        "  fSR = data[aa_idx, index]\n",
        "  fnSR = sum(gor_data[t][aa_idx, index] for t in other_types(type))\n",
        "  return np.log(fSR / fnSR) + norm\n",
        "\n",
        "norms = compute_normalization_factors(gor_data)\n",
        "norms"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "{<GorType.Coil: 1>: 0.32557217228433927,\n",
              " <GorType.Helix: 0>: 0.600947339587764,\n",
              " <GorType.Sheet: 2>: 1.2278688616978475}"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 19
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "J_ckV8emHdG7"
      },
      "source": [
        "Of course, we aren’t just working with one amino acid, so we approximate $I(S ; R_1, ... R_n)$ as $\\sum_{m} I(S; R_m)$ where m is the index in the sequence.  We can then predict the substructure at each amino acid by going through the sequence in windows like we did to initalize the matrices."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 52
        },
        "id": "X1jZzfmQHffT",
        "outputId": "27b21aff-25f3-4b06-fee3-a17f73774fe9"
      },
      "source": [
        "def information(seq, type: GorType, gor_data):\n",
        "  assert len(seq) == 17, f\"Sequence must be of length 17, got {len(seq)}\"\n",
        "  info = 0\n",
        "  for i in range(len(seq)):\n",
        "    aa = seq[i]\n",
        "    if aa not in aas:\n",
        "      continue\n",
        "    aa_idx = aas.index(aa)\n",
        "    info += single_info(type, aa_idx, i, gor_data, norms)\n",
        "  \n",
        "  return info\n",
        "\n",
        "def predict(sequence, gor_data=gor_data):\n",
        "  if len(sequence) < 17:\n",
        "    raise ValueError(\"Sequence is to small\")\n",
        "  \n",
        "  pred = \"NNNNNNNN\"\n",
        "  i = 8\n",
        "  while i + 8 < len(sequence):\n",
        "    aa_window = sequence[i-8:i+9]\n",
        "    infos = [information(aa_window, t, gor_data) for t in list(GorType)]\n",
        "    max_info = np.argmax(infos)\n",
        "    type = list(GorType)[max_info]\n",
        "    pred += type.name[0]\n",
        "    i += 1\n",
        "  \n",
        "  pred += \"NNNNNNNN\"\n",
        "  \n",
        "  return pred\n",
        "\n",
        "predict(human.single.seq)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'NNNNNNNNCCCCCCCHHHHHHHHHSCCCCCCCCCCCHHHHHHHHCCCCCHHSSSSCCCCCCCCCCCHHCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCSSSCCCCCSSSSSSSCCCCSSSSSSSCCCHHHHHHHHCCCCCSSSSSCCCCCCCCSSSHHHHHHHHHCCSSSSSSCCCCCSSCCCCCCCCCHHHSHHHCCCHSSSHCCCCCCSSSSSSCCCCCCCCCCSSSSSSSSSSCCCCCCCCCCCCSSSSSSSCCCCCCSSCCCCCSSSSSSCCCCCCHHHHHHHHHCCCCCSSCCCCCCSSSSCCCCCCCCCCCCCCCCCCSSHHHHHHHHHHHHHHHHHHHHHHHHHHHCCCCCCCSSSSHSSSCCCCCCHHHHHHSSSNNNNNNNN'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 20
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UeToQuV5HmdC"
      },
      "source": [
        "Since we use windowing, there are some N/A values at the start and end, but we can see that according to GOR, human p53 is mostly coils and helixes."
      ]
    }
  ]
 }