Skip to content

Instantly share code, notes, and snippets.

@mhfowler
Created July 14, 2020 16:55
Show Gist options
  • Save mhfowler/70b10311f4efc2eb740b025d205a472d to your computer and use it in GitHub Desktop.
Save mhfowler/70b10311f4efc2eb740b025d205a472d to your computer and use it in GitHub Desktop.
> "cells": [
> {
> "cell_type": "code",
> "execution_count": 407,
> "metadata": {},
> "outputs": [],
> "source": [
> "text = \"yellow orange red purple blue green\""
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 408,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "yellow orange red purple blue green\n"
> ]
> }
> ],
> "source": [
> "print(text)"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "# Random & String Methods"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "### shuffle words"
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 409,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "green\n",
> "red\n",
> "orange\n",
> "blue\n",
> "yellow\n",
> "purple\n"
> ]
> }
> ],
> "source": [
> "# shuffle words using random.shuffle\n",
> "import random\n",
> "\n",
> "# first we turn text into a list of words\n",
> "words = text.split(' ')\n",
> "\n",
> "# then we shuffle the list\n",
> "random.shuffle(words)\n",
> "\n",
> "# then we print each word in the list\n",
> "for word in words:\n",
> " print(word)"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "### shuffle letters"
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 410,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "wellyo\n",
> "egraon\n",
> "red\n",
> "puelpr\n",
> "bleu\n",
> "egnre\n"
> ]
> }
> ],
> "source": [
> "# shuffle letters\n",
> "words = text.split(' ')\n",
> "for word in words:\n",
> " # to use shuffle, you need a list. \n",
> " # word is a string, so we turn word into a list before shuffling\n",
> " letter_list = list(word)\n",
> " random.shuffle(letter_list)\n",
> " # now we turn letter_list back into a string\n",
> " new_word = ''.join(letter_list)\n",
> " print(new_word)"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "### slicing"
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 411,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "ye\n",
> "or\n",
> "re\n",
> "pu\n",
> "bl\n",
> "gr\n"
> ]
> }
> ],
> "source": [
> "# slicing strings \n",
> "words = text.split(' ')\n",
> "for word in words:\n",
> " new_word = word[:2]\n",
> " print(new_word)"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "You could change 2 to whatever you like."
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 412,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "llow\n",
> "ange\n",
> "d\n",
> "rple\n",
> "ue\n",
> "een\n"
> ]
> }
> ],
> "source": [
> "# slicing strings \n",
> "words = text.split(' ')\n",
> "for word in words:\n",
> " new_word = word[2:]\n",
> " print(new_word)"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "### random.sample"
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 413,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "\n",
> "\n",
> "*****\n",
> "yellow\n",
> "red\n",
> "\n",
> "\n",
> "*****\n",
> "orange\n",
> "blue\n",
> "\n",
> "\n",
> "*****\n",
> "blue\n",
> "green\n",
> "purple\n"
> ]
> }
> ],
> "source": [
> "# sampling from a list \n",
> "import random\n",
> "\n",
> "all_words = text.split(' ')\n",
> "\n",
> "print(\"\")\n",
> "print(\"\")\n",
> "print(\"*****\")\n",
> "a_few_words = random.sample(all_words, 2)\n",
> "for word in a_few_words:\n",
> " print(word)\n",
> " \n",
> "print(\"\")\n",
> "print(\"\")\n",
> "print(\"*****\")\n",
> "a_few_words = random.sample(all_words, 2)\n",
> "for word in a_few_words:\n",
> " print(word)\n",
> "\n",
> "print(\"\")\n",
> "print(\"\")\n",
> "print(\"*****\")\n",
> "a_few_words = random.sample(all_words, 3)\n",
> "for word in a_few_words:\n",
> " print(word)\n"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "### capitalize, upper, lower"
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 414,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "Yellow\n",
> "YELLOW\n",
> "yellow\n",
> "\n",
> "\n",
> "Orange\n",
> "ORANGE\n",
> "orange\n",
> "\n",
> "\n",
> "Red\n",
> "RED\n",
> "red\n",
> "\n",
> "\n",
> "Purple\n",
> "PURPLE\n",
> "purple\n",
> "\n",
> "\n",
> "Blue\n",
> "BLUE\n",
> "blue\n",
> "\n",
> "\n",
> "Green\n",
> "GREEN\n",
> "green\n",
> "\n",
> "\n"
> ]
> }
> ],
> "source": [
> "# capitalize and lowercase\n",
> "\n",
> "words = text.split(' ')\n",
> "for word in words:\n",
> " capitalize = word.capitalize()\n",
> " print(capitalize)\n",
> " upper = word.upper()\n",
> " print(upper)\n",
> " lowercase = word.lower()\n",
> " print(lowercase)\n",
> " print(\"\")\n",
> " print(\"\")\n"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "# Cleaning Up Text"
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 415,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> " red yellow green blue \n",
> "red yellow green blue\n"
> ]
> }
> ],
> "source": [
> "# if you are working with found text, or any kind of data or stuff you copy and pasted, \n",
> "# you may need to clean it up in some ways \n",
> "# this is sometimes called data cleaning, or data munging\n",
> "\n",
> "weird_text = \" red yellow green blue \"\n",
> "print(weird_text)\n",
> "\n",
> "# strip removes trailing and leading spaces\n",
> "stripped_text = weird_text.strip()\n",
> "print(stripped_text)"
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 416,
> "metadata": {},
> "outputs": [],
> "source": [
> "# for cleaning up text, \n",
> "# you may also need to remove and replace strange characters that are not just white space\n",
> "# replace is useful for this"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "# Find & Replace "
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 417,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "redyellowbluegreen\n"
> ]
> }
> ],
> "source": [
> "weird_text = \"%%%%%%%%%%%%%%%%%%%%%%red%%%%yellow%%%%blue%%%green%%\"\n",
> "\n",
> "new_text = weird_text.replace(\"%\", \"\")\n",
> "print(new_text)"
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 418,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "purple purple blue blue purple purple blue blue purple purple\n",
> "red red blue blue red red blue blue red red\n"
> ]
> }
> ],
> "source": [
> "source_string = \"purple purple blue blue purple purple blue blue purple purple\"\n",
> "print(source_string)\n",
> "\n",
> "new_colors = source_string.replace('purple', 'red')\n",
> "print(new_colors)"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "# Regular Expressions"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "regular expressions are complicated \\\n",
> "whether you like them or hate them probably depends on your personality\n",
> "\n",
> "what they basically do, is let you find and replace things \\\n",
> "but instead of just replacing purple with red, you can replace patterns with other patterns\n",
> "\n",
> "you could replace \\\n",
> "aaaaaaaabbbbbbbbbaaabbbbbbbbbbbaaab \\\n",
> "with \\\n",
> "ababab\n",
> "\n",
> "in english we might say, \\\n",
> "replace every sequence of \"a\" with one \"a\" and every sequence of \"b\" with one \"b\", no matter how long the sequence is\n",
> "\n",
> "in regex we would write, "
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 419,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "ababab\n"
> ]
> }
> ],
> "source": [
> "import re\n",
> "\n",
> "source = 'aaaaaaaabbbbbbbbbaaabbbbbbbbbbbaaab'\n",
> "output = re.sub('a+b+', 'ab', source)\n",
> "print(output)"
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 420,
> "metadata": {},
> "outputs": [],
> "source": [
> "# the following is a regex I once wrote for work\n",
> "r = '\\|([^\\|]+)\\|\\n\\s*(\\S+)'\n",
> "\n",
> "# reading it is more like slowly decrypting what it does. I don't remember exactly what it was for\n",
> "# but it was for finding and replacing some type of text \n",
> "\n",
> "# they can be cumbersomea and cryptic, but they also have a certain beauty like an alien language \n",
> "\n",
> "# when you are writing regular expressions, its nice to use a regex editor\n",
> "# it lets you test what its finding as you write it, and gives you a cheat sheet\n",
> "# like this https://pythex.org/"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "as a related sidenote, \\\n",
> "there is a whole branch of theoretical computer science, \\\n",
> "which among other things explores ways of describing \"how complicated a problem is\"\n",
> "\n",
> "one way of describing how complicated a problem is, is saying how long it would take a computer to find the answer \\\n",
> "this often involves showing that one problem is in some way the same as another problem which we already know how long it takes a computer to solve \n",
> "\n",
> "there is a whole class of problems that can be shown to be equivalent to regular expressions \\\n",
> "which gives them a sort of special status of being pretty easy to solve by a computer\n",
> "\n",
> "something which can be represented with a regular expression, is called a regular language \\\n",
> "notably, regular languages are not turing-complete languages \n",
> "\n",
> "a turing-complete language is a language that can (theoretically) represent anything you would want to compute \\\n",
> "most programming languages including python are turing complete, but regular expressions (a subset of python) are not\n",
> "\n",
> "again this is all from the perspective of theoretical computer science. if you start asking questions like can all knowledge or wisdom be represented by series of text you have left the conventional realm of theoretical computer science which mostly takes a series of textual symbols as its starting point\n",
> "\n"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "# Sorting\n",
> "\n",
> "sorting is a kind of classic computer science thing for some reason.\n",
> "there are many algorithms for sorting with different names, but in terms of what they output they all do the same thing and someone else already wrote them so now in python you can just call .sort\n",
> "\n",
> "here are some of the cute names of the ways of sorting:\n",
> "\n",
> "Bubble Sort \\\n",
> "Heap Sort \\\n",
> "Insertion Sort \\\n",
> "Merge Sort \\\n",
> "Quick Sort \\\n",
> "Radix Sort \\\n",
> "Selection Sort\n",
> "\n",
> "if you want to sort by something other than alphabetical, its possible, and I won't explain it here but it does come up e.g. this nice project by Sam Lavigne about other ways one might sort their social media feed besides the default feed algorithms http://otherorders.net"
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 421,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "blue\n",
> "green\n",
> "orange\n",
> "purple\n",
> "red\n",
> "yellow\n"
> ]
> }
> ],
> "source": [
> "words = text.split(' ')\n",
> "words.sort()\n",
> "for word in words:\n",
> " print(word)"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "# Markov Chains"
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 422,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "Requirement already satisfied: markovify in /Users/maxfowler/opt/anaconda3/lib/python3.7/site-packages (0.8.2)\r\n",
> "Requirement already satisfied: unidecode in /Users/maxfowler/opt/anaconda3/lib/python3.7/site-packages (from markovify) (1.1.1)\r\n"
> ]
> }
> ],
> "source": [
> "!pip install markovify"
-bash: !pip: event not found
> ]
> },
> {
> "cell_type": "code",
> "execution_count": 429,
> "metadata": {},
> "outputs": [
> {
> "name": "stdout",
> "output_type": "stream",
> "text": [
> "In the mood to start an ambient electronic boy band In the mood to start an ambient electronic boy band In the mood to start an ambient electronic boy band In the mood to start every day with calisthenics I just save them for tweets now, it is awesome.\n",
> "The bible states that our bodies are made from dust of the human mind to correlate all its contents.\n",
> "None\n",
> "I think it's rad Really opened my eyes to what sex-positive feminist sex-work should be a default way IDE's display multiple white space in a good way to show your identitys constancy.\n",
> "With sex magick we can actively send healing to the mystical aphorisms of the paint I used to paint models when I was suggesting that social media is like asmr but for bees Writing helps us connect by writing is a reaction let's start a kabutz in Bushwick put your cigarette out in the sun I want to have felt very deeply Sunrise 6: return of the key technologies of the key technologies of the earth.\n",
> "I felt like Seeing and hearing it made me want to listen to music that sounds like a distorted vibraphone.\n",
> "But why read something the same as its opposite.\n",
> "Feel like I'm dancing with the devil nicotine delayed publishing of tweets feels like nothing matters hypnotize yourself that work is a reaction let's start a kabutz in Bushwick put your cigarette out in the modern world.\n"
> ]
> }
> ],
> "source": [
> "import markovify\n",
> "\n",
> "# Get raw text as string.\n",
> "with open(\"/Users/maxfowler/Desktop/corpus.txt\") as f:\n",
> " text = f.read()\n",
> "\n",
> "# Build the model.\n",
> "text_model = markovify.Text(text)\n",
> "\n",
> "# Print five randomly-generated sentences\n",
> "for i in range(5):\n",
> " print(text_model.make_sentence())\n",
> "\n",
> "# Print three randomly-generated sentences of no more than 280 characters\n",
> "for i in range(3):\n",
> " print(text_model.make_short_sentence(280))"
> ]
> },
> {
> "cell_type": "markdown",
> "metadata": {},
> "source": [
> "# Documentation\n",
> "for complete documentation on random and string methods:\n",
> "- https://docs.python.org/3/library/random.html\n",
> "- https://docs.python.org/2.5/lib/string-methods.html\n",
> "\n",
> "^ there will be lots of methods here you probably won't use often, but it can be helpful for seeing what's available. "
> ]
> },
> {
> "cell_type": "code",
> "execution_count": null,
> "metadata": {},
> "outputs": [],
> "source": []
> }
> ],
> "metadata": {
> "kernelspec": {
> "display_name": "Python 3",
> "language": "python",
> "name": "python3"
> },
> "language_info": {
> "codemirror_mode": {
> "name": "ipython",
> "version": 3
> },
> "file_extension": ".py",
> "mimetype": "text/x-python",
> "name": "python",
> "nbconvert_exporter": "python",
> "pygments_lexer": "ipython3",
> "version": "3.7.6"
> }
> },
> "nbformat": 4,
> "nbformat_minor": 4
> }
>
>
>
>
>
>
>
>
>
>
(base) maxfowler@MacBook-Pro-8
~/computer/projects/dll/class-materials $ { "cells": [; { "cell_type": "code",; "execution_count": 407,; "metadata": {},; "outputs": [],; "source": [; "text = \"yellow orange red purple blue green\""; ]; },; { "cell_type": "code",; "execution_count": 408,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; "yellow orange red purple blue green\n"; ]; }; ],; "source": [; "print(text)"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "# Random & String Methods"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "### shuffle words"; ]; },; { "cell_type": "code",; "execution_count": 409,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; "green\n",; "red\n",; "orange\n",; "blue\n",; "yellow\n",; "purple\n"; ]; }; ],; "source": [; "# shuffle words using random.shuffle\n",; "import random\n",; "\n",; "# first we turn text into a list of words\n",; "words = text.split(' ')\n",; "\n",; "# then we shuffle the list\n",; "random.shuffle(words)\n",; "\n",; "# then we print each word in the list\n",; "for word in words:\n",; " print(word)"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "### shuffle letters"; ]; },; { "cell_type": "code",; "execution_count": 410,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; "wellyo\n",; "egraon\n",; "red\n",; "puelpr\n",; "bleu\n",; "egnre\n"; ]; }; ],; "source": [; "# shuffle letters\n",; "words = text.split(' ')\n",; "for word in words:\n",; " # to use shuffle, you need a list. \n",; " # word is a string, so we turn word into a list before shuffling\n",; " letter_list = list(word)\n",; " random.shuffle(letter_list)\n",; " # now we turn letter_list back into a string\n",; " new_word = ''.join(letter_list)\n",; " print(new_word)"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "### slicing"; ]; },; { "cell_type": "code",; "execution_count": 411,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; "ye\n",; "or\n",; "re\n",; "pu\n",; "bl\n",; "gr\n"; ]; }; ],; "source": [; "# slicing strings \n",; "words = text.split(' ')\n",; "for word in words:\n",; " new_word = word[:2]\n",; " print(new_word)"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "You could change 2 to whatever you like."; ]; },; { "cell_type": "code",; "execution_count": 412,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; "llow\n",; "ange\n",; "d\n",; "rple\n",; "ue\n",; "een\n"; ]; }; ],; "source": [; "# slicing strings \n",; "words = text.split(' ')\n",; "for word in words:\n",; " new_word = word[2:]\n",; " print(new_word)"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "### random.sample"; ]; },; { "cell_type": "code",; "execution_count": 413,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; "\n",; "\n",; "*****\n",; "yellow\n",; "red\n",; "\n",; "\n",; "*****\n",; "orange\n",; "blue\n",; "\n",; "\n",; "*****\n",; "blue\n",; "green\n",; "purple\n"; ]; }; ],; "source": [; "# sampling from a list \n",; "import random\n",; "\n",; "all_words = text.split(' ')\n",; "\n",; "print(\"\")\n",; "print(\"\")\n",; "print(\"*****\")\n",; "a_few_words = random.sample(all_words, 2)\n",; "for word in a_few_words:\n",; " print(word)\n",; " \n",; "print(\"\")\n",; "print(\"\")\n",; "print(\"*****\")\n",; "a_few_words = random.sample(all_words, 2)\n",; "for word in a_few_words:\n",; " print(word)\n",; "\n",; "print(\"\")\n",; "print(\"\")\n",; "print(\"*****\")\n",; "a_few_words = random.sample(all_words, 3)\n",; "for word in a_few_words:\n",; " print(word)\n"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "### capitalize, upper, lower"; ]; },; { "cell_type": "code",; "execution_count": 414,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; "Yellow\n",; "YELLOW\n",; "yellow\n",; "\n",; "\n",; "Orange\n",; "ORANGE\n",; "orange\n",; "\n",; "\n",; "Red\n",; "RED\n",; "red\n",; "\n",; "\n",; "Purple\n",; "PURPLE\n",; "purple\n",; "\n",; "\n",; "Blue\n",; "BLUE\n",; "blue\n",; "\n",; "\n",; "Green\n",; "GREEN\n",; "green\n",; "\n",; "\n"; ]; }; ],; "source": [; "# capitalize and lowercase\n",; "\n",; "words = text.split(' ')\n",; "for word in words:\n",; " capitalize = word.capitalize()\n",; " print(capitalize)\n",; " upper = word.upper()\n",; " print(upper)\n",; " lowercase = word.lower()\n",; " print(lowercase)\n",; " print(\"\")\n",; " print(\"\")\n"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "# Cleaning Up Text"; ]; },; { "cell_type": "code",; "execution_count": 415,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; " red yellow green blue \n",; "red yellow green blue\n"; ]; }; ],; "source": [; "# if you are working with found text, or any kind of data or stuff you copy and pasted, \n",; "# you may need to clean it up in some ways \n",; "# this is sometimes called data cleaning, or data munging\n",; "\n",; "weird_text = \" red yellow green blue \"\n",; "print(weird_text)\n",; "\n",; "# strip removes trailing and leading spaces\n",; "stripped_text = weird_text.strip()\n",; "print(stripped_text)"; ]; },; { "cell_type": "code",; "execution_count": 416,; "metadata": {},; "outputs": [],; "source": [; "# for cleaning up text, \n",; "# you may also need to remove and replace strange characters that are not just white space\n",; "# replace is useful for this"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "# Find & Replace "; ]; },; { "cell_type": "code",; "execution_count": 417,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; "redyellowbluegreen\n"; ]; }; ],; "source": [; "weird_text = \"%%%%%%%%%%%%%%%%%%%%%%red%%%%yellow%%%%blue%%%green%%\"\n",; "\n",; "new_text = weird_text.replace(\"%\", \"\")\n",; "print(new_text)"; ]; },; { "cell_type": "code",; "execution_count": 418,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; "purple purple blue blue purple purple blue blue purple purple\n",; "red red blue blue red red blue blue red red\n"; ]; }; ],; "source": [; "source_string = \"purple purple blue blue purple purple blue blue purple purple\"\n",; "print(source_string)\n",; "\n",; "new_colors = source_string.replace('purple', 'red')\n",; "print(new_colors)"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "# Regular Expressions"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "regular expressions are complicated \\\n",; "whether you like them or hate them probably depends on your personality\n",; "\n",; "what they basically do, is let you find and replace things \\\n",; "but instead of just replacing purple with red, you can replace patterns with other patterns\n",; "\n",; "you could replace \\\n",; "aaaaaaaabbbbbbbbbaaabbbbbbbbbbbaaab \\\n",; "with \\\n",; "ababab\n",; "\n",; "in english we might say, \\\n",; "replace every sequence of \"a\" with one \"a\" and every sequence of \"b\" with one \"b\", no matter how long the sequence is\n",; "\n",; "in regex we would write, "; ]; },; { "cell_type": "code",; "execution_count": 419,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; "ababab\n"; ]; }; ],; "source": [; "import re\n",; "\n",; "source = 'aaaaaaaabbbbbbbbbaaabbbbbbbbbbbaaab'\n",; "output = re.sub('a+b+', 'ab', source)\n",; "print(output)"; ]; },; { "cell_type": "code",; "execution_count": 420,; "metadata": {},; "outputs": [],; "source": [; "# the following is a regex I once wrote for work\n",; "r = '\\|([^\\|]+)\\|\\n\\s*(\\S+)'\n",; "\n",; "# reading it is more like slowly decrypting what it does. I don't remember exactly what it was for\n",; "# but it was for finding and replacing some type of text \n",; "\n",; "# they can be cumbersomea and cryptic, but they also have a certain beauty like an alien language \n",; "\n",; "# when you are writing regular expressions, its nice to use a regex editor\n",; "# it lets you test what its finding as you write it, and gives you a cheat sheet\n",; "# like this https://pythex.org/"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "as a related sidenote, \\\n",; "there is a whole branch of theoretical computer science, \\\n",; "which among other things explores ways of describing \"how complicated a problem is\"\n",; "\n",; "one way of describing how complicated a problem is, is saying how long it would take a computer to find the answer \\\n",; "this often involves showing that one problem is in some way the same as another problem which we already know how long it takes a computer to solve \n",; "\n",; "there is a whole class of problems that can be shown to be equivalent to regular expressions \\\n",; "which gives them a sort of special status of being pretty easy to solve by a computer\n",; "\n",; "something which can be represented with a regular expression, is called a regular language \\\n",; "notably, regular languages are not turing-complete languages \n",; "\n",; "a turing-complete language is a language that can (theoretically) represent anything you would want to compute \\\n",; "most programming languages including python are turing complete, but regular expressions (a subset of python) are not\n",; "\n",; "again this is all from the perspective of theoretical computer science. if you start asking questions like can all knowledge or wisdom be represented by series of text you have left the conventional realm of theoretical computer science which mostly takes a series of textual symbols as its starting point\n",; "\n"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "# Sorting\n",; "\n",; "sorting is a kind of classic computer science thing for some reason.\n",; "there are many algorithms for sorting with different names, but in terms of what they output they all do the same thing and someone else already wrote them so now in python you can just call .sort\n",; "\n",; "here are some of the cute names of the ways of sorting:\n",; "\n",; "Bubble Sort \\\n",; "Heap Sort \\\n",; "Insertion Sort \\\n",; "Merge Sort \\\n",; "Quick Sort \\\n",; "Radix Sort \\\n",; "Selection Sort\n",; "\n",; "if you want to sort by something other than alphabetical, its possible, and I won't explain it here but it does come up e.g. this nice project by Sam Lavigne about other ways one might sort their social media feed besides the default feed algorithms http://otherorders.net"; ]; },; { "cell_type": "code",; "execution_count": 421,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; "blue\n",; "green\n",; "orange\n",; "purple\n",; "red\n",; "yellow\n"; ]; }; ],; "source": [; "words = text.split(' ')\n",; "words.sort()\n",; "for word in words:\n",; " print(word)"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "# Markov Chains"; ]; },; { "cell_type": "code",; "execution_count": 422,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; "Requirement already satisfied: markovify in /Users/maxfowler/opt/anaconda3/lib/python3.7/site-packages (0.8.2)\r\n",; "Requirement already satisfied: unidecode in /Users/maxfowler/opt/anaconda3/lib/python3.7/site-packages (from markovify) (1.1.1)\r\n"; ]; }; ],; "source": [; ]; },; { "cell_type": "code",; "execution_count": 429,; "metadata": {},; "outputs": [; { "name": "stdout",; "output_type": "stream",; "text": [; "In the mood to start an ambient electronic boy band In the mood to start an ambient electronic boy band In the mood to start an ambient electronic boy band In the mood to start every day with calisthenics I just save them for tweets now, it is awesome.\n",; "The bible states that our bodies are made from dust of the human mind to correlate all its contents.\n",; "None\n",; "I think it's rad Really opened my eyes to what sex-positive feminist sex-work should be a default way IDE's display multiple white space in a good way to show your identitys constancy.\n",; "With sex magick we can actively send healing to the mystical aphorisms of the paint I used to paint models when I was suggesting that social media is like asmr but for bees Writing helps us connect by writing is a reaction let's start a kabutz in Bushwick put your cigarette out in the sun I want to have felt very deeply Sunrise 6: return of the key technologies of the key technologies of the earth.\n",; "I felt like Seeing and hearing it made me want to listen to music that sounds like a distorted vibraphone.\n",; "But why read something the same as its opposite.\n",; "Feel like I'm dancing with the devil nicotine delayed publishing of tweets feels like nothing matters hypnotize yourself that work is a reaction let's start a kabutz in Bushwick put your cigarette out in the modern world.\n"; ]; }; ],; "source": [; "import markovify\n",; "\n",; "# Get raw text as string.\n",; "with open(\"/Users/maxfowler/Desktop/corpus.txt\") as f:\n",; " text = f.read()\n",; "\n",; "# Build the model.\n",; "text_model = markovify.Text(text)\n",; "\n",; "# Print five randomly-generated sentences\n",; "for i in range(5):\n",; " print(text_model.make_sentence())\n",; "\n",; "# Print three randomly-generated sentences of no more than 280 characters\n",; "for i in range(3):\n",; " print(text_model.make_short_sentence(280))"; ]; },; { "cell_type": "markdown",; "metadata": {},; "source": [; "# Documentation\n",; "for complete documentation on random and string methods:\n",; "- https://docs.python.org/3/library/random.html\n",; "- https://docs.python.org/2.5/lib/string-methods.html\n",; "\n",; "^ there will be lots of methods here you probably won't use often, but it can be helpful for seeing what's available. "; ]; },; { "cell_type": "code",; "execution_count": null,; "metadata": {},; "outputs": [],; "source": []; }; ],; "metadata": {; "kernelspec": {; "display_name": "Python 3",; "language": "python",; "name": "python3"; },; "language_info": {; "codemirror_mode": {; "name": "ipython",; "version": 3; },; "file_extension": ".py",; "mimetype": "text/x-python",; "name": "python",; "nbconvert_exporter": "python",; "pygments_lexer": "ipython3",; "version": "3.7.6"; }; },; "nbformat": 4,; "nbformat_minor": 4; };
(base) maxfowler@MacBook-Pro-8
~/computer/projects/dll/class-materials $
(base) maxfowler@MacBook-Pro-8
~/computer/projects/dll/class-materials $ cat Bag\ Of\ Text\ Methods.ipynb
{
"cells": [
{
"cell_type": "code",
"execution_count": 407,
"metadata": {},
"outputs": [],
"source": [
"text = \"yellow orange red purple blue green\""
]
},
{
"cell_type": "code",
"execution_count": 408,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"yellow orange red purple blue green\n"
]
}
],
"source": [
"print(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Random & String Methods"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### shuffle words"
]
},
{
"cell_type": "code",
"execution_count": 409,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"green\n",
"red\n",
"orange\n",
"blue\n",
"yellow\n",
"purple\n"
]
}
],
"source": [
"# shuffle words using random.shuffle\n",
"import random\n",
"\n",
"# first we turn text into a list of words\n",
"words = text.split(' ')\n",
"\n",
"# then we shuffle the list\n",
"random.shuffle(words)\n",
"\n",
"# then we print each word in the list\n",
"for word in words:\n",
" print(word)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### shuffle letters"
]
},
{
"cell_type": "code",
"execution_count": 410,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"wellyo\n",
"egraon\n",
"red\n",
"puelpr\n",
"bleu\n",
"egnre\n"
]
}
],
"source": [
"# shuffle letters\n",
"words = text.split(' ')\n",
"for word in words:\n",
" # to use shuffle, you need a list. \n",
" # word is a string, so we turn word into a list before shuffling\n",
" letter_list = list(word)\n",
" random.shuffle(letter_list)\n",
" # now we turn letter_list back into a string\n",
" new_word = ''.join(letter_list)\n",
" print(new_word)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### slicing"
]
},
{
"cell_type": "code",
"execution_count": 411,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ye\n",
"or\n",
"re\n",
"pu\n",
"bl\n",
"gr\n"
]
}
],
"source": [
"# slicing strings \n",
"words = text.split(' ')\n",
"for word in words:\n",
" new_word = word[:2]\n",
" print(new_word)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You could change 2 to whatever you like."
]
},
{
"cell_type": "code",
"execution_count": 412,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"llow\n",
"ange\n",
"d\n",
"rple\n",
"ue\n",
"een\n"
]
}
],
"source": [
"# slicing strings \n",
"words = text.split(' ')\n",
"for word in words:\n",
" new_word = word[2:]\n",
" print(new_word)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### random.sample"
]
},
{
"cell_type": "code",
"execution_count": 413,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
"*****\n",
"yellow\n",
"red\n",
"\n",
"\n",
"*****\n",
"orange\n",
"blue\n",
"\n",
"\n",
"*****\n",
"blue\n",
"green\n",
"purple\n"
]
}
],
"source": [
"# sampling from a list \n",
"import random\n",
"\n",
"all_words = text.split(' ')\n",
"\n",
"print(\"\")\n",
"print(\"\")\n",
"print(\"*****\")\n",
"a_few_words = random.sample(all_words, 2)\n",
"for word in a_few_words:\n",
" print(word)\n",
" \n",
"print(\"\")\n",
"print(\"\")\n",
"print(\"*****\")\n",
"a_few_words = random.sample(all_words, 2)\n",
"for word in a_few_words:\n",
" print(word)\n",
"\n",
"print(\"\")\n",
"print(\"\")\n",
"print(\"*****\")\n",
"a_few_words = random.sample(all_words, 3)\n",
"for word in a_few_words:\n",
" print(word)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### capitalize, upper, lower"
]
},
{
"cell_type": "code",
"execution_count": 414,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Yellow\n",
"YELLOW\n",
"yellow\n",
"\n",
"\n",
"Orange\n",
"ORANGE\n",
"orange\n",
"\n",
"\n",
"Red\n",
"RED\n",
"red\n",
"\n",
"\n",
"Purple\n",
"PURPLE\n",
"purple\n",
"\n",
"\n",
"Blue\n",
"BLUE\n",
"blue\n",
"\n",
"\n",
"Green\n",
"GREEN\n",
"green\n",
"\n",
"\n"
]
}
],
"source": [
"# capitalize and lowercase\n",
"\n",
"words = text.split(' ')\n",
"for word in words:\n",
" capitalize = word.capitalize()\n",
" print(capitalize)\n",
" upper = word.upper()\n",
" print(upper)\n",
" lowercase = word.lower()\n",
" print(lowercase)\n",
" print(\"\")\n",
" print(\"\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cleaning Up Text"
]
},
{
"cell_type": "code",
"execution_count": 415,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" red yellow green blue \n",
"red yellow green blue\n"
]
}
],
"source": [
"# if you are working with found text, or any kind of data or stuff you copy and pasted, \n",
"# you may need to clean it up in some ways \n",
"# this is sometimes called data cleaning, or data munging\n",
"\n",
"weird_text = \" red yellow green blue \"\n",
"print(weird_text)\n",
"\n",
"# strip removes trailing and leading spaces\n",
"stripped_text = weird_text.strip()\n",
"print(stripped_text)"
]
},
{
"cell_type": "code",
"execution_count": 416,
"metadata": {},
"outputs": [],
"source": [
"# for cleaning up text, \n",
"# you may also need to remove and replace strange characters that are not just white space\n",
"# replace is useful for this"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Find & Replace "
]
},
{
"cell_type": "code",
"execution_count": 417,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"redyellowbluegreen\n"
]
}
],
"source": [
"weird_text = \"%%%%%%%%%%%%%%%%%%%%%%red%%%%yellow%%%%blue%%%green%%\"\n",
"\n",
"new_text = weird_text.replace(\"%\", \"\")\n",
"print(new_text)"
]
},
{
"cell_type": "code",
"execution_count": 418,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"purple purple blue blue purple purple blue blue purple purple\n",
"red red blue blue red red blue blue red red\n"
]
}
],
"source": [
"source_string = \"purple purple blue blue purple purple blue blue purple purple\"\n",
"print(source_string)\n",
"\n",
"new_colors = source_string.replace('purple', 'red')\n",
"print(new_colors)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Regular Expressions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"regular expressions are complicated \\\n",
"whether you like them or hate them probably depends on your personality\n",
"\n",
"what they basically do, is let you find and replace things \\\n",
"but instead of just replacing purple with red, you can replace patterns with other patterns\n",
"\n",
"you could replace \\\n",
"aaaaaaaabbbbbbbbbaaabbbbbbbbbbbaaab \\\n",
"with \\\n",
"ababab\n",
"\n",
"in english we might say, \\\n",
"replace every sequence of \"a\" with one \"a\" and every sequence of \"b\" with one \"b\", no matter how long the sequence is\n",
"\n",
"in regex we would write, "
]
},
{
"cell_type": "code",
"execution_count": 419,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ababab\n"
]
}
],
"source": [
"import re\n",
"\n",
"source = 'aaaaaaaabbbbbbbbbaaabbbbbbbbbbbaaab'\n",
"output = re.sub('a+b+', 'ab', source)\n",
"print(output)"
]
},
{
"cell_type": "code",
"execution_count": 420,
"metadata": {},
"outputs": [],
"source": [
"# the following is a regex I once wrote for work\n",
"r = '\\|([^\\|]+)\\|\\n\\s*(\\S+)'\n",
"\n",
"# reading it is more like slowly decrypting what it does. I don't remember exactly what it was for\n",
"# but it was for finding and replacing some type of text \n",
"\n",
"# they can be cumbersomea and cryptic, but they also have a certain beauty like an alien language \n",
"\n",
"# when you are writing regular expressions, its nice to use a regex editor\n",
"# it lets you test what its finding as you write it, and gives you a cheat sheet\n",
"# like this https://pythex.org/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"as a related sidenote, \\\n",
"there is a whole branch of theoretical computer science, \\\n",
"which among other things explores ways of describing \"how complicated a problem is\"\n",
"\n",
"one way of describing how complicated a problem is, is saying how long it would take a computer to find the answer \\\n",
"this often involves showing that one problem is in some way the same as another problem which we already know how long it takes a computer to solve \n",
"\n",
"there is a whole class of problems that can be shown to be equivalent to regular expressions \\\n",
"which gives them a sort of special status of being pretty easy to solve by a computer\n",
"\n",
"something which can be represented with a regular expression, is called a regular language \\\n",
"notably, regular languages are not turing-complete languages \n",
"\n",
"a turing-complete language is a language that can (theoretically) represent anything you would want to compute \\\n",
"most programming languages including python are turing complete, but regular expressions (a subset of python) are not\n",
"\n",
"again this is all from the perspective of theoretical computer science. if you start asking questions like can all knowledge or wisdom be represented by series of text you have left the conventional realm of theoretical computer science which mostly takes a series of textual symbols as its starting point\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sorting\n",
"\n",
"sorting is a kind of classic computer science thing for some reason.\n",
"there are many algorithms for sorting with different names, but in terms of what they output they all do the same thing and someone else already wrote them so now in python you can just call .sort\n",
"\n",
"here are some of the cute names of the ways of sorting:\n",
"\n",
"Bubble Sort \\\n",
"Heap Sort \\\n",
"Insertion Sort \\\n",
"Merge Sort \\\n",
"Quick Sort \\\n",
"Radix Sort \\\n",
"Selection Sort\n",
"\n",
"if you want to sort by something other than alphabetical, its possible, and I won't explain it here but it does come up e.g. this nice project by Sam Lavigne about other ways one might sort their social media feed besides the default feed algorithms http://otherorders.net"
]
},
{
"cell_type": "code",
"execution_count": 421,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"blue\n",
"green\n",
"orange\n",
"purple\n",
"red\n",
"yellow\n"
]
}
],
"source": [
"words = text.split(' ')\n",
"words.sort()\n",
"for word in words:\n",
" print(word)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Markov Chains"
]
},
{
"cell_type": "code",
"execution_count": 422,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: markovify in /Users/maxfowler/opt/anaconda3/lib/python3.7/site-packages (0.8.2)\r\n",
"Requirement already satisfied: unidecode in /Users/maxfowler/opt/anaconda3/lib/python3.7/site-packages (from markovify) (1.1.1)\r\n"
]
}
],
"source": [
"!pip install markovify"
]
},
{
"cell_type": "code",
"execution_count": 429,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In the mood to start an ambient electronic boy band In the mood to start an ambient electronic boy band In the mood to start an ambient electronic boy band In the mood to start every day with calisthenics I just save them for tweets now, it is awesome.\n",
"The bible states that our bodies are made from dust of the human mind to correlate all its contents.\n",
"None\n",
"I think it's rad Really opened my eyes to what sex-positive feminist sex-work should be a default way IDE's display multiple white space in a good way to show your identitys constancy.\n",
"With sex magick we can actively send healing to the mystical aphorisms of the paint I used to paint models when I was suggesting that social media is like asmr but for bees Writing helps us connect by writing is a reaction let's start a kabutz in Bushwick put your cigarette out in the sun I want to have felt very deeply Sunrise 6: return of the key technologies of the key technologies of the earth.\n",
"I felt like Seeing and hearing it made me want to listen to music that sounds like a distorted vibraphone.\n",
"But why read something the same as its opposite.\n",
"Feel like I'm dancing with the devil nicotine delayed publishing of tweets feels like nothing matters hypnotize yourself that work is a reaction let's start a kabutz in Bushwick put your cigarette out in the modern world.\n"
]
}
],
"source": [
"import markovify\n",
"\n",
"# Get raw text as string.\n",
"with open(\"/Users/maxfowler/Desktop/corpus.txt\") as f:\n",
" text = f.read()\n",
"\n",
"# Build the model.\n",
"text_model = markovify.Text(text)\n",
"\n",
"# Print five randomly-generated sentences\n",
"for i in range(5):\n",
" print(text_model.make_sentence())\n",
"\n",
"# Print three randomly-generated sentences of no more than 280 characters\n",
"for i in range(3):\n",
" print(text_model.make_short_sentence(280))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Documentation\n",
"for complete documentation on random and string methods:\n",
"- https://docs.python.org/3/library/random.html\n",
"- https://docs.python.org/2.5/lib/string-methods.html\n",
"\n",
"^ there will be lots of methods here you probably won't use often, but it can be helpful for seeing what's available. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment