Skip to content

Instantly share code, notes, and snippets.

@rmeertens
Created January 11, 2017 14:35
Show Gist options
  • Save rmeertens/0262baeb5f922912af9c252d5f1f146e to your computer and use it in GitHub Desktop.
Save rmeertens/0262baeb5f922912af9c252d5f1f146e to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Old dutch spelling to new dutch spelling\n",
"On February 10 2017 the 27th edition of the Computational Linguistics in the Netherlands conference will be held in Leuven. As part of this conference they organise a \"shared task\". This year, this task is a competition in \"translating\" the bible from old dutch to new dutch. \n",
"\n",
"In 1637 the first translation of the bible from the original Hebrew, Aramaic, and Greek languages to Dutch was published (https://en.wikipedia.org/wiki/Statenvertaling). As the dutch language changed over the years (under influence of Spain and other countries that conquered us) the bible was translated to newer and newer dutch. \n",
"\n",
"The difference can be spotted easily between the 1637 and 1888 version: \n",
"1637: De Aerde nu was woest ende ledich, ende duysternisse was op den afgront: ende de Geest Godts sweefde op de Wateren.\n",
"1888: De aarde nu was woest en ledig, en duisternis was op den afgrond; en de Geest Gods zweefde op de wateren.\n",
"If you are dutch you can probably read the 1637 version, although you will need some time to find the right words. The letters ae (aerde) changed to aa (aarde), ch (ledich) changed to g (ledig) and some words ending in t (afgront) now end with the d (afgrond). Even the word Godt changed to God in just 250 years. \n",
"\n",
"Apart from the spelling some words completely changed. For example: \n",
"1637: Ende sy waren beyde naeckt, Adam ende sijn wijf:\n",
"1888: En zij waren beiden naakt, Adam en zijn vrouw;\n",
"Nowadays calling a woman a \"wijf\" would be an insult..\n",
"\n",
"As it takes historians a long time to read old texts in their original form, we would like to make them a bit more readable. What is interesting is that the Google search bar understands what you want to say with their autocorrect: \n",
"<img src=\"google_result.png\">\n",
"How you can implement Google's autocorrect was written down by Peter Norvig in this excellent post: http://norvig.com/spell-correct.html . \n",
"\n",
"Using the \"Autocorrect function\" you can build a dictionary of words from old dutch to new dutch (if you would like to see this in a blogpost, please contact me). What I wanted to do was train a dictionary on a small number of words, and use neural networks to generalise the conversion of old dutch to new dutch. \n",
"\n",
"### Preparation\n",
"I automatically created a dictionary with the old dutch and new dutch version of 20.852 words. If this is enough for deep neural networks is something we will find out at the end of this project. Adding more data is difficult as the only aligned old-new data I have is the bible with 37.235 lines of text. \n",
"\n",
"### The network\n",
"My plan is to use a recurrent neural network (https://en.wikipedia.org/wiki/Recurrent_neural_network) encoder that reads all characters, and a recurrent neural network decoder that generates characters. The data should be preprocessed with this idea in mind... This means setting a max length to the words that we want to transform, and some other tricks I will discuss later. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Using TensorFlow backend.\n"
]
}
],
"source": [
"import random\n",
"from keras.preprocessing import sequence\n",
"import tensorflow as tf\n",
"import numpy as np\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"heyligden -> heiligden\n",
"arsaces -> arsaces\n",
"avondt -> avond\n",
"uythouwt -> uithouwt\n",
"helam -> hem\n",
"gemaeckt -> gemaakt\n",
"onechte -> onechte\n",
"vuyst -> vast\n",
"imri -> imri\n",
"hemelsche -> hemelse\n"
]
}
],
"source": [
"Xdata = []\n",
"Ydata = []\n",
"MAX_LENGTH_WORD = 10\n",
"\n",
"\n",
"feature_dict= dict()\n",
"feature_list = list()\n",
"\n",
"PADDING_CHARACTER = '~'\n",
"feature_dict[PADDING_CHARACTER]=0\n",
"feature_list.append(PADDING_CHARACTER)\n",
"max_features = 1\n",
"\n",
"def get_vector_from_string(input_s):\n",
" global max_features\n",
" vector_x = []\n",
" for i in input_s:\n",
" if i not in feature_dict:\n",
" feature_dict[i]=max_features\n",
" feature_list.append(i)\n",
" max_features += 1\n",
" vector_x.append(feature_dict[i])\n",
" return vector_x\n",
"def add_to_data(input_s,output_s):\n",
" if len(input_s) < MAX_LENGTH_WORD and len(output_s) < MAX_LENGTH_WORD:\n",
" vector_x = get_vector_from_string(input_s)\n",
" vector_y = get_vector_from_string(output_s)\n",
" Xdata.append(vector_x)\n",
" Ydata.append(vector_y)\n",
"\n",
" \n",
"def print_vector(vector,end_token='\\n'):\n",
" print(''.join([feature_list[i] for i in vector]),end=end_token)\n",
" \n",
"with open(\"dictionary_old_new_dutch.csv\") as in_file:\n",
" for line in in_file:\n",
" in_s,out_s = line.strip().split(\",\")\n",
" add_to_data(in_s,out_s)\n",
"for i in range(10):\n",
" print_vector(Xdata[i],end_token='')\n",
" print(' -> ', end='')\n",
" print_vector(Ydata[i])\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data preprocessing\n",
"As mentioned above I would like to use a sequence to sequence approach. Important for this approach is having a certain length of words. Words that are longer than that length have been discarded in de data-reading step above. Now we will add paddings to the words that are not long enough. \n",
"\n",
"\n",
"Another important step is creating a train and a test set. We only show the network examples from the train set. At the end I will manually evaluate some of the examples in the testset and discuss what the network learned. During training we train in batches with a small amount of data. With a random data splitter we get a different trainset every run. \n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"~heyligden -> after padding: ~heyligden\n"
]
}
],
"source": [
"\n",
"before_padding = Xdata[0]\n",
"Xdata = sequence.pad_sequences(Xdata, maxlen=MAX_LENGTH_WORD)\n",
"Ydata = sequence.pad_sequences(Ydata, maxlen=MAX_LENGTH_WORD)\n",
"after_padding = Xdata[0]\n",
"\n",
"print_vector(before_padding,end_token='')\n",
"print(\" -> after padding: \", end='')\n",
"print_vector(after_padding)\n",
"\n",
"class DataSplitter:\n",
" def __init__(self,percentage):\n",
" self.percentage = percentage\n",
" def split_data(self,data):\n",
" splitpoint = int(len(data)*self.percentage)\n",
" return data[:splitpoint], data[splitpoint:]\n",
"splitter = DataSplitter(0.8)\n",
"Xdata,Xtest = splitter.split_data(Xdata)\n",
"Ydata,Ytest = splitter.split_data(Ydata)\n",
"\n",
"def get_random_reversed_dataset(Xdata,Ydata,batch_size):\n",
" newX = []\n",
" newY = []\n",
" for _ in range(batch_size):\n",
" index_taken = random.randint(0,len(Xdata)-1)\n",
" newX.append(Xdata[index_taken])\n",
" newY.append(Ydata[index_taken][::-1])\n",
" return newX,newY"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The network\n",
"\n",
"Time to implement everything in Tensorflow. We use the embedding_attention_seq2seq function. This function:\n",
"- embeds our characters\n",
"- has an encoder which returns a sequence of outputs\n",
"- has an attention model which uses this sequence to generate output characters"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"\n",
"\n",
"batch_size = 64\n",
"memory_dim = 256\n",
"embedding_dim = 32\n",
"\n",
"enc_input = [tf.placeholder(tf.int32, shape=(None,)) for i in range(MAX_LENGTH_WORD)]\n",
"dec_output = [tf.placeholder(tf.int32, shape=(None,)) for t in range(MAX_LENGTH_WORD)]\n",
"\n",
"weights = [tf.ones_like(labels_t, dtype=tf.float32) for labels_t in enc_input]\n",
"\n",
"dec_inp = ([tf.zeros_like(enc_input[0], dtype=np.int32)]+[dec_output[t] for t in range(MAX_LENGTH_WORD-1)])\n",
"empty_dec_inp = ([tf.zeros_like(enc_input[0], dtype=np.int32,name=\"empty_dec_input\") for t in range(MAX_LENGTH_WORD)])\n",
"\n",
"cell = tf.nn.rnn_cell.GRUCell(memory_dim)\n",
"\n",
"# Create a train version of encoder-decoder, and a test version which does not feed the previous input\n",
"with tf.variable_scope(\"decoder1\") as scope:\n",
" outputs, _ = tf.nn.seq2seq.embedding_attention_seq2seq(enc_input,dec_inp,\n",
" cell,max_features,max_features,\n",
" embedding_dim, feed_previous=False)\n",
"with tf.variable_scope(\"decoder1\",reuse=True) as scope:\n",
" runtime_outputs, _ = tf.nn.seq2seq.embedding_attention_seq2seq(enc_input,empty_dec_inp,\n",
" cell,max_features,max_features,\n",
" embedding_dim,feed_previous=True)\n",
"\n",
"loss = tf.nn.seq2seq.sequence_loss(outputs, dec_output, weights, max_features)\n",
"\n",
"optimizer = tf.train.AdamOptimizer()\n",
"train_op = optimizer.minimize(loss)\n",
"\n",
"# Init everything\n",
"sess = tf.InteractiveSession()\n",
"sess.run(tf.initialize_all_variables())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training\n",
"Time for training! I will show the network 64.000 words. This means that each word is seen around 2-3 times. Every 100 batches I will print the loss to see how well the network is doing."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.19854\n",
"1.4743\n",
"0.787886\n",
"0.646052\n",
"0.432828\n",
"0.271531\n",
"0.320864\n",
"0.228205\n",
"0.382146\n",
"0.253352\n",
"0.319629\n"
]
}
],
"source": [
"for index_now in range(1002):\n",
" Xin,Yin = get_random_reversed_dataset(Xdata,Ydata,batch_size)\n",
" Xin = np.array(Xin).T\n",
" Yin = np.array(Yin).T\n",
" feed_dict = {enc_input[t]: Xin[t] for t in range(MAX_LENGTH_WORD)}\n",
" feed_dict.update({dec_output[t]: Yin[t] for t in range(MAX_LENGTH_WORD)})\n",
" _, l = sess.run([train_op,loss], feed_dict)\n",
" if index_now%100==1:\n",
" print(l)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train analysis\n",
"Looks like we are learning something! The loss is going down the first 500 steps. After that the loss is not reduced a lot anymore. This is possible because natural language is difficult, and rules of old dutch to new dutch are not always consistent. Without an additional dictionary to verify the solutions the network made I think it will be difficult to train a perfect network. \n",
"\n",
"Now it's test time! Let's input some words the network has not seen before and see what rules it learned. "
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"input: ~~~~wreket\n",
"expected: ~~~~wreekt\n",
"output: ~~~~wreekt\n",
"==============\n",
"input: ~~~cleedet\n",
"expected: ~~~~kleedt\n",
"output: ~~~~kledet\n",
"==============\n",
"input: ~~leydende\n",
"expected: ~~leidende\n",
"output: ~~leidende\n",
"==============\n",
"input: ~~~~yemant\n",
"expected: ~~~~iemand\n",
"output: ~~~~jemand\n",
"==============\n",
"input: ~frisscher\n",
"expected: ~~~frisser\n",
"output: ~~frissher\n",
"==============\n",
"aerde becomes: aarde\n",
"duyster becomes: duister\n",
"salfde becomes: zalfde\n",
"ontstondt becomes: ontstond\n",
"tusschen becomes: tuschen\n",
"wacker becomes: wakker\n",
"voorraet becomes: vorraat\n",
"gevreeset becomes: gevreet\n",
"cleopatra becomes: kleopara\n",
"mochtmen becomes: mochtmen\n"
]
}
],
"source": [
"def get_reversed_max_string_logits(logits):\n",
" string_logits = logits[::-1]\n",
" concatenated_string = \"\"\n",
" for logit in string_logits:\n",
" val_here = np.argmax(logit)\n",
" concatenated_string += feature_list[val_here]\n",
" return concatenated_string\n",
"def print_out(out):\n",
" out = list(zip(*out))\n",
" out = out[:10] # only show the first 10 samples\n",
" \n",
" for index,string_logits in enumerate(out):\n",
" print(\"input: \",end='')\n",
" print_vector(Xin[index])\n",
" print(\"expected: \",end='')\n",
" expected= Yin[index][::-1]\n",
" print_vector(expected)\n",
" \n",
" output = get_reversed_max_string_logits(string_logits)\n",
" print(\"output: \" + output)\n",
" \n",
" \n",
" print(\"==============\")\n",
"\n",
"\n",
"# Now run a small test to see what our network does with words\n",
"RANDOM_TESTSIZE = 5\n",
"Xin,Yin = get_random_reversed_dataset(Xtest,Ytest,RANDOM_TESTSIZE)\n",
"Xin_transposed = np.array(Xin).T\n",
"Yin_transposed = np.array(Yin).T\n",
"feed_dict = {enc_input[t]: Xin_transposed[t] for t in range(MAX_LENGTH_WORD)}\n",
"out = sess.run(runtime_outputs, feed_dict)\n",
"print_out(out)\n",
"\n",
"\n",
"def translate_single_word(word):\n",
" Xin = [get_vector_from_string(word)]\n",
" Xin = sequence.pad_sequences(Xin, maxlen=MAX_LENGTH_WORD)\n",
" Xin_transposed = np.array(Xin).T\n",
" feed_dict = {enc_input[t]: Xin_transposed[t] for t in range(MAX_LENGTH_WORD)}\n",
" out = sess.run(runtime_outputs, feed_dict)\n",
" return get_reversed_max_string_logits(out)\n",
"\n",
"interesting_words = [\"aerde\",\"duyster\", \"salfde\", \"ontstondt\", \"tusschen\",\"wacker\",\"voorraet\",\"gevreeset\",\"cleopatra\"]\n",
"for word in interesting_words:\n",
" print(word + \" becomes: \" + translate_single_word(word).replace(\"~\",\"\"))\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Test analysis\n",
"Looks like our network learned simple rules such as (ae) -> (aa), (uy) -> (ui), (s) -> (z) if at the start of a word. There are also difficult cases, such as tusschen (tussen). Some words (voorraet) are difficult, as in modern dutch the last t changed to a d (but this is not a hard rule, you simply have to learn it).\n",
"\n",
"Translating \"cleopatra\" is an interesting case. As it is a name, you don't want to change it... The network can't know this, and simply renames her to \"kleopatra\" \n",
"\n",
"\n",
"## Conclusion\n",
"Using a neural network to go from old dutch to new dutch has been mildly succesful. Some words are \"translated\" correctly, while others unfortunately are mistranslated. It has been a fun experiment, and it is still interesting to see what rules were \"easy\" to learn, and what rules are difficult to learn. \n",
"\n",
"If you want to toy around with this model, or have any questions, please let me know!\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment