Skip to content

Instantly share code, notes, and snippets.

@PhilipMay
Created March 2, 2020 21:36
Show Gist options
  • Save PhilipMay/9ece696dc11d7d57fee3f2f67b591eb4 to your computer and use it in GitHub Desktop.
Save PhilipMay/9ece696dc11d7d57fee3f2f67b591eb4 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from transformers import AutoTokenizer"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"transformers.tokenization_bert.BertTokenizer"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenizer = AutoTokenizer.from_pretrained(\"bert-base-german-cased\")\n",
"type(tokenizer)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"text = 'Ist das eine Frage?'"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Ist', 'das', 'eine', 'Frage', '[UNK]']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokens = tokenizer.tokenize(text)\n",
"tokens"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `?` should not be converted to `[UNK]`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3 \t [ C L S ]\n",
"4352 \t I s t\n",
"93 \t d a s\n",
"155 \t e i n e\n",
"1685 \t F r a g e\n",
"2 \t [ U N K ]\n",
"4 \t [ S E P ]\n"
]
}
],
"source": [
"tokens = tokenizer.encode(text)\n",
"for t in tokens:\n",
" print(t, '\\t', tokenizer.decode(t))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Same problem here."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[198, 1020, 26972]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"token_ids = tokenizer.convert_tokens_to_ids(['Ein', 'Haus', '##?'])\n",
"token_ids "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Ein Haus?'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokenizer.decode(token_ids)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Works for `##?` But somehow the tokenizer does not create `##?` but `?`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment