Skip to content

Instantly share code, notes, and snippets.

@vaclavdekanovsky
Created January 2, 2021 21:05
Show Gist options
  • Save vaclavdekanovsky/edb9fb2191c4b4229e0fad46ea5f5a27 to your computer and use it in GitHub Desktop.
Save vaclavdekanovsky/edb9fb2191c4b4229e0fad46ea5f5a27 to your computer and use it in GitHub Desktop.
Pool parameter in Julia CSV parser.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pool parameter of the Julia CSV parser\n",
"Written in [Julia](https://julialang.org/). See [CSV.jl](https://csv.juliadata.org/stable/) and [DataFrames.jl](https://dataframes.juliadata.org/stable/) for more details\n",
"\n",
"Pooling is similar to pandas catagory. The strings are stored to `PooledArrays.PooledArray` which can make some operations much faster. `Pool` argument set the treshold when the string column is turned to the pooled array. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"using CSV, DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"unique,cat\\nA18E9,AT\\nBF392,GC\\n93EBC,AT\\n54EE1,AT\\n8CD2E,GC\""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = \"\"\"unique,cat\n",
"A18E9,AT\n",
"BF392,GC\n",
"93EBC,AT\n",
"54EE1,AT\n",
"8CD2E,GC\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>unique</th><th>cat</th></tr><tr><th></th><th>String</th><th>String</th></tr></thead><tbody><p>5 rows × 2 columns</p><tr><th>1</th><td>A18E9</td><td>AT</td></tr><tr><th>2</th><td>BF392</td><td>GC</td></tr><tr><th>3</th><td>93EBC</td><td>AT</td></tr><tr><th>4</th><td>54EE1</td><td>AT</td></tr><tr><th>5</th><td>8CD2E</td><td>GC</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cc}\n",
"\t& unique & cat\\\\\n",
"\t\\hline\n",
"\t& String & String\\\\\n",
"\t\\hline\n",
"\t1 & A18E9 & AT \\\\\n",
"\t2 & BF392 & GC \\\\\n",
"\t3 & 93EBC & AT \\\\\n",
"\t4 & 54EE1 & AT \\\\\n",
"\t5 & 8CD2E & GC \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"5×2 DataFrame\n",
"│ Row │ unique │ cat │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mString\u001b[39m │\n",
"├─────┼────────┼────────┤\n",
"│ 1 │ A18E9 │ AT │\n",
"│ 2 │ BF392 │ GC │\n",
"│ 3 │ 93EBC │ AT │\n",
"│ 4 │ 54EE1 │ AT │\n",
"│ 5 │ 8CD2E │ GC │"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = CSV.read(IOBuffer(data), DataFrame; pool=0.4)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5-element Array{String,1}:\n",
" \"A18E9\"\n",
" \"BF392\"\n",
" \"93EBC\"\n",
" \"54EE1\"\n",
" \"8CD2E\""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# columns `unique` has different value on each row so it's not pooled\n",
"df[:,:unique]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5-element PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}:\n",
" \"AT\"\n",
" \"GC\"\n",
" \"AT\"\n",
" \"AT\"\n",
" \"GC\""
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# column can contain 2 values on 5 rows; 2/5 = 0.4 so the column is pooled. \n",
"df[:,:cat]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 1.4.1",
"language": "julia",
"name": "julia-1.4"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.4.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment