Created
January 2, 2021 21:05
-
-
Save vaclavdekanovsky/edb9fb2191c4b4229e0fad46ea5f5a27 to your computer and use it in GitHub Desktop.
Pool parameter in Julia CSV parser.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Pool parameter of the Julia CSV parser\n", | |
"Written in [Julia](https://julialang.org/). See [CSV.jl](https://csv.juliadata.org/stable/) and [DataFrames.jl](https://dataframes.juliadata.org/stable/) for more details\n", | |
"\n", | |
"Pooling is similar to pandas catagory. The strings are stored to `PooledArrays.PooledArray` which can make some operations much faster. `Pool` argument set the treshold when the string column is turned to the pooled array. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"using CSV, DataFrames" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"\"unique,cat\\nA18E9,AT\\nBF392,GC\\n93EBC,AT\\n54EE1,AT\\n8CD2E,GC\"" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data = \"\"\"unique,cat\n", | |
"A18E9,AT\n", | |
"BF392,GC\n", | |
"93EBC,AT\n", | |
"54EE1,AT\n", | |
"8CD2E,GC\"\"\"" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table class=\"data-frame\"><thead><tr><th></th><th>unique</th><th>cat</th></tr><tr><th></th><th>String</th><th>String</th></tr></thead><tbody><p>5 rows × 2 columns</p><tr><th>1</th><td>A18E9</td><td>AT</td></tr><tr><th>2</th><td>BF392</td><td>GC</td></tr><tr><th>3</th><td>93EBC</td><td>AT</td></tr><tr><th>4</th><td>54EE1</td><td>AT</td></tr><tr><th>5</th><td>8CD2E</td><td>GC</td></tr></tbody></table>" | |
], | |
"text/latex": [ | |
"\\begin{tabular}{r|cc}\n", | |
"\t& unique & cat\\\\\n", | |
"\t\\hline\n", | |
"\t& String & String\\\\\n", | |
"\t\\hline\n", | |
"\t1 & A18E9 & AT \\\\\n", | |
"\t2 & BF392 & GC \\\\\n", | |
"\t3 & 93EBC & AT \\\\\n", | |
"\t4 & 54EE1 & AT \\\\\n", | |
"\t5 & 8CD2E & GC \\\\\n", | |
"\\end{tabular}\n" | |
], | |
"text/plain": [ | |
"5×2 DataFrame\n", | |
"│ Row │ unique │ cat │\n", | |
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mString\u001b[39m │\n", | |
"├─────┼────────┼────────┤\n", | |
"│ 1 │ A18E9 │ AT │\n", | |
"│ 2 │ BF392 │ GC │\n", | |
"│ 3 │ 93EBC │ AT │\n", | |
"│ 4 │ 54EE1 │ AT │\n", | |
"│ 5 │ 8CD2E │ GC │" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"df = CSV.read(IOBuffer(data), DataFrame; pool=0.4)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"5-element Array{String,1}:\n", | |
" \"A18E9\"\n", | |
" \"BF392\"\n", | |
" \"93EBC\"\n", | |
" \"54EE1\"\n", | |
" \"8CD2E\"" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# columns `unique` has different value on each row so it's not pooled\n", | |
"df[:,:unique]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"5-element PooledArrays.PooledArray{String,UInt32,1,Array{UInt32,1}}:\n", | |
" \"AT\"\n", | |
" \"GC\"\n", | |
" \"AT\"\n", | |
" \"AT\"\n", | |
" \"GC\"" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# column can contain 2 values on 5 rows; 2/5 = 0.4 so the column is pooled. \n", | |
"df[:,:cat]" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Julia 1.4.1", | |
"language": "julia", | |
"name": "julia-1.4" | |
}, | |
"language_info": { | |
"file_extension": ".jl", | |
"mimetype": "application/julia", | |
"name": "julia", | |
"version": "1.4.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment