Created
December 19, 2020 22:09
-
-
Save vaclavdekanovsky/7034393c7bbe23ce36fe522ae931cf28 to your computer and use it in GitHub Desktop.
Julia, parsing CSV using dateformat parameter
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Dateformat parameter of the Julia CSV parser\n", | |
"Written in [Julia](https://julialang.org/). See [CSV.jl](https://csv.juliadata.org/stable/) and [DataFrames.jl](https://dataframes.juliadata.org/stable/) for more details" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"using CSV, DataFrames" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"All examples are based on string input, which is passed to Julia's CSV reader through `IOBuffer`.\n", | |
"\n", | |
"Using single string will set the `dateformat` for all the `date` columns." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table class=\"data-frame\"><thead><tr><th></th><th>c1</th><th>c2</th><th>c3</th><th>c4</th><th>d1</th></tr><tr><th></th><th>String</th><th>Int64</th><th>String</th><th>Float64</th><th>Date…</th></tr></thead><tbody><p>2 rows × 5 columns</p><tr><th>1</th><td>XY</td><td>2</td><td>c</td><td>1.5</td><td>2020-01-05</td></tr><tr><th>2</th><td>AB</td><td>16</td><td>x</td><td>2.33</td><td>2021-01-05</td></tr></tbody></table>" | |
], | |
"text/latex": [ | |
"\\begin{tabular}{r|ccccc}\n", | |
"\t& c1 & c2 & c3 & c4 & d1\\\\\n", | |
"\t\\hline\n", | |
"\t& String & Int64 & String & Float64 & Date…\\\\\n", | |
"\t\\hline\n", | |
"\t1 & XY & 2 & c & 1.5 & 2020-01-05 \\\\\n", | |
"\t2 & AB & 16 & x & 2.33 & 2021-01-05 \\\\\n", | |
"\\end{tabular}\n" | |
], | |
"text/plain": [ | |
"2×5 DataFrame\n", | |
"│ Row │ c1 │ c2 │ c3 │ c4 │ d1 │\n", | |
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mDates.Date\u001b[39m │\n", | |
"├─────┼────────┼───────┼────────┼─────────┼────────────┤\n", | |
"│ 1 │ XY │ 2 │ c │ 1.5 │ 2020-01-05 │\n", | |
"│ 2 │ AB │ 16 │ x │ 2.33 │ 2021-01-05 │" | |
] | |
}, | |
"execution_count": 2, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data = \"\"\"c1|c2|c3|c4|d1\n", | |
"\"XY\"|2|c|1.5|2020-01-05\n", | |
"\"AB\"|16|x|2.33|2021-01-05\n", | |
"\"\"\"\n", | |
"\n", | |
"CSV.read(IOBuffer(data), DataFrame; \n", | |
" dateformat=\"yyyy-mm-dd\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"If you have more than one dateformat, you can try the default parser (by setting `types`=>`Date`), but it will probably guess only some of the types. You also need `using Dates` so that you can specify `Date` type. If the parser fails, it wil return the `missing` value, equivalent `Nan`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stderr", | |
"output_type": "stream", | |
"text": [ | |
"┌ Warning: thread = 1 warning: error parsing Date around row = 2, col = 6: \"01/12/20\n", | |
"│ \", error=INVALID: OK | NEWLINE | INVALID_DELIMITER \n", | |
"└ @ CSV /home/vaclav/.julia/packages/CSV/la2cd/src/file.jl:606\n", | |
"┌ Warning: thread = 1 warning: error parsing Date around row = 3, col = 6: \"15/10/20\n", | |
"│ \", error=INVALID: OK | NEWLINE | EOF | INVALID_DELIMITER \n", | |
"└ @ CSV /home/vaclav/.julia/packages/CSV/la2cd/src/file.jl:606\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/html": [ | |
"<table class=\"data-frame\"><thead><tr><th></th><th>c1</th><th>c2</th><th>c3</th><th>c4</th><th>d1</th><th>d2</th></tr><tr><th></th><th>String</th><th>Int64</th><th>String</th><th>Float64</th><th>Date</th><th>Date?</th></tr></thead><tbody><p>2 rows × 6 columns</p><tr><th>1</th><td>XY</td><td>2</td><td>c</td><td>1.5</td><td>2020-01-05</td><td><em>missing</em></td></tr><tr><th>2</th><td>AB</td><td>16</td><td>x</td><td>2.33</td><td>2021-01-05</td><td><em>missing</em></td></tr></tbody></table>" | |
], | |
"text/latex": [ | |
"\\begin{tabular}{r|cccccc}\n", | |
"\t& c1 & c2 & c3 & c4 & d1 & d2\\\\\n", | |
"\t\\hline\n", | |
"\t& String & Int64 & String & Float64 & Date & Date?\\\\\n", | |
"\t\\hline\n", | |
"\t1 & XY & 2 & c & 1.5 & 2020-01-05 & \\emph{missing} \\\\\n", | |
"\t2 & AB & 16 & x & 2.33 & 2021-01-05 & \\emph{missing} \\\\\n", | |
"\\end{tabular}\n" | |
], | |
"text/plain": [ | |
"2×6 DataFrame\n", | |
"│ Row │ c1 │ c2 │ c3 │ c4 │ d1 │ d2 │\n", | |
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mDate\u001b[39m │ \u001b[90mDate?\u001b[39m │\n", | |
"├─────┼────────┼───────┼────────┼─────────┼────────────┼─────────┤\n", | |
"│ 1 │ XY │ 2 │ c │ 1.5 │ 2020-01-05 │ \u001b[90mmissing\u001b[39m │\n", | |
"│ 2 │ AB │ 16 │ x │ 2.33 │ 2021-01-05 │ \u001b[90mmissing\u001b[39m │" | |
] | |
}, | |
"execution_count": 3, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"using Dates\n", | |
"\n", | |
"data = \"\"\"c1|c2|c3|c4|d1|d2\n", | |
"\"XY\"|2|c|1.5|2020-01-05|01/12/20\n", | |
"\"AB\"|16|x|2.33|2021-01-05|15/10/20\n", | |
"\"\"\"\n", | |
"\n", | |
"# specify that columns are dates and then specify the dateformat\n", | |
"CSV.read(IOBuffer(data), DataFrame; \n", | |
" types=Dict(\"d1\"=>Date, \"d2\"=>Date), \n", | |
")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Passing a dict you can specify different `dateformat` for each column. You don't have to set the types explicitly, it's obvious that these columns should be considered as `Dates`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table class=\"data-frame\"><thead><tr><th></th><th>c1</th><th>c2</th><th>c3</th><th>c4</th><th>d1</th><th>d2</th></tr><tr><th></th><th>String</th><th>Int64</th><th>String</th><th>Float64</th><th>Date</th><th>Date</th></tr></thead><tbody><p>2 rows × 6 columns</p><tr><th>1</th><td>XY</td><td>2</td><td>c</td><td>1.5</td><td>2020-01-05</td><td>0020-12-01</td></tr><tr><th>2</th><td>AB</td><td>16</td><td>x</td><td>2.33</td><td>2021-01-05</td><td>0020-10-15</td></tr></tbody></table>" | |
], | |
"text/latex": [ | |
"\\begin{tabular}{r|cccccc}\n", | |
"\t& c1 & c2 & c3 & c4 & d1 & d2\\\\\n", | |
"\t\\hline\n", | |
"\t& String & Int64 & String & Float64 & Date & Date\\\\\n", | |
"\t\\hline\n", | |
"\t1 & XY & 2 & c & 1.5 & 2020-01-05 & 0020-12-01 \\\\\n", | |
"\t2 & AB & 16 & x & 2.33 & 2021-01-05 & 0020-10-15 \\\\\n", | |
"\\end{tabular}\n" | |
], | |
"text/plain": [ | |
"2×6 DataFrame\n", | |
"│ Row │ c1 │ c2 │ c3 │ c4 │ d1 │ d2 │\n", | |
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mDate\u001b[39m │ \u001b[90mDate\u001b[39m │\n", | |
"├─────┼────────┼───────┼────────┼─────────┼────────────┼────────────┤\n", | |
"│ 1 │ XY │ 2 │ c │ 1.5 │ 2020-01-05 │ 0020-12-01 │\n", | |
"│ 2 │ AB │ 16 │ x │ 2.33 │ 2021-01-05 │ 0020-10-15 │" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data = \"\"\"c1|c2|c3|c4|d1|d2\n", | |
"\"XY\"|2|c|1.5|2020-01-05|01/12/20\n", | |
"\"AB\"|16|x|2.33|2021-01-05|15/10/20\n", | |
"\"\"\"\n", | |
"\n", | |
"# specify that columns are dates and then specify the dateformat\n", | |
"df = CSV.read(IOBuffer(data), DataFrame; \n", | |
" dateformats=Dict(\n", | |
" \"d1\"=>\"yyyy-mm-dd\",\n", | |
" \"d2\"=>\"dd/mm/yy\"\n", | |
" )\n", | |
")\n", | |
"df" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"You see that if the year was specified by value `20` only, the parset set the year really to the year 20. You have to add 2000 years to get the correct value." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table class=\"data-frame\"><thead><tr><th></th><th>c1</th><th>c2</th><th>c3</th><th>c4</th><th>d1</th><th>d2</th></tr><tr><th></th><th>String</th><th>Int64</th><th>String</th><th>Float64</th><th>Date</th><th>Date</th></tr></thead><tbody><p>2 rows × 6 columns</p><tr><th>1</th><td>XY</td><td>2</td><td>c</td><td>1.5</td><td>2020-01-05</td><td>2020-12-01</td></tr><tr><th>2</th><td>AB</td><td>16</td><td>x</td><td>2.33</td><td>2021-01-05</td><td>2020-10-15</td></tr></tbody></table>" | |
], | |
"text/latex": [ | |
"\\begin{tabular}{r|cccccc}\n", | |
"\t& c1 & c2 & c3 & c4 & d1 & d2\\\\\n", | |
"\t\\hline\n", | |
"\t& String & Int64 & String & Float64 & Date & Date\\\\\n", | |
"\t\\hline\n", | |
"\t1 & XY & 2 & c & 1.5 & 2020-01-05 & 2020-12-01 \\\\\n", | |
"\t2 & AB & 16 & x & 2.33 & 2021-01-05 & 2020-10-15 \\\\\n", | |
"\\end{tabular}\n" | |
], | |
"text/plain": [ | |
"2×6 DataFrame\n", | |
"│ Row │ c1 │ c2 │ c3 │ c4 │ d1 │ d2 │\n", | |
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mDate\u001b[39m │ \u001b[90mDate\u001b[39m │\n", | |
"├─────┼────────┼───────┼────────┼─────────┼────────────┼────────────┤\n", | |
"│ 1 │ XY │ 2 │ c │ 1.5 │ 2020-01-05 │ 2020-12-01 │\n", | |
"│ 2 │ AB │ 16 │ x │ 2.33 │ 2021-01-05 │ 2020-10-15 │" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# add 2000 years to the column d2 containing 0020-MM-DD\n", | |
"# caregul to run this only once, since both df[:, :d2] or df[!, :d2] modifies the column\n", | |
"df[!, :d2] += Dates.Year(2000)\n", | |
"df" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Parsing DateTime\n", | |
"See [Dates](https://docs.julialang.org/en/v1/stdlib/Dates/#Dates.format-Tuple{TimeType,AbstractString}) module documentation to understand the format." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/html": [ | |
"<table class=\"data-frame\"><thead><tr><th></th><th>c1</th><th>c2</th><th>c3</th><th>c4</th><th>d1</th><th>d2</th><th>time</th></tr><tr><th></th><th>String</th><th>Int64</th><th>String</th><th>Float64</th><th>Date</th><th>Date</th><th>DateTime</th></tr></thead><tbody><p>2 rows × 7 columns</p><tr><th>1</th><td>XY</td><td>2</td><td>c</td><td>1.5</td><td>2020-01-05</td><td>0020-12-01</td><td>2020-01-15T10:55:03</td></tr><tr><th>2</th><td>AB</td><td>16</td><td>x</td><td>2.33</td><td>2021-01-05</td><td>0020-10-15</td><td>2020-01-15T23:08:59</td></tr></tbody></table>" | |
], | |
"text/latex": [ | |
"\\begin{tabular}{r|ccccccc}\n", | |
"\t& c1 & c2 & c3 & c4 & d1 & d2 & time\\\\\n", | |
"\t\\hline\n", | |
"\t& String & Int64 & String & Float64 & Date & Date & DateTime\\\\\n", | |
"\t\\hline\n", | |
"\t1 & XY & 2 & c & 1.5 & 2020-01-05 & 0020-12-01 & 2020-01-15T10:55:03 \\\\\n", | |
"\t2 & AB & 16 & x & 2.33 & 2021-01-05 & 0020-10-15 & 2020-01-15T23:08:59 \\\\\n", | |
"\\end{tabular}\n" | |
], | |
"text/plain": [ | |
"2×7 DataFrame. Omitted printing of 1 columns\n", | |
"│ Row │ c1 │ c2 │ c3 │ c4 │ d1 │ d2 │\n", | |
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mDate\u001b[39m │ \u001b[90mDate\u001b[39m │\n", | |
"├─────┼────────┼───────┼────────┼─────────┼────────────┼────────────┤\n", | |
"│ 1 │ XY │ 2 │ c │ 1.5 │ 2020-01-05 │ 0020-12-01 │\n", | |
"│ 2 │ AB │ 16 │ x │ 2.33 │ 2021-01-05 │ 0020-10-15 │" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"data = \"\"\"c1|c2|c3|c4|d1|d2|time\n", | |
"\"XY\"|2|c|1.5|2020-01-05|01/12/20|2020Jan15T10:55:03\n", | |
"\"AB\"|16|x|2.33|2021-01-05|15/10/20|2020Jan15T23:08:59\n", | |
"\"\"\"\n", | |
"\n", | |
"# specify that columns are dates and then specify the dateformat\n", | |
"df = CSV.read(IOBuffer(data), DataFrame; \n", | |
" dateformats=Dict(\n", | |
" \"d1\"=>\"yyyy-mm-dd\",\n", | |
" \"d2\"=>\"dd/mm/yy\",\n", | |
" \"time\"=>DateFormat(\"yyyyuuuddTHH:MM:SS\")\n", | |
" )\n", | |
")\n", | |
"df" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Details about date parsing" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"2020-01-01T00:00:00" | |
] | |
}, | |
"execution_count": 10, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"# DateFormat method allow to specify the language. By default comes only English\n", | |
"DateTime(\"2020Jan\", Dates.DateFormat(\"yyyyuuu\", \"english\"))" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Base.KeySet for a Dict{String,Dates.DateLocale} with 1 entry. Keys:\n", | |
" \"english\"" | |
] | |
}, | |
"execution_count": 11, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"keys(Dates.LOCALES)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 12, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Dict{String,Int64} with 24 entries:\n", | |
" \"Aug\" => 8\n", | |
" \"May\" => 5\n", | |
" \"may\" => 5\n", | |
" \"Jul\" => 7\n", | |
" \"Dec\" => 12\n", | |
" \"Apr\" => 4\n", | |
" \"nov\" => 11\n", | |
" \"jul\" => 7\n", | |
" \"Oct\" => 10\n", | |
" \"apr\" => 4\n", | |
" \"Feb\" => 2\n", | |
" \"feb\" => 2\n", | |
" \"Mar\" => 3\n", | |
" \"oct\" => 10\n", | |
" \"mar\" => 3\n", | |
" \"Sep\" => 9\n", | |
" \"Jun\" => 6\n", | |
" \"dec\" => 12\n", | |
" \"Jan\" => 1\n", | |
" \"aug\" => 8\n", | |
" \"jan\" => 1\n", | |
" \"jun\" => 6\n", | |
" \"Nov\" => 11\n", | |
" \"sep\" => 9" | |
] | |
}, | |
"execution_count": 12, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"Dates.LOCALES[\"english\"].month_abbr_value" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Julia 1.4.1", | |
"language": "julia", | |
"name": "julia-1.4" | |
}, | |
"language_info": { | |
"file_extension": ".jl", | |
"mimetype": "application/julia", | |
"name": "julia", | |
"version": "1.4.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment