Skip to content

Instantly share code, notes, and snippets.

@vaclavdekanovsky
Created December 19, 2020 22:09
Show Gist options
  • Save vaclavdekanovsky/7034393c7bbe23ce36fe522ae931cf28 to your computer and use it in GitHub Desktop.
Save vaclavdekanovsky/7034393c7bbe23ce36fe522ae931cf28 to your computer and use it in GitHub Desktop.
Julia, parsing CSV using dateformat parameter
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dateformat parameter of the Julia CSV parser\n",
"Written in [Julia](https://julialang.org/). See [CSV.jl](https://csv.juliadata.org/stable/) and [DataFrames.jl](https://dataframes.juliadata.org/stable/) for more details"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"using CSV, DataFrames"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All examples are based on string input, which is passed to Julia's CSV reader through `IOBuffer`.\n",
"\n",
"Using single string will set the `dateformat` for all the `date` columns."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>c1</th><th>c2</th><th>c3</th><th>c4</th><th>d1</th></tr><tr><th></th><th>String</th><th>Int64</th><th>String</th><th>Float64</th><th>Date…</th></tr></thead><tbody><p>2 rows × 5 columns</p><tr><th>1</th><td>XY</td><td>2</td><td>c</td><td>1.5</td><td>2020-01-05</td></tr><tr><th>2</th><td>AB</td><td>16</td><td>x</td><td>2.33</td><td>2021-01-05</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccccc}\n",
"\t& c1 & c2 & c3 & c4 & d1\\\\\n",
"\t\\hline\n",
"\t& String & Int64 & String & Float64 & Date…\\\\\n",
"\t\\hline\n",
"\t1 & XY & 2 & c & 1.5 & 2020-01-05 \\\\\n",
"\t2 & AB & 16 & x & 2.33 & 2021-01-05 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"2×5 DataFrame\n",
"│ Row │ c1 │ c2 │ c3 │ c4 │ d1 │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mDates.Date\u001b[39m │\n",
"├─────┼────────┼───────┼────────┼─────────┼────────────┤\n",
"│ 1 │ XY │ 2 │ c │ 1.5 │ 2020-01-05 │\n",
"│ 2 │ AB │ 16 │ x │ 2.33 │ 2021-01-05 │"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = \"\"\"c1|c2|c3|c4|d1\n",
"\"XY\"|2|c|1.5|2020-01-05\n",
"\"AB\"|16|x|2.33|2021-01-05\n",
"\"\"\"\n",
"\n",
"CSV.read(IOBuffer(data), DataFrame; \n",
" dateformat=\"yyyy-mm-dd\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you have more than one dateformat, you can try the default parser (by setting `types`=>`Date`), but it will probably guess only some of the types. You also need `using Dates` so that you can specify `Date` type. If the parser fails, it wil return the `missing` value, equivalent `Nan`."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"┌ Warning: thread = 1 warning: error parsing Date around row = 2, col = 6: \"01/12/20\n",
"\", error=INVALID: OK | NEWLINE | INVALID_DELIMITER \n",
"└ @ CSV /home/vaclav/.julia/packages/CSV/la2cd/src/file.jl:606\n",
"┌ Warning: thread = 1 warning: error parsing Date around row = 3, col = 6: \"15/10/20\n",
"\", error=INVALID: OK | NEWLINE | EOF | INVALID_DELIMITER \n",
"└ @ CSV /home/vaclav/.julia/packages/CSV/la2cd/src/file.jl:606\n"
]
},
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>c1</th><th>c2</th><th>c3</th><th>c4</th><th>d1</th><th>d2</th></tr><tr><th></th><th>String</th><th>Int64</th><th>String</th><th>Float64</th><th>Date</th><th>Date?</th></tr></thead><tbody><p>2 rows × 6 columns</p><tr><th>1</th><td>XY</td><td>2</td><td>c</td><td>1.5</td><td>2020-01-05</td><td><em>missing</em></td></tr><tr><th>2</th><td>AB</td><td>16</td><td>x</td><td>2.33</td><td>2021-01-05</td><td><em>missing</em></td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cccccc}\n",
"\t& c1 & c2 & c3 & c4 & d1 & d2\\\\\n",
"\t\\hline\n",
"\t& String & Int64 & String & Float64 & Date & Date?\\\\\n",
"\t\\hline\n",
"\t1 & XY & 2 & c & 1.5 & 2020-01-05 & \\emph{missing} \\\\\n",
"\t2 & AB & 16 & x & 2.33 & 2021-01-05 & \\emph{missing} \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"2×6 DataFrame\n",
"│ Row │ c1 │ c2 │ c3 │ c4 │ d1 │ d2 │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mDate\u001b[39m │ \u001b[90mDate?\u001b[39m │\n",
"├─────┼────────┼───────┼────────┼─────────┼────────────┼─────────┤\n",
"│ 1 │ XY │ 2 │ c │ 1.5 │ 2020-01-05 │ \u001b[90mmissing\u001b[39m │\n",
"│ 2 │ AB │ 16 │ x │ 2.33 │ 2021-01-05 │ \u001b[90mmissing\u001b[39m │"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"using Dates\n",
"\n",
"data = \"\"\"c1|c2|c3|c4|d1|d2\n",
"\"XY\"|2|c|1.5|2020-01-05|01/12/20\n",
"\"AB\"|16|x|2.33|2021-01-05|15/10/20\n",
"\"\"\"\n",
"\n",
"# specify that columns are dates and then specify the dateformat\n",
"CSV.read(IOBuffer(data), DataFrame; \n",
" types=Dict(\"d1\"=>Date, \"d2\"=>Date), \n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Passing a dict you can specify different `dateformat` for each column. You don't have to set the types explicitly, it's obvious that these columns should be considered as `Dates`."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>c1</th><th>c2</th><th>c3</th><th>c4</th><th>d1</th><th>d2</th></tr><tr><th></th><th>String</th><th>Int64</th><th>String</th><th>Float64</th><th>Date</th><th>Date</th></tr></thead><tbody><p>2 rows × 6 columns</p><tr><th>1</th><td>XY</td><td>2</td><td>c</td><td>1.5</td><td>2020-01-05</td><td>0020-12-01</td></tr><tr><th>2</th><td>AB</td><td>16</td><td>x</td><td>2.33</td><td>2021-01-05</td><td>0020-10-15</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cccccc}\n",
"\t& c1 & c2 & c3 & c4 & d1 & d2\\\\\n",
"\t\\hline\n",
"\t& String & Int64 & String & Float64 & Date & Date\\\\\n",
"\t\\hline\n",
"\t1 & XY & 2 & c & 1.5 & 2020-01-05 & 0020-12-01 \\\\\n",
"\t2 & AB & 16 & x & 2.33 & 2021-01-05 & 0020-10-15 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"2×6 DataFrame\n",
"│ Row │ c1 │ c2 │ c3 │ c4 │ d1 │ d2 │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mDate\u001b[39m │ \u001b[90mDate\u001b[39m │\n",
"├─────┼────────┼───────┼────────┼─────────┼────────────┼────────────┤\n",
"│ 1 │ XY │ 2 │ c │ 1.5 │ 2020-01-05 │ 0020-12-01 │\n",
"│ 2 │ AB │ 16 │ x │ 2.33 │ 2021-01-05 │ 0020-10-15 │"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = \"\"\"c1|c2|c3|c4|d1|d2\n",
"\"XY\"|2|c|1.5|2020-01-05|01/12/20\n",
"\"AB\"|16|x|2.33|2021-01-05|15/10/20\n",
"\"\"\"\n",
"\n",
"# specify that columns are dates and then specify the dateformat\n",
"df = CSV.read(IOBuffer(data), DataFrame; \n",
" dateformats=Dict(\n",
" \"d1\"=>\"yyyy-mm-dd\",\n",
" \"d2\"=>\"dd/mm/yy\"\n",
" )\n",
")\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You see that if the year was specified by value `20` only, the parset set the year really to the year 20. You have to add 2000 years to get the correct value."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>c1</th><th>c2</th><th>c3</th><th>c4</th><th>d1</th><th>d2</th></tr><tr><th></th><th>String</th><th>Int64</th><th>String</th><th>Float64</th><th>Date</th><th>Date</th></tr></thead><tbody><p>2 rows × 6 columns</p><tr><th>1</th><td>XY</td><td>2</td><td>c</td><td>1.5</td><td>2020-01-05</td><td>2020-12-01</td></tr><tr><th>2</th><td>AB</td><td>16</td><td>x</td><td>2.33</td><td>2021-01-05</td><td>2020-10-15</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|cccccc}\n",
"\t& c1 & c2 & c3 & c4 & d1 & d2\\\\\n",
"\t\\hline\n",
"\t& String & Int64 & String & Float64 & Date & Date\\\\\n",
"\t\\hline\n",
"\t1 & XY & 2 & c & 1.5 & 2020-01-05 & 2020-12-01 \\\\\n",
"\t2 & AB & 16 & x & 2.33 & 2021-01-05 & 2020-10-15 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"2×6 DataFrame\n",
"│ Row │ c1 │ c2 │ c3 │ c4 │ d1 │ d2 │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mDate\u001b[39m │ \u001b[90mDate\u001b[39m │\n",
"├─────┼────────┼───────┼────────┼─────────┼────────────┼────────────┤\n",
"│ 1 │ XY │ 2 │ c │ 1.5 │ 2020-01-05 │ 2020-12-01 │\n",
"│ 2 │ AB │ 16 │ x │ 2.33 │ 2021-01-05 │ 2020-10-15 │"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# add 2000 years to the column d2 containing 0020-MM-DD\n",
"# caregul to run this only once, since both df[:, :d2] or df[!, :d2] modifies the column\n",
"df[!, :d2] += Dates.Year(2000)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parsing DateTime\n",
"See [Dates](https://docs.julialang.org/en/v1/stdlib/Dates/#Dates.format-Tuple{TimeType,AbstractString}) module documentation to understand the format."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<table class=\"data-frame\"><thead><tr><th></th><th>c1</th><th>c2</th><th>c3</th><th>c4</th><th>d1</th><th>d2</th><th>time</th></tr><tr><th></th><th>String</th><th>Int64</th><th>String</th><th>Float64</th><th>Date</th><th>Date</th><th>DateTime</th></tr></thead><tbody><p>2 rows × 7 columns</p><tr><th>1</th><td>XY</td><td>2</td><td>c</td><td>1.5</td><td>2020-01-05</td><td>0020-12-01</td><td>2020-01-15T10:55:03</td></tr><tr><th>2</th><td>AB</td><td>16</td><td>x</td><td>2.33</td><td>2021-01-05</td><td>0020-10-15</td><td>2020-01-15T23:08:59</td></tr></tbody></table>"
],
"text/latex": [
"\\begin{tabular}{r|ccccccc}\n",
"\t& c1 & c2 & c3 & c4 & d1 & d2 & time\\\\\n",
"\t\\hline\n",
"\t& String & Int64 & String & Float64 & Date & Date & DateTime\\\\\n",
"\t\\hline\n",
"\t1 & XY & 2 & c & 1.5 & 2020-01-05 & 0020-12-01 & 2020-01-15T10:55:03 \\\\\n",
"\t2 & AB & 16 & x & 2.33 & 2021-01-05 & 0020-10-15 & 2020-01-15T23:08:59 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"2×7 DataFrame. Omitted printing of 1 columns\n",
"│ Row │ c1 │ c2 │ c3 │ c4 │ d1 │ d2 │\n",
"│ │ \u001b[90mString\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mString\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mDate\u001b[39m │ \u001b[90mDate\u001b[39m │\n",
"├─────┼────────┼───────┼────────┼─────────┼────────────┼────────────┤\n",
"│ 1 │ XY │ 2 │ c │ 1.5 │ 2020-01-05 │ 0020-12-01 │\n",
"│ 2 │ AB │ 16 │ x │ 2.33 │ 2021-01-05 │ 0020-10-15 │"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = \"\"\"c1|c2|c3|c4|d1|d2|time\n",
"\"XY\"|2|c|1.5|2020-01-05|01/12/20|2020Jan15T10:55:03\n",
"\"AB\"|16|x|2.33|2021-01-05|15/10/20|2020Jan15T23:08:59\n",
"\"\"\"\n",
"\n",
"# specify that columns are dates and then specify the dateformat\n",
"df = CSV.read(IOBuffer(data), DataFrame; \n",
" dateformats=Dict(\n",
" \"d1\"=>\"yyyy-mm-dd\",\n",
" \"d2\"=>\"dd/mm/yy\",\n",
" \"time\"=>DateFormat(\"yyyyuuuddTHH:MM:SS\")\n",
" )\n",
")\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Details about date parsing"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2020-01-01T00:00:00"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DateFormat method allow to specify the language. By default comes only English\n",
"DateTime(\"2020Jan\", Dates.DateFormat(\"yyyyuuu\", \"english\"))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Base.KeySet for a Dict{String,Dates.DateLocale} with 1 entry. Keys:\n",
" \"english\""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"keys(Dates.LOCALES)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Dict{String,Int64} with 24 entries:\n",
" \"Aug\" => 8\n",
" \"May\" => 5\n",
" \"may\" => 5\n",
" \"Jul\" => 7\n",
" \"Dec\" => 12\n",
" \"Apr\" => 4\n",
" \"nov\" => 11\n",
" \"jul\" => 7\n",
" \"Oct\" => 10\n",
" \"apr\" => 4\n",
" \"Feb\" => 2\n",
" \"feb\" => 2\n",
" \"Mar\" => 3\n",
" \"oct\" => 10\n",
" \"mar\" => 3\n",
" \"Sep\" => 9\n",
" \"Jun\" => 6\n",
" \"dec\" => 12\n",
" \"Jan\" => 1\n",
" \"aug\" => 8\n",
" \"jan\" => 1\n",
" \"jun\" => 6\n",
" \"Nov\" => 11\n",
" \"sep\" => 9"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Dates.LOCALES[\"english\"].month_abbr_value"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Julia 1.4.1",
"language": "julia",
"name": "julia-1.4"
},
"language_info": {
"file_extension": ".jl",
"mimetype": "application/julia",
"name": "julia",
"version": "1.4.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment