Skip to content

Instantly share code, notes, and snippets.

@amotl
Last active April 10, 2026 23:10
Show Gist options
  • Select an option

  • Save amotl/949547787e116c8cafabe2959281e7ec to your computer and use it in GitHub Desktop.

Select an option

Save amotl/949547787e116c8cafabe2959281e7ec to your computer and use it in GitHub Desktop.
Import a CSV file with special features into CrateDB
"""
### About
Task: Import a CSV file with special features into CrateDB.
Source: https://guided-path.s3.us-east-1.amazonaws.com/demo_climate_data_export.csv
The program applies two transformations before the data is ready for importing into
CrateDB. For the import procedure, it uses a performance path with pandas/SQLAlchemy.
- Convert coordinates in JSON list format to WKT POINT format.
- Convert a dictionary encoded in proprietary Python format into standard JSON format.
- Use `sqlalchemy_cratedb.insert_bulk` in `pandas.to_sql(method=insert_bulk)`.
### Usage
```shell
uv run cratedb_climate_data_import.py \
"https://guided-path.s3.us-east-1.amazonaws.com/demo_climate_data_export.csv" \
"crate://crate:crate@localhost:4200/?ssl=false"
```
### Details
Source format:
```
┌───────────────┬─────────────────────────────────┬─────────────────────────────────┐
│ timestamp ┆ geo_location ┆ data │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═══════════════╪═════════════════════════════════╪═════════════════════════════════╡
│ 1754784000000 ┆ [14.988999953493476, 51.102999… ┆ {'temperature': 19.70482788085… │
│ 1754784000000 ┆ [7.088122218847275, 51.0029999… ┆ {'temperature': 19.34780273437… │
│ 1754784000000 ┆ [7.58817776106298, 51.00299998… ┆ {'temperature': 17.71303710937… │
└───────────────┴─────────────────────────────────┴─────────────────────────────────┘
```
Intermediary format:
```
┌───────────────┬─────────────────────────┬─────────────────────────────────┐
│ timestamp ┆ geo_location ┆ data │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═══════════════╪═════════════════════════╪═════════════════════════════════╡
│ 1754784000000 ┆ POINT (14.989 51.103) ┆ {"temperature":19.704827880859… │
│ 1754784000000 ┆ POINT (7.088122 51.003) ┆ {"temperature":19.347802734375… │
│ 1754784000000 ┆ POINT (7.588178 51.003) ┆ {"temperature":17.713037109375… │
└───────────────┴─────────────────────────┴─────────────────────────────────┘
```
### References
- https://github.com/crate/crate/issues/19192
- https://github.com/crate/crate-clients-tools/issues/319
- https://github.com/pola-rs/polars/issues/27266
### Recommendation
Please convert the `data` column to valid JSON, see `FIXME` admonition below.
"""
# /// script
# requires-python = ">=3.9"
# dependencies = [
# "pandas",
# "polars",
# "polars-st",
# "sqlalchemy-cratedb",
# ]
# ///
import ast
import sys
import orjson
import polars as pl
import polars_st as st
import sqlalchemy as sa
from sqlalchemy_cratedb import insert_bulk
def climate_data_to_cratedb(source: str, target: str):
# Read CSV.
df = pl.scan_csv(source, quote_char='"')
# Apply transformations.
transformations = [
# Transformation rule to convert from coordinates to WKT POINT.
st.point(pl.col("geo_location").str.json_decode(dtype=pl.List(pl.Float64))).st.to_wkt(),
# Transformation rule to decode Python-serialized dictionary.
# FIXME: Please adjust the data source to use single quotes in the CSV,
# so the `data` column can be valid JSON, including double quotes.
# Then, this rule can go away after adjusting to use `scan_csv`
# with `quote_char="'"`. Thank you!
# Example CSV line: 1754784000000,'[14.988999953493476, 51.10299998894334]','{"temperature": 19.704827880859398, "pressure": 99310.625, "v10": -1.545882225036621, "u10": 1.7978938817977905, "latitude": 51.102999999999945, "longitude": 14.989}' # noqa: E501
pl.col("data").map_elements(lambda x: orjson.dumps(ast.literal_eval(x)).decode(), return_dtype=pl.String), # noqa: S307
]
df = df.with_columns(*transformations)
# Write to CrateDB.
engine = sa.create_engine(target, echo=True)
with engine.connect() as connection:
connection.execute(sa.text("""
CREATE TABLE IF NOT EXISTS climate_data
(
"timestamp" TIMESTAMP WITHOUT TIME ZONE,
"geo_location" GEO_POINT,
"data" OBJECT(DYNAMIC) AS (
"temperature" DOUBLE PRECISION,
"u10" DOUBLE PRECISION,
"v10" DOUBLE PRECISION,
"pressure" DOUBLE PRECISION,
"latitude" DOUBLE PRECISION,
"longitude" DOUBLE PRECISION,
"humidity" DOUBLE PRECISION
)
);
"""))
df.collect().to_pandas().to_sql(
name="climate_data",
schema="doc",
con=connection,
if_exists="append",
index=False,
chunksize=50_000,
method=insert_bulk,
)
if __name__ == "__main__":
try:
source = sys.argv[1]
target = sys.argv[2]
except IndexError:
example = """
uv run cratedb_climate_data_import.py \\
"https://guided-path.s3.us-east-1.amazonaws.com/demo_climate_data_export.csv" \\
"crate://crate:crate@localhost:4200/?ssl=false"
"""
print(
f"ERROR: Please provide two positional arguments <SOURCE> <TARGET>.\n\nExample:\n{example}", file=sys.stderr
)
sys.exit(1)
climate_data_to_cratedb(source, target)
@amotl
Copy link
Copy Markdown
Author

amotl commented Apr 10, 2026

Hi. Removing the second transformation rule to convert the Python dictionary into JSON will be nice to make the import less awkward, regardless of the tool or machinery we will optimally use to conduct the import.

When possible, please adjust your data source file by using single quotes to wrap the data cells, to be able to convey valid JSON within the data column. Thank you!

Example CSV line:

1754784000000,'[14.988999953493476, 51.10299998894334]','{"temperature": 19.704827880859398, "pressure": 99310.625, "v10": -1.545882225036621, "u10": 1.7978938817977905, "latitude": 51.102999999999945, "longitude": 14.989}'

/cc @zolbatar, @hammerhead, @grbade, @karynzv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment