Last active
April 10, 2026 23:10
-
-
Save amotl/949547787e116c8cafabe2959281e7ec to your computer and use it in GitHub Desktop.
Import a CSV file with special features into CrateDB
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| """ | |
| ### About | |
| Task: Import a CSV file with special features into CrateDB. | |
| Source: https://guided-path.s3.us-east-1.amazonaws.com/demo_climate_data_export.csv | |
| The program applies two transformations before the data is ready for importing into | |
| CrateDB. For the import procedure, it uses a performance path with pandas/SQLAlchemy. | |
| - Convert coordinates in JSON list format to WKT POINT format. | |
| - Convert a dictionary encoded in proprietary Python format into standard JSON format. | |
| - Use `sqlalchemy_cratedb.insert_bulk` in `pandas.to_sql(method=insert_bulk)`. | |
| ### Usage | |
| ```shell | |
| uv run cratedb_climate_data_import.py \ | |
| "https://guided-path.s3.us-east-1.amazonaws.com/demo_climate_data_export.csv" \ | |
| "crate://crate:crate@localhost:4200/?ssl=false" | |
| ``` | |
| ### Details | |
| Source format: | |
| ``` | |
| ┌───────────────┬─────────────────────────────────┬─────────────────────────────────┐ | |
| │ timestamp ┆ geo_location ┆ data │ | |
| │ --- ┆ --- ┆ --- │ | |
| │ i64 ┆ str ┆ str │ | |
| ╞═══════════════╪═════════════════════════════════╪═════════════════════════════════╡ | |
| │ 1754784000000 ┆ [14.988999953493476, 51.102999… ┆ {'temperature': 19.70482788085… │ | |
| │ 1754784000000 ┆ [7.088122218847275, 51.0029999… ┆ {'temperature': 19.34780273437… │ | |
| │ 1754784000000 ┆ [7.58817776106298, 51.00299998… ┆ {'temperature': 17.71303710937… │ | |
| └───────────────┴─────────────────────────────────┴─────────────────────────────────┘ | |
| ``` | |
| Intermediary format: | |
| ``` | |
| ┌───────────────┬─────────────────────────┬─────────────────────────────────┐ | |
| │ timestamp ┆ geo_location ┆ data │ | |
| │ --- ┆ --- ┆ --- │ | |
| │ i64 ┆ str ┆ str │ | |
| ╞═══════════════╪═════════════════════════╪═════════════════════════════════╡ | |
| │ 1754784000000 ┆ POINT (14.989 51.103) ┆ {"temperature":19.704827880859… │ | |
| │ 1754784000000 ┆ POINT (7.088122 51.003) ┆ {"temperature":19.347802734375… │ | |
| │ 1754784000000 ┆ POINT (7.588178 51.003) ┆ {"temperature":17.713037109375… │ | |
| └───────────────┴─────────────────────────┴─────────────────────────────────┘ | |
| ``` | |
| ### References | |
| - https://github.com/crate/crate/issues/19192 | |
| - https://github.com/crate/crate-clients-tools/issues/319 | |
| - https://github.com/pola-rs/polars/issues/27266 | |
| ### Recommendation | |
| Please convert the `data` column to valid JSON, see `FIXME` admonition below. | |
| """ | |
| # /// script | |
| # requires-python = ">=3.9" | |
| # dependencies = [ | |
| # "pandas", | |
| # "polars", | |
| # "polars-st", | |
| # "sqlalchemy-cratedb", | |
| # ] | |
| # /// | |
| import ast | |
| import sys | |
| import orjson | |
| import polars as pl | |
| import polars_st as st | |
| import sqlalchemy as sa | |
| from sqlalchemy_cratedb import insert_bulk | |
| def climate_data_to_cratedb(source: str, target: str): | |
| # Read CSV. | |
| df = pl.scan_csv(source, quote_char='"') | |
| # Apply transformations. | |
| transformations = [ | |
| # Transformation rule to convert from coordinates to WKT POINT. | |
| st.point(pl.col("geo_location").str.json_decode(dtype=pl.List(pl.Float64))).st.to_wkt(), | |
| # Transformation rule to decode Python-serialized dictionary. | |
| # FIXME: Please adjust the data source to use single quotes in the CSV, | |
| # so the `data` column can be valid JSON, including double quotes. | |
| # Then, this rule can go away after adjusting to use `scan_csv` | |
| # with `quote_char="'"`. Thank you! | |
| # Example CSV line: 1754784000000,'[14.988999953493476, 51.10299998894334]','{"temperature": 19.704827880859398, "pressure": 99310.625, "v10": -1.545882225036621, "u10": 1.7978938817977905, "latitude": 51.102999999999945, "longitude": 14.989}' # noqa: E501 | |
| pl.col("data").map_elements(lambda x: orjson.dumps(ast.literal_eval(x)).decode(), return_dtype=pl.String), # noqa: S307 | |
| ] | |
| df = df.with_columns(*transformations) | |
| # Write to CrateDB. | |
| engine = sa.create_engine(target, echo=True) | |
| with engine.connect() as connection: | |
| connection.execute(sa.text(""" | |
| CREATE TABLE IF NOT EXISTS climate_data | |
| ( | |
| "timestamp" TIMESTAMP WITHOUT TIME ZONE, | |
| "geo_location" GEO_POINT, | |
| "data" OBJECT(DYNAMIC) AS ( | |
| "temperature" DOUBLE PRECISION, | |
| "u10" DOUBLE PRECISION, | |
| "v10" DOUBLE PRECISION, | |
| "pressure" DOUBLE PRECISION, | |
| "latitude" DOUBLE PRECISION, | |
| "longitude" DOUBLE PRECISION, | |
| "humidity" DOUBLE PRECISION | |
| ) | |
| ); | |
| """)) | |
| df.collect().to_pandas().to_sql( | |
| name="climate_data", | |
| schema="doc", | |
| con=connection, | |
| if_exists="append", | |
| index=False, | |
| chunksize=50_000, | |
| method=insert_bulk, | |
| ) | |
| if __name__ == "__main__": | |
| try: | |
| source = sys.argv[1] | |
| target = sys.argv[2] | |
| except IndexError: | |
| example = """ | |
| uv run cratedb_climate_data_import.py \\ | |
| "https://guided-path.s3.us-east-1.amazonaws.com/demo_climate_data_export.csv" \\ | |
| "crate://crate:crate@localhost:4200/?ssl=false" | |
| """ | |
| print( | |
| f"ERROR: Please provide two positional arguments <SOURCE> <TARGET>.\n\nExample:\n{example}", file=sys.stderr | |
| ) | |
| sys.exit(1) | |
| climate_data_to_cratedb(source, target) |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi. Removing the second transformation rule to convert the Python dictionary into JSON will be nice to make the import less awkward, regardless of the tool or machinery we will optimally use to conduct the import.
When possible, please adjust your data source file by using single quotes to wrap the data cells, to be able to convey valid JSON within the
datacolumn. Thank you!Example CSV line:
/cc @zolbatar, @hammerhead, @grbade, @karynzv