Skip to content

Instantly share code, notes, and snippets.

@namessanti
Forked from andy-esch/index.md
Last active October 23, 2022 15:22

Revisions

  1. namessanti revised this gist May 20, 2016. 1 changed file with 1 addition and 12 deletions.
    13 changes: 1 addition & 12 deletions index.md
    Original file line number Diff line number Diff line change
    @@ -1,15 +1,4 @@
    # CartoCamp Intermediate SQL

    ## January 11th, 2016

    ## Instructors

    Andy Eschbacher, Map Scientist, @MrEPhysics

    Stuart Lynn, Map Scientist, @Stuart_Lynn

    ## Wifi: CartoDB
    ## Pass: Ge0graphy
    # Intermediate SQL

    ## Workshop: http://bit.ly/cdb-int-workshop

  2. Andy Eschbacher revised this gist Jan 13, 2016. 1 changed file with 6 additions and 6 deletions.
    12 changes: 6 additions & 6 deletions index.md
    Original file line number Diff line number Diff line change
    @@ -291,18 +291,18 @@ We can use WHERE to match on strings like we have already seen but we dont have
    operator

    ```sql
    SELECT * FROM nyc_rat_sightings
    WHERE location_type LIKE '%Family%'
    SELECT * FROM nyc_rat_sightings
    WHERE location_type LIKE '%Family%'
    ```
    The LIKE command allows you to write a string with wildcards. A _ matches any one character and % matches any number of characters, so '%Family%' matches any string with 'Family' in it at any point. This is really powerful to get more info on pattern matching check out the [docs](http://www.postgresql.org/docs/8.3/static/functions-matching.html).

    Finally lets aggregate by the number of people dwelling in the buiilding. This is contained at the start of the string '1-2 Family Dwelling' so how do we get at it? Well we can split strings using the
    split_part function

    ```sql
    SELECT split_part(location_type, ' ', 1)
    FROM nyc_rat_sightings
    WHERE location_type LIKE '%Family%'
    SELECT split_part(location_type, ' ', 1)
    FROM nyc_rat_sightings
    WHERE location_type LIKE '%Family%'
    ```

    Split part takes a string to split on, what to split it with ' ' and which part of the split to return. Finally lets aggregate
    @@ -329,7 +329,7 @@ Awesome so we now see that for family homes with 1-2 people there where 15,166 s

    For more info on string functions in PostgreSQL cehck out the [documentation](http://www.postgresql.org/docs/9.3/static/functions-string.html).

    ## Visualize the number or rat complaints in each borough
    ## Visualize the number of rat complaints in each borough

    Let's use our `GROUP BY`s to look at how boroughs compare. Let's do the normal group by, but this time find the center of the complaints in each borough (a kind of center of mass of complaints).

  3. Andy Eschbacher revised this gist Jan 13, 2016. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion index.md
    Original file line number Diff line number Diff line change
    @@ -21,7 +21,7 @@ https://data.cityofnewyork.us/Social-Services/Rat-Sightings/3q43-55fe

    We can import it more easily via CartoDB's cached infrastructure:
    ```text
    http://eschbacher.cartodb.com/api/v2/sql?q=SELECT%20the_geom,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_addess,street_name,address_type,city,status,due_date,resolution_action_updated_date,latitude,longitude,borough,community_board%20FROM%20nyc_rat_sightings&format=csv&filename=nyc_rat_sightings
    http://eschbacher.cartodb.com/api/v2/sql?q=SELECT%20the_geom,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_address,street_name,address_type,city,status,due_date,resolution_action_updated_date,latitude,longitude,borough,community_board%20FROM%20nyc_rat_sightings&format=csv&filename=nyc_rat_sightings
    ```

    ## Getting Help
  4. Andy Eschbacher revised this gist Jan 12, 2016. 1 changed file with 3 additions and 0 deletions.
    3 changes: 3 additions & 0 deletions index.md
    Original file line number Diff line number Diff line change
    @@ -8,6 +8,9 @@ Andy Eschbacher, Map Scientist, @MrEPhysics

    Stuart Lynn, Map Scientist, @Stuart_Lynn

    ## Wifi: CartoDB
    ## Pass: Ge0graphy

    ## Workshop: http://bit.ly/cdb-int-workshop

    ## Data
  5. Andy Eschbacher revised this gist Jan 12, 2016. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions index.md
    Original file line number Diff line number Diff line change
    @@ -8,6 +8,8 @@ Andy Eschbacher, Map Scientist, @MrEPhysics

    Stuart Lynn, Map Scientist, @Stuart_Lynn

    ## Workshop: http://bit.ly/cdb-int-workshop

    ## Data

    ### 311 Complaints about Rats
  6. Andy Eschbacher created this gist Jan 12, 2016.
    458 changes: 458 additions & 0 deletions index.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,458 @@
    # CartoCamp Intermediate SQL

    ## January 11th, 2016

    ## Instructors

    Andy Eschbacher, Map Scientist, @MrEPhysics

    Stuart Lynn, Map Scientist, @Stuart_Lynn

    ## Data

    ### 311 Complaints about Rats
    The source of our data is NYC's Open Data:
    https://data.cityofnewyork.us/Social-Services/Rat-Sightings/3q43-55fe

    We can import it more easily via CartoDB's cached infrastructure:
    ```text
    http://eschbacher.cartodb.com/api/v2/sql?q=SELECT%20the_geom,unique_key,created_date,closed_date,agency,agency_name,complaint_type,descriptor,location_type,incident_zip,incident_addess,street_name,address_type,city,status,due_date,resolution_action_updated_date,latitude,longitude,borough,community_board%20FROM%20nyc_rat_sightings&format=csv&filename=nyc_rat_sightings
    ```

    ## Getting Help

    The internet is full of people who have had the same problems as you! Use their experince to solve your problems!

    **Google foo**

    If I want some date to string functions, I search:

    ```text
    postgresql string date
    ```

    Aggregate functions:

    ```text
    postgresql aggregate functions
    ```

    **Always know your version()**

    Find your PostgreSQL version:

    ```sql
    SELECT version()
    ```

    **StackExchange**

    * Queries and Database questions: http://dba.stackexchange.com/
    * Geospatial database questions: http://gis.stackexchange.com/

    **Learning resources**

    CartoDB's Map/Data Academy
    http://academy.cartodb.com/courses/sql-postgis/

    Codecademy:
    https://www.codecademy.com/learn/learn-sql

    **Other resources**

    If you don't have your own PostgreSQL instance set up, [SQLFiddle](http://sqlfiddle.com/) is a great place to try out queries. You can even share them with other people.
    http://sqlfiddle.com/

    [CartoCamp Office Hours!](http://www.meetup.com/CartoCamp/)


    ## Explore our data using SQL

    First, let's see how many rat sightings were reported:

    ```sql
    SELECT count(*)
    FROM nyc_rat_sightings
    ```

    | count |
    |-------|
    | 70576 |

    We see that there are 70,576 complaints about rats (or we assume they're all complaints).

    `count(*)` gives a count of the number of rows. If we want to find out how many rat sightings have a valid position (i.e., `the_geom` is not null), then we can do `count(the_geom)`, like this:

    ```sql
    SELECT
    count(the_geom) As num_sightings_with_location,
    count(*) As num_sightings
    FROM
    nyc_rat_sightings
    ```

    (Here we are giving each column an alias using `As new_name` after we define the operation. Also notice that we can liberally use whitespace to reformat for readability.)

    | num_sightings_with_location | num_sightings |
    |--------------------------|--------------------|
    | 70576 | 70131 |

    Notice that the numbers are different. We can check with a `WHERE` condition to see if we get 70576 - 70131 = 445:

    ```sql
    SELECT
    count(*),
    CASE WHEN count(*) = 445 THEN '\o/' ELSE ':(' END As result
    FROM
    nyc_rat_sightings
    WHERE
    the_geom IS NULL
    ```

    Result:

    | count | result |
    |-------|--------|
    | 445 | \o/ |

    ## Exploring time

    Now that we know the number of sightings (~70k), how many is that per year?

    We can use the aggregate functions `max` and `min` on the date columns to find out!

    ```sql
    SELECT
    max(created_date),
    min(created_date)
    FROM nyc_rat_sightings
    ```

    | max | min |
    |---------------------|----------------------------|
    | 12/31/2015 12:00:00 AM | 01/01/2010 02:15:27 PM |

    So we see that this dataset covers 2010 through 2015 sightings -- six complete years but is strangely tied to beginning and end of the year. Upon closer investigation, we have dates in 2016.

    But notice that `created_date` is a **string**! Postgres supports some fancy operations on `date` types, so we should probably convert to the `timestamp` format. CartoDB has an easy type converter built-in, but if you want to be careful over how it's converted, you may want to use a PostgreSQL function like [`to_timestamp`](http://www.postgresql.org/docs/9.3/static/functions-formatting.html) like this:

    ```sql
    -- Create a new column of type timestamp
    ALTER TABLE nyc_rat_sightings
    ADD COLUMN created_date_timestamp TIMESTAMP;

    -- Fill in string -> timestamp converted values
    UPDATE nyc_rat_sightings
    SET created_date_timestamp = to_timestamp(created_date, 'MM/DD/YYYY HH12:MI:SS AM')
    WHERE created_date IS NOT NULL AND created_date <> '';
    ```

    That's the safe way to do it because you don't alter the original column. If you want to live on the edge and alter the column type in place, you can run this:

    ```sql
    ALTER TABLE
    nyc_rat_sightings
    ALTER COLUMN
    created_date TYPE TIMESTAMP USING to_timestamp(created_date,'MM/DD/YYYY HH12:MI:SS AM')
    WHERE created_date IS NOT NULL AND created_date <> ''
    ```

    Let's try finding the max/min times again:

    ```sql
    SELECT
    max(created_date_timestamp),
    min(created_date_timestamp)
    FROM nyc_rat_sightings
    ```

    | max | min |
    |----------------------|----------------------|
    | 2016-01-10T01:55:47Z | 2010-01-01T08:29:58Z |

    Now we have the real max min from our data set, not the dates alphabetically.


    We can also use these max/min values to find the number of days in the range:

    ```sql
    SELECT
    max(created_date_timestamp) - min(created_date_timestamp) As num_days
    FROM
    nyc_rat_sightings
    ```

    | num_days |
    |----------|
    | 2200 |

    ## Finding which borough has the largest number of rat sightings

    All of the aggregate functions used so far have acted on the entire table... What if we want to look at aggregates over groups, such as which borough the sighting was in? We use the SQL clause `GROUP BY`!

    ```sql
    SELECT borough, count(*)
    FROM nyc_rat_sightings
    GROUP BY borough
    ```
    Result:

    |count|borough|
    |-------|----------|
    |23302|BROOKLYN|
    |18953|MANHATTAN|
    |14501|BRONX|
    |10242|QUEENS|
    |3577|STATEN ISLAND|
    |1|Unspecified|


    Notice after the `SELECT` that we need to either include the column(s) which appear in the GROUP BY, or use an aggregate function over the other column(s). For example, this would fail because `community_board` isn't in an aggregate or the `GROUP BY`:

    ```sql
    SELECT
    borough,
    count(*),
    community_board
    FROM
    nyc_rat_sightings
    GROUP BY
    borough
    ```

    If you wanted to aggregate over a string as well, you could do [`string_agg`](http://www.postgresql.org/docs/9.3/static/functions-aggregate.html).

    BUT, if you want groups within groups, you can specify multiple columns in the GROUP BY. This gives you counts

    ## Working with dates and time

    Dates can be tricky to work with but thankfully PostgreSQL has a number of functions that can help us out. The two most useful of which are `date_part` and `date_trunc`.

    `date_part` allows you to grab part of the date. For example, the year, the month, the day of the week, etc. Let's say we wanted to know how many rats where sighted on Mondays (day number 1 in PostgreSQL -- the number of days starts at Sunday):

    ```sql
    SELECT
    count(*)
    FROM
    nyc_rat_sightings
    WHERE
    date_part('dow', created_date_timestamp) = 1
    ```

    Or let's say you wanted to know how many rat sightings for each hour of the day there where you could use `date_part('hour', created_date_timestamp)`

    ```sql
    SELECT
    date_part('hour', created_date_timestamp) as hour,
    count(*) as count_per_hour
    FROM
    nyc_rat_sightings
    GROUP BY
    date_part('hour', created_date_timestamp)
    ORDER BY
    hour
    ```

    Finally, we can use `date_trunc` to truncate a date to a given precision. So for example let's say we wanted to create counts for rat sightings by hour each day, we could do:

    ```sql
    SELECT
    date_trunc('hour', created_date_timestamp) as time,
    count(*) as count_per_hour
    FROM
    nyc_rat_sightings
    GROUP BY
    date_trunc('hour', created_date_timestamp)
    ORDER BY
    date_trunc('hour', created_date_timestamp)
    ```

    For a full list of functions that can be used with dates and times check out the [documentation](http://www.postgresql.org/docs/9.3/static/functions-datetime.html).


    ## Working with Strings

    Just like dates SQL has a number of functions for matching and working with strings. The simplest of which is to concatenate strings. Lets say we wanted to create a full address from the table, we currenlty have it spread out over the incident_address, borough and city fields. Lets smoosh these in to one field with the parts of the address joined by the character '/' using the || operator in PostgreSQL


    ```sql
    SELECT
    incident_address || ' / ' || borough || ' / ' || city AS full_address
    FROM
    nyc_rat_sightings
    ```

    We can use WHERE to match on strings like we have already seen but we dont have to match to the exact string. Lets try and get all of the sightings in places where famalies live. In the Location_type column we have entries like "1-2 Family Dwelling" or "3+ Family Apt. Building". To get all of them we could make a list of strings that indicate a family lives there or we can simple us the like
    operator

    ```sql
    SELECT * FROM nyc_rat_sightings
    WHERE location_type LIKE '%Family%'
    ```
    The LIKE command allows you to write a string with wildcards. A _ matches any one character and % matches any number of characters, so '%Family%' matches any string with 'Family' in it at any point. This is really powerful to get more info on pattern matching check out the [docs](http://www.postgresql.org/docs/8.3/static/functions-matching.html).

    Finally lets aggregate by the number of people dwelling in the buiilding. This is contained at the start of the string '1-2 Family Dwelling' so how do we get at it? Well we can split strings using the
    split_part function

    ```sql
    SELECT split_part(location_type, ' ', 1)
    FROM nyc_rat_sightings
    WHERE location_type LIKE '%Family%'
    ```

    Split part takes a string to split on, what to split it with ' ' and which part of the split to return. Finally lets aggregate


    ```sql
    SELECT
    split_part(location_type, ' ', 1) as occupancy,
    count(*) as rat_num
    FROM
    nyc_rat_sightings
    WHERE
    location_type LIKE '%Family%'
    GROUP BY
    split_part(location_type, ' ', 1)
    ```
    Awesome so we now see that for family homes with 1-2 people there where 15,166 sightings and for 3+ there where 33875.

    | occupancy | rat_num |
    |-----------|---------|
    | 1-2 | 15166 |
    | 3+ | 33875 |


    For more info on string functions in PostgreSQL cehck out the [documentation](http://www.postgresql.org/docs/9.3/static/functions-string.html).

    ## Visualize the number or rat complaints in each borough

    Let's use our `GROUP BY`s to look at how boroughs compare. Let's do the normal group by, but this time find the center of the complaints in each borough (a kind of center of mass of complaints).

    ```sql
    SELECT
    min(cartodb_id) As cartodb_id,
    count(*) As num_sightings,
    ST_Transform(ST_Centroid(ST_Collect(the_geom)),3857) As the_geom_webmercator,
    borough
    FROM
    nyc_rat_sighting
    GROUP BY
    borough
    ```

    Some of the magic happening here is in the PostGIS function [`ST_Centroid`](http://postgis.net/docs/ST_Centroid.html), which calculates the 'center of mass' of the geometry into a latitude/longitude point. The aggregate function that allows us to use `the_geom` is [`ST_Collect()`](http://postgis.net/docs/ST_Centroid.html).

    We can visualize this in CartoDB by making a 'bubble map' that sizes the marker by an attribute, such as the number of sightings. But we can do more.

    ## Exploring time and space

    Now that we have some summary information about rats and boroughs, let's go farther!

    What about breaking them out by borough _and_ year? This way we can see trends in complaint data.

    One thing we need ahead of time is the center of all the boroughs that we used in the last query. Since this will change year by year, we should use one for all years so they are neatly stacked.

    ```sql
    -- Find centers of sightings by borough (all years)
    WITH centers As (
    SELECT
    ST_Transform(ST_Centroid(ST_Collect(the_geom)),3857) As borough_centers,
    borough
    FROM
    nyc_rat_sightings
    GROUP BY
    borough
    )

    -- grab all of the values we need
    SELECT
    min(n.cartodb_id) As cartodb_id,
    n.borough As borough,
    count(*) As num_sightings,
    date_part('year', created_date_timestamp) As year,
    150.0 * sqrt(count(*) / 6000.0) As symbol_size,
    c.borough_centers As the_geom_webmercator
    FROM
    nyc_rat_sightings As n
    JOIN
    centers As c
    ON c.borough = n.borough
    GROUP BY
    n.borough,
    date_part('year', created_date_timestamp),
    c.borough_centers
    ORDER BY
    num_sightings DESC
    ```

    Proportional symbol maps symbolize attributes by having the marker have an area proportional to a quantity -- in our case that's number of complaints. In our case, we know roughly that the max number of complaints in any of the boroughs is around 6000, so we normalize all years by that number and then take the square root (since area is proportional to r^2).

    Now we can make a category map off of the `year` column, and then do a rough proportional symbol map off of the number of complaints. Here's a good styling for that (with help from our colleague Mamata):

    ```css
    @1: #75445C;
    @2: #AF6458;
    @3: #D5A75B;
    @4: #D5C155;
    @5: #736F4C;
    @6: #5b788e;
    @7: #4C4E8F;

    Map {
    buffer-size: 255;
    }

    #nyc_rat_sightings {
    marker-fill-opacity: 1;
    marker-line-color: #FFF;
    marker-line-width: 0;
    marker-line-opacity: 1;
    marker-placement: point;
    marker-type: ellipse;
    marker-width: [symbol_size];
    marker-allow-overlap: true;
    [zoom=8]{marker-width: [symbol_size]*0.25;}
    [zoom=9]{marker-width: [symbol_size]*0.5;}
    [zoom=11]{marker-width: [symbol_size]*2;}
    [zoom=12]{marker-width: [symbol_size]*4;}
    [zoom=13]{marker-width: [symbol_size]*8;}
    }

    #nyc_rat_sightings[year=2010] {
    marker-fill: @1;
    }
    #nyc_rat_sightings[year=2011] {
    marker-fill: @2;
    }
    #nyc_rat_sightings[year=2012] {
    marker-fill: @3;
    }
    #nyc_rat_sightings[year=2013] {
    marker-fill: @4;
    }
    #nyc_rat_sightings[year=2014] {
    marker-fill: @5;
    }
    #nyc_rat_sightings[year=2015] {
    marker-fill: @6;
    }
    #nyc_rat_sightings[year=2016] {
    marker-fill: @7;
    }
    ```

    If you want to make way better proportional symbol maps, check out [this blogpost](http://blog.cartodb.com/proportional-symbol-maps/) by our colleague Mamata Akella.

    [![](http://i.imgur.com/WBNqKls.png)](https://team.cartodb.com/u/eschbacher/viz/09fcb7e0-b948-11e5-a80a-0ea31932ec1d/embed_map)

    Check out the [final map](https://team.cartodb.com/u/eschbacher/viz/09fcb7e0-b948-11e5-a80a-0ea31932ec1d/embed_map "Proportional symbol map visualizing the number of reported rat sightings").

    ## Resources

    * CartoDB's Map Academy: http://academy.cartodb.com/courses/sql-postgis/
    * PostgreSQL documentation (the good stuff): http://www.postgresql.org/docs/9.3/static/
    * PostGIS documentation: http://postgis.net
    * Andy: [email protected] or @MrEPhysics
    * Stuart: [email protected] or @Stuart_Lynn
    * DBA StackExchange: http://dba.stackexchange.com
    * GIS StackExchange: http://gis.stackexchange.com