mheberger · June 27, 2025 01:42 · mheberger · Jun 27, 2025
diff --git a/watersheds_usgs.py b/watersheds_usgs.py
 """
 # Demo of watershed delineation in the US using the the Python library pynhd and
 data and methods from the USGS

 Revision #1, incorporating feedback from Taher Chegini, email of 2022-11-21.
 Revision #2, 2025-02-23, updating the new URLs of the USGS API, the NLDI.
 Revision #3, 2025-06-11, making it work with updated versions of libraries.

 This demo shows you how to use Python to find the watershed for any point in the
 continental United States. We'lluse a new water data service created by the US
 Geological Survey (USGS), called the Hydro Network Linked Data Index
 (NLDI). For more info, see: https://waterdata.usgs.gov/blog/nldi-intro/

 The service is an application programming interface (API), and lets you query
 geodata and other information from the National Hydrographic Dataset (NHD),
 version 2. The NHD is "the nation's water data backbone," and gives direct
 access to tons of useful information about rivers and drainage areas (or
 catchments) that have been compiled by state and federal agencies over the
 course of decades.

 Their goal in creating the API was to "make the data in the NHD available to new
 audiences," i.e. programmers without specialized skills in GIS. Still, it was a
 little tricky to get the data I wanted, and took quite a few steps. Although,
 this is probably much easier than it was a few years ago!

 Unfortunately, the NLDI API is **not** based on the latest hydrography dataset
 covering rivers and watersheds in the US. The naming of the different NHD
 versions is confusing and inconsistent. As best as I can figure out, the NLDI
 is based on NHDP v2, the version that was released in 2012, and was digitized
 from maps at the 1:100,000-scale. There is a later version called NHDPlus HR,
 for "high-resolution," which was never quite completed for the whole nation.
 As of 2023, this is no longer being developed, and the USGS is now focused on
 creating a hydrography data product called 3DHP.

 The newer datasets (NHDPlus HR and 3DHP) cannot be accessed via the NLDI API.
 If you want to use these data to delineate watersheds, you'll have to download
 the geodata and use GIS software to do it. There's a [toolbox](
 https://github.com/ACWI-SSWD/nhdplushr_tools) available for ArcMap, but if
 you're using open-source software like QGIS, it will require more work.

 While detailed enough for regional and watershed-level analysis, this medium
 resolution means smaller streams and water bodies might not be represented,
 particularly intermittent or ephemeral streams less than about 1 mile in
 length.
 """
 # First, import the libraries we'll be using
 # I'm noting vresions here, because I previously had a lot of difficulty when
 # trying to update this script with incompatible versions of different
 # libraries. Best thing seems to be to start with a fresh virtual environment
 # and install the required libraries from scratch.
 import contextily  # 1.6.2
 import geopandas  # 1.1.0
 import matplotlib.pyplot as plt  # 3.10.3
 from matplotlib_scalebar.scalebar import ScaleBar
 import pynhd  # 0.19.4
 import shapely.geometry as sgeom  # 2.1.1
 from shapely.geometry import Polygon, MultiPolygon
 import time
 import xyzservices.providers as xyz

 # We'll use the Python library `pynhd` that allows for convenient access to the
 # USGS NLDI API: https://github.com/hyriver/pynhd
 pygeoapi = pynhd.PyGeoAPI()
 nldi = pynhd.NLDI()

 # Tell the script where to put the outputs (a folder name)
 out_folder = 'out'

 """
 ## Specify the watershed outlet 
 Here are the coordinates for the watershed outlet, in latitude and longitude. 
 In my experimenting, I found that it's hard to get good results for watersheds
 with the API. When you enter coodinates on a  large river and expect a big 
 watershed, it often gives you a small watershed of a little local drainage area. 
 The reason is explained in this [USGS blog post]
 (https://waterdata.usgs.gov/blog/nldi-processes/):

 > the split catchment algorithm requires that a **precise location** along a 
 > flowline be provided (as can be  derived from the raindrop trace algorithm). 
 > [emphasis added]
   
 It turns out that we need to do our own "snap to river centerline" in order to 
 get expected results. The good news is that the API provides a function for this,
 called `hydrolocation`. One of the challenges in using the API is that it has a
 peculiar vocabulary, and the developers seem to assume you'll know what it means
 to call a function calle `hydrolocation` or `navigation`. I've been working in 
 hydrology for a long time, and I had no idea what those mean...

 The hydrolocation routine will take our REQUESTED point, and find the closest
 river centerline. Note that the closest flowline may not be the one you are 
 looking for! This is one of the reasons you have to always carefully inspect the 
 results of automated watershed delineation!

 This outlet is on the Snake River in Washington State. It has a watershed
 with an interesting shape!
 """

 lat, lng = 46.666, -117.44

 # pynhd's `comid_byloc()` function will snap the outlet to a nearby river centerline
 # It is important to tell the function the map projection of our coordinates.
 # In this case it is CRS 4326, for standard lat, lng coordinates.
 # It expects the coordinates as a tuple, in the order lng, lat (take note!)
 coords = (lng, lat)
 outlet_gdf = nldi.comid_byloc(coords=coords, loc_crs=4326)

 # The function returns a GeoPandas GeoDataFrame.
 print(outlet_gdf)

 # Here is how to extract the coordinates of the snapped outlet, if needed
 snapped_lng = outlet_gdf.geometry[0].x
 snapped_lat = outlet_gdf.geometry[0].y
 print(f"Snapped outlet: ({snapped_lat}, {snapped_lng})")

 # You might be curious how far the snapped outlet is from your requested outlet
 # THis will be kind of approximate, because the coordinates are unprojected.
 snap_distance = float(outlet_gdf.measure[0])
 print(f"Snapped to a river {snap_distance:.1f} meters away")

 # We can also get the `comid` of the unit catchment our outlet is in.
 # The `comid` or "common identifier" is a unique ID for river reaches
 # (or flowlines) and unit catchments. Conveniently, these map 1:1.
 comid_outlet = outlet_gdf.comid[0]
 print(f"COMID of the terminal unit catchment: {comid_outlet}")

 """
 ## Get the watershed

 pynhd's pygeoapi module has a `split_catchment` function that returns the upstream 
 watershed boundary as a GeoPandas GeoDataFrame.

 Following the example [in the docs](https://docs.hyriver.io/examples/notebooks/pygeoapi.html),
 the `pygeoapi` function in `PyNHD` requires two inputs, a `geopandas.GeoDataFrame` 
 and a string, either "flow_trace" or "split_catchment". There is an additional 
 parameter `upstream`. If you set upstream to False, you will get just the split 
 unit catchment. We want the *entire* upstream watershed, so we set upstream to 
 True. This parameter needs to go into the outlet GeoDataFrame, which I found a 
 little confusing as a way to pass parameters to a function, but it works!
 """

 # Let's create a new GeoDataFrame for the outlet, with only the information we need.
 outlet_gdf = geopandas.GeoDataFrame(
    {
        "upstream": [
            True,
        ]
    },
    geometry=[sgeom.Point((snapped_lng, snapped_lat))],
    crs=4326,
 )

 # Get the watershed
 watershed_gdf = pynhd.pygeoapi(outlet_gdf, "split_catchment")
 print(watershed_gdf)

 # Plot the watershed boundary -- GeoPandas makes this easy using Matplotlib
 fig, (ax1, ax2) = plt.subplots(ncols=2)

 # Plot the entire watershed on the first set of axes
 watershed_gdf.plot(ax=ax1, alpha=0.5)
 # Add the outlet point to the map
 outlet_gdf.plot(ax=ax1, edgecolor="red", facecolor="red", markersize=10)
 ax1.title.set_text('Whole watershed')

 # On the second plot, zoom in on the area around the outlet
 watershed_gdf.plot(ax=ax2, alpha=0.5, edgecolor='k', facecolor=['yellow', 'blue'])
 ax2.set_xlim([lng - 0.05, lng + 0.05])
 ax2.set_ylim([lat - 0.05, lat + 0.05])
 outlet_gdf.plot(ax=ax2, edgecolor="red", facecolor="red", markersize=15)
 ax2.title.set_text('Zoomed in on the outlet')
 plt.show()

 """
 When we zoom in on the area around the outlet, we can see two polygons returned
 by `split_catchment`. The smaller one is the "home" unit catchment that the
 outlet is in. This includes area that is downstream of the outlet. 

 It's interesting, but we don't need it. So, let's delete the small polygon from 
 the result set. Unfortunately, the order in which the two polygons is returned
 by the API is not consistent. Sometimes the small polygon is in the first row,
 and sometimes it is in the second row. 

 ~~We could calculate the sizes of the polygons, and delete the smaller of the two. ~~

 However, I realized that the larger watershed is the one whose catchmentID is NaN.
 So, we can simply delete the one whose catchmentID is not NaN. 
 """

 # Before we delete the unit catchment, get its comid.
 watershed_gdf.sort_values('catchmentID', inplace=True)
 comid_unit_catchment = watershed_gdf['catchmentID'][0]
 print(f'COMID of the home unit catchment: {comid_unit_catchment}')

 # The comid of the unit catchments and river reaches map 1:1. These should match!
 assert comid_unit_catchment == comid_outlet

 # The function returns the comid as a string, but it should be an integer
 comid = int(comid_unit_catchment)


 # Drop the home unit catchment (the unsplit part)
 watershed_gdf = watershed_gdf.drop(0)
 print(watershed_gdf)


 # Make a plot of the upstream watershed
 _, (ax1) = plt.subplots(ncols=1)
 watershed_gdf.plot(ax=ax1, alpha=0.5)
 # Add the outlet point to the map
 outlet_gdf.plot(ax=ax1, edgecolor="red", facecolor="red", markersize=10)
 ax1.title.set_text('Upstream only')
 plt.show()

 """
 Sometimes the watersheds from NLDI have little spurious areas, often very small,
 that are disconnected. In my testing, these are usually undesirable and should
 be cleaned up. One way to do that is to remove all but the largest shapes, 
 converting any MultiPolygon features to Polygons. Below is a function that I 
 wrote to do this. It works on Shapely objects. Shapely is a Python library, 
 and it is the data format that is used by GeoPandas to store its geometries.
 """


 def get_largest(input_poly: MultiPolygon or Polygon) -> Polygon:
    """
    Converts a Shapely MultiPolygon to a Shapely Polygon
    For multipart polygons, will only keep the largest polygon
    in terms of area. In my testing, this was usually good enough

    Note: can also do this via PostGIS query... see myqueries_merit.py, query19a and query19b
          Not sure one approach is better than the other. They both seem to work well.
    Args:
        input_poly: A Shapely Polygon or MultiPolygon

    Returns:
        a shapely Polygon
    """
    if input_poly.geom_type == "MultiPolygon":
        areas = []
        polygons = list(input_poly.geoms)

        for poly in polygons:
            areas.append(poly.area)

        max_index = areas.index(max(areas))

        return polygons[max_index]
    else:
        return input_poly


 watershed_gdf['geometry'] = watershed_gdf['geometry'].apply(get_largest)

 """
 Sometimes, the watershed created by NLDI will have lots of little donut holes in
 it. This happens when there are "sinks" inside of the watershed, or little 
 internal _endorheic_ basins. In these areas, water drains to a depression and 
 gets stuck there and either infiltrates into the ground or evaporates, rather 
 than flowing into a river or stream and toward the watershed outlet. 

 How to deal with these little holes is somewhat of an unsettled question in 
 hydrology. I have seen complex routines where they fill with water and then 
 overflow. My view is that, for simple watershed delineation, most small donut 
 holes are data artifacts. Larger ones may be important for overland flow, but
 perhaps less so for studying groundwater. In this case,  let's fill them in. 
 I adapated this [code snippet]
 (https://stackoverflow.com/questions/61427797/filling-a-hole-in-a-multipolygon-with-shapely-netherlands-2-digit-postal-codes) 
 from Stack Overflow to do the job. (This function gives you the option of 
 keeping holes above a certain area threshold.)
 """


 def close_holes(poly: Polygon or MultiPolygon, area_max: float) -> Polygon or MultiPolygon:
    """
    Close polygon holes by limitation to the exterior ring.
    Updated to accept a MultiPolygon as input
    Args:
        poly: Input shapely Polygon
        area_max: keep holes that are larger than this.
                  Fill any holes less than or equal to this.
                  We're working with unprojected lat, lng
                  so this needs to be in square decimal degrees...
    Example:
        gdf.geometry.apply(lambda p: close_holes(p))
    """

    if isinstance(poly, Polygon):
        # Handle Polygon case
        if area_max == 0:
            if poly.interiors:
                return Polygon(list(poly.exterior.coords))
            else:
                return poly

        else:
            list_interiors = []

            for interior in poly.interiors:
                p = Polygon(interior)
                if p.area > area_max:
                    list_interiors.append(interior)

            return Polygon(poly.exterior.coords, holes=list_interiors)

    elif isinstance(poly, MultiPolygon):
        # Handle MultiPolygon case
        result_polygons = []
        for sub_poly in poly.geoms:
            new_sub_poly = close_holes(sub_poly, area_max)
            result_polygons.append(new_sub_poly)
        return MultiPolygon(result_polygons)
    else:
        raise ValueError("Unsupported geometry type")


 # Make a plot of the filled in watershed polygon
 _, (ax1) = plt.subplots(ncols=1)
 watershed_gdf = watershed_gdf.geometry.apply(lambda p: close_holes(p, 0))
 watershed_gdf.plot(ax=ax1)
 ax1.title.set_text('Watershed with the donut holes filled in')
 plt.show()

 """
 ## Get the upstream river reaches
 We can use pynhd's `navigate_byid()` or `navigate_byloc()` function to get 
 the upstream river reaches. The function `navigate_byid` will return a list 
 of the comids of the upstream river reaches. The function `navigate_byloc()`
 is better as it returns the geometries (geodata) as well. 

 You also need to tell it how far upstream you want to search for tributaries.
 Since we want all of them, I just put 9,999 km, which is 6,100 miles. That is 
 longer than the world's longest river, so we should be good!

 Sometimes the NLDI API times out. I don't know any solution other than to wait 
 a minute and then try again. You could code a recursive function to try this 
 a few times... This exercise is left to the reader.
 """

 print("Getting the river network geodata...")
 try:
    rivers_gdf = nldi.navigate_byloc((lng, lat), navigation='upstreamTributaries',
                                     source='flowlines', trim_start=True, distance=9999)
 except:
    print("Failed. Trying again in one second...")
    time.sleep(1)
    rivers_gdf = nldi.navigate_byloc((lng, lat), navigation='upstreamTributaries',
                                     source='flowlines', trim_start=True, distance=9999)
    # You could program this to try again more than once I suppose!

 print(rivers_gdf)

 # Plot the upstream rivers
 ax = rivers_gdf.plot(edgecolor="blue", linewidth=0.5, figsize=(7, 8))

 # Add the outlet point
 outlet_gdf.plot(ax=ax, edgecolor="red", facecolor="red", markersize=15, zorder=2)
 plt.show()

 """
 We now have both: (a) the watershed boundary and (b) the rivers. We could stop 
 here. But you'll notice that, for a large watershed like this one, there is an 
 incredible amount of detail in our river network. The lines won't show up 
 clearly unless you make a poster-size map. To make an attractive map, we need 
 to "prune" the river network, or get rid of the small streams and just keep
 the big rivers. In the next step, we'll add extra information or "attributes"
 to each river reach to help us do just this. 

 The bad news is that the rivers dataset is very bare bones. For each river reach,
 we have its ID and geometry. The good news is that the USGS has tabulated lots of 
 information about these rivers, like stream order, name, length, slope, 
 and much more. They call these "Value Added Attributes" (VAA). You can find this
 data table online, and load the data in your scripts. 
 The pynhd library has a handy function to get the VAA data, so we'll use that. 
 https://docs.hyriver.io/autoapi/pynhd/nhdplus_derived/index.html#pynhd.nhdplus_derived.nhdplus_vaa

 The first time you run this function, it is slow, as it downloads a 245MB file.
 Look for a .parquet file in the folder `cache`. 
 After the first download, it's cached and should be much faster.
 """

 df_attribs = pynhd.nhdplus_derived.nhdplus_vaa()
 print(df_attribs.head())

 """
 It was a little hard to find documentation for what all the VAA fields mean, 
 but I found them in a table on page 6 of this document: 
 https://edap-ow-data-commons.s3.amazonaws.com/NHDPlusV21/Data/NationalData/0Release_Notes_NationalData_Seamless_GeoDatabase.pdf

 There is a lot of valuable information in here, including estimated discharge, or flow.
 (Incidentally, do _not_ pay attention to several of the first Google hits for 
 "NHD VAA." The first few results describe NHD HR, which is the new _high-resolution_
 version of the NHD, which has only partial coverage and no plans to be completed.

 Next, we will **join** this information to the tribuataries geodataframe, 
 `tribs_ddf`. 

 The join is easier to do if the join fields have the same name and data type.

 Note: Loading the full VAA dataset into memory is pretty inefficient. I found a
 faster way to get just the information you need using a library called `duckdb`
 that lets you run SQL queries on a parquet file. If you are worried about speed,
 this is a good option. For this demo, the simple method works well enough.
 """

 # In the tributaries GeoDataFrame, rename the column "nhdplus_comid" to "comid"
 # and convert it from a floating point number to a base-64 integer
 rivers_gdf['comid'] = rivers_gdf['nhdplus_comid'].astype('int64')
 rivers_gdf = rivers_gdf.drop(['nhdplus_comid'], axis=1)

 # GeoPandas makes it easy to do a table join.
 # This adds the attributes columns to our tributaries geodataframe
 rivers_attrs_gdf = rivers_gdf.merge(df_attribs, on='comid')
 print(rivers_attrs_gdf)

 """
 ## Prune the river network 

 Next we will prune the river network by removing smaller tributaries. I found 
 that keeping a total of 4 or 5  orders of streams results in maps that are
 visually pleasing and easier to read. We will use the field `streamorde`, 
 which is the Strahler Stream Order. Small headwater streams have order = 1, 
 and as you move downstream past a confluence with another stream, the order
 increases. The highest stream order in NHDv2 is the Mississippi River, down-
 stream of its confluence with the Missouri River, with order = 10.

 We can also see on our maps that the final river reach extends a bit downstream
 of the watershed boundary. You could clip the river flowlines using the watershed
 polygons to fix this, but I'll leave this as an exercise to the reader. 
 Hint: GeoPandas has a `.clip()` method, but it is painfully slow. 
 """

 # Find the minimum stream order in our river network
 min_order = rivers_attrs_gdf.streamorde.min()
 max_order = rivers_attrs_gdf.streamorde.max()
 print(f"Stream order ranges from {min_order} to {max_order}")

 # We'll keep a total of 4 orders of streams
 min_order_to_keep = max_order - 3
 print(f"The map will show streams from order {min_order_to_keep} to {max_order}")

 # Filter the table, keeping only those rows where stream order is above the minimum we calculated above
 rivers_attrs_gdf = rivers_attrs_gdf[rivers_attrs_gdf.streamorde > min_order_to_keep]
 len(rivers_attrs_gdf)

 """
 ## Make our final map
 We'll continue using geopandas' built-in plotting methods, which rely on 
 matplotlib, to make a nice map that we can print or embed in a document. 
 We'll add a basemap to give it some more context. To do this, we have to 
 reproject our data from uprojected lat/lng coordinates to the "Web Mercator" 
 projection that is used by all the online maps and tile providers. 

 When we plot the rivers on a map, we'll give big rivers have a heavier line, 
 and use a lighter line for small streams. Setting the line width proportional 
 to the square root of stream order looks nice.
 """

 # Add a field for the river width and calculate its value
 rivers_attrs_gdf['linewidth'] = (rivers_attrs_gdf['streamorde'] - min_order_to_keep - 0.5) ** 0.5

 # Reproject to Web Mercator so we can show the basemap
 watershed_webmercator = watershed_gdf.to_crs(epsg=3857)

 # Plot the watershed boundary, and save the "axes" object that is returned
 # so we can plot other layers on the same plot
 ax = watershed_webmercator.plot(alpha=1, facecolor="none", edgecolor='red', 
                                zorder=1, linewidth=2, figsize=(7, 8))

 # Use the contextily library to add a basemap
 # The default is Stamen terrain, which is nice, but there are many options:
 # https://contextily.readthedocs.io/en/latest/providers_deepdive.html
 contextily.add_basemap(ax, attribution=False)
 # Setting attribution to False avoids putting a long text that makes the map unviewable.

 # For example, this line will give you aerial imagery, courtesy of the GIS company ESRI
 # contextily.add_basemap(ax, source="https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}")

 # Or this line will load the USGS topographic map baselayer.
 # contextily.add_basemap(ax, source = xyz.USGS.USTopo)

 # Plot the rivers
 rivers_attrs_webmercator = rivers_attrs_gdf.to_crs(epsg=3857)
 rivers_attrs_webmercator.plot(ax=ax, linewidth=rivers_attrs_gdf['linewidth'],
                              edgecolor="blue", zorder=2)

 # Plot the outlet point
 point_webmercator = outlet_gdf.to_crs(epsg=3857)
 point_webmercator.plot(ax=ax, linewidth=2, edgecolor="purple",
                       facecolor="none", markersize=40, alpha=1, zorder=3)

 # Remove the axis tick mark labels
 ax.xaxis.set_ticklabels([])
 ax.yaxis.set_ticklabels([])
 ax.tick_params(left=False, bottom=False)

 # Add a scalebar - although note that these are not all that accurate
 # for the Web Mercator map projection
 ax = ax.add_artist(ScaleBar(1, location='lower left', box_alpha=0))
 plt.show()

 """
 We can also create an interactive Leaflet map with Folium. This is a map you 
 view in a web browser, like Google Maps. This is very handy to pan and zoom 
 around your watershed!

 We can use the `explore()` method of Geopandas to put the features on a folium map
 The function returns a folium map object that we can reuse to add other layers.
 """

 folium_map = watershed_gdf.explore(color="red", popup=False, style_kwds={'fill': False})

 # Add a marker to show the outlet
 outlet_gdf.explore(m=folium_map, style_kwds=dict(stroke=True, weight=3, color='purple', radius=5))

 # Add the rivers to the map. The style function varies the line width
 # using the attribute we added above, based on the square root of the stream order.
 rivers_attrs_gdf.explore(
    m=folium_map,
    style_kwds=dict(
        color='blue',
        style_function=lambda x: {"weight": x["properties"]['linewidth']}
    )
 )

 # Save the map to an html file
 folium_map.save(f"{out_folder}/map.html")

 """
 # Save your results

 If you're happy with the results, and want to save them for later, GeoPandas makes it easy 
 to save the results in a bunch of different formats, like Geopackage (recommended)
 or shapefiles. Here I'm using GeoJSON. A list of file formats here:
 https://geopandas.org/en/stable/docs/user_guide/io.html#writing-spatial-data
 """

 # Export the watershed boundary and rivers as GeoJSON files
 print("Writing output geodata files")
 watershed_gdf.to_file(f'{out_folder}/watershed.geojson', driver='GeoJSON')

 """
 I could not get the dataframe with the joined attributes to write, because two of the columns 
 were of type 'categorical,' and GeoPandas threw an error trying to handle them. 
 Here I'm just deleting those columns with drop(), but presumably one could also change the data type 
 to a string or integer, and that would probably fix the problem too.
 """

 rivers_attrs_gdf.drop(['reachcode'], axis=1, inplace=True)
 rivers_attrs_gdf.drop(['wbareatype'], axis=1, inplace=True)

 # Try saving to a GeoPackage this time.
 rivers_attrs_gdf.to_file(f'{out_folder}/rivers.gpkg')


 """
 # Conclusion
 In this Python script, we saw how to use Python to:

  * delineate a watershed for anywhere in the continental United States with the USGS NLDI API
  * find all upstream river reaches in the watershed
  * attach "value added attributes" to learn more about those rivers
  * create attractive plots and an interactive map 
  * save the results in a format that can be shared or used in other software
 
 **Happy watershed delineation!**

 """