Skip to content

Instantly share code, notes, and snippets.

@mheberger
Last active June 27, 2025 01:42
Show Gist options
  • Save mheberger/9b99aa7c185bbb070c9eff7f8e497091 to your computer and use it in GitHub Desktop.
Save mheberger/9b99aa7c185bbb070c9eff7f8e497091 to your computer and use it in GitHub Desktop.
Python script for delineating watersheds in the continental United States with the USGS NLDI API
"""
# Demo of watershed delineation in the US using the the Python library pynhd and
data and methods from the USGS
Revision #1, incorporating feedback from Taher Chegini, email of 2022-11-21.
Revision #2, 2025-02-23, updating the new URLs of the USGS API, the NLDI.
Revision #3, 2025-06-11, making it work with updated versions of libraries.
This demo shows you how to use Python to find the watershed for any point in the
continental United States. We'lluse a new water data service created by the US
Geological Survey (USGS), called the Hydro Network Linked Data Index
(NLDI). For more info, see: https://waterdata.usgs.gov/blog/nldi-intro/
The service is an application programming interface (API), and lets you query
geodata and other information from the National Hydrographic Dataset (NHD),
version 2. The NHD is "the nation's water data backbone," and gives direct
access to tons of useful information about rivers and drainage areas (or
catchments) that have been compiled by state and federal agencies over the
course of decades.
Their goal in creating the API was to "make the data in the NHD available to new
audiences," i.e. programmers without specialized skills in GIS. Still, it was a
little tricky to get the data I wanted, and took quite a few steps. Although,
this is probably much easier than it was a few years ago!
Unfortunately, the NLDI API is **not** based on the latest hydrography dataset
covering rivers and watersheds in the US. The naming of the different NHD
versions is confusing and inconsistent. As best as I can figure out, the NLDI
is based on NHDP v2, the version that was released in 2012, and was digitized
from maps at the 1:100,000-scale. There is a later version called NHDPlus HR,
for "high-resolution," which was never quite completed for the whole nation.
As of 2023, this is no longer being developed, and the USGS is now focused on
creating a hydrography data product called 3DHP.
The newer datasets (NHDPlus HR and 3DHP) cannot be accessed via the NLDI API.
If you want to use these data to delineate watersheds, you'll have to download
the geodata and use GIS software to do it. There's a [toolbox](
https://github.com/ACWI-SSWD/nhdplushr_tools) available for ArcMap, but if
you're using open-source software like QGIS, it will require more work.
While detailed enough for regional and watershed-level analysis, this medium
resolution means smaller streams and water bodies might not be represented,
particularly intermittent or ephemeral streams less than about 1 mile in
length.
"""
# First, import the libraries we'll be using
# I'm noting vresions here, because I previously had a lot of difficulty when
# trying to update this script with incompatible versions of different
# libraries. Best thing seems to be to start with a fresh virtual environment
# and install the required libraries from scratch.
import contextily # 1.6.2
import geopandas # 1.1.0
import matplotlib.pyplot as plt # 3.10.3
from matplotlib_scalebar.scalebar import ScaleBar
import pynhd # 0.19.4
import shapely.geometry as sgeom # 2.1.1
from shapely.geometry import Polygon, MultiPolygon
import time
import xyzservices.providers as xyz
# We'll use the Python library `pynhd` that allows for convenient access to the
# USGS NLDI API: https://github.com/hyriver/pynhd
pygeoapi = pynhd.PyGeoAPI()
nldi = pynhd.NLDI()
# Tell the script where to put the outputs (a folder name)
out_folder = 'out'
"""
## Specify the watershed outlet
Here are the coordinates for the watershed outlet, in latitude and longitude.
In my experimenting, I found that it's hard to get good results for watersheds
with the API. When you enter coodinates on a large river and expect a big
watershed, it often gives you a small watershed of a little local drainage area.
The reason is explained in this [USGS blog post]
(https://waterdata.usgs.gov/blog/nldi-processes/):
> the split catchment algorithm requires that a **precise location** along a
> flowline be provided (as can be derived from the raindrop trace algorithm).
> [emphasis added]
It turns out that we need to do our own "snap to river centerline" in order to
get expected results. The good news is that the API provides a function for this,
called `hydrolocation`. One of the challenges in using the API is that it has a
peculiar vocabulary, and the developers seem to assume you'll know what it means
to call a function calle `hydrolocation` or `navigation`. I've been working in
hydrology for a long time, and I had no idea what those mean...
The hydrolocation routine will take our REQUESTED point, and find the closest
river centerline. Note that the closest flowline may not be the one you are
looking for! This is one of the reasons you have to always carefully inspect the
results of automated watershed delineation!
This outlet is on the Snake River in Washington State. It has a watershed
with an interesting shape!
"""
lat, lng = 46.666, -117.44
# pynhd's `comid_byloc()` function will snap the outlet to a nearby river centerline
# It is important to tell the function the map projection of our coordinates.
# In this case it is CRS 4326, for standard lat, lng coordinates.
# It expects the coordinates as a tuple, in the order lng, lat (take note!)
coords = (lng, lat)
outlet_gdf = nldi.comid_byloc(coords=coords, loc_crs=4326)
# The function returns a GeoPandas GeoDataFrame.
print(outlet_gdf)
# Here is how to extract the coordinates of the snapped outlet, if needed
snapped_lng = outlet_gdf.geometry[0].x
snapped_lat = outlet_gdf.geometry[0].y
print(f"Snapped outlet: ({snapped_lat}, {snapped_lng})")
# You might be curious how far the snapped outlet is from your requested outlet
# THis will be kind of approximate, because the coordinates are unprojected.
snap_distance = float(outlet_gdf.measure[0])
print(f"Snapped to a river {snap_distance:.1f} meters away")
# We can also get the `comid` of the unit catchment our outlet is in.
# The `comid` or "common identifier" is a unique ID for river reaches
# (or flowlines) and unit catchments. Conveniently, these map 1:1.
comid_outlet = outlet_gdf.comid[0]
print(f"COMID of the terminal unit catchment: {comid_outlet}")
"""
## Get the watershed
pynhd's pygeoapi module has a `split_catchment` function that returns the upstream
watershed boundary as a GeoPandas GeoDataFrame.
Following the example [in the docs](https://docs.hyriver.io/examples/notebooks/pygeoapi.html),
the `pygeoapi` function in `PyNHD` requires two inputs, a `geopandas.GeoDataFrame`
and a string, either "flow_trace" or "split_catchment". There is an additional
parameter `upstream`. If you set upstream to False, you will get just the split
unit catchment. We want the *entire* upstream watershed, so we set upstream to
True. This parameter needs to go into the outlet GeoDataFrame, which I found a
little confusing as a way to pass parameters to a function, but it works!
"""
# Let's create a new GeoDataFrame for the outlet, with only the information we need.
outlet_gdf = geopandas.GeoDataFrame(
{
"upstream": [
True,
]
},
geometry=[sgeom.Point((snapped_lng, snapped_lat))],
crs=4326,
)
# Get the watershed
watershed_gdf = pynhd.pygeoapi(outlet_gdf, "split_catchment")
print(watershed_gdf)
# Plot the watershed boundary -- GeoPandas makes this easy using Matplotlib
fig, (ax1, ax2) = plt.subplots(ncols=2)
# Plot the entire watershed on the first set of axes
watershed_gdf.plot(ax=ax1, alpha=0.5)
# Add the outlet point to the map
outlet_gdf.plot(ax=ax1, edgecolor="red", facecolor="red", markersize=10)
ax1.title.set_text('Whole watershed')
# On the second plot, zoom in on the area around the outlet
watershed_gdf.plot(ax=ax2, alpha=0.5, edgecolor='k', facecolor=['yellow', 'blue'])
ax2.set_xlim([lng - 0.05, lng + 0.05])
ax2.set_ylim([lat - 0.05, lat + 0.05])
outlet_gdf.plot(ax=ax2, edgecolor="red", facecolor="red", markersize=15)
ax2.title.set_text('Zoomed in on the outlet')
plt.show()
"""
When we zoom in on the area around the outlet, we can see two polygons returned
by `split_catchment`. The smaller one is the "home" unit catchment that the
outlet is in. This includes area that is downstream of the outlet.
It's interesting, but we don't need it. So, let's delete the small polygon from
the result set. Unfortunately, the order in which the two polygons is returned
by the API is not consistent. Sometimes the small polygon is in the first row,
and sometimes it is in the second row.
~~We could calculate the sizes of the polygons, and delete the smaller of the two. ~~
However, I realized that the larger watershed is the one whose catchmentID is NaN.
So, we can simply delete the one whose catchmentID is not NaN.
"""
# Before we delete the unit catchment, get its comid.
watershed_gdf.sort_values('catchmentID', inplace=True)
comid_unit_catchment = watershed_gdf['catchmentID'][0]
print(f'COMID of the home unit catchment: {comid_unit_catchment}')
# The comid of the unit catchments and river reaches map 1:1. These should match!
assert comid_unit_catchment == comid_outlet
# The function returns the comid as a string, but it should be an integer
comid = int(comid_unit_catchment)
# Drop the home unit catchment (the unsplit part)
watershed_gdf = watershed_gdf.drop(0)
print(watershed_gdf)
# Make a plot of the upstream watershed
_, (ax1) = plt.subplots(ncols=1)
watershed_gdf.plot(ax=ax1, alpha=0.5)
# Add the outlet point to the map
outlet_gdf.plot(ax=ax1, edgecolor="red", facecolor="red", markersize=10)
ax1.title.set_text('Upstream only')
plt.show()
"""
Sometimes the watersheds from NLDI have little spurious areas, often very small,
that are disconnected. In my testing, these are usually undesirable and should
be cleaned up. One way to do that is to remove all but the largest shapes,
converting any MultiPolygon features to Polygons. Below is a function that I
wrote to do this. It works on Shapely objects. Shapely is a Python library,
and it is the data format that is used by GeoPandas to store its geometries.
"""
def get_largest(input_poly: MultiPolygon or Polygon) -> Polygon:
"""
Converts a Shapely MultiPolygon to a Shapely Polygon
For multipart polygons, will only keep the largest polygon
in terms of area. In my testing, this was usually good enough
Note: can also do this via PostGIS query... see myqueries_merit.py, query19a and query19b
Not sure one approach is better than the other. They both seem to work well.
Args:
input_poly: A Shapely Polygon or MultiPolygon
Returns:
a shapely Polygon
"""
if input_poly.geom_type == "MultiPolygon":
areas = []
polygons = list(input_poly.geoms)
for poly in polygons:
areas.append(poly.area)
max_index = areas.index(max(areas))
return polygons[max_index]
else:
return input_poly
watershed_gdf['geometry'] = watershed_gdf['geometry'].apply(get_largest)
"""
Sometimes, the watershed created by NLDI will have lots of little donut holes in
it. This happens when there are "sinks" inside of the watershed, or little
internal _endorheic_ basins. In these areas, water drains to a depression and
gets stuck there and either infiltrates into the ground or evaporates, rather
than flowing into a river or stream and toward the watershed outlet.
How to deal with these little holes is somewhat of an unsettled question in
hydrology. I have seen complex routines where they fill with water and then
overflow. My view is that, for simple watershed delineation, most small donut
holes are data artifacts. Larger ones may be important for overland flow, but
perhaps less so for studying groundwater. In this case, let's fill them in.
I adapated this [code snippet]
(https://stackoverflow.com/questions/61427797/filling-a-hole-in-a-multipolygon-with-shapely-netherlands-2-digit-postal-codes)
from Stack Overflow to do the job. (This function gives you the option of
keeping holes above a certain area threshold.)
"""
def close_holes(poly: Polygon or MultiPolygon, area_max: float) -> Polygon or MultiPolygon:
"""
Close polygon holes by limitation to the exterior ring.
Updated to accept a MultiPolygon as input
Args:
poly: Input shapely Polygon
area_max: keep holes that are larger than this.
Fill any holes less than or equal to this.
We're working with unprojected lat, lng
so this needs to be in square decimal degrees...
Example:
gdf.geometry.apply(lambda p: close_holes(p))
"""
if isinstance(poly, Polygon):
# Handle Polygon case
if area_max == 0:
if poly.interiors:
return Polygon(list(poly.exterior.coords))
else:
return poly
else:
list_interiors = []
for interior in poly.interiors:
p = Polygon(interior)
if p.area > area_max:
list_interiors.append(interior)
return Polygon(poly.exterior.coords, holes=list_interiors)
elif isinstance(poly, MultiPolygon):
# Handle MultiPolygon case
result_polygons = []
for sub_poly in poly.geoms:
new_sub_poly = close_holes(sub_poly, area_max)
result_polygons.append(new_sub_poly)
return MultiPolygon(result_polygons)
else:
raise ValueError("Unsupported geometry type")
# Make a plot of the filled in watershed polygon
_, (ax1) = plt.subplots(ncols=1)
watershed_gdf = watershed_gdf.geometry.apply(lambda p: close_holes(p, 0))
watershed_gdf.plot(ax=ax1)
ax1.title.set_text('Watershed with the donut holes filled in')
plt.show()
"""
## Get the upstream river reaches
We can use pynhd's `navigate_byid()` or `navigate_byloc()` function to get
the upstream river reaches. The function `navigate_byid` will return a list
of the comids of the upstream river reaches. The function `navigate_byloc()`
is better as it returns the geometries (geodata) as well.
You also need to tell it how far upstream you want to search for tributaries.
Since we want all of them, I just put 9,999 km, which is 6,100 miles. That is
longer than the world's longest river, so we should be good!
Sometimes the NLDI API times out. I don't know any solution other than to wait
a minute and then try again. You could code a recursive function to try this
a few times... This exercise is left to the reader.
"""
print("Getting the river network geodata...")
try:
rivers_gdf = nldi.navigate_byloc((lng, lat), navigation='upstreamTributaries',
source='flowlines', trim_start=True, distance=9999)
except:
print("Failed. Trying again in one second...")
time.sleep(1)
rivers_gdf = nldi.navigate_byloc((lng, lat), navigation='upstreamTributaries',
source='flowlines', trim_start=True, distance=9999)
# You could program this to try again more than once I suppose!
print(rivers_gdf)
# Plot the upstream rivers
ax = rivers_gdf.plot(edgecolor="blue", linewidth=0.5, figsize=(7, 8))
# Add the outlet point
outlet_gdf.plot(ax=ax, edgecolor="red", facecolor="red", markersize=15, zorder=2)
plt.show()
"""
We now have both: (a) the watershed boundary and (b) the rivers. We could stop
here. But you'll notice that, for a large watershed like this one, there is an
incredible amount of detail in our river network. The lines won't show up
clearly unless you make a poster-size map. To make an attractive map, we need
to "prune" the river network, or get rid of the small streams and just keep
the big rivers. In the next step, we'll add extra information or "attributes"
to each river reach to help us do just this.
The bad news is that the rivers dataset is very bare bones. For each river reach,
we have its ID and geometry. The good news is that the USGS has tabulated lots of
information about these rivers, like stream order, name, length, slope,
and much more. They call these "Value Added Attributes" (VAA). You can find this
data table online, and load the data in your scripts.
The pynhd library has a handy function to get the VAA data, so we'll use that.
https://docs.hyriver.io/autoapi/pynhd/nhdplus_derived/index.html#pynhd.nhdplus_derived.nhdplus_vaa
The first time you run this function, it is slow, as it downloads a 245MB file.
Look for a .parquet file in the folder `cache`.
After the first download, it's cached and should be much faster.
"""
df_attribs = pynhd.nhdplus_derived.nhdplus_vaa()
print(df_attribs.head())
"""
It was a little hard to find documentation for what all the VAA fields mean,
but I found them in a table on page 6 of this document:
https://edap-ow-data-commons.s3.amazonaws.com/NHDPlusV21/Data/NationalData/0Release_Notes_NationalData_Seamless_GeoDatabase.pdf
There is a lot of valuable information in here, including estimated discharge, or flow.
(Incidentally, do _not_ pay attention to several of the first Google hits for
"NHD VAA." The first few results describe NHD HR, which is the new _high-resolution_
version of the NHD, which has only partial coverage and no plans to be completed.
Next, we will **join** this information to the tribuataries geodataframe,
`tribs_ddf`.
The join is easier to do if the join fields have the same name and data type.
Note: Loading the full VAA dataset into memory is pretty inefficient. I found a
faster way to get just the information you need using a library called `duckdb`
that lets you run SQL queries on a parquet file. If you are worried about speed,
this is a good option. For this demo, the simple method works well enough.
"""
# In the tributaries GeoDataFrame, rename the column "nhdplus_comid" to "comid"
# and convert it from a floating point number to a base-64 integer
rivers_gdf['comid'] = rivers_gdf['nhdplus_comid'].astype('int64')
rivers_gdf = rivers_gdf.drop(['nhdplus_comid'], axis=1)
# GeoPandas makes it easy to do a table join.
# This adds the attributes columns to our tributaries geodataframe
rivers_attrs_gdf = rivers_gdf.merge(df_attribs, on='comid')
print(rivers_attrs_gdf)
"""
## Prune the river network
Next we will prune the river network by removing smaller tributaries. I found
that keeping a total of 4 or 5 orders of streams results in maps that are
visually pleasing and easier to read. We will use the field `streamorde`,
which is the Strahler Stream Order. Small headwater streams have order = 1,
and as you move downstream past a confluence with another stream, the order
increases. The highest stream order in NHDv2 is the Mississippi River, down-
stream of its confluence with the Missouri River, with order = 10.
We can also see on our maps that the final river reach extends a bit downstream
of the watershed boundary. You could clip the river flowlines using the watershed
polygons to fix this, but I'll leave this as an exercise to the reader.
Hint: GeoPandas has a `.clip()` method, but it is painfully slow.
"""
# Find the minimum stream order in our river network
min_order = rivers_attrs_gdf.streamorde.min()
max_order = rivers_attrs_gdf.streamorde.max()
print(f"Stream order ranges from {min_order} to {max_order}")
# We'll keep a total of 4 orders of streams
min_order_to_keep = max_order - 3
print(f"The map will show streams from order {min_order_to_keep} to {max_order}")
# Filter the table, keeping only those rows where stream order is above the minimum we calculated above
rivers_attrs_gdf = rivers_attrs_gdf[rivers_attrs_gdf.streamorde > min_order_to_keep]
len(rivers_attrs_gdf)
"""
## Make our final map
We'll continue using geopandas' built-in plotting methods, which rely on
matplotlib, to make a nice map that we can print or embed in a document.
We'll add a basemap to give it some more context. To do this, we have to
reproject our data from uprojected lat/lng coordinates to the "Web Mercator"
projection that is used by all the online maps and tile providers.
When we plot the rivers on a map, we'll give big rivers have a heavier line,
and use a lighter line for small streams. Setting the line width proportional
to the square root of stream order looks nice.
"""
# Add a field for the river width and calculate its value
rivers_attrs_gdf['linewidth'] = (rivers_attrs_gdf['streamorde'] - min_order_to_keep - 0.5) ** 0.5
# Reproject to Web Mercator so we can show the basemap
watershed_webmercator = watershed_gdf.to_crs(epsg=3857)
# Plot the watershed boundary, and save the "axes" object that is returned
# so we can plot other layers on the same plot
ax = watershed_webmercator.plot(alpha=1, facecolor="none", edgecolor='red',
zorder=1, linewidth=2, figsize=(7, 8))
# Use the contextily library to add a basemap
# The default is Stamen terrain, which is nice, but there are many options:
# https://contextily.readthedocs.io/en/latest/providers_deepdive.html
contextily.add_basemap(ax, attribution=False)
# Setting attribution to False avoids putting a long text that makes the map unviewable.
# For example, this line will give you aerial imagery, courtesy of the GIS company ESRI
# contextily.add_basemap(ax, source="https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}")
# Or this line will load the USGS topographic map baselayer.
# contextily.add_basemap(ax, source = xyz.USGS.USTopo)
# Plot the rivers
rivers_attrs_webmercator = rivers_attrs_gdf.to_crs(epsg=3857)
rivers_attrs_webmercator.plot(ax=ax, linewidth=rivers_attrs_gdf['linewidth'],
edgecolor="blue", zorder=2)
# Plot the outlet point
point_webmercator = outlet_gdf.to_crs(epsg=3857)
point_webmercator.plot(ax=ax, linewidth=2, edgecolor="purple",
facecolor="none", markersize=40, alpha=1, zorder=3)
# Remove the axis tick mark labels
ax.xaxis.set_ticklabels([])
ax.yaxis.set_ticklabels([])
ax.tick_params(left=False, bottom=False)
# Add a scalebar - although note that these are not all that accurate
# for the Web Mercator map projection
ax = ax.add_artist(ScaleBar(1, location='lower left', box_alpha=0))
plt.show()
"""
We can also create an interactive Leaflet map with Folium. This is a map you
view in a web browser, like Google Maps. This is very handy to pan and zoom
around your watershed!
We can use the `explore()` method of Geopandas to put the features on a folium map
The function returns a folium map object that we can reuse to add other layers.
"""
folium_map = watershed_gdf.explore(color="red", popup=False, style_kwds={'fill': False})
# Add a marker to show the outlet
outlet_gdf.explore(m=folium_map, style_kwds=dict(stroke=True, weight=3, color='purple', radius=5))
# Add the rivers to the map. The style function varies the line width
# using the attribute we added above, based on the square root of the stream order.
rivers_attrs_gdf.explore(
m=folium_map,
style_kwds=dict(
color='blue',
style_function=lambda x: {"weight": x["properties"]['linewidth']}
)
)
# Save the map to an html file
folium_map.save(f"{out_folder}/map.html")
"""
# Save your results
If you're happy with the results, and want to save them for later, GeoPandas makes it easy
to save the results in a bunch of different formats, like Geopackage (recommended)
or shapefiles. Here I'm using GeoJSON. A list of file formats here:
https://geopandas.org/en/stable/docs/user_guide/io.html#writing-spatial-data
"""
# Export the watershed boundary and rivers as GeoJSON files
print("Writing output geodata files")
watershed_gdf.to_file(f'{out_folder}/watershed.geojson', driver='GeoJSON')
"""
I could not get the dataframe with the joined attributes to write, because two of the columns
were of type 'categorical,' and GeoPandas threw an error trying to handle them.
Here I'm just deleting those columns with drop(), but presumably one could also change the data type
to a string or integer, and that would probably fix the problem too.
"""
rivers_attrs_gdf.drop(['reachcode'], axis=1, inplace=True)
rivers_attrs_gdf.drop(['wbareatype'], axis=1, inplace=True)
# Try saving to a GeoPackage this time.
rivers_attrs_gdf.to_file(f'{out_folder}/rivers.gpkg')
"""
# Conclusion
In this Python script, we saw how to use Python to:
* delineate a watershed for anywhere in the continental United States with the USGS NLDI API
* find all upstream river reaches in the watershed
* attach "value added attributes" to learn more about those rivers
* create attractive plots and an interactive map
* save the results in a format that can be shared or used in other software
**Happy watershed delineation!**
"""
@mheberger
Copy link
Author

mheberger commented Jun 27, 2025

Updated 2025-06-26 for the revised NLDI API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment