Notes about feature IDs (FIDs)

about feature IDs

I couldn't find a complete specification for FIDs, but GDAL's vector data model docs say,

The feature id (FID) of a feature is intended to be a unique identifier for the feature within the layer it is a member of. Freestanding features, or features not yet written to a layer may have a null (OGRNullFID) feature id. The feature ids are modeled in OGR as a 64-bit integer; however, this is not sufficiently expressive to model the natural feature ids in some formats. For instance, the GML feature id is a string.

This suggests that all features read from a layer will have a defined FID, and all features will have an FID that is either defined or null. (Though it's not clear that that is guaranteed).

Format drivers may/may not enforce more properties of the FID. For instance, shapefile FIDs start at 0 while geopackage FIDs start at 1. Every time a shapefile changes, the FIDs are re-ordered sequentially starting from 0 with the modified features moved to the end. I haven't researched other formats, but they almost certainly do things differently.

Therefore, if we're working with a GDAL layer read from an unknown vector format, we should not assume much about its FIDs.

guaranteed / non-guaranteed properties of FIDs

Seemingly guaranteed (for data read from an unknown format):

the function feature.GetFID exists
it will return an integer
the integer will be unique among features in the layer
the function layer.GetFeature(fid) will return the feature with the given FID, if it exists
if that FID doesn't exist, it will raise an error

Not guaranteed (for data read from an unknown format):

the FID 0 or 1 (or any other number) will exist
FIDs in the range [0, layer.GetFeatureCount()] will exist
FIDs will be consecutive
FIDs will be preserved if the dataset is copied to another format
FIDs will be preserved if features are added or deleted
FIDs will be preserved if other attributes are changed
etc

using FIDs safely

FIDs are valuable when used correctly. Just make sure that FIDs are only used as identifiers within a single unmodified layer.

A few models assume that FIDs match across two identical output vectors created in the same format. This seems to hold up, but may not be guaranteed and seems vulnerable to error. For instance, if attributes are changed in one vector, causing the FIDs to re-order, are we silently mixing up data?

Common use cases for FIDs:

key in the dictionary returned by pygeoprocessing.zonal_statistics
geometry ID in an rtree spatial index

A different unique key attribute should be created and used for post-processing e.g. in GIS software. Other output data should not reference a vector's FIDs. While it does work, it's prone to user error.

emlys/featureIDs.md

about feature IDs

guaranteed / non-guaranteed properties of FIDs

using FIDs safely