GIS DATA QA-QC

This is a list of general QA-QC procedures for evaluating GIS vector data. These are methods I've had success with for catching common errors and anomalies. These techniques focus on identifying higher level issues that can be overlooked when the data is peer-reviewed at a “down in the weeds” level specific to the content of the data layers.

Most often, granular data edit errors are caught in peer-level review. After all that is the whole purpose of these reviews. These are the reviews focused on “Did all the sewer lines get added? Did all of the attributes on the sewer lines get filled out correctly?" These reviews are good at catching if a feature line was missed and not added to the shapefile, a feature attribute that could have been filled out was left empty or was filled in with an incorrect value, etc.

With all the hard work to make sure the nitty-gritty details were captured in the feature class, there can be bigger picture issues that cause problems which may not be immediately evident. Features with null geometry, corrupt spatial indexes, duplicated features, or final data unintentionally cropped in an export can all cause problems down the line for the end users or client.

CHECKLIST

GIS Vector Data QA-QC Checklist

Check for geometry anomalies {Check Geometry, Repair Geometry, Select Layer By Attribute}
Compare datasets pre- and post-edits to confirm intended changes {Feature Compare, Table Compare}
Check feature counts and for the presence of duplicates {Find Identical, Delete Identical}
Rebuild spatial index {Add Spatial Index}
Ensure the schema was adhered to

GEOMETRY

Geometry issues occur when features in a feature class or shapefile have corrupt or null geometry. These can pop up by causing geoprocessing tools to fail, edit sessions to not save, or by causing other weird behavior.

A few common scenarios I have come across where geometry issues can arise:

When editing in an attribute table, you click in the last blank row of the attribute table and it creates a new feature that has NULL geometry. You can select and edit this table row as usual, but there is not a corresponding geographic feature on the map.
Importing data from sources not originally generated in GIS, such as DWG, DGN, KMZ, etc. This is a common scenario where features can show up that have invalid or corrupt geometry (but not necessarily null) i.e. a self-intersecting polygon
Weird artifacts from editing or geoprocessing that create features with valid geometry, but this geometry makes no sense in the context of the feature i.e. a sewer line with a shape_length of 0.1' or a tax parcel with a shape_area of 1.0'

Some good tools to use to check for these issues:

Check Geometry geoprocessing tool
Repair Geometry geoprocessing tool if issues are found (make a copy of the dataset first because this tool alters the input dataset)
Select Layer By Attribute against the shape_length and shape_area fields to identify quantities that do not make sense

COMPARE DATA STATES

A good high-level check to do when updates are being made to datasets is to compare the corresponding layers pre- and post-edits. With an understanding of what the work to be completed was, the question to ask here is “we were supposed to change this, that, and the other thing; did these data edits do that?” This check in conjunction with the Editor Tracking dates should give you a good idea of all the changes that were made. This is not necessarily a replacement for a good visual comparison in ArcMap of the two data layers. But, just that visual comparison is not sufficient either for an exhaustive check. Do both.

Two geoprocessing tools that are helpful here are the Feature Compare and Table Compare tools. These will run a comparison against two data layers and report the differences in the geometries, attribute tables, or both. For example, input the original water line feature class and the edited water line feature class, and the tool will return what all the changes between them are.

These tools can also be helpful for identifying changes when one dataset “drifts” from another because one was edited and the other was not. For example, we have a final shapefile that we sent to the client. They had that shapefile for awhile, made some edits to it, and now we are tasked with doing project updates. They send us that shapefile and we want to see what they changed because they weren't using editor tracking.

FEATURE COUNTS

A simple check of feature counts can sometimes tip you off to other issues. There was an original shapefile with 100 features and the task was to add a bunch more features. You check the final data that was published to ArcGIS Online, and there are 75 features in the feature service. This doesn't seem right, what happened?

There is not really any single issue or remedy to feature count mismatches. A mismatched feature count when you compare two datasets is just the canary tipping you off that something is wrong; you'll have to investigate further to get to the bottom of it.

Here are a couple scenarios where feature count issues can pop up:

The most common scenario I have seen is when you are all done making your data edits in a working GDB. Now it's time to copy your data to the final GDB. You run a Data Export to the final GDB, but don't notice that you have features selected. Only your selected features write to the final GDB instead of the entire feature class.
Null or corrupt feature geometries in your feature class cause inconsistent fluctuation in the feature count
You are in an edit session and have features or table rows selected that are out of view. You copy and paste your feature of interest, unknowingly also copying and pasting a bunch of other features you didn't mean to. See identical features below...

Similar to feature count issues, you can also run into problems when there are duplicate identical features in a feature class. This can happen when a feature gets accidentally copied and pasted twice in an edit session, the Append geoprocessing tool is accidentally run twice, etc. Duplicate features can cause some insidious problems later particularly with attribution when one feature's attributes get updated while the others don't.

Fortunately, duplicate features are easy to identify using the Find Identical geoprocessing tool (choose the Shape field to find duplicate geometries), and you can clean them up with the Delete Identical tool (make a copy of the dataset because this tool alters the input dataset you give it).

SPATIAL INDEXES

Corrupt spatial indexes can on the surface cause the “scariest” problems, but in reality are probably the simplest to fix out of all the other issues mentioned in this document.

A corrupt or outdated spatial index is usually evident in the telltale behavior of features appearing and dissappearing as you zoom in and out of the map. Another sign is when you zoom to your layer extents, and the map zooms way out to include areas where you do not actually have any features in the layer.

I've noticed this behavior usually occurring when a feature class originally had the wrong coordinate system defined for the area of work and then features were digitized. This can also sometimes happen when pulling layers from ArcGIS Online down to local GDB copies. Behind the scenes, when you publish your feature class to ArcGIS Online, Esri takes your feature class and reprojects it to WGS84 Web Mercator for viewing on the web. When you then export your feature service to a GDB that you can download locally, Esri does the reverse and projects the feature class from WGS84 Web Mercator back down to whatever the original coordinate system was, like NAD83 NY State Plane. If features were added to the service through ArcGIS Online, the spatial index can get weird in your local GDB copy.

The solution here to fix all these problems is very simple - rebuild the spatial index. You can do this either by right-clicking on the feature class and going to the Indexes tab, or run the Add Spatial Index geoprocessing tool.

SCHEMAS

Some projects require that we use a specific schema (i.e. data formatting) for the GIS layers (synonymous with CAD standards when working on that platform). Other times, we are able to decide what the schema will be. The schema includes what feature classes are created and what their naming convention is, what attribute fields are used for each feature class, and what coordinate system is used by the feature classes.

Checking schemas can be tricky since there are so many small details, and usually we establish this prior to the actual GIS work being done so it is typically not an issue. Just be aware of it before starting any new GIS work.

The most common problem I see here is usually just with coordinate systems, typically with data layers that are brought into the GIS database from other sources. For example, a federal wetlands shapefile was downloaded and then imported into the GDB with the other project data. All the data layers are in State Plane, but this wetlands GDB is in UTM.

mikepianka/gis_data_qaqc.md