5.5 Data Quality
Not all geospatial data are created equally. Data quality refers to the ability of a given dataset to satisfy the objective for which it was created. With the voluminous amounts of geospatial data being created and served to the cartographic community, care must be taken by individual geographic information system (GIS) users to ensure that the data employed for their project is suitable for the task at hand.
Two primary attributes characterize data quality. Accuracy describes how close a measurement is to its actual value and is often expressed as a probability (e.g., 80 percent of all points are within +/− 5 meters of their true locations). Precision refers to the variance of a value when repeated measurements are taken. For example, a watch may be correct to 1/1000th of a second (precise) but maybe 30 minutes slow (not accurate). The blue darts are precise and accurate, while the red darts are precise but inaccurate.
Several errors can arise when accuracy and precision requirements are not met during data capture and creation. Positional accuracy is the probability of a feature within +/− units of its true location on earth (absolute positional accuracy) or its location concerning other mapped features (relative positional accuracy). For example, it could be said that a particular mapping effort may result in 95 percent of trees being mapped to within +/− 5 feet for their true location (absolute), or 95 percent of trees are mapped to within +/− 5 feet of their location as observed on a digital ortho quarter quadrangle (relative).
Speaking about absolute positional error does beg the question of what exactly is the true location of an object? As discussed in Chapter 2, “Map Anatomy,” differing conceptions of the earth’s shape have led to many projections, data points, and spheroids, each attempting to clarify positional errors for locations on the earth. To begin addressing this unanswerable question, the US National Map Accuracy Standard (or NMAS) suggests that a paper map is expected to have no more than 10 percent of measurable points fall outside the accuracy values to meet the horizontal accuracy requirements range. Similarly, the vertical accuracy of no more than 10 percent of elevations on a contour map shall be in error of more than one-half the contour interval. Any map that does not meet these horizontal and vertical accuracy standards will be deemed unacceptable for publication.
Positional errors arise via multiple sources. First, the process of digitizing paper maps commonly introduces such inaccuracies. Second, errors can arise while registering the map on the digitizing board. Third, a paper map can shrink, stretch, or tear over time, changing the scene’s dimensions. Fourth, input errors created from hastily digitized points are standard. Finally, converting between coordinate systems and transforming data points may also introduce errors to the dataset.
The root-mean-square (RMS) error is frequently used to evaluate the degree of inaccuracy in a digitized map. This statistic measures the deviation between the control points’ actual (true) and estimated (digitized) locations. For example, the inaccuracies of lines representing soil types result from input control point location errors. By applying an RMS error calculation to the dataset, one could determine the accuracy of the digitized map and thus determine its suitability for inclusion in each study.
Positional errors can also arise when features to be mapped are inherently vague. Take the example of a wetland. What defines a wetland boundary? Wetlands are determined by hydrologic, vegetative, and edaphic factors. Although the US Army Corps of Engineers is currently responsible for defining the boundary of wetlands throughout the country, this task is not as simple as it may seem.
Moreover, regional differences in the characteristics of a wetland make delineating these features particularly troublesome. For example, the definition of a wetland boundary for the riverine wetlands in the eastern United States, where water is abundant, is often useless when delineating similar types of wetlands in the desert southwest United States. Indeed, the complexity and confusion associated with the conception of what a “wetland” is may result in difficulties defining the feature in the field, which subsequently leads to positional accuracy errors in the GIS database.
In addition to positional accuracy, attribute accuracy is a common source of error in a GIS. Attribute errors can occur when an incorrect value is recorded within the attribute field or when a field is missing a value. Misspelled words and other typographical errors are common as well. Similarly, a common inaccuracy occurs when developers enter “0” in an attribute field when the value is “null.” This is common in count data where “0” would represent zero findings, while a “null” would represent a locale where no data collection effort was undertaken. Finally, in the case of categorical values, inaccuracies occasionally occur when attributes are mislabeled. For example, a land-use/land-cover map may list a polygon as “agricultural” when it is, in fact, “residential.” This is particularly true if the dataset is out of date, which leads us to our next source of error.
Temporal accuracy addresses the age or timeliness of a dataset. No dataset is ever wholly current. It has already become outdated in the time it takes to create the dataset. Regardless, there are several dates to be aware of while using a dataset. These dates should be found within the metadata. The publication date will tell you when the dataset was created and released. The field date relates to the date and time the data was collected. If the dataset contains any future predictions, there should also be a forecast period and date. To address temporal accuracy, many datasets undergo a regular data update regimen. For example, the California Department of Fish and Game updates its sensitive species databases on a near-monthly basis as new findings are continually being made. It is essential to ensure that, as an end-user, you are constantly using the most up-to-date data for your GIS application.
The fourth type of accuracy in a GIS is logical consistency. Logical consistency requires that the data be topologically correct. For example, does a stream segment of a line shapefile fall within the floodplain of the corresponding polygon shapefile? Do roadways connect at nodes? Do all the connections and flows point in the correct direction in a network? Regarding the last question, the author recently used an unnamed smartphone application to navigate a busy city roadway and was twice told to turn in the wrong direction down one-way streets. So beware, errors in logical consistency may lead to traffic violations, or worse!
The final type of accuracy is data completeness. Comprehensive inclusion of all features within the GIS database is required to ensure accurate mapping results. All the data must be present for a dataset to be accurate. Are all the counties in the state represented? Are all the stream segments included in the river network? Is every convenience store listed in the database? Are only certain types of convenience stores listed within the database? Indeed, incomplete data will inevitably lead to incomplete or insufficient analysis.