"

5.2 Geographic Data Acquisition

Acquiring geographic data is crucial in any geographic information system (GIS) effort. It has been estimated that data acquisition typically consumes 60 to 80 percent of the time and money spent on any given project. Therefore, care must ensure that GIS projects remain mindful of their stated goals so spatial data collection proceeds efficiently and effectively. This chapter outlines the many forms and sources of geospatial data available in a GIS.

Data Types

The type of data that we employ to help us understand a given entity is determined by (1) what we are examining, (2) what we want to know about that entity, and (3) our ability to measure that entity at the desired scale. The most common data types available in a GIS are alphanumeric strings, numbers, Boolean values, dates, and binaries.

An alphanumeric string, or text, the data type is any simple combination of letters and numbers that may or may not form coherent words. The number data type can be subcategorized as either floating-point or integer. A floating point is any data value containing decimal digits, while an integer is any data value not containing decimal digits. Integers can be short or long, depending on the number of significant digits. Also, they are based on the concept of the “bit” in a computer. As you may recall, a bit is a computer’s most basic unit of information and stores values in one of two states: 1 or 0.

Therefore, an 8-bit attribute would consist of eight 1s or 0s in any combination (e.g., 10010011, 00011011, 11100111).

Short integers are 16-bit values and, therefore, can characterize numbers ranging from −32,768 to 32,767 or from 0 to 65,535 depending on whether the number is signed or unsigned (i.e., contains a + or − sign). Alternatively, long integers are 32-bit values and, therefore, can characterize numbers ranging from −2,147,483,648 to 2,147,483,647 or from 0 to 4,294,967,295.

A single-precision floating-point value occupies 32 bits, like the long integer. However, this data type provides a value of up to 7 bits to the left of the decimal (a maximum value of 128, or 127 if signed) and up to 23-bit values to the right of the decimal point (approximately seven decimal digits). A double-precision floating-point value stores two 32-bit values as a single value. Then, double-precision floats can represent a value with up to 11 bits to the left of the decimal point and values with up to 52 bits to the right of the decimal (approximately 16 decimal digits).

Boolean, date, and binary values are less complex. Boolean values are simply those deemed true or false based on applying a Boolean operator such as AND, OR, and NOT. The date data type is self-explanatory, while the binary data type represents attributes whose values are either 1 or 0.

Measurement Scale

In addition to defining data by type, a measurement scale acts to group data according to the level of complexity (Stevens, 1946). For GIS analyses, measurement scales can be grouped into two broad categories. Nominal and ordinal data represent categorical data; interval and ratio data represent numeric data.

The most straightforward data measurement scale is the nominal or named scale. The nominal scale makes statements about what to call data points but does not allow for scalar comparisons between one object and another. For example, attributing nominal information to points representing cities will describe whether the given locale is “Los Angeles” or “New York.” However, no further denotations can be made about those locales, such as population or voting history. Other examples of nominal data include last name, eye color, land-use type, ethnicity, and gender.

Ordinal data places attribute information into ranks and yields more precisely scaled information than nominal data. Ordinal data describes the position in which data occur, such as first, second, third, etc. These scales may also take on names such as “very unsatisfied,” “unsatisfied,” “satisfied,” and “very satisfied.” Although this measurement scale indicates the ranking of each data point relative to other data points, the ordinal scale does not explicitly denote the exact quantitative difference between these rankings. For example, if an ordinal attribute represents which runner came in first, second, or third place, it does not state how long the winning runner beat the second-place runner. Therefore, one cannot undertake arithmetic operations with ordinal data. The only sequence is explicit.

An interval data is a measurement scale that allows precise quantitative statements about attributes. Interval data are measured along a scale in which each position is equidistant. Elevation and temperature readings are typical representations of interval data. For example, this scale can determine that 30 degrees Fahrenheit is 5 degrees Fahrenheit warmer than 25 degrees Fahrenheit. A notable property of the interval scale is that zero is not a meaningful value because zero does not represent nothingness or the absence of a value. Indeed, 0 degrees Fahrenheit does not indicate that no temperature exists. Similarly, an elevation of 0 feet does not indicate a lack of elevation but indicates the mean sea level.

Ratio data are like the interval measurement scale but based on a meaningful zero value. Population density is an example of ratio data whereby a 0-population density indicates that no people live in interest. Similarly, the Kelvin temperature scale is a ratio scale as 0 K does imply that no heat (temperature) is measurable within the given attribute.

Specific to numeric datasets, data values also can be discrete or continuous. Discrete data maintain a finite number of values, while infinite values can represent continuous data. For example, the number of mature trees on a small property will necessarily be between one and one hundred (for argument’s sake). However, the height of those trees represents a constant data value as there are infinite potential values (e.g., one tree could 20 feet tall, 20.1 feet, or 20.15 feet, 20.157 feet, etc.).

Primary Data Capture

Now that we understand the different data types and measurement scales available in a GIS, we must direct our thoughts to how this data can be acquired. Primary data capture is a direct data acquisition methodology usually associated with in-the-field effort. In the case of vector data, directly captured data commonly comes from a global positioning system (GPS) or other types of surveying equipment, such as a total station. Total stations are specialized; primary data capture instruments that combine a theodolite (or transit), which measures horizontal and vertical angles, with a tool to measure the slope distance from the unit to an observed point. A total station allows field crews to derive the topography quickly and accurately for a particular landscape.

In the case of GPS, handheld units access positional data from satellites and log the information for subsequent retrieval. A network of twenty-four navigation satellites is situated around the globe and provides precise coordinate information for any point on the Earth’s surface. Maintaining a line of sight to four or more satellites provides the user with reasonably accurate location information. Depending on user preference, these locations can be collected as individual points or linked together to form lines or polygons. In addition, the user can enter attribute data such as land-use type, telephone pole number, and river name. This location and attribute data can be uploaded to the GIS for visualization. Depending on the GPS make and model, this upload often requires some intermediate file conversion via software provided by the manufacturer of the GPS unit. However, some free online resources can convert GPS data from one format to another. GPS Babel is an example of such an online resource (http://www.gpsvisualizer.com/gpsbabel).

In addition to the typical GPS unit, GPS is becoming increasingly incorporated into other innovative technologies. For example, smartphones now embed GPS capabilities as a standard technological component. These phone/GPS units maintain comparable accuracy to similarly priced stand-alone GPS units. They are responsible for a renaissance in facilitating portable, real-time data capture and sharing to the masses. Furthermore, the ubiquity of this technology led to a proliferation of crowdsourced data acquisition alternatives. Crowdsourcing is a data collection method whereby users contribute freely to building spatial databases. This rapidly expanding methodology is utilized in such applications as TomTom’s MapShare application, Google Earth, Bing Maps, and ArcGIS.

Raster data obtained via direct capture comes more commonly from remotely sensed sources. Remotely sensed data can obviate the need for physical access to the imaged area. In addition, the researcher can characterize vast tracts of land with little to no additional time and labor. On the other hand, validation is required for remotely sensed data to ensure that the sensor is operating correctly and adequately calibrated to collect the desired information. Satellites and aerial cameras provide the most ubiquitous sources of direct-capture raster data.

Secondary Data Capture

Secondary data capture is an indirect methodology that utilizes the vast amount of existing geospatial data in digital and hard-copy formats. Before initiating any GIS effort, it is always wise to mine online resources for existing GIS data that may fulfill your mapping needs without the potentially intensive step of creating the data from scratch. Such digital GIS data are available from a variety of sources, including international agencies (CGIAR, CIESIN, United Nations, World Bank, etc.); federal governments (USGS, USDA, NOAA, USFWS, NASA, EPA, US Census, etc.); state governments (CDFG, Teale Data Center, INGIS, MARIS, NH GIS Resources, etc.); local governments (SANDAG, RCLIS, etc.); university websites (UCLA, Duke, Stanford, University of Chicago, Indiana Spatial Data Portal, etc.); and commercial websites (ESRI, GeoEye, Geocomm, etc.). These secondary data are available in various file types, extents, and sizes but are ready-made in most GIS software packages. Often, these data are free, but many sites will charge a fee for access to the proprietary information they have developed.

Although these data sources are all cases where the information has been converted to digital format and projected adequately for use in a GIS, much spatial information can be gleaned from existing, non-digital sources. Paper maps, for example, may contain current or historical information on a locale that cannot be found in digital format. In this case, digitization can create digital files from the original paper copy. Three primary methods exist for digitizing spatial information: two are manual, and one is automated.

Tablet digitizing is a manual data capture method whereby a user enters coordinate information into a computer using a digitizing tablet and a digitizing puck. To begin, a paper map is secured to a backlit digitizing tablet. The backlight allows all features on the map to be easily observed, which reduces eyestrain. The coordinates of the point, line, and polygon features on the paper map are then entered into a digital file as the user employs a puck, which is like a multi-buttoned mouse with a crosshair, to “click” their way around the vertices of each desired feature. The resulting digital file will need to be correctly georeferenced following the completion of the digitization task to ensure that this information will adequately align with existing datasets.

The second manual data capture method, heads-up digitizing, is called “on-screen” digitizing. Heads-up digitizing can be used on either paper maps or existing digital files. In the case of a paper map, the map must first be scanned into the computer at a high enough resolution to resolve all pertinent features. Second, the now-digital image must be registered so the map will conform to an existing coordinate system. Third, the user can enter control points on the screen and transform, or “rubber sheet,” the scanned image into real-world coordinates. Finally, the user zooms to specific areas on the map and traces the points, lines, and polygons, like the tablet digitization example.

Heads-up digitizing is particularly simple when existing GIS files, satellite images, or aerial photographs are used as a baseline. For example, if a user plans to digitize the boundary of a lake as seen from a georeferenced satellite image, the steps of scanning and registering can be skipped, and projection information from the originating image can be copied over to the digitized file.

The third automated method of secondary data capture requires the user to scan a paper map and vectorize the information therein. This vectorization method typically requires a specific software package to convert a raster scan to vector lines. This requires a very high-resolution, clean scan. If the image is not clean, all the imperfections on the map will be converted to false points/lines/polygons in the digital version. If a clean scan is not available, it is often faster to use a manual digitization methodology. This method is much quicker than the manual methods and may be the best option if multiple maps must be digitized and time is a limiting factor. A semiautomatic approach is often employed whereby a map is scanned and vectorized, followed by a heads-up digitizing session to edit and repair any errors that occurred during automation.

The final secondary data capture method worth noting is reports and documents. Via this method, one enters reports and documents into the attribute table of an existing digital GIS file containing all the pertinent points, lines, and polygons. For example, new information specific to census tracts may become available following a scientific study. The GIS user needs to download the existing GIS file of census tracts and enter the study’s report/document information directly into the attribute table. If the data tables are available digitally, the use of the “join” and “relate” functions in a GIS (Section 5.2.2 “Joins and Relates”) is often extremely helpful as it will automate much of the data entry effort.