5.4 File Formats
Geospatial data are stored in many different file formats. Each geographic information system (GIS) software package, and each version of these software packages, supports different formats. This is true for both vector and raster data. Although several more common file formats are summarized here, many other formats exist for use in various GIS programs.
Vector File Formats
The most common vector file format is the shapefile. Shapefiles, developed by ESRI in the early 1990s for use with the dBASE III database management software package in ArcView 2, are simple, nontopological files developed to store the geometric location and attribute information of geographic features.
Shapefiles are incapable of storing null values and annotations or network features. Field names within the attribute table are limited to ten characters, and each shapefile can represent an only point, line, or polygon feature sets. Supported data types are limited to floating-point, integer, date, and text. All commercial and open-source GIS software supports Shapefiles.
Despite being called a “shapefile,” this format is a compilation of many different files. Table 5.1 “Shapefile File Types” lists and describes the different file formats associated with the shapefile. Only the SHP, SHX, and DBF file formats are mandatory to create a functioning shapefile, while all others are conditionally required. As a rule, the names for each file should conform to the MS-DOS 8.3 convention when using older versions of GIS software packages. According to this convention, the filename prefix can contain eight characters, and the filename suffix contains three characters. However, the more recent GIS software packages have relaxed this requirement and will accept longer filename prefixes.
The ArcInfo coverage is the earliest vector format file for use in GIS software packages, which is still in use today. This georelational file format supports multiple feature types (e.g., points, lines, polygons, annotations) while storing the topological information associated with those features. Attribute data are stored as multiple files in a separate directory labeled “Info.” Due to its creation in an MS-DOS environment, these files maintain strict naming conventions. File names cannot be longer than thirteen characters, cannot contain spaces, cannot start with a number, and must be entirely in lowercase. Coverages cannot be edited in ArcGIS 9.x or later versions of ESRI’s software package.
The US Census Bureau maintains a specific type of shapefile referred to as TIGER or TIGER/Line (Topologically Integrated Geographic Encoding and Referencing system). Although these open-source files do not contain actual census information, they map features such as census tracts, roads, railroads, buildings, rivers, and other features that support and improve the Bureau and improve the Bureau’s ability to collect census information. TIGER/Line shapefiles, first released in 1990, are topologically explicit and linked to the Census Bureau’s Master Address File (MAF), enabling the geocoding of street addresses. These files are accessible to the public and can be freely downloaded from private vendors that support the format.
The AutoCAD DXF (Drawing Interchange Format or Drawing Exchange Format) is a proprietary vector file format developed by Autodesk to allow interchange between engineering-based CAD (computer-aided design) software and other mapping software packages. DXF files were initially released in 1982 to provide an exact representation of AutoCAD’s native DWG format. Although the DXF is still commonly used, newer versions of AutoCAD have incorporated more complex data types (e.g., regions, dynamic blocks) that are not supported in the DXF format. Therefore, it may be presumed that the DXF format may become less popular in geospatial analysis over time.
Finally, the US Geological Survey (USGS) maintains an open-source vector file format that details physical and cultural features across the United States. These topologically explicit DLGs (Digital Line Graphics) come in large-, intermediate-, and small-scale depending on whether they are derived from 1:24,000-; 1:100,000-; or 1:2,000,000-scale USGS topographic quadrangle maps. The features available in the different DLG types depend on the scale of the DLG but include data such as administrative and political boundaries, hydrography, transportation systems, hypsography, and land cover.
Vector data files can also be structured to represent surface elevation information. A TIN (Triangulated Irregular Network) is an open-source vector data structure that uses contiguous, nonoverlapping triangles to represent geographic surfaces. In comparison, the raster depiction of a surface represents elevation as an average value over the spatial extent of the individual pixel (see Section 5.3.2 “Raster File Formats”), and the TIN data structure models each vertex of the triangle as an exact elevation value at a specific point on the earth. The arcs between each vertex approximate the elevation between two vertices. These arcs are then aggregated into triangles from which information on elevation, slope, aspect, and surface area can be derived across the entire extent of the model’s space. Note that the term “irregular” in the name of the data model refers to the fact that the vertices are typically laid out in a scattered fashion.
The use of TINs confers certain advantages over raster-based elevation models (see Section 5.3.2 “Raster File Formats”). First, linear topographic features are accurately represented relative to their raster counterpart. Second, a comparatively small number of data points are needed to represent a surface, so file sizes are typically much smaller. This is particularly true as vertices can be clustered in areas where relief is complex and sparse in areas where relief is simple. Third, specific elevation data can be incorporated into the data model post hoc via the placement of additional vertices if the original is deemed insufficient or inadequate. Finally, specific spatial statistics can be calculated that cannot be obtained when using a raster-based elevation model, such as flood plain delineation, storage capacity curves for reservoirs, and time-area curves for hydrographs.
Raster File Formats
A multitude of raster file format types is available for use in GIS. The selection of raster formats has dramatically increased with the widespread availability of imagery from digital cameras, video recorders, satellites, etc. Raster imagery is typically 8-bit (256 colors) or 24-bit (16 million colors). Due to ongoing technological advancements, raster image file sizes have been getting larger and larger. Two types of file compression are commonly used to deal with this potential constraint: lossless and lossy. Lossless compression reduces file size without decreasing image quality. Lossy compression exploits the human eye’s limitations by removing information from the image that cannot be sensed, resulting in smaller file sizes than lossless compression.
Among the most common raster files used on the web are the JPEG, TIFF, and PNG formats, all of which are open source and can be used with most GIS software packages. The JPEG (Joint Photographic Experts Group) and TIFF (Tagged Image File Format) raster formats are most frequently used by digital cameras to store 8-bit values for each of the red, blue, and green colors spaces (and sometimes 16-bit colors, in the case of TIFF images). JPEGs support lossy compression, while TIFFs can be either lossy or lossless. Unlike JPEG, TIFF images can be saved in either RGB or CMYK color spaces.PNG (Portable Network Graphics) files are 24-bit images that support either lossy or lossless compression. PNG files are designed for efficient viewing in web-based browsers such as Internet Explorer, Mozilla Firefox, Netscape, and Safari.
Native JPEG, TIFF, and PNG files do not have georeferenced information associated with them and, therefore, cannot be used in any geospatial mapping efforts. A world file must first be created to employ these files in a GIS. A world file is a separate, plaintext data file that specifies the locations and transformations that allow the image to be projected into a standard coordinate system (e.g., Universal Transverse Mercator [UTM] or State Plane). The filename of the world file is based on the name of the raster file, while a w is typically added to the file extension. For example, the world file extension name for a JPEG is JPW; for a TIFF, it is TFW; and for a PNG, PGW.
An example of a raster file format with explicit georeferencing information is the proprietary MrSID (Multiresolution Seamless Image Database) format. LizardTech, Inc. developed this lossless compression format for large aerial photographs or satellite images, whereby portions of a compressed image can be viewed quickly without decompressing the entire file. In addition, the MrSID format is frequently used for visualizing orthophotos.
Like MrSID, the proprietary ECW (Enhanced Compression Wavelet) format also includes georeferencing information within the file structure. This lossy compression format was developed by Earth Resource Mapping and supports up to 255 layers of image information. Due to the potentially substantial file sizes associated with an image that supports so many layers, ECW files represent an excellent option for performing rapid analysis on large images while using a small amount of the computer’s RAM (Random Access Memory), thus accelerating computation speed.
Like the open-source, vector-based DLG, DRGs (Digital Raster Graphics) are scanned versions of USGS topographic maps and include all the collar material from the originals. In addition, the geospatial information found within the image’s neat line is georeferenced, specifically to the UTM coordinate system. These graphics are scanned at a minimum of 250 dpi (dots per inch) and therefore have a spatial resolution of approximately 2.4 meters. DRGs contain thirteen colors and may look slightly different from the originals. In addition, they include all the collar material from the original print version, are georeferenced to the surface of the earth, fit the Universal Transverse Mercator (UTM) projection, and are based on the NAD27 data points (NAD stands for North American Datum).
Some raster file formats are developed explicitly for modeling elevation, like the TIN vector format. These include the USGS DEM, USGS SDTS, and DTED file formats. The USGS DEM (US Geological Survey Digital Elevation Model) is a popular file format due to widespread availability, the simplicity of the model, and the extensive software support for the format. Each pixel value in these grid-based DEMs denotes spot elevations on the ground, usually in feet or meters. Care must be taken when using grid-based DEMs due to the enormous volume of data accompanying these files as the spatial extent covered in the image increases. DEMs are referred to as digital terrain models (DTMs) when they represent a simple, bare-earth model and digital surface models (DSMs)when they include the heights of landscape features such as buildings and trees.
USGS DEMs can be classified into one of four levels of quality (labeled 1 to 4) depending on their source data and resolution. This source data can be 1:24,000-; 1:63,360-; or 1:250,000-scale topographic quadrangles. The DEM format is a single file of ASCII text comprised of three data blocks: A, B, and C. The A block contains header information such as data origin, type, and measurement systems. The B block contains contiguous elevation data described as a six-character integer. The C block contains trailer information such as the scene’s root-mean-square (RMS) error. The USGS DEM format has recently succeeded the USGS SDTS (Spatial Data Transfer Standard) DEM format. The SDTS format [1] was specifically developed as a distribution for transferring data from one computer to another with zero data loss.
The DTED (Digital Terrain Elevation Data) format is another elevation-specific raster file format. It was developed in the 1970s for military purposes such as line of sight analysis, 3-D visualization, and mission planning. The DTED format maintains three levels of data over five different latitudinal zones. Level 0 data has a resolution of approximately 900 meters; Level 1 data has a resolution of approximately 90 meters, and Level 2 data has a resolution of approximately 30 meters.
Hybrid File Formats
A geodatabase is a recently developed, proprietary ESRI file format that supports vector and raster feature datasets (e.g., points, lines, polygons, annotation, JPEG, TIFF) within a single file. In addition, this format maintains topological relationships and is stored as an MDB file. The geodatabase was developed to be a comprehensive model for representing and modeling geospatial information.
There are three diverse types of geodatabases. The personal geodatabase was developed for single-user editing, whereby two editors cannot work on the same geodatabase at a given time. The personal geodatabase employs the Microsoft Access DBMS file format and maintains a size limit of 2 gigabytes per file, although it has been noted that performance begins to degrade after file size approaches 250 megabytes. The personal geodatabase is currently being phased out by ESRI and is therefore not used for new data creation.
The file geodatabase similarly allows only single-user editing, but this restriction applies only to unique feature datasets within a geodatabase. The file geodatabase incorporates new tools such as domains (rules applied to attributes), subtypes (groups of objects with a feature class or table), and split/merge policies (rules to control and define the output of split and merge operations). This format stores information as binary files with a size limit of 1 terabyte and has been noted to perform and scale much more efficiently than the personal geodatabase (approximately one-third of the feature geometry storage required by shapefiles and personal geodatabases). File databases are not tied to any specific relational database management system and can be employed on Windows and UNIX platforms. Finally, file geodatabases can be compressed to read-only formats that further reduce file size without subsequently reducing performance.
The third hybrid ESRI format is the ArcSDE geodatabase, which allows multiple editors to simultaneously work on feature datasets within a single geodatabase (a.k.a. versioning). This format can be employed on Windows and UNIX platforms like the file geodatabase. However, the file size is limited to 4 gigabytes, and its proprietary nature requires an ArcInfo or ArcEditor license for use. The ArcSDE geodatabase is implemented on the SQL Server Express software package, a free DBMS platform developed by Microsoft.
In addition to the geodatabase, Adobe Systems Incorporated’s geospatial PDF (Portable Document Format) is an open-source format that allows for representing geometric entities such as points, lines, and polygons. In addition, geospatial PDFs can be used to find and mark coordinate pairs, measure distances, reproject files, and georegister raster images. This format is handy as the PDF is widely accepted as the preferred standard for printable web documents. Although functionally similar, the geospatial PDF should not be confused with the GeoPDF format developed by TerraGo Technologies. Instead, the GeoPDF is a branded version of the geospatial PDF.
Finally, Google Earth supports a new, open-source, hybrid file format called a KML (Keyhole Markup Language). KML files associate points, lines, polygons, images, 3-D models, longitude and latitude values, and other view information such as tilt, heading, altitude, etc. KMZ files are commonly encountered, and they are zipped versions of KML files.