3.3 Metadata

Files and Formats

When we collect data about your home, rainforests, or anything, we usually need to put them somewhere. Though we may scribble numbers and measures on the back of an envelope or write them down on a pad of paper, if we want to update, share, analyze, or map them in the future, it is often helpful to record them in digital form so a computer can read them. So, though we will not bother ourselves with the bits and bytes of computing, it is necessary to discuss some fundamental elements of computing that are both relevant and required when learning and working with a GIS.

A file is one of the most common elements of working with computers and computing. Files in a computer can contain any number of things, from a complex set of instructions (e.g., a computer program) to a list of numbers and letters (e.g., an address book). Furthermore, computer files come in all varied sizes and types. One of the clues we can use to distinguish one file from another is the file extension. A file extension refers to the letters that follow the period (“.”) after the file’s name. The table below contains some of the most common file extensions and the types of files with which they are associated.

filename.txt Simple text file
filename.docx Microsoft Word document
filename.pdf Adobe portable document format
filename.jpg Compressed image file
filename.tif Tagged image format
filename.html Hypertext markup language (used to create websites)
filename.xml Extensible markup language
filename.zip Zipped/compressed archive

Some computer programs may be able to read, write, or analyze with only specific file types, while others are more adept at reading multiple file formats. As you work with information technology and GIS, you will realize that familiarity with different file types is essential. In addition, learning how to convert or export one file type to another is also a beneficial and valuable skill to obtain. In this regard, recognizing and knowing how to identify different and unfamiliar file types will undoubtedly increase your proficiency with computers and GIS.

Of the numerous file types, one of the most common and widely accessed files is simple text, plain text, or just text file. This is because word processing programs, spreadsheet and database programs, and web browsers can read simple text files widely. Often ending with the extension “.txt” (i.e., filename.txt), text files contain no special formatting (e.g., bold, italic, underlining) and contain only alphanumeric characters. In other words, images or sophisticated graphics are not well suited for text files. Text files, however, are ideal for recording, sharing, and exchanging data because most computers and operating systems can recognize and read simple text files with programs called text editors.

When a text file contains organized or structured data in some fashion, it is sometimes called a flat file (but the file extension remains the same, i.e., .txt). Flat files are organized in a tabular format or line by line. In other words, each line or row of the file contains one and only one record. So, if we collected height measurements on three people, Tim, Jake, and Harry, the file might look something like this:

Name Height
Tim 6’1″
Sarah 5’7″
Maria 5’5″

Each row corresponds to one and only one record, observation, or case. There are two other essential elements to know about this file. First, note that the first row does not contain any data; instead, it describes the data contained in each column. When the first row of a file contains such descriptors, it is referred to as a header row or just a header. Columns in a flat file are fields, variables, or attributes. For example, “Height” is the attribute, field, or variable that we are interested in, and the observations or cases in our data set are “Tim,” “Jake,” and “Harry.” In short, rows are for records; columns are for fields.

The file’s second unseen but critical element is the spaces between each column or field. For example, a space separates the “name” column from the “height” column in the example. Upon closer inspection, however, note how the initial values of the “height” column are aligned. The height column would not be aligned if a single space separated each column. In this case, a tab is used to separate the columns of each row. The delimiter or separator is the character used to separate columns within a flat file. Though any character can be used as a delimiter, the most common delimiters are the tab, the comma, and a single space. The following are examples of each.

Tab-Delimited Single-Space-Delimited Comma-Delimited
Name     Height Name Height Name, Height
Tim     6.1 Tim 6.1 Tim, 6.1
Sara    5.6 Sara 5.6 Sara, 5.6
Harry     5.9 Harry 5.9 Harry, 5.9

Knowing the delimiter to a flat file is essential because it enables us to distinguish and separate the columns efficiently and without error. Sometimes such files are referred to by their delimiters, such as “comma-separated values” or a “tab-delimited” file.

The same general format is applied when recording and working with geographic data. Rows are reserved for records; in the case of geographic data, locations and columns or fields are used for the attributes or variables associated with each location. For example, the following tab-delimited flat file contains data for three places (i.e., countries) and three attributes or characteristics of each country (i.e., population, language, continent), as noted by the header.

Country Population Languages Continent
France 65,000,000 French Europe
Brazil 192,000,000 Portuguese South America
Jordan 9,531,712 Arabic Southwest Asia

Files like those presented here are the building blocks of the various tables, charts, reports, graphs, and other visualizations we see online, in print, and on television daily. They are also vital components of GIS maps and geographic representations. Rarely if ever, however, will you work with one and only one file or file type. Often, especially when working with GIS, you will work with multiple files. Such a grouping of multiple files is called a database. Since the files within a database may be of varied sizes, shapes, and even formats, we need to devise a system that will allow us to work, update, edit, integrate, share, and display the various data within the database. Such a system is called a database management system (DBMS). So, databases and DBMSs are crucial to GIS, and a later chapter is dedicated to them. Geodatabases are a collection of geographic data.

Metadata

Consider the following common-delimited file:

city, sun, temp, precip

Los Angeles, 300, 70, 10

London, 50, 55, 40

Singapore, 330, 80, 60

Looking at the file’s contents, we can see that it contains data about the cities of Los Angeles, London, and Singapore. A comma separates each field or attribute, and the file also includes a header row that tells us about the data contained in each column. Or does it? What does the column “sun” refer to? Is it the number of sunny days this year, last year, annually, or when? What about “temp”? Does this refer to the average daytime, evening, or annual temperature? For that matter, how is temperature measured? In Celsius? Fahrenheit? Kelvin? The column “precip” refers to precipitation, but what are the units or time frames for such measures and data? Finally, where did these data come from? Who collected them, when were they collected, and for what purpose?

Incredibly, such a small text file can lead to so many questions. Let us extend the example to a file with one hundred records on ten variables, one thousand records on one hundred variables, or ten thousand records on one thousand variables. Through this simple example, several general but central issues related to data emerge. Such problems range from the relatively mundane naming conventions that are used to identify individual records (i.e., rows) and distinguish one field (i.e., column) from another to the issue of providing documentation about what data are included in a given file; when the data were collected; for what purpose are the data to be used; who collected them; and, of course, where did the data come from?

The previous simple text file illustrates how we cannot and should not take data and information for granted. It also highlights two important concepts regarding the source of data and the contents of data files. First, data can be put into two distinct categories regarding data sources. The first category is called primary data. Primary data refer to data collected directly or on a firsthand basis. For example, if you wanted to examine the variability of local temperatures in May and record the temperature at noon every day in May, you would be constructing a primary data set. Conversely, secondary data refer to data collected by someone or another party. For instance, we use secondary data when working with census or economic data collected and distributed by the government.

Several factors influence the decision behind constructing and using primary data sets versus secondary data sets. Data acquisition costs in terms of money, availability, and time are essential factors. Most geographic information system (GIS) projects’ data acquisition and integration phases are often the most time-consuming. In other words, locating, obtaining, and putting together the data for a GIS project may take up most of your time, whether you collect the data yourself or use secondary data. Of course, depending on the purpose, availability, and need, it may not be necessary to construct an entirely new data set (i.e., primary data set). However, considering the vast amounts of data and publicly available information, for example, via the Internet, secondary data’s cost and time savings often offset any benefits associated with primary data collection.

Now that we understand the difference between primary and secondary data and the rationale, how do we find the data and information we need? As noted earlier, an incredibly vast and growing amount of data and information is available to us. For example, an online search for “deforestation data” will return hundreds, if not thousands, of results. We need to turn to even more data to overcome this data and information overload. We are looking for a special kind of data called metadata, which simply means data about data. At one level, a header row in a simple text file like those discussed in the previous section is analogous to metadata. The header row provides data (e.g., names and labels) about the subsequent rows of data.

However, header rows may need an additional explanation, as previously illustrated. Furthermore, when working with or searching through several data sets, it can be tedious or impossible to open every file to determine its contents and usability. Enter metadata. Many files, particularly secondary data sets, come with a metadata file. These metadata files contain items such as general descriptions of the contents of the file, definitions for the various terms used to identify records (rows) and fields (fields), the range of values for fields, the quality or reliability of the data, and measurements, how the data were collected, when the data were collected, and who collected the data. Though not all data are accompanied by metadata, it is easy to see and understand why metadata is essential and valuable when searching for secondary data and constructing primary data that may be shared in the future.

Just as simple files come in all shapes, sizes, and formats, so do metadata. As the amount and availability of data and information increase daily, metadata plays a critical role in making sense. Geospatial metadata is the metadata class we are most concerned with when working with a GIS. As the name suggests, geospatial metadata is data about geographical and spatial data. According to the Federal Geographic Data Committee (FGDC) in the United States, “Geospatial metadata are used to document digital geographic resources such as GIS files, geospatial databases, and earth imagery. A geospatial metadata record includes core library catalog elements such as Title, Abstract, and Publication Data; geographic elements such as Geographic Extent and Projection Information; and database elements such as Attribute Label Definitions and Attribute Domain Values.” The definition of geospatial metadata is about improving transparency regarding data and promoting standards. Take a few moments to explore and examine the contents of a geospatial metadata file that conforms to the FGDC here.

“Geospatial metadata are used to document digital geographic resources such as GIS files, geospatial databases, and earth imagery. A geospatial metadata record includes core library catalog elements such as Title, Abstract, and Publication Data; geographic elements such as Geographic Extent and Projection Information; and database elements such as Attribute Label Definitions and Attribute Domain Values.” – Federal Geographic Data Committee

Standards refer to widely promoted, accepted, and followed rules and practices. Given the range and variability of data and data sources, identifying a common thread to locate and understand the contents of any given file can be challenging. However, just as the rules of grammar and mathematics provide the foundations for communication and numeric calculations, metadata provides similar frameworks for working with and sharing data and information from various sources.

The central point behind metadata is that it facilitates data and information sharing. Within the context of large organizations such as governments, data and information sharing can eliminate redundancies and increase efficiencies. Moreover, access to data and information promotes the integration of different data to improve analyses, inform decisions, and shape policy. The role that metadata, and geospatial metadata, play in GIS is critical and offers enormous benefits in terms of cost and time savings. The sharing, widespread distribution, and integration of various geographic and nongeographic data and information enabled by metadata drive some of the most exciting and compelling innovations in GIS and the broader geospatial information technology community. More critical, widespread access, distribution, and sharing of geographic data and information have essential social costs and benefits and yield better analyses and more informed decisions.

Click the “Previous” button on the lower left or the ‘Next” button on the lower right to navigate throughout the textbook.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Geographic Information Systems and Cartography Copyright © 2022 by adamdastrup is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book