"

5.3 Geodatabases

A database is a structured collection of data files. A database management system (DBMS) is a software package that allows creating, storing, maintaining, manipulating, and retrieving large datasets distributed over one or more files. A DBMS and its associated functions are usually accessed through commercial software packages such as Microsoft Access, Oracle, FileMaker Pro, or Avanquest MyDataBase. Database management usually refers to storing tabular data in row and column format and is frequently used for personal, business, government, and scientific endeavors. Alternatively, geospatial database management systems include the functionality of a DBMS and contain specific geographic information about each data point, such as identity, location, shape, and orientation. Integrating this geographic information with the tabular attribute data of a classical DBMS provide users with powerful tools to visualize and answer the spatially explicit questions that arise in an increasingly technological society.

Several database models exist, such as the flat, hierarchical, network, and relational models (Worboys, 1995; Jackson, 1999). A flat database is a spreadsheet storing all data in a single, large table. A hierarchical database is also a simple model that organizes data into a “one-to-many” association across levels

. Common examples of this model include phylogenetic trees for classifying plants and animals and familial genealogical trees showing parent-child relationships. However, network databases are like hierarchical databases because they also support “many-to-many” relationships. This expanded capability allows greater search flexibility within the dataset and reduces the potential redundancy of information. Alternatively, both the hierarchical and network models can become incredibly complex depending on the size of the databases and the number of interactions between the data points. Modern geographic information system (GIS) software typically employs a fourth model referred to as a relational database (Codd, 1970).

Relational Database Management Systems

A relational database management system (RDBMS) is a collection of connected tables so that data can be accessed without reorganization of the tables. The tables are created such that each column represents a particular attribute (e.g., soil type, PIN number, last name, acreage), and each row contains a unique instance of data for that columnar attribute (e.g., Delhi Sands Soils, 5555, Smith, 412.3 acres)

In the relational model, each table (not surprisingly called a relation) is linked to each other table via predetermined keys (Date 1995). [4] The primary key represents the attribute (column) whose value uniquely identifies a particular record (row) in the relation (table). The primary key may not contain missing values as multiple missing values would represent nonunique entities that violate the basic rule of the primary key. The primary key corresponds to an identical attribute in a secondary table (and possibly third, fourth, fifth, etc.) called a foreign key. This results in all the information in the first table being related to the information in the second table via the primary and foreign keys, hence the term “relational” DBMS. With these links in place, tables within the database can be kept relatively simple, resulting in minimal computation time and file complexity. This process can be repeated over many tables if each contains a foreign key corresponding to another table’s primary key.

The relational model has two primary advantages over the other database models described earlier. First, each table can now be separately prepared, maintained, and edited. This is particularly useful when considering the potentially massive size of many of today’s modern databases. Second, the tables may be maintained separately until the need for a particular query or analysis calls for the tables to be related. This creates a significant degree of efficiency for processing information within a given database.

It may become apparent to the reader that there is immense potential for redundancy in this model as each table must contain an attribute that corresponds to an attribute in every other related table. Therefore, redundancy must actively be monitored and managed in an RDBMS. A set of rules called normal forms was developed (Codd, 1970). There are three primary normal forms. The first normal form refers to five conditions that must be met (Date 1995). [6] They are as follows:

  • There is no sequence to the ordering of the rows.
  • There is no sequence to the ordering of the columns.
  • Each row is unique.
  • Every cell contains one and only one value.
  • All values in a column pertain to the same subject.

The second normal form states that any column, not a primary key, must depend on the primary key. This reduces redundancy by eliminating the potential for multiple primary keys throughout multiple tables. However, this step often involves the creation of new tables to maintain normalization.

The third normal form states that all nonprimary keys must depend on the primary key, while the primary key remains independent of all nonprimary keys. This form was wittily summed up by Kent (1983) [7], who quipped that all nonprimary keys “must provide a fact about the key, the whole key, and nothing but the key.” Echoing this quote is the rejoinder: “so help me, Codd” (personal communication with Foresman 1989).

Joins and Relations

An additional advantage of an RDBMS is that it allows attribute data in separate tables to be linked together. The two operations commonly used to accomplish this are the join and relate.

The join operation appends the fields of one table into a second table using an attribute or field that is common to both tables. This is commonly utilized to combine attribute information from one or more nonspatial data tables (i.e., information taken from reports or documents) with a spatially explicit GIS feature layer. The second type of join combines feature information based on spatial location and association rather than common attributes. In ArcGIS, three types of spatial joins are available. Users may (1) match each feature to the closest feature, (2) match each feature to the feature that it is part of, or (3) match each feature to the feature that it intersects.

Alternatively, the related operation temporarily associates two map layers or tables while keeping them physically separate. Relates are bidirectional, so data can be accessed from one of the tables by selecting records in the other table. The related operation also allows for the association of three or more tables.

Sometimes it can be unclear as to which operation one should use. As a general rule, joins are most suitable for instances involving one-to-one or many-to-one relationships. Joins are also advantageous since the data from the two tables are readily observable in the single output table. On the other hand, the use of relates is suitable for all table relationships (one-to-one, one-to-many, many-to-one, and many-to-many); however, they can slow down computer access time if the tables are extensive or spread out over remote locations.