WE EXPLORE Missouri Census Data Center

Overview of the "We Explore" Tutorial

The uexplore program is a cgi-bin application that permits web browsers to have access to directory structures on UNIX file servers in much the same way as the Windows Explore program permits Windows users to explore the files stored on their PC drives. While conceptually similar, there are a great many differences between these two data navigation tools. The purpose of this tutorial is to provide background and assistance to help users get the full benefits of uexplore and the applications that it can invoke.

Directories that can be accessed using uexplore must have their permission flags set so that they are universally readable and "executable" within their UNIX file systems. In addition, they must contain a special "table of contents" file, named "Contents" (note upper case "C") containing the names and descriptions of the files within the directory. Though not actually required by the software, Contents files should contain descriptive text along with each file name. The concept is to provide the explorer with some guidance as to what they may expect to find if they peruse and choose files for further exploration.

The Missouri Census Data Center (MCDC) is an organization whose mission is to collect and make available to researchers and the general public a wide array of public information. As a participant in the Census Bureau sponsored State Data Center program, they have a special mission to receive and make available within their state data files and data products produced by the U.S. Bureau of the Census. The majority of the files within the MCDC collection are based on data received from the Census Bureau. In most cases these files have been in some ways transformed or converted by the MCDC. Using a commercial software package called SAS(r), most of the raw data files received by the MCDC have gone through a process of being converted to something called a SAS dataset. These SAS datasets are similar to other file structures such as .dbf (dBASE) files or .xls (Excel spreadsheets) in that the data have been structured into rows and columns, or observations and variables, or records and fields. In the case of SAS the terms are observations for the basic sub-entities of the datasets; the observations are made up of variables. Access to such datasets requires special software that knows how to decipher the special structure of the SAS dataset files. The uexplore application provides a means of accessing these files by dynamically invoking the SAS software and passing it information about what the client (the web client that invoked the application) has requested. That information is then formatted per the client's specifications and passed back to the client via the browser.

The user need know very little if anything about the SAS software in order to make use of these exploration tools.

While uexplore is the name of the application that the user invokes to begin and to control the overall direction of their exploration, most of the actual work is done by special application programs. In fact, the uexplore program is written in a language called Perl, and it never directly opens and accesses any of the SAS datasets within the archive. Instead, it simply recognizes when a client has requested to "see" a SAS dataset, and provides the user with a menu of appropriate sub-applications that are written in SAS and that can access the data directly. Much of this tutorial will be dealing with these special sub-applications, especially the two basic (and universal) ones:

  • hypercon -- a utility that provides information about the structure and content of SAS datasets
  • xtract -- a utility that performs the actual extraction of data from the SAS datasets; this program allows selecting observations (records) and variables (fields) and can deliver the results in any of five commonly used format which the user can specify.

In order to be able to use the uexplore tool effectively, there are two basic kinds of things that the explorer has to understand:

  • The mechanics of using the software. This is the usual sort of thing you have to cope with when using any piece of software. You need to understand how to get started, how to respond to the prompts, where to click when you want to do something, what the various parameter options really mean, etc. While not trivial, this is the (relatively) easy part of the exploration process.
  • The structure and content of the data archive itself. You may be able to master the navigational tools, and know how to move about in the system and to use the special applications programs to select information and formats, and get those back to your desktop and loaded into your favorite local data processing package. But this will be of little value to you unless you also have some understanding of the structure and content of the data itself. Unless you are already well versed in the arcane file-naming schemes and summary table style presentations used in many Census Bureau products, you will probably find that making your way through these conventions is a lot tougher than mastering the relatively simple mechanics of using xtract.

We should warn you now that the data archive you'll be exploring has been built over a period of 15-20 years by different people for different kinds of access (i.e. batch access by programmers on a mainframe) on different computing platforms. Certain conventions for organizing and naming files that worked well when data was stored on computer tapes for an all-batch mainframe system may not be very convenient for an on-line browsing application. Many of the datasets are really too big -- too many observations and/or too many variables. Datasets with over 3000 variables are not uncommon, and datasets over ten megabytes - a few even over one hundred megabytes - are presented along with a few sets that are more reasonable in size. You may, for example, find that accessing any file in a directory with a name beginning "stf" (for "summary tape file") can be an adventure. As their names imply, these files were designed to be stored upon and accessed from magnetic tape media. As a would-be user of such datasets, you may need to spend some time searching the Tools or Docs subdirectories to look for codebooks or other metadata describing such files. Once you understand the structure and content of these beasts, there is no reason why you cannot use the uexplore tools to capture the information you need. But you may need to do some off-line preparation in order to be able to do so efficiently. For many would-be users of this system who may only have a single request for a specific kind of data for a specific kind of geography, they may well find that the initial start-up costs for getting familiar with the data is simply not worth the effort. For those people, the MCDC staff are available to help them with their data requests.

Even if you decide after some initial exploration, that the data archive may be a bit too archaic and complex for your time-constrained data needs, you still may find it worthwhile to become familiar with the exploration tools. You can always save time and money (MCDC personnel often do charge for their services when extracting data for clients) by being able to do your own extracts -- but from specific data sets that you did not discover from your own explorations, but rather by means of a phone call to one of the MCDC's public information specialists. See the ETC chapter for more details.


Main Page of We Explore
|| Overview || Invoking uexplore || Basics of the /mscdc/data space
|| Using uexplore and sasapps || The hypercon application || The xtract application