WE EXPLORE Missouri State Census Data Center

Basics of the MCDC Data Space

What We'll Be Talking About

In this chapter we want to provide a low-tech overview of some basic organization and naming conventions used in the MCDC data archive. We'll be talking about UNIX directories and subdirectories, paths, filenames, file name extensions, special filenames that always mean the same kind of data, etc. You really do not have to know much, if anything, about UNIX to understand this material. However, you should have some familiarity with tree-type directory structures, such as those used with most PC operating systems, including DOS and Windows. These structures were modeled after the UNIX conventions, which have been around much longer. The biggest practical difference for a user is that UNIX uses forward slashes ( / ) where the PC systems use back slashes ( \ ). And, in UNIX, case matters: "Contents" and "contents" are two different files.

Directory Basics

Directories are the same thing as folders in the Windows 95-and-above world. It is unlikely that you would be reading this if you were not already familiar with this concept since it is fundamental to accessing information on the Web. Directories can contain other directories -- referred to, of course, as subdirectories -- as well as files. It is the files we usually want to get to, the directories are just things we use to organize them. By carefully placing our files in certain folders we make it much easier to find what we are looking for. When we refer to a specific file, we sometimes include references to the folders in which it is contained. In the UNIX world we use the "/" character in file references to indicate that what appears to the left of it is a directory (folder) name. So when we refer to the file "/mscdc/data/stf903/usi.ssd01", what we are really talking about is a file named "usi.ssd01" which is contained in a folder name "stf903", which in turn is contained in a folder named "data", which in turn is contained in a folder named "mscdc". If the person creating all these folders knows what (s)he is doing, there will be a logical structure to all of this. If stf903 is in a folder named data, then we would hope that it refers to some kind of a sub-category of data; and similarly since the data folder is stored within the mscdc folder, we would expect that the data referred to is somehow related to something called "mscdc". (Of course, we all know that "mscdc" is just lower case for "Missouri State Census Data Center".) The full "/"-delimited list of folders in which a file is contained is referred to as the path of the file. In our example the file usi.ssd01 has a path of /mscdc/data/stf903.

The /mscdc/data Directory and Filetypes

The MCDC data archive is a collection of files and paths stored within the directory /mscdc/data. More precisely, the files are stored within the network of subdirectories of this parent directory. There is actually only a single file directly in the /mscdc/data directory; it is composed almost entirely of a series of subdirectories that we refer to as filetypes. These subdirectories correspond to the major categories of data within the archive. Examples of filetypes are cbp (County Business Patterns), stf803 (1980 Summary Tape File 3), stf901 (1990 Summary Tape File 1) and pums90 (1990 Public Use MicroSample.) Every data file in the archive is associated with one and only one of these filetypes. We have attempted to assign names to these directories which are mnemonic, i.e. that correspond to the names by which these data files are commonly known. Many of the filetypes will be very familiar to people used to working with census data (e.g. the "stf" names are widely known and used.) But there are some filetypes that the MCDC has more or less invented to contain data that come from local resources (such as Missouri state agencies.) Most users of the archive will have no desire or need to get a handle on the entire collection of filetypes. Practically speaking, only the most serious of data users is going to need to access more than a small fraction of the filetypes contained in the archive. A small number of filetypes (for example, the 1990 census summary files) will typically account for a very large percentage of user interest and access.

Contents Files

We stated in the previous paragraph that the "parent" directory of the data archive (/mscdc/data) contained only a collection of subdirectories and only a single actual file. That file is named Contents. There are lots of files with this name in the archive; these are special files that are used to control access to, and provide additional annotation for the data entities within a directory. If a directory does not contain a properly formatted Contents file, then uexplore will not permit access. As a user of the system, you really do not need to know this, and you certainly do not need to be concerned with the mechanics of just how a Contents file is formatted and interacts with uexplore. But if you are curious you can point uexplore to the top directory of the archive using

http://www.oseda.missouri.edu/cgi-bin/uexplore?/mscdc/data@secure

(you probably ought to go ahead and bookmark this) and then click on the "contents@" entry. Normallly, you cannot actually browse a Contents file, but in this case we created a pseudo-file named "contents" (note the lower-case "c" instead of "C") which is really just a pointer to the Contents file. It's a very simple file. It just has the names of the files contained in the directory and some text describing what each file contains. When you invoke uexplore and specify a directory to start exploring, the program looks for the Contents file in that directory and uses it to generate an HTML "menu" page where the information is slightly reformatted and where each filename becomes a hyper-link to that file or subdirectory. Subdirectories are shown with a trailing slash. Most other files are shown with no special character appended (the "contents" file has a trailing "@" symbol which tells us that this is not an actual file but rather just a pointer, or link, to another file.) When you select a subdirectory (by clicking on its hyper-link name) all you are really doing is re-invoking uexplore with a new value after the "?" to tell it which (sub)directory is to be "explored".

File Extensions

As with DOS and Windows systems, filenames used in the archive use special conventions with regard to the final portion of each name (the portion following the rightmost period). Such filename extensions can serve as a signal to the uexplore program, telling it what kind of information or what format the file is in. The following is a list of the extension values that the program recognizes:

  • sas SAS program file. Contains SAS statements in simple ASCII format. Browsable.
  • ssd01 SAS dataset. Contains actual data stored in the proprietary SAS software format.
  • snx01 SAS index. Linked to a SAS dataset of the same name. Cannot be uexplore'd.
  • ssv01 SAS view. Acts like a SAS dataset, but contains instructions on how to create a SAS dataset instead of the actual data -- in this archive it is almost always used to describe a specific subset of another SAS dataset.
  • log , lst These extensions are usually paired with a .sas file. When a sas program is executed it produces files with the same name as the file that contained the program but with these extensions for the SAS "log" file, and the SAS "listing" file. So, for example, if there is a file called cnvt2.sas, there will usually be files called cnvt2.log and cnvt2.lst.
  • zip, gz Special compressed files created using the zip or programs, respectively. Such files will not be directly accessible via uexplore.
  • arc* Some files are either so large and/or seldom used that they are not kept in immediately accessible storage; instead they are "archived" to a backup media. Such files need to be restored before they can be accessed by uexplore. Restoring an archived file can only be done by authorized MCDC personnel. (The trailing "*" indicates that the file actually contains executable code. The file contains the commands necessary to restore the original file.)
  • html The usual meaning - hyper text markup language web document.
  • csv Comma-separated value file. Ascii file in which the fields are separated by commas.
  • xpt SAS export format file. Rarely used (in the archive, but the xtract application can create them for you.)
  • dbf dBASE format. Rare.
  • just about anything else Unless recognized as something undisplayable the program will attempt to display the file as a simple text file.

Special Files and Directories

There are certain special filenames or subdirectory names that we use throughout the archive. It will help you navigate through the data if you recognize these special naming conventions.

  • Tools is a special subdirectory that should be present within each filetype subdirectory. As the name suggests, modules within a Tools directory contain non-data modules that are related to the data. In here you may find things such as SAS programs that were used to create the datasets in the the filetype directory, codebook files, sample applications, sample listings, etc.
  • Docs is a subdirectory that has been copied from a Census Bureau cd-rom from a directory that was almost always named DOCUMENT on the cd-rom. These files may not always be directly relevant to the archive, since they tend to describe a desktop cd-rom environment. But frequently they include codebook files and various metadata files (e.g. table subject index files for STF's) that may be useful in working with the data in the archive.
  • a, b or c Such single-letter subdirectories are used to correspond to the Census Bureau's convention of breaking certain filetypes down into "files". For example the /a subdirectory of /mscdc/data/stf904 contains data distributed by the Bureau designated as "Summary Tape File 4, File A". We have actually tried to avoid these subfile distinctions (as in the case of the stf903 filetype, for example) but there are a few cases where we have used the Bureau's convention.
  • Readme(.html) files are just what you'd expect. They provide an introduction or overview of what to expect in the directory.
  • Metadata.(filename.)html files provide variable-level documentation for the observations of the datasets in the directory. If there is a filename level before the ".html" extension then it means the documentation is specific to that SAS dataset. For example, the file /mscdc/data/saipe/Metadata.html has variable-level descriptions of any SAS datasets in the SAIPE (Small Area Income and Population Estimates) directory. The file /mscdc/data/popests/Metadata.uscom96.html has variable-level descriptions of the variables in the /mscdc/data/popests/uscom96.ssd01 dataset. (Note: the variable descriptions from these files are added to the reports generated by the hypercon application in the description column.)

Not Exactly the Data Archive

We have stated that the MCDC Data Archive is comprised of the collection of files and directories in or below the /mscdc/data directory. That is one way to look at it; that definition will cover all the files that actually contain data (i.e. specific information about geographic areas or persons/households in a sample survey, etc.) But there are some other files stored in other subdirectories of the /mscdc directory that are related and may be of some interest. We'll not go into great detail here, but users are welcome to browse. The URL to point uexplore to this level is:

http://www.oseda.missouri.edu/cgi-bin/uexplore?/mscdc@secure .

Mostly what you will find here are collections of things like SAS macros, SAS format modules, old report files, etc. Probably not of interest to the general user. More likely to appeal to anyone who understands SAS and would like to look at some of the tools we have used to build the archive.

Speaking of Tools, the subdirectory /mscdc/metadata is actually where all of the Tools subdirectories physically reside. It's all done with pointers (symbolic links.) If this makes absolutely no sense to you, forget about it. Not important then. Many (most?) of the files that you will find in these directories were created on and for the IBM MVS operating system. You'll be able to tell this when you open files and see 8-digit sequence number fields in cols. 73-80 and the text entered in all uppercase. Certain directories, like appsmvs (applications for MVS) are obsolete -- they point to files and printers and tape cartridges that in many cases no longer exist. We keep them around as models for new UNIX applications and for historical reference.

Summary

We have provided only a very brief overview of some basic features of how the MCDC archive is structured and what sorts of information is stored there. We need to do a much longer tutorial or maybe a short book to cover this subject in sufficient detail. But using the tools described in this tutorial you should be able to make your way around the archive and sample its contents. Maybe the best navigation tool you have is the pervasive feedback/comments buttons that appear near the bottom of most of the HTML pages associated with this system. You can always use these tools to ask for assistance in finding what you are looking for, or in trying to make sense of what you have already found in the archive. Or can just contact: John or Evelyn, Ph. 573-882-7396.


Main Page of We Explore
|| Overview || Invoking uexplore || Basics of the /mscdc/data space
|| Using uexplore and sasapps || The hypercon application || The xtract application