![]() |
WE EXPLORE | |
Basics of the MCDC Data Space |
||
What We'll Be Talking AboutIn this chapter we want to provide a low-tech overview of some basic organization and naming conventions used in the MCDC data archive. We'll be talking about UNIX directories and subdirectories, paths, filenames, file name extensions, special filenames that always mean the same kind of data, etc. You really do not have to know much, if anything, about UNIX to understand this material. However, you should have some familiarity with tree-type directory structures, such as those used with most PC operating systems, including DOS and Windows. These structures were modeled after the UNIX conventions, which have been around much longer. The biggest practical difference for a user is that UNIX uses forward slashes ( / ) where the PC systems use back slashes ( \ ). And, in UNIX, case matters: "Contents" and "contents" are two different files.
Directory BasicsDirectories are the same thing as folders in the Windows 95-and-above world. It is unlikely that you would be reading this if you were not already familiar with this concept since it is fundamental to accessing information on the Web. Directories can contain other directories -- referred to, of course, as subdirectories -- as well as files. It is the files we usually want to get to, the directories are just things we use to organize them. By carefully placing our files in certain folders we make it much easier to find what we are looking for. When we refer to a specific file, we sometimes include references to the folders in which it is contained. In the UNIX world we use the "/" character in file references to indicate that what appears to the left of it is a directory (folder) name. So when we refer to the file "/mscdc/data/stf903/usi.ssd01", what we are really talking about is a file named "usi.ssd01" which is contained in a folder name "stf903", which in turn is contained in a folder named "data", which in turn is contained in a folder named "mscdc". If the person creating all these folders knows what (s)he is doing, there will be a logical structure to all of this. If stf903 is in a folder named data, then we would hope that it refers to some kind of a sub-category of data; and similarly since the data folder is stored within the mscdc folder, we would expect that the data referred to is somehow related to something called "mscdc". (Of course, we all know that "mscdc" is just lower case for "Missouri State Census Data Center".) The full "/"-delimited list of folders in which a file is contained is referred to as the path of the file. In our example the file usi.ssd01 has a path of /mscdc/data/stf903.
The /mscdc/data Directory and FiletypesThe MCDC data archive is a collection of files and paths stored within the directory /mscdc/data. More precisely, the files are stored within the network of subdirectories of this parent directory. There is actually only a single file directly in the /mscdc/data directory; it is composed almost entirely of a series of subdirectories that we refer to as filetypes. These subdirectories correspond to the major categories of data within the archive. Examples of filetypes are cbp (County Business Patterns), stf803 (1980 Summary Tape File 3), stf901 (1990 Summary Tape File 1) and pums90 (1990 Public Use MicroSample.) Every data file in the archive is associated with one and only one of these filetypes. We have attempted to assign names to these directories which are mnemonic, i.e. that correspond to the names by which these data files are commonly known. Many of the filetypes will be very familiar to people used to working with census data (e.g. the "stf" names are widely known and used.) But there are some filetypes that the MCDC has more or less invented to contain data that come from local resources (such as Missouri state agencies.) Most users of the archive will have no desire or need to get a handle on the entire collection of filetypes. Practically speaking, only the most serious of data users is going to need to access more than a small fraction of the filetypes contained in the archive. A small number of filetypes (for example, the 1990 census summary files) will typically account for a very large percentage of user interest and access.
Contents FilesWe stated in the previous paragraph that the "parent" directory of the data archive (/mscdc/data) contained only a collection of subdirectories and only a single actual file. That file is named Contents. There are lots of files with this name in the archive; these are special files that are used to control access to, and provide additional annotation for the data entities within a directory. If a directory does not contain a properly formatted Contents file, then uexplore will not permit access. As a user of the system, you really do not need to know this, and you certainly do not need to be concerned with the mechanics of just how a Contents file is formatted and interacts with uexplore. But if you are curious you can point uexplore to the top directory of the archive using http://www.oseda.missouri.edu/cgi-bin/uexplore?/mscdc/data@secure (you probably ought to go ahead and bookmark this) and then click on the "contents@" entry. Normallly, you cannot actually browse a Contents file, but in this case we created a pseudo-file named "contents" (note the lower-case "c" instead of "C") which is really just a pointer to the Contents file. It's a very simple file. It just has the names of the files contained in the directory and some text describing what each file contains. When you invoke uexplore and specify a directory to start exploring, the program looks for the Contents file in that directory and uses it to generate an HTML "menu" page where the information is slightly reformatted and where each filename becomes a hyper-link to that file or subdirectory. Subdirectories are shown with a trailing slash. Most other files are shown with no special character appended (the "contents" file has a trailing "@" symbol which tells us that this is not an actual file but rather just a pointer, or link, to another file.) When you select a subdirectory (by clicking on its hyper-link name) all you are really doing is re-invoking uexplore with a new value after the "?" to tell it which (sub)directory is to be "explored". File ExtensionsAs with DOS and Windows systems, filenames used in the archive use special conventions with regard to the final portion of each name (the portion following the rightmost period). Such filename extensions can serve as a signal to the uexplore program, telling it what kind of information or what format the file is in. The following is a list of the extension values that the program recognizes:
Special Files and DirectoriesThere are certain special filenames or subdirectory names that we use throughout the archive. It will help you navigate through the data if you recognize these special naming conventions.
Not Exactly the Data ArchiveWe have stated that the MCDC Data Archive is comprised of the collection of files and directories in or below the /mscdc/data directory. That is one way to look at it; that definition will cover all the files that actually contain data (i.e. specific information about geographic areas or persons/households in a sample survey, etc.) But there are some other files stored in other subdirectories of the /mscdc directory that are related and may be of some interest. We'll not go into great detail here, but users are welcome to browse. The URL to point uexplore to this level is: http://www.oseda.missouri.edu/cgi-bin/uexplore?/mscdc@secure .Mostly what you will find here are collections of things like SAS macros, SAS format modules, old report files, etc. Probably not of interest to the general user. More likely to appeal to anyone who understands SAS and would like to look at some of the tools we have used to build the archive. Speaking of Tools, the subdirectory /mscdc/metadata is actually where all of the Tools subdirectories physically reside. It's all done with pointers (symbolic links.) If this makes absolutely no sense to you, forget about it. Not important then. Many (most?) of the files that you will find in these directories were created on and for the IBM MVS operating system. You'll be able to tell this when you open files and see 8-digit sequence number fields in cols. 73-80 and the text entered in all uppercase. Certain directories, like appsmvs (applications for MVS) are obsolete -- they point to files and printers and tape cartridges that in many cases no longer exist. We keep them around as models for new UNIX applications and for historical reference. SummaryWe have provided only a very brief overview of some basic features of how the MCDC archive is structured and what sorts of information is stored there. We need to do a much longer tutorial or maybe a short book to cover this subject in sufficient detail. But using the tools described in this tutorial you should be able to make your way around the archive and sample its contents. Maybe the best navigation tool you have is the pervasive feedback/comments buttons that appear near the bottom of most of the HTML pages associated with this system. You can always use these tools to ask for assistance in finding what you are looking for, or in trying to make sense of what you have already found in the archive. Or can just contact: John or Evelyn, Ph. 573-882-7396. || Overview || Invoking uexplore || Basics of the /mscdc/data space || Using uexplore and sasapps || The hypercon application || The xtract application | ||