WE EXPLORE Missouri Census Data Center

The xtract Application

Overview

This is probably the most important, and most complex, component of the whole exploration system. The other components are just helper modules to guide you to the data and help you to understand what's there and how it is structured, named, identified, formatted, etc. But it is the xtract application that lets you actually view and/or retrieve the data you want.

As data query engines go, xtract is rather basic. It has facilities to let you:

  • retrieve data in any of five common formats (printed report, comma-separated value, .dbf (dBASE), HTML table, and SAS export file.)
  • do basic observation filtering of the chosen data set, i.e. you can specify that certain variables must take on certain values or you don't want that observation.
  • choose which variables you want.

That's basically all it does. It only takes data from a single data set and it does no combining (aggregation, merging, ...) of data stored in different observations. But what is hoped is that if you can extract the basic numbers you need, and get it into a format that you are used to dealing with (a spreadsheet, a SAS dataset, or even a printed report) that you will be able turn the raw data into useful information. Or, another possibility, that you will call the Urban Information Center or other MCDC core agency, and let them do some custom programming for you using the full power of the SAS software. It's not a free service, but it's very reasonable.

In most cases the hardest part of using xtract is knowing how to find the data you want, and being able to use the hypercon and metadata tools to understand what the data are about. The mechanics of using xtract to get what you want -- once you have figured out just what that is -- is not trivial, but it's really not as difficult as it might look. It requires filling out three forms before you get any output , and there are places where there are no pull-down menus and you actually have to enter a code or list of codes. We hope to change that at some point -- the application could and should be enhanced to make it easier and more reliable. But for now, it's an application that even though it is quite simple, can give you access to an awful lot of very useful information and in what most people will find to be a rather convenient format. You just may have to know a little bit more about the data archive than you'd really like to.

Jumping In: A Quick Example

A good way to get a feel for what this application is about is to go ahead and do a quick and simple example. We'll avoid any unnecessary options and focus on how simple it can be if you you have a simple request. Start by invoking the uexplore application and pointing to the population estimates directory. The URL is :

http://www.oseda.missouri.edu/cgi-bin/uexplore?/mscdc/data/popests@secure

(If you are already in the uexplore application, just back up to the main directory and select the popests filetype.)

There are two basic modes in which people operate when accessing the data archive. They either

  • have a specific piece of data or dataset in mind that they are searching for, or
  • they have some general idea of a subject area (such as population estimates) in mind, but what they are really doing is looking around in the archive to see what they can find that might be helpful.

Unfortunately, while the latter mode is quite a good approach and is what most of us are used to doing on the Web, most users will probably come away a little frustrated if they try to do this with the MCDC archive right now. The end-user navigation tools are just not good enough yet, at least not for first-time or casual users. You can, and people do, find good stuff by surfing, but you have to be very interested and very patient.

This said, let's assume you are not in that surfing mode, but already have a notion of what you are looking for. Let's assume that you are looking for the latest set of population estimates for all the counties in the state of Missouri. This explains why you are in the popests directory. Maybe you used the phone or e-mail to an MCDC public information specialist and they already gave you the name of the dataset you need to look for. Or maybe you are just a very good and very patient reader and you studied the Contents page for the popests directory and have discovered the dataset and description:

mocom96.ssv01
Missouri subset of uscom96 with population estimates with components of change,
1990-1996
.

And you somehow understood that "subset of uscom96" was actually referring to another SAS dataset in the same directory so you scrolled on down the alphabetized Contents-based page and saw:

uscom96.ssd01
SAS data set with the US population estimates with components of change data as
released by the Bureau and created by the Federal State Coop for Population Estimates
(FSCPE.) First obs is US totals (SLVL='010'), then observations for states and
counties. SLVL='040' for state totals, SLVL='050' for county data. Sorted and indexed
by state (2-digit FIPS state code.) Created by Tools/uscom96.cnvt.sas program. To
access data just for Missouri use the mocom96.ssv01 SAS view.

That last sentence actually tells you which dataset to use. Of course, it refers to something called a "SAS View" just to confuse you. Try not to worry about it. A SAS View acts just like a regular SAS dataset except that it's a software trick. When you access the view, it just has instructions in it that tell SAS what data it has to get out of the uscom96 dataset. Go ahead and click on the mocom96.ssd01 link. When the intermediate sasapps menu page comes up, just take the default application by immediately clicking on the Submit Request button. This takes you to the xtract application's first (of 3) menu page.

As usual, uexplore has done much of the work for you by already filling in the path and filename entries. What you have to do on this page is simply tell the application what format(s) you are interested in. You can have anywhere from 1 to 5 different output files. The first line in this section has a pair of checkboxes, the left one for "Comma Delimited File", and then the parenthesized "(Tab Delimited") box. If you select the Tab Delimited option it just means you want to substitute tab characters for commas in your delimited output file; you get only one delimited file.

If there are format options that you do not recognize then it is unlikely that you'll be interested in that format. If you don't have a clue about what a SAS export file is, then just skip it. The one option you should be careful about is the HTML one, since it creates some restrictions that could cause your query to fail. Do not choose HTML format if your extract will involve more than 200 observations or more than 15 variables. For the sake of our example, we'll assume we are going to be getting only about 115 observations (the number of counties in Missouri) and we certainly should not want more than 15 variables. We'll want to sample a variety of output formats. So -- select (click on the boxes next to)

  • Printed list file
  • HTML Table
  • (leave comma-delimited choice as is, i.e. selected)

Click on the Continue button.

Next comes the Filter Specifications and Sort Criteria form page. This is where the real work takes place. The first thing you may want to make use of is the box near the top where you can limit the number of observations that will be selected. When requesting HTML format output you may as well enter 200 in here since that is the limit anyway, but if you enter it here the request will not abort as it will otherwise if you exceed that limit. The next line presents you with an interesting opportunity/dilemma:

Click here to see hypercon report on mocom96 data set.

What you are about to do is tell this application what observations and variables you'll be wanting to extract from this dataset. We thought this would be a good place to let you use the application that lets you have a look at the names, labels and some sample values for those observations and variables. So go ahead and click on the word here to bring up the hypercon report page.

One of the first pieces of information you'll see on this page is that this set contains -1 observations. This is because the SAS program does not know how many observations will be selected when a view is executed (we know there are 116, one for the state totals, and 115 county summaries.) For regular SAS datasets the observation count will be reported accurately. We also see that there are 112 variables in this dataset. We get the file description, and then the table with 112 rows, one for each variable. The rightmost 3 columns of the table correspond to the first 3 observations on the dataset. So we can see that the first observation contains data summarizing the entire state, the second observation has data summarizing Adair County, etc. We can see the names of the variables and the descriptive labels. All we want to do on this page is mentally (or we're allowed to take notes) choose the variables we'll want to be keeping on our output files. If we were thinking of being selective about observations we would need to look carefully at any fields that we were thinking of using to define the filtering conditions. We'll come back to this. For now, lets assume we want to keep all the observations.

After you have absorbed all the information about what the dataset contains, use the back button to return to the xtract filtering page. Since we are not doing any filtering we can just ignore all the hard parts and scroll down towards the bottom of the page. In the text entry box in the TITLE OPTION section type in a title for the report that you are about to generate. Something like

My Very First xtract-generated Report . Then click on the Continue button.

The final page of the xtract menu trilogy is the Extract Variable Selection Form.. Pretty simple page, though you may spend some time deciding just which variables you want to keep. Any variable used in filtering (from the previous page) will be pre-selected but otherwise you'll see a select list with the names of all the variables in the set, with none of them selected. The variables will be in the same order as what you were just looking at in the hypercon dataset listing report. For this example just go ahead and click on any 15 variables that interest you. Be sure you pick enough so that you can identify the data you are retrieving. In this case the crucial variables are COUNTY - the FIPS county code, and AREANAME - containing the county name (or "Missouri" in the case of the state summary observation.) You need at least one of these and keeping both would be a good idea. Note that we have organized this (and many others, though not quite all) of the datasets in the archive so that the observation keys are at the beginning of the observation. Related data items are usually stored in adjacent variable positions. In this case, we have packed some of the most frequently used variables -- the current population estimate, the change and pct change figues, etc. near the start of the observation. Noticed that once again, there is a button that will allow you to bring up the Hypercon dataset report if you would like some assistance in interpreting the meaning of the variable names.

On most browsers you'll need to hold down the control key while clicking on your variables choices after the first. Otherwise each new click unselects the previous select and replaces it with the current choice instead of adding it to the list of chosen entries. Be careful not to select more than 15 variables. Hit the Submit button to execute the request (this is not a Continue button as we had on the previous two pages.)

It should take less than ten seconds for your request to be processed (unless you are doing this in a classroom setting where 15 or 20 other people are submitting the same request on the same computer -- that could slow things down a bit.) The next page that appears in your browser window should look something like Figure 3.

Fig. 3 Xtract Output Menu Page (Sample)

All that rather strange "Directory:..." title is about is telling you where the application has stored the results of your query. In this case, it has created a set of 5 output files for you. Yes, you only requested 3, but you always get a summary.log file. It contains mostly techno-babble stuff but you may find it slightly useful to help document just what specifications you typed in to produce these results. The other "extra" file is the varlist.lst file which is a companion to the xtract.csv (comma delimited file - csv stands for comma separated value, the Windows standard extension for such a file); it contains the variable names on the file and their labels. You should click on each of the 5 files and carefully examine what they contain. Note, for example, that the first line on the .csv file contains the names of the variables (or fields) contained in the file. Handy if you import the file into a spreadsheet since it stores the variable names as the first row of the sheet. The xtract.lst file is not very sexy looking -- very plain text report on default gray background with variable names as column headers. But it has your information in a reasonably useful format. Not very good when you want dozens of variables but good for small requests like this one. The xtract.html file is the most interesting. It's in color and uses the variable labels (if present) instead of the names as the column headings. But you may notice it takes a while to load this file, much more than it takes to load the simple text listing file. What you may find surprising about all this is how relatively quick the actual steps to do the extraction are. It typically takes longer to generate a hypercon report for a dataset with over 100 variables than it does to generate a small to medium-sized extract report set. You'll also find that adding requests for extra output formats has very little effect on how long it takes. The exception to that is the HTML format, but that is restricted to smaller requests.

If you would like to simulate the experience of reviewing the results of this xtract query without having to do all the clicking we have described, we have saved a copy of this output menu page and copies of the files it references at http://www.oseda.missouri.edu/uic/uicapps/weexplore/sample.xtract/index.htm


Filtering The Data

In our previous example, we did an extract that involved taking only selected variables, but we were not selective about observations -- we accepted the default of taking all of them. But the real power of using the xtract application is the ability to subset the sometimes enormous datasets stored in the archive, to get just the observations (rows, records, geogaphic areas, households, etc.) that you need. The tools for doing this are on the second of the three menu pages associated with the xtract application, the one titled Filter Specifications and Sort Criteria Form. You can get to this through the front door by invoking uexplore and selecting a dataset, and then selecting the xtract application, and then filling out the first menu page to specify your output file options. Or, if you are still running the previous example and are browsing your output files, you can back into the page by hitting the back key repeatedly to get back to this page. When you do this, the values you entered on the form when you came thru here initially will still be there. (Likewise, the values you entered on the first form regarding output file selections are still invisibly present and "remembered" as you fill out this form again.) So you do not have to go all the way back to the beginning if you decide to change how you filled out this second form -- you can always go back in the application. What you will find, however, is that when you go back and then come forward the application will not remember what you previously entered on the later page. So if I go back to the second menu page (filtering) and fill it out again and submit it, the next Variable Selection menu I see is not going to have the selections I made previously still "remembered" for me -- I'll have to repeat those selections. Similarly, if I go back to the first page, then anything I have entered on the second or third menu pages is gone and will have to be re-done when I come back through. We may try to fix this in a future release, but that is how it works now (in the fall of 1997).

The crucial portion of the Filter Specifications form is a set of five rows of Variable-Operator-Value items. Variable is a pull-down menu box that lets you select a variable from the dataset. The initial value is none meaning you are not using the row as part of the filtering specs. When xtract detects a value of user for this item in any row, it assumes the filtering specifications are finished (it won't check any later rows either.) The Operator item is also a pull-down menu that allows you to select a logical operator to be inserted between the selected variable and the constant value that you need to enter in the Value text entry box (the third column.) Figure 4 shows a sample of this section of the form, after it has been "filled out". We are still working here in the popests.mocom96 dataset from our previous example. We got POP90 to appear in the first-column box by clicking on the pull-down menu there and then selecting (clicking on) the POP90 item in that drop-down list. Similarly, we caused Greater Than or Equal to (>=) to appear in the 2nd-column box by using that pull-down menu and clicking on the entry with this value. The "5000" value in the 3rd-column box we typed in. So what does this mean?

It means that we only want observations for our extract that satisfy the logical condition we have specified: the value of variable pop90 must be at least 5000. Since we know, from a careful reading of the Hypercon report, that the pop90 variable contains the 1990 population of the area, that what we are really saying is we only want areas that had at least 5000 people living in them according to the 1990 census.

Separating the rows of this "filter specification table" you'll notice a set of 3 radio buttons labels: And, or and And Not. The And button is "pushed" (by default, and we have not altered that default value in our sample) saying that if we now specify another logical condition by filling out another row of the form, we'll want that condition to logically "anded" with the one that follows. In this example we did a similar sequence of points, clicks and typing to turn the second row into the logical condition saying that the change in population (variable POPCHANG) has to be more than 500 or we don't want to see it. Because we have anded the conditions, we have said to only give us observations where both conditions are met: the 1990 population is at least 5000 and the change is over 500.

Fig. 4 A Filled-Out Filter Specifications Form

Go ahead and see if you can replicate this filter screen and submit the screen. We entered a title in the box and then selected about 10 variables on the Variable Selection page. Figure 5 shows an example of what we saw when we browsed our report output file (after setting our fixed-font size to the smallest possible setting).

Fig. 5 Sample Listing File With Filtering Per Fig. 4

Filtering Using FIPS Codes

Lets do one more xtract example with filtering. But this time, we'll make it a little harder. We'll apply a filter that requires knowing some of the geographic codes used in the dataset.

Hit the back key several times to get back to the original first page for the xtract application. We are still working with the dataset popests.mocom96. This time, go ahead and select all of the possible output formats, and select the Tab-delimited option. Submit this form.

On the Filter Specifications and Sort Criteria form we have decided that we'll ask for a report showing only counties that are in the St. Louis or Kansas City metropolitan statistical areas (MSA's). We see these values in the reports so we know they are on the dataset, so it should be easy to filter based on them. The tricky part here is that while in the report (see Fig. 5) we see "Kansas City, MO-KS" listed in the column labeled MSACMSA, that is not actually the value coded in the dataset. To refresh your memory click on the here button near the top of the form to take you to a hypercon report screen for the dataset. Look at the entry for the MSACMSA variable. It shows that it has a LEN value of "$4" and a FORMAT value of "$METRO.". We talked about this earlier, but to refresh your memory what this means is that the value stored on the dataset is a 4-character code, but we have used something called a SAS format code to specify that whenever the value is displayed an automatic lookup process will take place using the format code and the label specified for the code will be displayed instead of the code. Recall also, that the format name in the display is a hyper-link to the format source module. So go ahead and click on it. When the format module is displayed use your browser's find command to locate "kansas" and then "louis" (case does not matter). Note that the codes for the two metro areas are "3760" for Kansas City and "7040" for St. Louis. Go back twice to get to the Filter page. You are now ready to point-click-and-enter the 2-row filtering specications as shown if Figure 6.

Fig. 6 Filtering Using FIPS MSA Codes
You'll notice we set our browser proportional font as small as it would go so that we could get this much information in the screen image.

Important things to note here are the value entered -- 3760 and 7040, and also that we clicked on the or radio button between the first two rows. If we left this as an And operand we'd have been saying we only wanted observations where the MSACMSA variable was equal to both 7040 and 3760. This is, of course, logically impossible. You would get no observations selected. It's also a very common error to make.

An equivalent way to obtain exactly the same results using only one row of the filtering matrix would be to enter:

MSACMSA as the Variable. IN as the Operator. 3760:7040 as the Value

The key here is the use of the "IN" operator, which allows you to specify a list of values in the Value box, each separated from the previous value with a colon (:). This gets translated (ICYC) to the SAS statement

where msacmsa in ('3760','7040');

Note also in fig. 6 that we have entered something in the sort specifications box. We typed msacmsa popest (D) which means we want the observations sorted first by the metro area codes, and then within metro area by the value of the popest variable but in descending order. The instructions for doing this are on the form.

Figure 7 shows what we get when we select the resulting HTML file on our output menu after submitting the request (we had to re-specify our variable selections again, of course, before submitting the request.) You can see the results of our typing in a value in the TITLE entry box of the Filter .... form. Note that the two sort values appear at the left and are color-highlighted.

Fig. 7 HTML Report for Two MSA's With Sorting
HTML output is not always as practical or useful as plain text reports, but users really like them.

Filtering Using Blank Values

Occasionally you may want to apply a filter that requires specifying that a certain variable does or does not have a blank value. The Xtract application has a special convention for representing blanks when entering them in the VALUE column of the Filters form. You need to type a single underscore (_) character to indicate a blank value. If you select COUNTY as your VARIABLE, EQUALS as your OPERATOR and then enter _ as your VALUE, you are specifying that a condition for inclusion in this extract is that the variable county have a blank value (also referred to a "missing" in SAS software terminology.) This convention only applies to character type variables (identified in hypercon reports by have a "$" in the LEN column.) You can also use this convention when using the IN operator and with the NOT EQUALS operator. An example of using it with the IN operator would be to specify a value of "29:17:_" . This would say that the value of the VARIABLE specified must be either '29', '17' or ' ' (blank) for the condition to be true.

Processing .dbf Files

Since we asked for all five output formats in this example, you should also have an entry on your output menu screen (the one returned to you after you submit the final xtract form). When you click on this file the results can vary depending on your browser and how you have it configured. You can configure most browsers to have plugin applications that automatically kick in when you attempt to browse files with specified filename extensions. ".dbf" is a common and widely used extension that is frequently set up for plugin processing. You can, for example, specify that when you select a .dbf file for browsing that Excel will begin execution and will automatically load the file into an Excel spreadsheet. Even if you do not have your browser configured this way, you will get a special screen presenting you with options regarding how you want to process the file. One option is to save the file to a local disk -- i.e. download it. You can then access the file from your desktop using any application that can read such files. This includes most spreadsheet and database programs. (ICYC -- it also includes SAS, using PROC DBF).

Processing .csv Files

Most of what we said about .dbf files can also be said about comma-delimited files. You can set up your browser to read them automatically, or you can just download them and later load them into many desktop applications. But you can also just browse them, of course, as they are somewhat eye-readable (though not nearly as convenient to peruse as a listing file.) The tab-delimited option enhances their readability on the screen. You can download them after browsing them using the browser's file-save command.

Processing SAS Xport Files

Many/most of you do not know and do not probably want to know what a SAS Xport file is, much less how to process one. Skip to the Summary. If you have SAS on your local system, and want to process the data with SAS, then this is your best output format. When you click on the filename for the ".xpt" file you should get a dialog window asking you how you want to handle this file. (If you do not get such a dialog, try holding down the shift key as you click on the filename link.) You can tell it you want the file saved to a local disk. Once the file is transferred to your site, you can access it from SAS using the following process:

  • Issue a SAS libname command to specify that this is an export file, such as:
    • libname sasxpt xport 'c:\temp\xtract.xpt'; *<--change to where you saved the file-;
  • This xpt file actually has a SAS dataset inside of it waiting to get out and that dataset has a name that is stored on the file. That name is always (assuming it was created by the xtract application, that is) extract. Yes, the filename name xtract.xpt but the name inside is extract. (The extra "e" just to confuse you. ) You'll probably want to convert this to a regular SAS dataset on your local system, but you do not have to. Here are three things you could do to access it:
    • proc copy in=sasxpt out=work; *<---copies all the datasets in sasxpt to the work library. Creates work.extract;
    • proc print data=sasxpt.extract(obs=10); title 'First 10 observations!'; run; *<---access it directly with a proc-;
    • filename dbfile 'c:\temp\sasxpt.dbf'; proc dbf data=sasxpt.extract db3=dbfile; *<--convert to a dbase file--;
    Possible glitch with Permanent Formats

Those value-labeling permanent format codes that we use on the archive can cause some problems when you transer the dataset to your local system. If you download a dataset that has a builtin reference to something like $msacmsa. it can lead to SAS error messages and denying you access to the data set. Type and submit the command options nofmterr; to avoid the error message. But you still are not getting the benefit of the value labeling. You can always use uexplore to browse the /mscdc/sasfmats directory (at oseda.missouri.edu/cgi-bin/uexplore?/mscdc/sasfmats@secure and download the format modules you need. ($MSACMSA will be stored in the file Smsacmsa.sas -- i.e. the $ becomes an uppercase "S", the format name is lowercase, and it has an extension of ".sas".) Once you get that file onto your local system you'll need to use it as part of a program to turn it back into a SAS format code using something such as :

  • filename fmt 'c:\sas\sasfmats\Smsacmsa.sas';
  • proc format; %include fmt; run; *<---now you can use the $msacmsa format code-;

Summary

You are probably either very tired or very excited about all of this by now. If you are tired we'll let you go. If you are excited, then you may just want to keep playing around a bit. There's a lot of data in the archive to explore and extract . Try looking at the stf903x and stf803x filetypes. These are just our standard extracts from the almost-too-large-for-this complete stf803 and stf903 files, which are also here. Or maybe take a look at the recent poverty estimates data in the saipe filetype, or, if you really feel ambitious and are interested in county to county migration data (1985-1990 - sorry, nothing more current) take a look at the stp28 filetype. You'll spend most of your time there in the Tools directory trying to figure out just what we have.

But we digress from a true summary. We've gone through just a few examples of the mechanics of the xtract application. Hopefully, these are enough to get you going. If not, you can always try to put your frustrations into writing and use those feedback keys at the bottom of most of the forms. Also, if you are not frustrated - if you are actually able to figure out what we are talking about and get to some of the data you want, you can always send us a note about that, too.


Main Page of We Explore
|| Overview || Invoking uexplore || Basics of the /mscdc/data space
|| Using uexplore and sasapps || The hypercon application || The xtract application