Quick Links
ALS site index
ALS
ALS data
List of Libraries

4. Merging Variables in the Annual Academic Library Survey datasets

4.1. Abstract

In their seminal Cumulated ARL University Library Statistics, 1962-63 through 1978-79, Kendon Stubbs and David Buxton observed, after a discussion of aspects of the Association of Research Libraries data they had compiled:

These vagaries in data collection over the years stand as they appeared in the originnal annual issues of the statistics (unless corrections were reported by individual libraries) p. v.

I have rephrased this considered admonition to form what I have called the Hippocratic Oath of data compilation: "first do no harm." Stubbs and Buxton had observed numbers that were clearly wrong or odd and there is a temptation to "do something." However, usually doing anything is worse than leaving the data with their vagaries intact.

Unfortunately, it often happens that definitions and practices differ from year to year and in merging data from different years to create a set of data one might use for time series analysis these different definitions and practices must be taken into account and changes made in the received text of data—but only with understanding and good reason. These vagaries present a different kind of problem than discussed by Stubbs and Buxton in that they are introduced by compilers after the data exist. And compiling practices change. The purpose of this page is to describe the examples found in the ALS variables where changes in the reported data, as reflected in the documentation, have had an effect on the longitudinal file. The NCES documentation is not replaced by anything here so the analyst is still obliged to be familiar with it. This is a complex series.

The method followed in untangling these data is to focus first on 2004, 2002, and 2000 then broadening what is learned from these three years in an attempt to arrive at the principles that will be used in joining these years and, with luck, the others. Generally, these three years share a common infrastructure and one that differs from the previous years.

What follows then, is a taxonomy of the variables, that is, types and characteristics with a consideration of their likely use in the library world that will have an effect on a longitudinal file of the ALS data.

In addition to this discussion, the documentation of these variables will reflect the results discussed here without the accompanying details as found in the NCES documentation. I fear that most people who use data do not have the patience to slog through the kinds of details that I discuss here. In any case, for those who do, this discussion arrives at principles that I hope to implement through computer code. Given that erring is human, this code is also made available for anyone to check.


4.2. Crosswalks

The ALS data since 1998 have had tables matching named variables in each of the years with named variables from the previous year. For example, the crosswalk shows these following (selected) examples from the 2004 crosswalk. The crosswalk tells us this year's variable name and what its match is from the year before:

Simple Examples From 2004 Crosswalk
Variable Short definition 2004 2002
DUNS Dunn and Bradstreet identification number DUNS DUNS
GROFFER Graduate offering (generated, based on response to IC) GROFFER GROFFER
OBEREG Bureau of Economic Analysis Region Code OBEREG OBEREG

This kind of comparison must be done for all years of the data and this example is for just two years for now. However, a look at a form of this process for all years of the ALS data can be glimpsed at the ALS main page where three variables (INSTNM, ICLEVEL, EXBIB) are outlined for the entire period. Remember that there are over 500 of such variables and what begins here is a measured, systematic step to compile a dataset where one can examine trends by resolving such yearly changes. This page will change as more years are added to the dataset. What begins here is comparing several years with an attempt to create reasonable and explicit principles to guide us through the shoals that lie ahead.

Be that future as it may be, in this crosswalk example for two years, these three variable names report the same thing for the two years listed here. But do they report them the same way each year? And what other surprises lurk?


4.3. Variable coding

Consider three examples of differences in the way variables reported in both years were coded, starting with these simple ones:

DUNS
Variable Short definition 2004 2002
DUNS Dunn and Bradstreet identification number -3 - Not available Blank - Not available

When there is no DUNS number, in 2004 the fact is recorded with a "-3," while in 2002, there is a blank. But, note this example, where "-3" is used in both years:


GROFFER
Variable Short definition 2004 2002
GROFFER Graduate offering (generated, based on response to IC) -3 - Not available -3 - Not available

And another using -3 both years:

OBEREG
Variable Short definition 2004 2002
OBEREG Bureau of Economic Analysis Region Code 0-U.S. Service Schools
1-New England (CT ME MA NH RI VT)
2-Mid East (DE DC MD NJ NY PA)
3-Great Lakes (IL IN MI OH WI)
4-Plains (IA KS MN MO NE ND SD)
5-Southeast (AL AR FL GA KY LA MS NC SC TN VA WV)
6-Southwest (AZ NM OK TX)
7-Rocky Mountains (CO ID MT UT WY)
8-Far West (AK CA HI NV OR WA)
9-Outlying Areas (AS FM GU MH MP PR PW VI)
-3 - Not available
0-U.S. Service Schools
1-New England (CT ME MA NH RI VT)
2-Mid East (DE DC MD NJ NY PA)
3-Great Lakes (IL IN MI OH WI)
4-Plains (IA KS MN MO NE ND SD)
5-Southeast (AL AR FL GA KY LA MS NC SC TN VA WV)
6-Southwest (AZ NM OK TX)
7-Rocky Mountains (CO ID MT UT WY)
8-Far West (AK CA HI NV OR WA)
9-Outlying Areas (AS FM GU MH MP PR PW VI)
-3 - Not available

Looking at these examples, we can see that DUNS uses the "-3" in 2004 but not in 2002 while GROFFER and OBEREG use it in both years. Some of the 2002 variables use the blank to indicate that the number is not available and some use -3. In fact, the ALS data often use the -3 convention to indicate data are not available—but, as we see here, not always.

And for good measure, let's look at the important variable CYPARCH over three years:

CYPARCH coding
Variable Short definition 2004 2002 2000
CYPARCH
Current year parent/child indicator 1 - Parent (Combined data respondent; record contains data for more than one institution)
2 - Child (Data reported on another institution’s record)
N - No response
1 - Parent (Combined data respondent; record contains data for more than one institution)
2 - Child (Data reported on another institution’s record)
N - No response
1 - Parent (Combined data respondent; record contains data for more than one institution)
2 - Child (Data reported on another institution’s record)
-2 - No response

2004 and 2002 code "No response" with an N while in 2000, it is coded with a -2. The same state in this latter year is merely coded differently from in 2000 than in the other two years.


4.4. Towards principles of compiling data

An obligation of the compiler is to intrude as little as possible on the communication of the analyst and the person in the institution filling out the form. Between us and the person filling out the form, data editors made an intellectual decision in these examples about how to handle a very common fact in the data world—missing data—and came up with different answers in a way that was not consistently applied.

For the analyst, these two methods for encoding "not available" or "No response" could cause problems in a set of data with these years if our analyst does not read documentation and is not careful, in any case. In one year for DUNS, he or she would have to do one thing and in another, something else for these cases. As a general rule, it is better to be consistent—if possible. For purposes of these three variables, -3 is as good as any way of encoding this state. And with CYPARCH, N would work as well for the three years. Why N and not -2? I believe it better to follow common practice whenever possible in such cases because the analyst is most likely to have the current documentation at hand and, too, NCES's practices are constantly improving and I think it ia good practice to follow their lead when nothing else intervenes.

But, let's look deeper at that these variables present different analytical problems. OBEREG is a categorical variable and -3 is just another category represented by a number. However, these are categories so calculating the average region for a set of libraries would have no meaning, for example, but knowing how many institutions of a given type are in a region might be of value, depending on the question at hand. The -3 could as easily be represented by a letter rather than a number. Hence, OBEREG is listed as a character variable not a numeric variable in the NCES documentation. DUNS also looks like a number but is also treated as a character variable because arithmetic operations on them would also have no meaning any more than such operations would have on a set of telephone numbers—what would an "average" phone number mean?

CYPARCH is also categorical with two states for non-response in the raw series.

It would seem reasonable to change the blanks in 2002 for DUNS to -3 so the analyst could write one program to handle the years in the same manner. Given that -3 is used conventionally throughout these datasets for such a state, this treatment would result in a change that is consistent with common practice. Similarly, changing CYPARCH's 2000 -2 to N seems indicated.

Here is another case that looks similar to GROFFER, CYPARCH and OBEREG but is not:

PCTMIN1
Variable Short definition 2004 2002 2000
PCTMIN1 Percent minority, generated from 2004 Fall Enrollment survey - responding institutions only (does not include imputed data)
Percent Black, non-Hispanic
-1 - Not reported
-3 - Not available
-1 - Not reported
-3 - Not available
-1 - Not Reported
-2 - Not Applicable
-3 - Not Available

This variable for these three years is the same. The three years share two codes while 2000 has an additional -2 for "Not Applicable." I will deal with this latter code in the next section. That aside, what is the problem with -1 and -3?

Well, this is a numeric—not categorical—variable. Here the analyst who does not read his or her documentation or look at the data might attempt to add the data from two institutions and average a -3 from one institution with the data from another and get a meaningless average. So, for numeric variables, the -3 presents a problem if it is left alone by the compiler because in the library world, the most common use of these data will be to drop subsets of these data into Excel without examining the data or reading the documentation. That is just our reality. Thus, the recompiler is forced into a choice between two bad alternatives. The compiler should do no harm but should that compiler attempt to protect others from doing harm when they use the data? Glib answers aside there is an issue here.

Practically, such a protection from bad analysis is impossible but, I think that it is defensible to argue that changing the negative numbers to, say, spaces in numeric variables is a tiny enough change that could have disproportionate benefits. That has been done in the two longitidunal series and derived spreadsheets. I have tried to match codes of categorical variables to the most current practice, as discussed with CYPARCH.

The first question in contemplating this change is: are there cases in these data where this policy would lead to ambiguity? For instance, are there cases where negative numbers and spaces seem to indicate lack of availability? I find no cases. NCES is careful about such things and the data are scrutinized for internal inconsistencies.


4.5. Not available, not applicable, unavailable, not reported

The question of changing negative numbers used in coding brings up another, interwoven conundrum and it would be useful to deal with this now before proceeding.

It is a common fact of datasets that there are missing data. Data can be missing for a host of reasons, the two most common: the person filling out the form does not know the answer and the second is that a question is asked about something not done at the institution—such as the number of graduate students at an undergraduate-only institution. The Association of Research Libraries drew the distinction in the 1980s between these two states as "Unavailable" ("U/A") and "Not applicable" ("N/A"). That distinction makes intuitive sense in the abstract but often makes little difference analytically. NCES generally uses "Not available" for both states but not always. In any case, the abbreviation "N/A" is clearly ambiguous: does it mean not available or not applicable?

We have a similar kind of distinction here with PCTMIN1. It has a code for "Not reported," and "Not available," and in 2000 it had a code for "Not Applicable." Note how for this variable, NCES's "N/A" is equivalent to ARL's "U/A." For the life of me, I cannot see any difference between the two in this case because they mean the same thing. Every institution has a number for this category and Not Applicable would be 0%. I would bet that is why -2 was dropped in 2002 and 2004—it just has no meaning.

Given the ambiguity with these various codes, I am inclined to regard changing them in numerical variables to spaces as a minor change. I have never seen any analysis that uses these distinctions and I am going to make the changes but I am mindful of another general dictum I have followed over the years in doing these compilations—to treat analysis and compilation as separate things. There are several reasons:

  1. Just because I know of no case where someone has or has not done something does not mean that it has not happened.
  2. Other analysts with more imagination will see things that I cannot see.
  3. I learned a cautionary tale from a piano tuner who told me that the best piano tuners are ones who do not know how to play the piano. The reasoning is that if I play the piano and tune them, I will tune pianos to the sounds I like, not to a neutral set of sounds. Given that the sole reason I do these data compilations is to analyze the resulting data, I am mindful of this warning.

I would welcome any discussion or disagreement on this course of action.

4.6. In-band/out-of-band metadata


4.6. In-band/out-of-band metadata

An aspect of the variables discussed so far is that the variables report data and the state of the data in the same field. Hence, we might have an institution with 100 graduate students and another with -3, indicating the status of that institution's reporting. The phone company used what was called "in-band" signalling in a similar way. One used to be able to hear signals related to managing calls in the calls themselves. As a principle, this is not a robust system for all the reasons that we have wrestled with so far and more than a few the phone company discovered to its sorrow. Networks also use "out of band signalling" and this method has a parallel in the ALS data with statusflags, a better method for encoding the state of data than negative numbers in a number field. 65 variables use this convention in 2004, 76 in 2004-2000. Of course, there is no such thing as a free lunch, and with this new status variable, alas, there are ten different states that can be assigned to a variable. I will deal with these next.

4.7. Removing Imputations

There are two main sets of longitudinal files that are being made available, with subsidiary, derivative datasets. The two are: NCES summary data with imputations and a dataset with imputations (I hope) removed. Imputations have been removed two ways:


4.8. Variables not changed in these compilations

There are variables that have been left as they were in the original compilations. It is hard to classify them but CARNEGIE has identical codes for 2004 and 2002 but in 2000, there are changes, and I suspect there will be further changes going back in time so I will leave that to the invidual investigator. The link is to a page that will detail all the codes for this variable. DATASRC records the medium used to collect the data for the survey. In 2000, there were considerably more of such methods used and this is just an historical fact.


4.9. Small year-to-year changes in definition

Collecting data is a dynamic process, one that will change each time the process begins. In 2004, seven new variables were added to the ALS collection reflecting changes in the way academic libraries operate. Other changes include dropping variables and tweaking definitions. Such changes are inevitable in a healthy data series and may affect analysis of these data. The ALS documentation is copious and generally clear so it behooves the serious analyst to be familiar with it. As has been said here, the documentation to this longitudinal series is an adjunct, not a replacement, to the ALS documentation.


4.10. Reused Variable Names

The variable name DOCDIGYN is reused:

DOCDIGYN
2004-2002 2000
Documents digitized by the library staff
1 - Yes
2 - No
N - No response
Access of electronic files other than the catalog from within the library
1 - Yes
2 - No
-1 - Not Reported

The same name is used to record two different variables. I must say this is a troubling discovery. I will change the variable name in the 2000 input program. In the four library data series from NCES that I have worked on, this is the first time I have seen this kind of thing happen.

Valid XHTML 1.0!


October 29, 2007
Back to the ALS index page
Back to ALS variables
US Library Data Sources and Analysisx
NCLIS 30th Anniversary logo Return to NCLIS Homepage