Set definition by subject in library and information science entails two major statistical consequences. The first of these consequences relates to the concept of a statistical set and the interaction between subject fields as described by Bradford's and Garfield's laws. In his classic statistics textbook, Hays (1994, 97374) places the concept of a set at the basis of all modern mathematics and probability, giving the following definition of a set: "Any well-defined collection of objects is a set" (bold in original). He then goes on to point out that the qualification "well-defined" means that "it must be possible, at least in principle, to specify the set so that one can decide whether any given object does or does not belong" (italics in original). To make things more complicated, Hays goes on to point out that the word "object" denotes not only an object in the usual sense but also a "phenomenon," "happening," or "logical possibility." For example, the fact that there are no females in the set of U.S. presidents might not mean that there are none in the set but simply that one has not yet "happened."
Due to the interaction of Bradford's and Garfield's laws, it is extremely difficult, if not impossible, to follow Hays' rules for set definition. The principle behind these laws is that subjects intermix, and the problem of subject intermixing is compounded, when one uses a library classification system to define subject sets, by the flaws inherent in such a system as described by Kelley (1937). Due to these factors, defining sets by subject in library and information science brings one face to face with the statistical problem of "outliers."
As defined by Barnett and Lewis (1984, 4), an outlier in a set of data is "an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data" (italics in original). As such, the appearance of outliers depends upon the logic underlying the definition of the set. In their literature review of outliers, Beckman and Cook (1983) describe outliers as a "subjective, post-data concept," and they divide them into two types: (1) "discordant observations"any observations that appear discordant or discrepant to the investigator, and (2) "contaminants"any observations that are not a realization from the target population. Given the operation of Bradford's and Garfield's laws, contaminants or observations foreign to the population under investigation are a common problem in library and information science, and it is often impossible to exclude them on a logical basis. When contaminants appear at the extreme end of a distribution, they can cause major difficulties in attempts to represent the population by grossly distorting the parameter estimates in some model of the population. Often the only alternative open to an investigator in library and information science is to do the test with and without the contaminants to determine their effects.
The other major statistical consequence brought forward by subject set definition in library and information science relates to the differing levels of consensus in the various fields of human knowledge. This problem was most succinctly defined by Kuhn (1970) in his famous book, The Structure of Scientific Revolutions. In this book Kuhn advanced two closely interrelated concepts: scientific community and paradigm. Scientific community was described by him as follows (1970, 177):
Kuhn defined his concept of a paradigm in the following way (1970, 175):
Kuhn distinguished between disciplines having a paradigm and those in a preparadigmatic phase. A preparadigmatic school has no generally accepted theory and is split into several competing schools. For example, he considered it an open question whether the social sciences had yet acquired any paradigms at all and noted, "History suggests that the road to a firm research consensus is extraordinarily arduous" (1970, 15).
The two statistical consequences of subject set definitioncontaminants and differing levels of consensushave important implications for the analysis of the skewed distributions that dominate library and information science. Attention will now be turned to this analysis.
Skewed Distributions
Absence of the Normal Distribution in Library and Information Science
It is with great trepidation that mere practitioners of statistics undertake a discussion of probability distributions. This is a world where statisticians conduct dogfights in the mathematical stratosphere, and a ground observer in the trenches has extreme difficulty in deriving conclusions about the course of the combat from the formulaic contrails in the skies overhead. Yet it is a necessary exercise. Standard parametric statistical operations such as correlation and regression assume the so-called normal distribution, which is virtually absent in library and information science. In this respect, library and information science is like many areas of human knowledge, particularly in the biological and social sciences. The relatively infrequent occurrence of the normal distribution was noted by Geary (1947, 24041), who attributed the use of it in statistics largely to its mathematical characteristics as well as its applicability predominantly in astronomy and games of chanceareas suitable for the mathematical model. However, as a result of its rarity, Geary advised that the following warning be printed in bold type in all statistics textbooks to make amends to future generations of students: "Normality is a myth; there never was, and never will be, a normal distribution."
Given this clash between statistical theory and much of reality, one must have some concept of the probability distribution underlying the data, so that it can be transformed mathematically into at least an approximation of the normal distribution in order to obtain correct results from standard statistical operations. As if this is not complicated enough, many sets of data in library and information science are what is known technically as "truncated on the left." This means that a group of observationsthe so-called "zero class"should have been included in them but were not counted because they either did not happen or were excluded by the system of measurement. The zero class can be the source of enormous difficulties.