LSU Libraries

Bibliometric Laws, Stochastic Processes, and the Biological Model: The Negative Binomial Distribution

Library and information science has been marked by a number of empirical, eponymous laws describing the skewed distributions inherent within it. Not only are there "Bradford's law of scattering" and "Garfield's law of concentration" described above, but there are also "Lotka's law of scientific productivity"—later modified by Price (1986, 38–44, 222–23)—on the distribution of authorship over scientists; "Zipf's law of word frequency" on the occurrence of words in a text; and "Trueswell's 80/20 rule" on library circulation. A major trend in library and information science literature has been to treat these laws as particular manifestations of more general statistical distributions and develop stochastic models to represent them (Oluiƒ-Vukoviƒ 1997).

In a series of papers worthy of being termed an intellectual tour de force, Bookstein (1990a; 1990b; 1995; 1997) compares these "informetric" laws to similar laws in the biological and social sciences, such as those of John Christopher Willis on the distribution of species and Vilfredo Pareto on the distribution of income. According to Bookstein, all these laws are similar in that they describe the distribution of the yield in a population of discrete entities over a time-like variable. He defines "yield" as a quantity such as income or journal citations that is possible to cumulate. In his view, the underlying similarity of these laws has been obscured by their differing subject content as well as their different ways of describing the distribution of yields. Bookstein then subjected Bradford's law, the Leimkuhler variant of Bradford's law, Lotka's law, Zipf's law, and Pareto's law to rigorous mathematical analysis, and came to the conclusion that all these distributions were "variants of a single distribution." Bookstein further found this distribution to be extremely robust and resilient to ambiguity in that it was not sensitive to time period or to the way the data are counted or conceptualized. Bookstein finished by locating this single informetric distribution in the family of compound Poisson distributions.

A workable candidate for the single informetric distribution posited by Bookstein appears to be the negative binomial distribution (NBD). Although Bookstein did not endorse the distribution, he did indicate that the NBD has been successfully applied to many problems in the information sciences (Bookstein 1997, 8). An interesting feature of the NBD its malleability, i.e., its capability of being shaped into other probability distributions by the adjustment of its parameters. In the biological sciences, the NBD is usually presented in conjunction with the binomial and Poisson distributions (Elliot 1977, 14–66; Williams 1964, 15–16; Bliss 1953, 176–77). Here it serves to model concentration in contrast to the binomial (which models uniformity) and the Poisson (which models randomness). The generating function of the binomial is (p+q) k, where p and q are chances of two alternative happenings in k number of repetitions. Its defining characteristic is that the variance is less than the mean. The NBD is the mathematical counterpart of the binomial, and therefore the probability series of the NBD is given by the expansion of (q–p)-k.

The defining characteristic of the NBD is that the variance is greater than the mean, and it has two parameters, the arithmetic mean and the exponent k. However, unlike in the binomial, k does not measure number of repetitions but degree of concentration. As k approaches infinity, the NBD converges to the Poisson, whose defining characteristic is that the variance equals the mean. On the other side, as k approaches 0, the NBD converges into the logarithmic series, which models superconcentration. The geometric distribution is a particular case of the NBD with k=1 (Cooper and Weekes 1983, 137; Haight 1978, 158). However, perhaps the most useful feature of the NBD is that it can be converted into the normal distribution for standard parametric statistical operations by a series of logarithmic transformations whose form depends upon the size of the exponent k and whether the data contains zero counts (Elliot 1977, 30–36). In the study utilizing survey data gathered by the 1993 pilot project with the LSU Department of Chemistry, it was found that all the quantitative variables—faculty ratings, total citations, impact factor, source items, journal age, library holdings, and price—satisfied the basic NBD criterion of overdispersion, i.e., the variances significantly exceeded the means (Bensman 1996, 154–56).

The NBD satisfies one of the major conditions posited by Bookstein (1990a, 369) for his single informetric distribution given its robustness, i.e., that it be the consequence of a wide variety of underlying models. In a review of the chance mechanisms causing the NBD, Boswell and Patil (1970) described no less than 12 stochastic models that lead to the full NBD plus two more leading to its zero-truncated form. This multitude of causal processes is probably behind its apparent ubiquity. However, of all these models, two have proven to be the most influential: the compound gamma-Poisson model and the Polya-Eggenberger model derived from the Polya urn scheme.

The first can perhaps be simply presented in the following way. A Poisson distribution arises from counts of random occurrences happening over time or space at a given rate in a population, and a compound Poisson distribution arises when there is a mixed population of different elements, each having different rates of occurrence distributed according to some function. If the function is the gamma function, the model is called gamma-Poisson. In contrast, the Polya-Eggenberger model is derived by drawing balls of two different colors from an urn. As the balls are drawn, they are not only replaced, but new balls of the same color are added. In this way, numerous drawings of balls of one color greatly increases the probability of that color being drawn.

The conceptual interest of the negative binomial distribution for library and information science lies in the conundrum posed by Feller (1943) about apparent contagion and true contagion with respect to these two models. As Feller pointed out, the Poisson distribution describes mutually independent occurrences that have no influence on each other. Due to this feature, the compound Poisson distribution arises as a result of the inhomogeneity of the population. With the Polya-Eggenberger urn model, the occurrence of an event increases the likelihood of its happening again. Describing the first model as apparent contagion and the second as true contagion, Feller pointed out that because both models lead to the same result, it is impossible to know which process is taking place if the data conforms to the NBD.


Previous Section | Table of Contents | Next Section


LSU Libraries | Louisiana State University | Collection Development | Collection Development Policies


[ Collection Development/Acquisitions ] [ Collection Services ] [ LSU Libraries ] [ LSU Home Page ]
Copyright © 1997-2009 LSU Libraries
URL: http://www.lib.lsu.edu/collserv/lrts/ST8.html
Contact the Collection Services Webmaster (LIBCS@lsu.edu) about this site.