|
| CSU |
Hayward |
|
|
Statistics |
Department |
Derivation and Applications of the Poisson Distribution
Preliminaries
Because our derivation of the Poisson distribution involves taking limits, it is best to put some standard results about limits on record as we begin. All limits in this section are taken as n approaches infinity. We begin with limiting expressions that involve the fundamental mathematical constant e = 2.71828..., sometimes referred to as the base of natural logarithms.
-
For any constant c, it is clear that lim (1 + 1/n)c = 1. As the term 1/n shrinks to zero, the quantity (1 + 1/n) approaches 1 and so the limit is of the type 1c.
-
For any positive constant b, it is clear that lim (1 + 1/b)n in infinite. The quantity (1 + 1/b) is greater than 1, so that increasingly large powers of it increase without bound.
-
What then is the value of lim (1 + 1/n)n? Intuitively from the above, it should be between 1 and infinity. Actually, it can be shown that this limit is the irrational number e = 2.71828 (the base of natural logarithms).
-
It turns out that for any constant c, lim (1 + c/n)n = ec. This result is customarily proved in calculus courses.
The following table, generated by Minitab from illustrative numbers put into the first column of a worksheet, shows how the limiting processes proceeds. First, look at the column headed c = 1. As n increases, (1 + 1/n)n increases also, approaching, but never exceeding, e. If c = 3, the limit is e3 = 20.0855; if c = 1, the limit is e1 = 1/e = 0.367879.
MTB > set c1
DATA> 1 2 5 10 100 1000
DATA> end
|
MTB > let c2 = (1+1/c1)**c1
MTB > let c3 = (1+3/c1)**c1
MTB > let c4 = (11/c1)**c1
|
MTB > print c1-c4
Row n c=1 c=3 c=-1
1 1 2.00000 4.0000 0.000000
2 2 2.25000 6.2500 0.250000
3 5 2.48832 10.4858 0.327680
4 10 2.59374 13.7858 0.348678
5 100 2.70481 19.9955 0.366032
6 1000 2.71692 20.0765 0.367695
7 10000 2.71815 20.0846 0.367861
|
Next we deal with a limit involving the binomial coefficient C(n, k), the "combinations of n things taken k at a time."
|
C(n, k)
|
= n! / [k!(nk)!]
|
|
= (n)(n1)(n2) ... (nk+1) / k!,
|
where there are k factors in the numerator of the right-hand expression. From this is it easy to see that, as n approaches infinity,
|
lim C(n, k) / nk
|
= lim (1/k!) [n(n1)(n2) ... (nk+1)] / nk
|
|
= (1/k!) lim [n/n] [(n1)/n] [(n2)/n] ... [(nk+1)/n]
|
|
= (1/k!) lim [1 1/n][1 2/n] ... [1 (k1)/n]
|
|
= 1/k!,
|
because each of the k 1 bracketed factors in the next-to-last expression approaches 1. (Notice that k is a fixed number as n becomes larger.)
An Approximately Binomial Model
Suppose that a radioactive source of very long half-life emits particles into a Geiger counter at an average rate of 3 per second. The number X of particles actually seen in any particular one-second interval is a random variable. The average number seen is 3, but the actual number will frequently be 2 or 4, and any of the values 0, 1, 2, 3, 4, 5, .... is possible. We can use the binomial distribution to approximate the distribution of the random variable X.
In order to construct the binomial approximation, let a one-second interval of time be divided into 100 consecutive intervals of length 0.01 sec. each. Consider each of these small intervals as a binomial trial. Because n = 100 and E(X) = np = 3, we conclude that we must have p = 0.03. This construction requires that particles arrive in the small intervals independently, that we regard a small interval as a "success" if it contains a particle, and that the probability P(Success) is the same for each small interval, namely P(Success) = p = 0.03.
Thus, the approximate distribution of X is given by the expression
P(X = k) = C(100, k)(0.03) k(0.97)100k.
This expression is only approximate because it does not take into account the possibility that there might be two or more particles in one of the small intervals. A "double hit" in a small interval is very unlikely because the probability of a single hit is about 0.03 and so, by independence, the probability of a double hit in any one small interval should be something like (0.03)2 = 0.0009. Three or more hits in a small interval are even less likely.
Thus there are two reasons that the above expression for P(X = k) is somewhat unsatisfactory.
- The computation is tedious (at least without a computer).
- There is a small, but nonetheless positive probability of multiple hits. Multiple hits would violate the binomial model (which does not make provision for "super-successes") and thus lead to some small degree of inaccuracy.
A Minitab printout of this approximating binomial distribution is shown in the column headed n = 100 below. (The next column is explained just below the printout. The last column refers to the Poisson distribution derived in the next section.)
MTB > set c11
DATA> 1:16
DATA> end
|
MTB > pdf c11 c12;
SUBC> bino 100 .03.
|
MTB > pdf c11 c13;
SUBC> bino 1000 .003.
|
MTB > pdf c11 c14;
SUBC> pois 3.
|
MTB > print c11-c14
Row k n=100 n=1000 Pois(3)
1 0 0.047553 0.049563 0.049787
2 1 0.147070 0.149137 0.149361
3 2 0.225153 0.224154 0.224042
4 3 0.227474 0.224379 0.224042
5 4 0.170606 0.168284 0.168031
6 5 0.101308 0.100869 0.100819
7 6 0.049610 0.050333 0.050409
8 7 0.020604 0.021507 0.021604
9 8 0.007408 0.008033 0.008102
|
10 9 0.002342 0.002664 0.002701
11 10 0.000659 0.000794 0.000810
12 11 0.000167 0.000215 0.000221
13 12 0.000038 0.000053 0.000055
14 13 0.000008 0.000012 0.000013
15 14 0.000002 0.000003 0.000003
16 15 0.000000 0.000001 0.000001
17 16 0.000000 0.000000 0.000000
|
We could minimize the inaccuracy due to multiple hits by making the small intervals more numerous and the probability of success in any one of them correspondingly smaller. For example, we could deal with 1000 intervals of a millisecond each to obtain
P(X = k) = C(1000, k)(0.003) k(0.997) nk.
The resulting computation is even more tedious, but the possibility of multiple hits is now truly minuscule. The results are given in the column headed n = 1000 of the printout above.
Browser note: The remainder of this document uses Microsoft's "symbol" font to print the lower-case Greek letter lambda (l). If you see the Latin letter "ell" (l) in parentheses in the previous sentence, your installation is not using this font.
An Exact Probability Model
Suppose now that the average number of particles arriving in an interval of length 1 sec. is l so that we seek the distribution of a random variable X with E(X) =l. If we let the small intervals of the previous section get ever smaller and more numerous, we are talking about binomial distributions with n trials and P(Success) = l/n. Taking the limit as n approaches infinity, we have
|
P(X = k)
|
= lim C(n, k)(l/n)k(1 l/n)nk
|
|
= lk lim [C(n, k)/nk][1 + (l)/n]n [1 l/n]k.
|
We have shown above that the first bracketed factor approaches 1/k!, that the second approaches el, and that the third approaches 1.
Thus,
P(X = k) = el lk / k!, k = 0 , 1, 2, 3, ... .
This is the probability density function of the Poisson distribution. Because we derived the Poisson distribution as a limit of binomials all with mean l, it is not surprising that l is the mean of the Poisson distribution as well. It can be shown that the variance of the Poisson distribution is also numerically equal to l. Thus, a Poisson random variable with l = 4 counts per minute will have a standard deviation of 2 counts per minute.
Siméon Poisson was a French mathematician of the 19th century who developed this distribution. (Poisson is the French word for fish; it is pronounced something like PWAH-ssohn, and nothing at all like the English word poison. You probably don't say FROOD for Freud or BATCH for Bach, so it's only fair not to say POY-sson for Poisson.]
Applications of the Poisson Distribution
An extraordinarily large number of natural and social phenomena have been successfully modeled using the Poisson distribution.
-
The "domain" in which counts are observed can be in interval of time: as for the radioactive counts mentioned above or as for cases arriving at the emergency room of a hospital during a one-hour period in mid-afternoon.
-
The domain can be a volume. In the volume represented by a beaker containing cells in suspension, the number of cells that divide in a particular unit of time may be modeled with the Poisson distribution. Also the number of red giant stars in a volume of interstellar space has been shown to be Poisson distributed.
-
The domain can be linear. The number of defects in a length of wire and the number of armadillos killed by traffic on a length of an Arizona highway have both been shown to have Poisson distributions.
-
The domain can be an area. Bomb hits in acre tracts of metropolitan London during WW2, the number of pollen grains collected in regions of a sticky plate exposed to the open air, and (under the right conditions) the number of bird nests in tracts San Francisco Bay marshes have all been successfully modeled as Poisson.
The conditions in each case are the same.
-
The rate at which particles occur over the domain must be constant throughout.
-
If the half life of the radioactive sample is so short that the rate of decay changes noticeably during one second, this condition is violated.
Likewise, it would not do to apply the same Poisson distribution to emergency arrivals for an hour in the middle of a Sunday afternoon as to the middle of a Saturday night, if the former time period has a reputation in the community for being safe and peaceful and the latter for being somewhat violent and dangerous.
-
It would not do to extend the area in which bomb hits were counted too far into rural areas around London where the intensity of bombing was systematically less severe.
-
The particles must arrive independently of one another.
-
The armadillos must cross the highway at random places, not in "herds" at instinctively programmed "armadillo crossings."
-
The emergency room arrivals must not be due to a common cause such as an explosion, hurricane, or earthquake.
-
The birds building the nests must be neither social (building nests in clumps) nor territorial (spacing the nests evenly over the available space, instead of randomly).
-
The pollen grains must not be "magnetic," neither attracting nor repelling one another.
As with any probability model, the application of the Poisson family of distributions to any particular situation may not be perfect (except perhaps for radioactive decay of stable samples), but the situations that can be satisfactorily modeled by Poisson distributions are extraordinarily many and varied.
Problems
1. In a one-second period of time, a radioactive source emits a random number X of particles into a counter, where X has a Poisson distribution with mean 3.
- What is the probability that no particles are actually seen?
- What is the probability that fewer than 3 particles are actually seen? [Answer: Sum three probabilities of the appropriate Poisson distribution, including the result just above, to obtain P(X < 2) = 0.4232.]
2. Consider the radioactive source in Problem 1. Suppose that we are interested in the number of particles counted in a two-second time period.
- What is the probability that no particles are actually seen? [Answer: For a two-second interval the mean will be 6, so that the required probability is 0.0025.]
- What is the probability that fewer than 3 particles are actually seen?
3. A particular batch of steel wire has flaws occurring along the length of the wire at random locations according to a Poisson distribution at a rate of 2 flaws per meter.
- A prospective purchaser takes five sample sections of wire at random from the batch, each of them 10 cm long, and inspects them for flaws. The shipment will be rejected if any of the five sections has any flaws. What is the probability that the shipment will be rejected? [Answer: 1 e1 = 0.6321. This answer can be obtained by two somewhat different arguments: first, by finding the average rate of flaws in a sample of total length 50 cm; second, by finding the average rate of flaws in one section of length 10 cm and then arguing that the 5 sections are independent.]
- What is the minimum number of 10 cm sections that needs to be inspected so that the probability of rejecting this batch exceeds 90%?
4. One hundred specimens of volume 0.1 ml each are taken from a liquid suspension containing bacteria of a particular kind. Each specimen is spread onto nutrient jell in a different culture dish. Two days later we find that cultures grew in 60 of the dishes. Assume that a culture grows if, and only if, the specimen happened to have one or more bacteria in it.
- Give an estimate of the number of bacteria per ml of the suspension. [Answer: 10(ln 0.4) = 9.16, where "ln" denotes a logarithm to the base e.]
- If specimens of volume 0.5 ml are used in the next assay, then in how many out of 100 culture dishes would you expect to see cultures? [Answer: About 99.]
5. Let Y be a Poisson random variable with E(Y) = 50.
- What is the standard deviation s of this random variable?
- For many distributions, about 95% of the probability is contained within an interval that extends 2 standard deviations on either side of the mean. This is called the "Empirical Rule"; it is a rough rule of thumb, not a mathematical theorem. Which specific integer values y in the distribution of Y lie in this interval? What is the exact probability that Y falls in this interval? Use tables of the Poisson distribution, if available. Alternatively, use Minitab (command cdf; and subcommand pois 50. to find this probability. [Answer: 0.960.]
- Does the Empirical Rule work well for a Poisson distribution with l = 10? With l = 2? [Answer: You should find the exact probabilities. Generally, the Empirical Rule works best for symmetrical distributions and the Poisson becomes more nearly symmetrical for large values of l.
6.  The number of "permutations (ordered samples) of n things taken k at a time" is P(n, k) = n! / (n k)!. Show that lim P(n, k)/nk approaches 1 as n approaches infinity (for fixed k).
7. Try Quiz Question 7 on this site.
Copyright © 1999 by Bruce E. Trumbo. All rights reserved. Intended for instructional use at California State University, Hayward. Please request permission in advance for other uses: btrumbo@csuhayward.edu.
BT/CF: Posted 11/13/1998, Last revised (mainly to include problems) 07/08/99
|