Annex № 1 to the Order № February 15- "Rules of Carriage of Passengers, Baggage and Cargo" "Azerbaijan Airlines" Closed Joint Stock Company Table of contents 1. General provisions Terms, Definitions and abbreviations Compliance with laws and requirements of state authorities Electronic Ticket
UntitledIEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 9, SEPTEMBER 2010 Thinning, Entropy, and the Law of Thin Numbers Peter Harremoës, Member, IEEE, Oliver Johnson, and Ioannis Kontoyiannis, Senior Member, IEEE Abstract—Rényi's thinning operation on a discrete random vari-
where the random variables are independent and identi- able is a natural discrete analog of the scaling operation for contin-
cally distributed (i.i.d.) each with a Bernoulli distribution with uous random variables. The properties of thinning are investigated
parameter , denoted , and also independent of in an information-theoretic context, especially in connection with
information-theoretic inequalities related to Poisson approxima-
usual, we take the empty sum to be equal to zero.] An tion results. The classical Binomial-to-Poisson convergence (some-
explicit representation of times referred to as the "law of small numbers") is seen to be a
special case of a thinning limit theorem for convolutions of discrete
distributions. A rate of convergence is provided for this limit, and
nonasymptotic bounds are also established. This development par-
allels, in part, the development of Gaussian inequalities leading
When it causes no ambiguity, the thinned distribution to the information-theoretic version of the central limit theorem.
In particular, a "thinning Markov chain" is introduced, and it is
shown to play a role analogous to that of the Ornstein-Uhlenbeck
For any random variable with distribution process in connection to the entropy power inequality.
-fold convolution of with itself, i.e., the distribution of the sum of . For example, if Index Terms—Binomial distribution, compound Poisson distri-
bution, entropy, information divergence, law of small numbers, law
, the binomial distribution of thin numbers, Poisson distribution, Poisson-Charlier polyno-
and . It is easy to see that its ; see Example 2.2. Therefore, the classical Binomial-to-Poisson convergence result—some- times referred to as the "law of small numbers"—can be phrased A PPROXIMATING the distribution of a sum of weakly as saying that, if
dependent discrete random variables by a Poisson distri- bution is an important and well-studied problem in probability;see  and the references therein for an extensive account.
denotes the Poisson distribution with parameter Strong connections between these results and information-the- oretic techniques were established , . In particular, for One of the main points of this paper is to show that this result the special case of approximating a binomial distribution by holds for very wide class of distributions , and to provide con- a Poisson, some of the sharpest results to date are established ditions under which several stronger and more general versions using a combination of the techniques , , and Pinsker's of (3) can be obtained. We refer to results of the form (3) as laws inequality , , . Earlier work on information-theoretic of thin numbers.
bounds for Poisson approximation is reported in , , .
Section II contains numerous examples that illustrate how The thinning operation, which we define next, was introduced particular families of random variables behave on thinning, by Rényi in , who used it to provide an alternative charac- and it also introduces some of the particular classes of random terization of Poisson measures.
variables that will be considered in the rest of the paper. In Definition 1.1: Given and a discrete random vari- Sections III and IV several versions of the law of thin numbers with distribution , the -thinning are formulated; first for i.i.d. random variables in Section III, is the distribution and then for general classes of (not necessarily independentor identically distributed) random variables in Section IV. Forexample, in the simplest case where and with mean , so that the distribution Manuscript received June 03, 2009; revised April 22, 2010. Date of current version August 18, 2010. The work of P. Harremoës was supported by a grantfrom the Danish Natural Science Research Council and by the European Pascal Network of Excellence. The work of I. Kontoyiannis was supported in part bya Marie Curie International Outgoing Fellowship, PIOF-GA-2009-235837. Thematerial in this paper was presented in part at the 2007 IEEE International Sym- , where, as usual, posium on Information Theory, Nice, France, June 2007.
notes the information divergence, or relative entropy, from P. Harremoës is with Copenhagen Business College, 1175 Copenhagen, Den- mark (e-mail: [email protected]).
O. Johnson is with the Department of Mathematics, University of Bristol, Bristol, BS8 1TW, United Kingdom (e-mail: [email protected]).
I. Kontoyiannis is with the Department of Informatics, Athens University of Economics and Business, Athens 10434, Greece (e-mail: [email protected]).
Communicated by E. Ordentlich, Associate Editor for Source Coding.
1Throughout the paper, log denotes the natural logarithm to base e, and we Digital Object Identifier 10.1109/TIT.2010.2053893 adopt the usual convention that 0 log 0 = 0.
0018-9448/$26.00 2010 IEEE HARREMOËS et al.: THINNING, ENTROPY, AND THE LAW OF THIN NUMBERS Note that, unlike most classical Poisson convergence results, the ically, we employ the scaled Fisher information functional in- law of thin numbers in (4) proves a Poisson limit theorem for the troduced in  to give precise, explicit bounds on the diver- sum of a single sequence of random variables, rather than for a . An example of the type of result triangular array.
we prove is the following: Suppose is an ultra bounded (see It may be illuminating to compare the result (4) with the in- Section II, Definition 2.1) random variable, with distribution formation-theoretic version of the central limit theorem (CLT); mean , and finite variance see, e.g.,  and . Suppose are i.i.d. continuous random variables with density , and with zero mean and unit variance. Then the density of their sum , is the -fold convolution with itself. Write for a nonzero constant we explicitly identify; cf. Corollary 6.1.
for the standard scaling operation in the CLT regime: If a Similarly, in Section VIII we give both finite- continuous random variable has density , then totic bounds on the law of small numbers in terms of the total density of the scaled random variable , and, in particular, variation distance, the density of the standardized sum distribution. In particular, Theorem 8.1 states that formation-theoretic CLT states that, if and finite variance is the standard Normal density. Note the close analogy [Corresponding lower bounds will be presented in the com- between the statements of the law of thin numbers in (4) and the panion paper .] A closer examination of the monotonicity properties of the Before describing the rest of our results, we mention that scaled Fisher information in relation to the thinning operation is there is a significant thread in the literature on thinning limit described in Section VII. Finally, Section IX shows how the idea theorems and associated results for point processes. Conver- of thinning can be extended to compound Poisson distributions.
gence theorems of the "law of thin numbers" type, as in (3) TheAppendix contains the proofs of some of the more technical and (4), were first examined in the context of queueing theory by Palm  and Khinchin , while more general results Finally we mention that, after the announcement of (weaker were established by Grigelionis . See the discussion in the and somewhat more specialized versions of) the present results text, [12, pp. 146–166], for details and historical remarks; also in , Yu  also obtained some interesting, related results. In see the comments following Section IV, Theorem 4.1. More particular, he showed that the conditions of the strong and ther- specifically, this line of work considered asymptotic results, pri- modynamic versions of the law of thin numbers (see Theorems marily in the sense of weak convergence, for the distribution 3.3 and 3.2) can be weakened, and he also provided conditions of a superposition of the sample paths of independent (or ap- under which the convergence in these limit theorems is mono- propriately weakly dependent) point processes. Here we take a different direction and, instead of considering the full infi-nite-dimensional distribution of a point process, we focus on II. EXAMPLES OF THINNING AND DISTRIBUTION CLASSES finer results—e.g., convergence in information divergence and This section contains several examples of the thinning opera- nonasymptotic bounds—for the one-dimensional distribution of tion, statements of its more basic properties, and the definitions the thinned sum of integer-valued random variables.
of some important classes of distributions that will play a cen- With these goals in mind, before examining the finite- tral role in the rest of this paper. The proofs of all the lemmas , in Section V we study a simpler but re- and propositions of this section are given in the Appendix.
lated problem, on the convergence of a continuous-time "thin- Note, first, two important properties of thinning that are im- ning" Markov chain on , which acts as the mediate from its definition: It is well known that this Markov chain has the Poisson law as its 1. The thinning of a sum of independent random variables is unique invariant measure (see for example,  or ). In The- the convolution of the corresponding thinnings.
orem 5.1 we characterize precisely the rate at which it converges and any distribution to the Poisson law in terms of the distance, which leads to the following.
an upper bound on its convergence in information divergence. Anew characterization of the Poisson distribution in terms of thin- ning is also obtained. The main technical tool used here is basedon an examination of the properties of the Poisson-Char- Example 2.1: Thinning preserves the Poisson family of laws, lier polynomials in the thinning context. In the present context, . This follows from (2), since as described by Chafaï , this Markov chain plays a role par-allel to that of the Ornstein-Uhlenbeck process in the context ofGaussian convergence and the entropy power inequality ,, .
In Section VI we give both asymptotic and finite- on the rate of convergence for the law of thin numbers. Specif- IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 9, SEPTEMBER 2010 , denotes the Poisson mass As it turns out, the factorial moments of a thinned distribu- tion are easier to work with than ordinary moments. Recall thatthe th factorial moment of i.i.d. geometrics has a negative binomial distribu- falling factorial tion. Thus, in view of this example and property 1 stated in thebeginning of this section, the thinning of a negative binomialdistribution is also negative binomial.
The factorial moments of an -thinning are easy as given by Partly motivated by these examples, we describe certain the following result.
classes of random variables (some of which are new). Theseappear as natural technical assumptions in the subsequent Lemma 2.1: For any random variable with distribution development of our results. The reader may prefer to skip the for a random variable with remainder of this section and only refer back to the definitions when necessary.
Definition 2.1: 1) A Bernoulli sum is a distribution that can be obtained from That is, thinning scales factorial moments in the same way as the sum of finitely many independent Bernoulli random ordinary multiplication scales ordinary moments.
variables with possibly different parameters. The class of We will use the following result, which is a multinomial ver- Bernoulli sums with mean sion of Vandermonde's identity and is easily proved by induc- tion. The details are omitted.
2) A distribution Lemma 2.2: The falling factorial satisfies the multinomial expansion; that is, for any positive integer is said to be ultra log-concave (ULC); cf. . The set ofultra log-concave distributions with mean , and we also write . Note that (8) is satisfied for a single value if and only if it is satisfied for all 3) The distribution of a random variable The following is a basic regularity property of the thinning will be said to be ultra bounded (UB) with ratio . The set of ultra boundeddistributions with this ratio is denoted Proposition 2.1: For any 4) The distribution of a random variable will be said to be Poisson bounded (PB) with ratio . The set of Poisson bounded Example 2.2: Thinning preserves the class of Bernoulli sums.
distributions with this ratio is denoted That is, the thinned version of the distribution of a finite sum of 5) A random variable will be said to be ULC, UB, or PB, if independent Bernoulli random variables (with possibly different its distribution is ULC, UB or PB, respectively.
parameters) is also such a sum. This follows from property 1 First we mention some simple relationships between these stated in the beginning of this section, combined with the ob- classes. Walkup  showed that if servation that the distribution is the distribution. In particular, thinning preserves the bi- . In  it was shown that, if Example 2.3: Thinning by transforms a geometric distri- is Poisson bounded if and only if the -thinning into a geometric distribution with mean is Poisson bounded, for some . The same holds for ultra Recalling that the geometric distribution with mean Proposition 2.2: In the notation of Definition 2.1, the class . That is, if the distribution of HARREMOËS et al.: THINNING, ENTROPY, AND THE LAW OF THIN NUMBERS The next result states that the PB and UB properties are pre- is bounded below by served on summing and thinning.
Proposition 2.3: Now, for any fixed value of tending to infinity Formally, the above discussion can be summarized as Finally, we note that each of these classes of distributions is"thinning-convex," i.e., if are element of a set then and by monotone convergence is also an element of the same set. In partic- ular, thinning maps each of these sets into itself, since , the point mass at zero, has III. LAWS OF THIN NUMBERS: THE I.I.D. CASE In this section we state and prove three versions of the law of thin numbers, under appropriate conditions; recall the relevant are probability mass functions and so is discussion in the Introduction. Theorem 3.1 proves convergence is necessarily a limit.
in total variation, Theorem 3.2 in entropy, and Theorem 3.3 ininformation divergence.
As usual, the entropy of a probability distribution Recall that the total variation distance Recall that the entropy of the Poisson distribution cannot beexpressed in closed form, although useful bounds do exist; see,e.g.,  and the references therein.
Theorem 3.1 (Weak Version): For any distribution Theorem 3.2 (Thermodynamic Version): If which is Poisson bounded with mean , then Proof: In view of Scheffé's lemma, pointwise convergence Proof: The distribution converges pointwise of discrete distributions is equivalent to convergence in total to the Poisson distribution so, by dominated convergence, it is variation, so it suffices to show that, sufficient to prove that dominated by a summable function. This easily follows from , and that (2) implies the simple bound in the following lemma.
the following elementary bounds for all , using Jensen's in- Lemma 3.1: Suppose is Poisson bounded with ratio Proof: Note that, for all so that, in particular, Since for i.i.d. variables , the probability obtain that the convolution From the proof of [24, Theorem 2.5] we know that, is ultra log-concave, so for such distributions the theorem states that the entropy converges IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 9, SEPTEMBER 2010 to its maximum. For ultra log-concave distributions the ther- where the first term (corresponding to modynamic version also implies convergence in information the Poisson measures form an exponential family, they satisfy a divergence. This also holds for Poisson bounded distributions, Pythagorean identity  which, together with the bound which is easily proved using dominated convergence. As shownin the next theorem, convergence in information divergence can be established under quite general conditions.
Theorem 3.3 (Strong Version): For any distribution see, e.g.,  or , gives, for each The proof of Theorem 3.3 is given in the Appendix; it is based on a straightforward but somewhat technical application of thefollowing general bound.
Proposition 3.1: Let be a random variable with distribu- and with finite mean Since the final bound clearly remains valid for tuting it into (12) gives (10).
IV. LAWS OF THIN NUMBERS: THE NON-I.I.D. CASE where the right-hand side is always finite.
In this section we state and prove more general versions of the Proof: First note that, since has finite mean, its entropy law of thin numbers, for sequences of random variables that are is bounded by the entropy of a geometric with the same mean, not necessarily independent or identically distributed. Although which is finite, so is finite. Therefore, the divergence some of the results in this section are strict generalizations of can be expanded as Theorems 3.1 and 3.3, their proofs are different.
We begin by showing that, using a general proof technique introduced in , the weak law of thin numbers can be estab-lished under weaker conditions than those in Theorem 3.1. Themain idea is to use the data-processing inequality on the totalvariation distance between an appropriate pair of distributions.
Theorem 4.1 (Weak Version, Non-I.I.D.): Let an arbitrary sequence of distributions on for the convolution of the first where the last inequality follows from the Stirling bound as long as the following three conditions are satisfied as denotes the function is finite, the bound (11) implies that is finite. [Recall the convention that Note that Theorem 4.1 can be viewed as a one-dimensional Also note that the representation of version of Grigelionis' Theorem 1 in ; recall the relevant comments in the Introduction. Recently, Schuhmacher established nonasymptotic, quantitative versions of this result,in terms of the Barbour-Brown distance, which metrizes weakconvergence in the space of probability measures of point Using this and the joint convexity of information divergence in processes. As the information divergence is a finer functional its two arguments (see, e.g., [9, Th. 2.7.2]), the divergence of than the Barbour-Brown distance, Schuhmacher's results are interest can be bounded as not directly comparable with the finite- bounds we obtain in Propositions 3.1, 4.1, and Corollary 6.1.
Before giving the proof of the theorem, we state a simple lemma on a well-known bound for proof is included for completeness.
Lemma 4.1: For any , HARREMOËS et al.: THINNING, ENTROPY, AND THE LAW OF THIN NUMBERS Proof: Suppose, without loss of generality, that and define two independent random variables coupling inequality  The second inequality in the lemma is trivial.
and, by assumption, this converges to zero as Proof of Theorem 4.1: First we introduce some convenient pleting the proof.
be independent random variables with for all ; for each Recall that, in the i.i.d. case, the weak law of thin numbers dent random variables with for all ; and simi- only required the first moment of to be finite, while the strong version also required that the divergence from . Also we define the distribution be finite. For a sum of independent, non-identicallydistributed random variables with finite second moments, Proposition 3.1 can be used as in the proof of Theorem 3.3 to prove the following result. Note that the precise conditions required are somewhat analogous to those in Theorem 4.1.
Theorem 4.2 (Strong Version, Non-i.i.d.): Let an arbitrary sequence of distributions on and finite variance. Writing and, by assumption, With these definitions in place, we approximate as long as the following two conditions are satisfied: where, by Lemma 4.1, the second term is bounded by which vanishes as . Therefore, it suffices to show that The proof of Theorem 4.2 is given in the Appendix, and it is the first term in (14) goes to zero. For that term based on Proposition 3.1. It turns out that under the additionalcondition of finite second moments, the proof of Proposition3.1 can be refined to produce a stronger upper bound on thedivergence.
Proposition 4.1: If is a distribution on Proof: Recall that in the proof of Proposition 3.1 it was where the first inequality above follows from the fact that, beingan -divergence, the total variation distance satisfies the data- processing inequality ; the second inequality comes fromthe well-known bound on the total variation distance betweentwo product measures as the sum of the distances between their respective marginals; and the third bound is simply the triangleinequality.
Finally, noting that, for any random variable , and also recalling the simple IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 9, SEPTEMBER 2010 and where in the last step above we used the simple bound The first and third terms tend to zero by assumptions (b) and (c), respectively. And using assumption (a), the second term is Substituting (17) into (16) yields which also tends to zero by assumption (b).
V. THE THINNING MARKOV CHAIN Before examining the rate of convergence in the law of thin numbers, we consider a related and somewhat simpler problemfor a Markov chain. Several of the results in this section maybe of independent interest. The Markov chain we will discuss represents the evolution of the queue. Its properties Using the bound (15) instead of Proposition 3.1, the following were used by Chafaï  to develop a family of inequalities more general version of the law of thin numbers can be estab- which extends the logarithmic Sobolev inequalities of Bobkov and Ledoux , and other authors. Chafaï argues in [7, Sec. 1.2,1.3] that this process is a natural discrete analog of the Ornstein- Theorem 4.3 (Strong Version, Non-I.I.D.): Let Uhlenbeck process associated with the Gaussian distribution.
a sequence of (not necessarily independent or identicallydistributed) random variables on Definition 5.1: Let be a distribution on distribution of the partial sum for the distribution . Assume that the have finite means and variances, a) They are "uniformly ultra bounded," in that, for all , with a common is often written simply as b) Their means satisfy , and that, obviously, c) Their covariances satisfy probability distributions to probability distributions. Therefore,if for a fixed of linear operators on the space of probability defines a Markov transition semigroup. Specif- , the transition probabilities define a continuous-time Markov chain is intuitively clear that, as (or, equivalently, should converge to the Indeed, the following two well-known results (see for example, Proof: Obviously it suffices to prove the general statement.
 or ) state that is ergodic, with unique invariant Proposition 4.1 applied to . Theorem 5.1 gives the rate at which it converges in terms of the moments of the underlying distribution.
This complements results such as [7, Th. 3.1], which gives abound in terms of the tightest value of the log-Sobolev constant.
Proposition 5.1: For any distribution verges in total variation to Proof: From the definition of HARREMOËS et al.: THINNING, ENTROPY, AND THE LAW OF THIN NUMBERS Definition 5.2: For given , the Poisson-Charlier polynomial Some well-known properties of the Poisson-Charlier polyno- mials are listed in the following lemma without proof. Note that their exact form depends on the chosen normalization; other au-thors present similar results, but with different normalizations.
Lemma 5.1: For any , where (18) follows from the fact that convolution with any dis- tribution is a contraction with respect to the lows from the triangle inequality, and (20) converges to zero because of the bound (9).
Using this, we can give a characterization of the Poisson Corollary 5.1: Let denote a discrete distribution with mean is the unique invariant measure of the Markov chain Proof: Assume that , by Proposition 5.1, large. The strengthened convergence of can be proved using standard arguments along the lines of the corresponding discrete-time Observe that, since the Poisson-Charlier polynomials form an results in , , and .
orthonormal set, any function can be expanded as Next we shall study the rate of convergence of Poisson distribution. It is easy to check that the Markov chain is in fact reversible with respect to its invariant measure . Therefore, the natural setting for the study of its con- space of functions It will be convenient to be able to translate between factorial mo- . This space is also endowed ments and the "Poisson-Charlier moments," with the usual inner product in (21) shows that . More generally, the following proposition shows that the role of the Poisson-Charlier momentswith respect to the Markov chain is analogous to the role played by the factorial moments with respect to the pure thin- and the linear operators ning operation; cf. Lemma 2.1. Its proof, given in the Appendix, is similar to that of Lemma 2.1.
Proposition 5.2: Let be a random variable with for a random variable with distribution The reversibility of and assume that the thinning is a self-adjoint linear operator on , therefore, its eigenvec- has initial distribution tors are orthogonal functions. In this context, we introduce the , then, Proposition 5.2 states that Poisson-Charlier family of orthogonal polynomials for a broad introduction.
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 9, SEPTEMBER 2010 that is, the Poisson-Charlier moments of Proof: The proof is based on a Hilbert space argument . Similarly, expanding any func- using the fact that the Poisson-Charlier polynomials are orthog- in terms of Poisson-Charlier polynomials, . Using Proposition 5.3 , and using Proposition 5.2 Thus, the rate of convergence of will be dominated by the term corresponding to The following proposition (proved in the Appendix) will be used in the proof of Theorem 5.1, which shows that this is indeed where the last step follows from the orthogonality relation (22).
the right rate in terms of the distance. Note that there is no restriction on the mean of in the proposition.
Proposition 5.3: If is Poisson bounded, then the can be expanded as which is finite. From the previous expansion we see that , combining Propositions 5.2 , which implies the finite- and 5.3, we obtain that ness claim. Moreover, that expansion has its dominant term, implying the stated limit.
Theorem 5.1 readily leads to upper bounds on the rate of con- vergence in terms of information divergence via the standardbound which follows from direct applications of Jensen's inequality.
where, as before, denotes the first integer Furthermore, replacing this bound by the well-known approxi- . This sum can be viewed as a discrete analog of the well-known Edgeworth expansion for the distribution ofa continuous random variable. A technical disadvantage of boththis and the standard Edgeworth expansion is that, although thesum converges in , truncating it to a finite number of terms in gives the estimate general produces an expression which may take negative values.
By a more detailed analysis we shall see in the following twosections how to get around this problem.
For now, we determine the rate of convergence of We shall later prove that, in certain cases, this approximation can indeed be rigorously justified.
recall the definition of the distance between two probability VI. THE RATE OF CONVERGENCE IN THE STRONG LAW OF be a random variable on Theorem 3.3 we showed that, if Theorem 5.1: If is Poisson bounded, then the distance is finite for all also has finite variance , then Proposition 4.1 implies denotes the smallest HARREMOËS et al.: THINNING, ENTROPY, AND THE LAW OF THIN NUMBERS suggesting a convergence rate of order . In this section, we Theorem 6.1: Suppose prove more precise upper bounds on the rate of convergence in bounded with ratio . Let denote the smallest integer the strong law of thin numbers (27). For example, if ultra bounded random variable with , then we show that Combining Theorem 6.1 with (31) immediately yields the lows from the more general result of Corollary 6.1; its proof is Corollary 6.1: Suppose based on a detailed analysis of the scaled Fisher information in- bounded with ratio . Let denote the smallest integer troduced in . We begin by briefly reviewing some properties of the scaled Fisher information.
Definition 6.1: The scaled Fisher information of a random with mean , is defined by VII. MONOTONICITY RESULTS FOR THE SCALED denotes the scaled score function FISHER INFORMATION In this section we establish a finer result for the behavior of the scaled Fisher information upon thinning, and use that to de-duce a stronger finite- upper bound for the strong law of thinnumbers. Specifically, if is ULC with mean , and In [28, Prop. 2] it was shown, using a logarithmic Sobolev denotes a random variable with distribution inequality of Bobkov and Ledoux , that for any . This implies that, for all ULC random , we have the following finite- version of the strong law of thin numbers under mild conditions on the support of . Also, [28, Prop. 3] satisfies a subadditivity property: For indepen- dent random variables Note that, unlike the more general result in (28) which gives abound of order , the above bound is of order The key observation for these results is in the following . In particular, recalling that the thinning of a convolution is the convolution of the corresponding thin- Lemma 7.1: Suppose is a ULC random variable with dis- are i.i.d. random variables with mean then the bounds in (29) and (30) imply random variable with distribution . Then the derivative of Therefore, our next goal is to determine the rate at which tending to 0. We begin with the where, for a random variable with mass function following proposition; its proof is given in Appendix.
Proposition 6.1: If is Poisson bounded, then mits the representation Moreover, the truncated sum from is an upper bound is even, and a lower bound if Proof: This result follows on using the expression for the An important consequence of this proposition is that the prob- arising as the case tends to zero like Prop. 3.6], that is leads to the following asymptotic result for the scaled Fisher in-formation, also proved in theAppendix.
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 9, SEPTEMBER 2010 Using this, for each we deduce that the derivative Consider the finite difference operator . We require a result suggested by can be expressed as the sum of relevant results in , . Its proof is given in the Appendix.
Lemma 7.2: Let be ULC random variable with distribution . Then for any function , defining VIII. BOUNDS IN TOTAL VARIATION In this section, we show that a modified version of the argu- ment used in the proof of Proposition 4.1 gives an upper boundto the rate of convergence in the weak law of small numbers. If , then combining the bound (15) of Proposition 4.1 with Pinsker's inequality we obtain The result follows (with the term-by-term differentiation of theinfinite sum justified) if the sum of these terms in convergent. The first terms are positive, and their sum is abso-lutely convergent to by assumption. The second terms form a which gives an upper bound of order . From the asymp- collapsing sum, which is absolutely convergent assuming that totic upper bound on information divergence, Corollary 6.1, weknow that one should be able to obtain upper bounds of order . Here we derive an upper bound on total variation using the same technique used in the proof of Proposition 4.1.
Note that, for any ULC distribution , by definition we have for Theorem 8.1: Let be a distribution on above sum is bounded above by, which is finite by Proposition 2.2.
The proof uses the following simple bound, which follows easily from a result of Yannaros, [40, Theorem 2.3]; the details We now deduce the following theorem, which parallels, re- are omitted.
spectively, [41, Th. 8], where a corresponding result is provedfor the information divergence.
Lemma 8.1: For any Theorem 7.1: Let be a ULC random variable with for a random variable with distribution Proof: The first inequality in the proof of Proposition 4.1 remains valid due to the convexity of the total variation norm(since it is an -divergence). The next equality becomes an in-equality triangle, and it is justified by the triangle, and we have The first part follows from the observation , since, by Lemma 7.1, its derivative is in the more technical Lemma 7.2 below, we deduce that , and this proves (i). Then (ii) immediately follows from (i) combined with the earlier bound (31), upon recallingthat thinning preserves the ULC property .
HARREMOËS et al.: THINNING, ENTROPY, AND THE LAW OF THIN NUMBERS And using Lemma 8.1 leads to are as before, and able that is independent of the Perhaps the most natural way in which the compound Poisson distribution arises is as the limit of compound binomials. That is, , or, equivalently and the result follows by an application of Hölder's inequality.
As with the strong law of thin numbers, this result remains true for general distributions , and the convergence can be es- tablished in the sense of information divergence.
IX. COMPOUND THINNING Theorem 9.1: Let be a distribution on There is a natural generalization of the thinning operation, and finite variance . Then, for any probability measure via a process which closely parallels the generalization of thePoisson distribution to the compound Poisson. Starting with arandom variable is obtained by writing and then keeping each of these 1s with probability dently of all the others; cf. (1) above.
The proof is very similar to that of Theorem 3.3 and thus More generally, we choose and fix a "compounding" distribu- omitted. In fact, the same argument as that proof works for non- integer-valued compounding. That is, if is an arbitrary prob- then the compound -thinned version of ability measure on , then compound thinning a -thinned version of , is the random variable as in (35) gives a probability measure which results from first thinning as above and then replacing of the 1s that are kept by an independent random sample from It is somewhat remarkable that the statement and proof of most of our results concerning the information divergence re- main essentially unchanged in this case. For example, we easilyobtain the following analog of Proposition 4.1.
where all the random variables involved are independent. For Proposition 9.1: If is a distribution on for the distribution of the , then, for any prob- -thinned version of pressed as a mixture of "compound binomials" in the same wayas is a mixture of binomials. The compound binomial distribution with parameters is the distribution of the sum of i.i.d. random variables, each of which is the product of a random variable and an The details of the argument of the proof of the proposition are random variable. In other words, it is the straightforward extensions of the corresponding proof of Propo- -thinned version of the point mass at , i.e., the distribu- tion of (35) with w.p.1. Then we can express the prob- -thinned version of The following two observations are immediate from the Proof of Lemma 2.1: Simply apply Lemma 2.2 to Defini- 1) Compound thinning maps a Bernoulli sum into a com- pound Bernoulli sum: If is the distribution of the is the distribution of the "com- pound Bernoulli sum," are i.i.d. with distribu- , independent of the 2) Compound thinning maps the Poisson to the compound Poisson distribution, that is, the compound Poisson distribution with rate pounding distribution as the distribution of IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 9, SEPTEMBER 2010 using the fact that the sequence of factorial moments of the Proof of Proposition 2.1: Assume that . Then, recalling the property stated in (6), it The second property is easily checked using Lemma 2.1.
Proof of Theorem 3.3: In order to apply Proposition 3.1 , which is only possible if , we need to check that denote the sum of is the distribution of Proof of Proposition 2.2: Note that the expectation ables. Therefore, using the data-processing inequality  as in implies that finite by assumption.
by the Chebyshev rearrangement lemma, since it is the covari- Proposition 3.1 gives ance between an increasing and a decreasing function. Rear-ranging this inequality gives The law of large numbers implies that, as required.
complete the proof it suffices to show that Proof of Proposition 2.3: For part (a), using Lemma 2.2, , or, equivalently, that the se- is uniformly integrable. We will actually show that the nonnegative random variables bounded above by a different uniformly integrable sequence. In-deed, by the log-sum inequality It is straightforward to check, using Lemma 2.1, that Arguing as in the beginning of the proof of Proposition 3.1 shows that the mean is finite, so the law To prove part (b), using Lemma 2.2, Pascal's identity and of large numbers implies that the averages in (36) converge to relabelling, yields . Hence, they form a uniformly integrable se- quence; this implies that the are also uniformly integrable, completing the proof.
Proof of Theorem 4.2: The proof is similar to that of The- orem 3.3, so some details are omitted. For each , where the random are independent, with each First, to see that plying the data-processing inequality  as in  gives , and it is easy to check that each of these terms is finite because all second moments. As before, Proposition 3.1 gives HARREMOËS et al.: THINNING, ENTROPY, AND THE LAW OF THIN NUMBERS for each , the independent random vari- have zero mean and where we have used the fact that the factorial moments of a which is finite by assumption (b). Then, by the general version . Simplifying and of the law of large numbers on [14, p. 239] interchanging the two sums, a.s., and hence, by assumption (a), a.s., so that also, . Moreover, since for every integer Proof of Proposition 5.3: First we have to prove that is Poisson bounded with ratio which is uniformly bounded over by our assumptions.
say. Using the bound in Lemma 3.1 Therefore, the sequence , which implies that it is uniformly integrable, therefore it converges to Finally, recalling once more that the Poisson measures form an exponential family, they satisfy a Pythagorean identity ,so that which is finite.
where the first term was just shown to go to zero as Now, recalling the general expansion (25), it suffices to show and the second term is actually equal to as required.
which also vanishes as by assumption (a).
Proof of Proposition 6.1: We need the following simple Proof of Proposition 5.2: Let lemma; for a proof see, e.g., .
dent random variables with distributions Lemma A.1: If respectively. Then from the definitions, and using Lemmas 2.2and 2.1 Turning to the proof of Proposition 6.1, assume Poisson bounded with ratio . Then the series in the statementconverges, since IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 9, SEPTEMBER 2010 Therefore, using this and (38), for small enough term in (39) is bounded above by which, divided by , tends to zero as For the other two terms we use the full expansion of Propo- sition 6.1, together with Lemma 2.1, to obtain a more accurateexpression for the score function A similar argument holds for Proof of Theorem 6.1: Let have distribution Using Lemma 2.1, Proposition 6.1, and the fact that Since, by assumption, bounded, for small enough the score function of , the first terms in the series in the numerator above vanish. Therefore, , the numerator and denominator above are both bounded functions of , and the denominator is bounded away from zero (because of the term corresponding to , the score function . For the first term in (39) we thus have Since the lower bound is obvious, it follows that, which, again, when divided by , tends to zero as Thus only the second term in (39) contributes. For this term, we similarly obtain that the limit For the third term note that, applying Markov's inequality to thefunction , which increases on the integers, we obtain, HARREMOËS et al.: THINNING, ENTROPY, AND THE LAW OF THIN NUMBERS This means that (with the reversal of order of summation justi-fied by Fubini, since all the terms have the same sign) Finally, combining the above limits with (39) yields Proof of Lemma 7.2: The key is to observe that for is decreasing in , and creasing in , there exists an integer and the result holds. Note that the inequality in (44) follows by(42) and (43), and the inequality in (45) by the discussion above.
The authors wish to thank E. Telatar and C. Vignat for hosting a small workshop in January 2006, during which someof these ideas developed. J. Swart also provided us with usefulcomments.
 J. A. Adell, A. Lekuona, and Y. Yu, "Sharp bounds on the entropy of the Poisson law and related quantities," IEEE Trans. Inf. Theory, vol.
56, no. 5, pp. 2299–2306, 2010.
 A. D. Barbour, L. Holst, and S. Janson, Poisson Approximation.
Further, by Cauchy-Schwarz, for ford, U.K.: Clarendon , 1992.
 A. R. Barron, "Entropy and the central limit theorem," Ann. Probab. Theory, vol. 14, no. 1, pp. 336–342, 1986.
 A. R. Barron, "Limits of information, Markov chains, and projections," in Proc. 2000 Int. Symp. Inf. Theory, 2000, pp. 25–25.
 S. G. Bobkov and M. Ledoux, "On modified logarithmic Sobolev in- equalities for Bernoulli and Poisson measures," J. Funct. Anal., vol.
156, no. 2, pp. 347–365, 1998.
 A. A. Borovkov and S. A. Utev, "An inequality and a characteriza- tion of the normal distribution connected with it," Teor. Veroyatnost. iPrimenen, vol. 28, no. 2, pp. 209–218, 1983.
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 9, SEPTEMBER 2010  D. Chafaï, "Binomial-Poisson entropic inequalities and the M=M=1  D. Schuhmacher, "Distance estimates for dependent superpositions queue," ESAIM Probab. Statist., vol. 10, pp. 317–339, 2006.
of point processes," Stochastic Process. Appl., vol. 115, no. 11, pp.
 T. S. Chihara, An Introduction to Orthogonal Polynomials.
York: Gordon and Breach, 1978.
 C. E. Shannon, "A mathematical theory of communication," Bell Syst.  T. Cover and J. A. Thomas, Elements of Information Theory.
Tech. J., vol. 27, no. 379–423, pp. 623–656, 1948.
York: Wiley, 1991.
 A. J. Stam, "Some inequalities satisfied by the quantities of information  I. Csiszár, "Information-type measures of difference of probability dis- of Fisher and Shannon," Inf. Contr., vol. 2, pp. 101–112, Jun. 1959.
tributions and indirect observations," Studia Sci. Math. Hungar., vol.
 D. W. Walkup, "Pólya sequences, binomial convolution and the union 2, pp. 299–318, 1967.
of random sets," J. Appl. Probabil., vol. 13, no. 1, pp. 76–85, 1976.
 I. Csiszár and P. Shields, "Information theory and statistics: A tutorial,"  N. Yannaros, "Poisson approximation for random sums of Bernoulli Found. Trends in Commun. Inf. Theory, vol. 1, pp. 1–111, 2004.
random variables," Statist. Probab. Lett., vol. 11, no. 2, pp. 161–165,  D. J. Daley and D. Vere-Jones, An Introduction To The Theory of Point Processes, Second ed.
New York: Springer, 2008, vol. II.
 Y. Yu, "Monotonic convergence in an information-theoretic law of small  A. Fedotov, P. Harremoës, and F. Topsøe, "Refinements of Pinsker's numbers," IEEE Trans. Inf. Theory, vol. 55, pp. 5412–5422, 2009.
inequality," IEEE Trans. Inf. Theory, vol. 49, pp. 1491–1498, Jun. 2003.
 V. M. Zolotarev, "Probability metrics," Teor. Veroyatnost. i Primenen,  W. Feller, An Introduction to Probability Theory and Its Applications, vol. 28, no. 2, pp. 264–287, 1983.
New York: Wiley, 1971, vol. II.
 J. Fritz, "An information-theoretical proof of limit theorems for re- versible Markov processes," in Proc. 6th Prague Conf. Inf. Theory,Statist. Decision Functions, Random Processes, Prague, Sep. 1971.
Peter Harremoës (M'00) received the B.Sc. degree in mathematics in 1984, the
 L. Gerber, "An extension of Bernoulli's inequality," Amer. Math. Exam. Art. degree in archaeology in 1985, and the M.Sc. degree in mathematics Monthly, vol. 75, pp. 875–876, 1968.
in 1988, all from the University of Copenhagen, Denmark. He received the Ph.D.
 B. Grigelionis, "The convergence of stepwise random processes to a degree in natural sciences in 1993 from Roskilde University, Denmark.
Poisson process," Teor. Verojatnost. i Primenen, vol. 8, pp. 189–194, From 1993 to 1998, he worked as a mountaineer. From 1998 to 2000, he held various teaching positions in mathematics. From 2001 to 2006, he was  P. Harremoës, "Binomial and Poisson distributions as maximum a Postdoctoral Fellow with the University of Copenhagen, with a longer visit entropy distributions," IEEE Trans. Inf. Theory, vol. IT-47, pp.
to Zentrum für Interdiziplinäre Forschung, Bielefeld, Germany, 2003. From 2039–2041, Jul. 2001.
2006 to 2009, he was affiliated with Centrum Wiskunde and Informatica, Am-  P. Harremoës and K. K. Holst, "Convergence of Markov chains in in- sterdam, The Netherlands, under the European Pascal Network of Excellence.
formation divergence," J. Theoret. Probab., vol. 22, no. 1, pp. 186–202, Since then he has been affiliated with Niels Brock, Copenhagen Business College, Denmark.
 O. Johnson and I. Kontoyiannis, "Thinning and the law of small Dr. Harremoës has been Editor-in-Chief of the journal Entropy since 2007.
numbers," in Proc. IEEE Int. Symp. Inf. Theory, Jun. 2007, pp.
 P. Harremoës, O. Johnson, and I. Kontoyiannis, "Thinning and infor- mation projections," Preparation, 2010.
 P. Harremoës and P. Ruzankin, "Rate of convergence to Poisson law Oliver Johnson received the B.A. degree in 1995, Part III Mathematics in 1996
in terms of information divergence," IEEE Trans. Inf. Theory, vol. 50, and the Ph.D. degree in 2000, all from the University of Cambridge.
pp. 2145–2149, 2004.
He was a Clayton Research Fellow of Christ's College and Max Newman Re-  O. Johnson, Information Theory and Central Limit Theorem.
search Fellow of Cambridge University until 2006, during which time he pub- London, U.K.: Imperial College Press, 2004.
lished the book Information Theory and the Central Limit Theorem in 2004.
 O. Johnson, "Log-concavity and the maximum entropy property of the Since 2006, he has been Lecturer in Statistics at Bristol University, U.K.
Poisson distribution," Stochastic Process. Appl., vol. 117, no. 6, pp.
 I. M. Johnstone and B. MacGibbon, "Une mesure d'information car- actérisant la loi de Poisson," Séminaire de Probab., XXI, pp. 563–573, Ioannis Kontoyiannis (S'94–M'98–SM"08) was born in Athens, Greece, in
1972. He received the B.Sc. degree in mathematics in 1992 from Imperial Col-  A. Y. Khintchine, Mathematical Methods in the Theory of Queueing.
lege (University of London), U.K., and in 1993 he obtained a distinction in Part New York: Hafner, 1960.
III of the Cambridge University Pure Mathematics Tripos. He received the M.S.
 C. A. J. Klaassen, "On an inequality of Chernoff," Ann. Probab., vol.
degree in statistics and the Ph.D. degree in electrical engineering, both from 13, no. 3, pp. 966–974, 1985.
Stanford University, Stanford, CA, in 1997 and 1998, respectively.
 I. Kontoyiannis, P. Harremoës, and O. Johnson, "Entropy and the law Between June and December 1995, he was with IBM Research working on of small numbers," IEEE Trans. Inf. Theory, vol. IT-51, pp. 466–472, a NASA-IBM satellite image processing and compression project. From June 1998 to August 2001, we was an Assistant Professor with the Department of  E. H. Lieb, "Proof of an entropy conjecture by Wehrl," Commun. Math. Statistics, Purdue University, West Lafayette, IN (and also, by courtesy, with the Phys., vol. 62, pp. 35–41, 1978.
Department of Mathematics, and the School of Electrical and Computer Engi-  T. Lindvall, Lectures on the Coupling Method.
New York: Wiley , neering). From August 2000 until July 2005, he was an Assistant, then Associate Professor, with the Division of Applied Mathematics and with the Department  D. Mejzler, "A note on Erlang's formulas," Israel J. Math., vol. 3, pp.
of Computer Science, Brown University, Providence, RI. Since March 2005, he 157–162, 1965.
has been an Associate Professor with the Department of Informatics, Athens  G. F. Newell, "The M=G=1 queue," SIAM J. Appl. Math., vol. 14, University of Economics and Business.
pp. 86–88, 1966.
Dr. Kontoyiannis was awarded the Manning endowed Assistant Professorship  C. Palm, "Intensitätsschwankungen im Fernsprechverkehr," Ericsson in 2002, and was awarded an honorary Master of Arts degree Ad Eundem, in Technics No, vol. 44, pp. 189–189, 1943.
2005, both by Brown University. In 2004, he was awarded a Sloan Foundation  R.-D. Reiss, Approximate Distributions of Order Statistics, ser.
Research Fellowship. Currently he serves as an Associate Editor for the IEEE Springer Series in Statistics.
New York: Springer-Verlag, 1989.
TRANSACTIONS ON INFORMATION THEORY. His research interests include data  A. Rényi, "A characterization of Poisson processes," Magyar Tud. compression, applied probability, information theory, statistics, simulation, and Akad. Mat. Kutaló Int. Közl., vol. 1, pp. 519–527, 1956.
Hippocampal microRNA-132 mediates stress-inducible cognitive deficits through its acetylcholinesterase target G. Shaltiel, M. Hanan, Y. Wolf, S. Barbash, E. Kovalev, S. Shoham & Brain Structure and Function ISSN 1863-2653Brain Struct FunctDOI 10.1007/s00429-011-0376-z Your article is published under the Creative