123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229 |
- [section:hypergeometric_dist Hypergeometric Distribution]
- ``#include <boost/math/distributions/hypergeometric.hpp>``
- namespace boost{ namespace math{
- template <class RealType = double,
- class ``__Policy`` = ``__policy_class`` >
- class hypergeometric_distribution;
- template <class RealType, class Policy>
- class hypergeometric_distribution
- {
- public:
- typedef RealType value_type;
- typedef Policy policy_type;
- // Construct:
- hypergeometric_distribution(unsigned r, unsigned n, unsigned N);
- // Accessors:
- unsigned total()const;
- unsigned defective()const;
- unsigned sample_count()const;
- };
- typedef hypergeometric_distribution<> hypergeometric;
- }} // namespaces
- The hypergeometric distribution describes the number of "events" /k/
- from a sample /n/ drawn from a total population /N/ ['without replacement].
- Imagine we have a sample of /N/ objects of which /r/ are "defective"
- and N-r are "not defective"
- (the terms "success\/failure" or "red\/blue" are also used). If we sample /n/
- items /without replacement/ then what is the probability that exactly
- /k/ items in the sample are defective? The answer is given by the pdf of the
- hypergeometric distribution `f(k; r, n, N)`, whilst the probability of
- /k/ defectives or fewer is given by F(k; r, n, N), where F(k) is the
- CDF of the hypergeometric distribution.
- [note Unlike almost all of the other distributions in this library,
- the hypergeometric distribution is strictly discrete: it can not be
- extended to real valued arguments of its parameters or random variable.]
- The following graph shows how the distribution changes as the proportion
- of "defective" items changes, while keeping the population and sample sizes
- constant:
- [graph hypergeometric_pdf_1]
- Note that since the distribution is symmetrical in parameters /n/ and /r/, if we
- change the sample size and keep the population and proportion "defective" the same
- then we obtain basically the same graphs:
- [graph hypergeometric_pdf_2]
- [h4 Member Functions]
- hypergeometric_distribution(unsigned r, unsigned n, unsigned N);
- Constructs a hypergeometric distribution with a population of /N/ objects,
- of which /r/ are defective, and from which /n/ are sampled.
- unsigned total()const;
- Returns the total number of objects /N/.
- unsigned defective()const;
- Returns the number of objects /r/ in population /N/ which are defective.
- unsigned sample_count()const;
- Returns the number of objects /n/ which are sampled from the population /N/.
- [h4 Non-member Accessors]
- All the [link math_toolkit.dist_ref.nmp usual non-member accessor functions]
- that are generic to all distributions are supported: __usual_accessors.
- The domain of the random variable is the unsigned integers in the range
- \[max(0, n + r - N), min(n, r)\]. A __domain_error is raised if the
- random variable is outside this range, or is not an integral value.
- [caution
- The quantile function will by default return an integer result that has been
- /rounded outwards/. That is to say lower quantiles (where the probability is
- less than 0.5) are rounded downward, and upper quantiles (where the probability
- is greater than 0.5) are rounded upwards. This behaviour
- ensures that if an X% quantile is requested, then /at least/ the requested
- coverage will be present in the central region, and /no more than/
- the requested coverage will be present in the tails.
- This behaviour can be changed so that the quantile functions are rounded
- differently using
- [link math_toolkit.pol_overview Policies]. It is strongly
- recommended that you read the tutorial
- [link math_toolkit.pol_tutorial.understand_dis_quant
- Understanding Quantiles of Discrete Distributions] before
- using the quantile function on the Hypergeometric distribution. The
- [link math_toolkit.pol_ref.discrete_quant_ref reference docs]
- describe how to change the rounding policy
- for these distributions.
- However, note that the implementation method of the quantile function
- always returns an integral value, therefore attempting to use a __Policy
- that requires (or produces) a real valued result will result in a
- compile time error.
- ] [/ caution]
- [h4 Accuracy]
- For small N such that
- `N < boost::math::max_factorial<RealType>::value` then table based
- lookup of the results gives an accuracy to a few epsilon.
- `boost::math::max_factorial<RealType>::value` is 170 at double or long double
- precision.
- For larger N such that `N < boost::math::prime(boost::math::max_prime)`
- then only basic arithmetic is required for the calculation
- and the accuracy is typically < 20 epsilon. This takes care of N
- up to 104729.
- For `N > boost::math::prime(boost::math::max_prime)` then accuracy quickly
- degrades, with 5 or 6 decimal digits being lost for N = 110000.
- In general for very large N, the user should expect to lose log[sub 10]N
- decimal digits of precision during the calculation, with the results
- becoming meaningless for N >= 10[super 15].
- [h4 Testing]
- There are three sets of tests: our implementation is tested against a table of values
- produced by Mathematica's implementation of this distribution. We also sanity check
- our implementation against some spot values computed using the online calculator
- here [@http://stattrek.com/Tables/Hypergeometric.aspx http://stattrek.com/Tables/Hypergeometric.aspx].
- Finally we test accuracy against some high precision test data using
- this implementation and NTL::RR.
- [h4 Implementation]
- The PDF can be calculated directly using the formula:
- [equation hypergeometric1]
- However, this can only be used directly when the largest of the factorials
- is guaranteed not to overflow the floating point representation used.
- This formula is used directly when `N < max_factorial<RealType>::value`
- in which case table lookup of the factorials gives a rapid and accurate
- implementation method.
- For larger /N/ the method described in
- "An Accurate Computation of the Hypergeometric Distribution Function",
- Trong Wu, ACM Transactions on Mathematical Software, Vol. 19, No. 1,
- March 1993, Pages 33-43 is used. The method relies on the fact that
- there is an easy method for factorising a factorial into the product
- of prime numbers:
- [equation hypergeometric2]
- Where p[sub i] is the i'th prime number, and e[sub i] is a small
- positive integer or zero, which can be calculated via:
- [equation hypergeometric3]
- Further we can combine the factorials in the expression for the PDF
- to yield the PDF directly as the product of prime numbers:
- [equation hypergeometric4]
- With this time the exponents e[sub i] being either positive, negative
- or zero. Indeed such a degree of cancellation occurs in the calculation
- of the e[sub i] that many are zero, and typically most have a magnitude
- or no more than 1 or 2.
- Calculation of the product of the primes requires some care to prevent
- numerical overflow, we use a novel recursive method which splits the
- calculation into a series of sub-products, with a new sub-product
- started each time the next multiplication would cause either overflow
- or underflow. The sub-products are stored in a linked list on the
- program stack, and combined in an order that will guarantee no overflow
- or unnecessary-underflow once the last sub-product has been calculated.
- This method can be used as long as N is smaller than the largest prime
- number we have stored in our table of primes (currently 104729). The method
- is relatively slow (calculating the exponents requires the most time), but
- requires only a small number of arithmetic operations to
- calculate the result (indeed there is no shorter method involving only basic
- arithmetic once the exponents have been found), the method is therefore
- much more accurate than the alternatives.
- For much larger N, we can calculate the PDF from the factorials using
- either lgamma, or by directly combining lanczos approximations to avoid
- calculating via logarithms. We use the latter method, as it is usually
- 1 or 2 decimal digits more accurate than computing via logarithms with
- lgamma. However, in this area where N > 104729, the user should expect
- to lose around log[sub 10]N decimal digits during the calculation in
- the worst case.
- The CDF and its complement is calculated by directly summing the PDF's.
- We start by deciding whether the CDF, or its complement, is likely to be
- the smaller of the two and then calculate the PDF at /k/ (or /k+1/ if we're
- calculating the complement) and calculate successive PDF values via the
- recurrence relations:
- [equation hypergeometric5]
- Until we either reach the end of the distributions domain, or the next
- PDF value to be summed would be too small to affect the result.
- The quantile is calculated in a similar manner to the CDF: we first guess
- which end of the distribution we're nearer to, and then sum PDFs starting
- from the end of the distribution this time, until we have some value /k/ that
- gives the required CDF.
- The median is simply the quantile at 0.5, and the remaining properties are
- calculated via:
- [equation hypergeometric6]
- [endsect]
- [/ hypergeometric.qbk
- Copyright 2008 John Maddock.
- Distributed under the Boost Software License, Version 1.0.
- (See accompanying file LICENSE_1_0.txt or copy at
- http://www.boost.org/LICENSE_1_0.txt).
- ]
|