...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
#include <boost/math/distributions/kolmogorov_smirnov.hpp>
namespace boost{ namespace math{ template <class RealType = double, class Policy = policies::policy<> > class kolmogorov_smirnov_distribution; typedef kolmogorov_smirnov_distribution<> kolmogorov_smirnov; template <class RealType, class Policy> class kolmogorov_smirnov_distribution { public: typedef RealType value_type; typedef Policy policy_type; // Constructor: kolmogorov_smirnov_distribution(RealType n); // Accessor to parameter: RealType number_of_observations()const; }} // namespaces
The Kolmogorov-Smirnov test in statistics compares two empirical distributions, or an empirical distribution against any theoretical distribution.[1] It makes use of a specific distribution which is informally known in the literature as the Kolmogorv-Smirnov distribution, implemented here.
Formally, if n observations are taken from a theoretical distribution G(x), and if Gn(x) represents the empirical CDF of those n observations, then the test statistic
will be distributed according to a Kolmogorov-Smirnov distribution parameterized by n.
The exact form of a Kolmogorov-Smirnov distribution is the subject of a large, decades-old literature.[2] In the interest of simplicity, Boost implements the first-order, limiting form of this distribution (the same form originally identified by Kolmogorov[3]), namely
Note that while the exact distribution only has support over [0, 1], this limiting form has positive mass above unity, particularly for small n. The following graph illustrations how the distribution changes for different values of n:
kolmogorov_smirnov_distribution(RealType n);
Constructs a Kolmogorov-Smirnov distribution with n observations.
Requires n > 0, otherwise calls domain_error.
RealType number_of_observations()const;
Returns the parameter n from which this object was constructed.
All the usual non-member accessor functions that are generic to all distributions are supported: Cumulative Distribution Function, Probability Density Function, Quantile, Hazard Function, Cumulative Hazard Function, mean, median, mode, variance, standard deviation, skewness, kurtosis, kurtosis_excess, range and support.
The domain of the random variable is [0, +∞].
The CDF of the Kolmogorov-Smirnov distribution is implemented in terms of the fourth Jacobi Theta function; please refer to the accuracy ULP plots for that function.
The PDF is implemented separately, and the following ULP plot illustrates its accuracy:
Because PDF values are simply scaled out and up by the square root of n, the above plot is representative for all values of n. Note that for present purposes, "accuracy" refers to deviations from the limiting approximation, rather than deviations from the exact distribution.
In the following table, n is the number of observations, x is the random variable, π is Archimedes' constant, and ζ(3) is Apéry's constant.
Function |
Implementation Notes |
---|---|
cdf |
Using the relation: cdf = jacobi_theta4tau(0, 2*x*x/π) |
|
Using a manual derivative of the CDF |
cdf complement |
When x*x*n == 0: 1 When 2*x*x*n <= π: 1 - jacobi_theta4tau(0, 2*x*x*n/π) When 2*x*x*n > π: -jacobi_theta4m1tau(0, 2*x*x*n/π) |
quantile |
Using a Newton-Raphson iteration |
quantile from the complement |
Using a Newton-Raphson iteration |
mode |
Using a run-time PDF maximizer |
mean |
sqrt(π/2) * ln(2) / sqrt(n) |
variance |
(π2/12 - π/2*ln2(2))/n |
skewness |
(9/16*sqrt(π/2)*ζ(3)/n3/2 - 3 * mean * variance - mean2 * variance) / (variance3/2) |
kurtosis |
(7/720*π4/n2 - 4 * mean * skewness * variance3/2 - 6 * mean2 * variance - mean4) / (variance2) |
[2] Simard, R. and L'Ecuyer, P. (2011) "Computing the Two-Sided Kolmogorov-Smirnov Distribution". Journal of Statistical Software, vol. 39, no. 11.
[3] Kolmogorov A (1933). "Sulla determinazione empirica di una legge di distribuzione". G. Ist. Ital. Attuari. 4: 83–91.