DISCO¶
-
class
hyppo.ksample.DISCO(compute_distance='euclidean', bias=False, **kwargs)¶ Distance Components (DISCO) test statistic and p-value.
DISCO is a powerful multivariate k-sample test. It leverages distance matrix capabilities (similar to tests like distance correlation or Dcorr). In fact, DISCO statistic is equivalent to our 2-sample formulation nonparametric MANOVA via independence testing, i.e.
hyppo.ksample.KSample, and tohyppo.independence.Dcorr,hyppo.ksample.Energy,hyppo.independence.Hsic, andhyppo.ksample.MMD[1] [2].Traditionally, the formulation for the DISCO statistic is as follows [3]:
Define \(\{ u^i_1 \stackrel{iid}{\sim} F_{U_1},\ i = 1, ..., n_1 \}\) up to \(\{ u^j_k \stackrel{iid}{\sim} F_{V_1},\ j = 1, ..., n_k \}\) as k groups of samples deriving from different distributions with the same dimensionality. If \(d(\cdot, \cdot)\) is a distance metric (i.e. Euclidean), \(N = \sum_{i = 1}^k n_k\), and \(\mathrm{Energy}\) is the Energy test statistic from
hyppo.ksample.Energythen,\[\mathrm{DISCO}_N(\mathbf{u}_1, \ldots, \mathbf{u}_k) = \sum_{1 \leq k < l \leq K} \frac{n_k n_l}{2N} \mathrm{Energy}_{n_k + n_l} (\mathbf{u}_k, \mathbf{u}_l)\]The implementation in the
hyppo.ksample.KSampleclass (usinghyppo.independence.Dcorr) is in fact equivalent to this implementation (for p-values) and statistics are equivalent up to a scaling factor [1].The p-value returned is calculated using a permutation test uses
hyppo.tools.perm_test. The fast version of the test useshyppo.tools.chi2_approx.- Parameters
compute_distance (
str,callable, orNone, default:"euclidean") -- A function that computes the distance among the samples within each data matrix. Valid strings forcompute_distanceare, as defined insklearn.metrics.pairwise_distances,From scikit-learn: [
"euclidean","cityblock","cosine","l1","l2","manhattan"] See the documentation forscipy.spatial.distancefor details on these metrics.From scipy.spatial.distance: [
"braycurtis","canberra","chebyshev","correlation","dice","hamming","jaccard","kulsinski","mahalanobis","minkowski","rogerstanimoto","russellrao","seuclidean","sokalmichener","sokalsneath","sqeuclidean","yule"] See the documentation forscipy.spatial.distancefor details on these metrics.
Set to
Noneor"precomputed"ifxandyare already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the formmetric(x, **kwargs)wherexis the data matrix for which pairwise distances are calculated and**kwargsare extra arguements to send to your custom function.bias (
bool, default:False) -- Whether or not to use the biased or unbiased test statistics.**kwargs -- Arbitrary keyword arguments for
compute_distance.
Methods Summary
|
Calulates the DISCO test statistic. |
|
Calculates the DISCO test statistic and p-value. |
-
DISCO.statistic(*args)¶ Calulates the DISCO test statistic.
-
DISCO.test(*args, reps=1000, workers=1, auto=True)¶ Calculates the DISCO test statistic and p-value.
- Parameters
*args (
ndarray) -- Variable length input data matrices. All inputs must have the same number of samples and dimensions. That is, the shapes must be (n, p) where n is the number of samples and p is the number of dimensions.reps (
int, default:1000) -- The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.workers (
int, default:1) -- The number of cores to parallelize the p-value computation over. Supply-1to use all cores available to the Process.auto (
bool, default:True) -- Automatically uses fast approximation when n and size of array is greater than 20. IfTrue, and sample size is greater than 20, thenhyppo.tools.chi2_approxwill be run. Parametersrepsandworkersare irrelevant in this case. Otherwise,hyppo.tools.perm_testwill be run.
- Returns
Examples
>>> import numpy as np >>> from hyppo.ksample import DISCO >>> x = np.arange(7) >>> y = x >>> stat, pvalue = DISCO().test(x, y) >>> '%.3f, %.1f' % (stat, pvalue) '-1.566, 1.0'