Automatize your statistical tests with Python, Pandas and SciPy

Synthetic dataset

First let’s import few packages to do the job.

import uuid
import pandas as pd
from scipy import stats

Now we can create a synthetic dataset to support the discussion:

def create_dataset(id_gen=uuid.uuid4, law=stats.norm, size=300):
    return pd.DataFrame({"value": law.rvs(size=size)}).assign(id=id_gen())

data = pd.concat([create_dataset() for _ in range(5)])

Pandas aggregation magic

The key to make easy and flexible aggregation with pandas is to define function with a specific signature:

First argument is a dummy variable capturing slice of dataframe you want to work on;
Next arguments are parameters to control the function;
Returned structure must be a Pandas Series.

If we match those signature criteria, then our aggregation will rock!

Inferring statistics

Let’s say we want to assess the normality of our samples by fitting them using MLE procedure implemented in scipy stats. Then we just need to wrap it to match the signature requirements:

def fit(x, law=stats.norm):
    parameters = law.fit(x)
    return pd.Series({
        "mean": parameters[0],
        "std": parameters[1],
        "data": x.values.squeeze(),
        "rv": law(*parameters)
    })

Now we can apply it on pandas dataframe and groupby accessor:

grouped = data.groupby("id").apply(fit)

The results of this operation is a dataframe:

id	mean	std	data	rv
06a6aeae-1415-434f-8183-70214fe6e86e	0.000852	0.977518	[0.765054, …]	<scipy.stats._distn_…>
7c904f58-640d-4f68-8710-14d9b2ea9177	-0.029186	1.026545	[-1.085630, …]	<scipy.stats._distn_…>
91af2683-ab09-495b-a429-00d994a8ec77	0.089522	0.872507	[1.111701, …]	<scipy.stats._distn_…>
a4a2e0c7-5f42-4637-9f2e-13e844d9edd3	-0.045189	0.968537	[0.551302, …]	<scipy.stats._distn_…>
f0bb5a17-c731-467b-b2ee-d95a15801ec1	-0.048523	1.011672	[1.140656, …]	<scipy.stats._distn_…>

Applying hypothesis test

Now we can create a function with the same signature that perform hypothesis testing:

def hypothesis_test(x, test=stats.kstest):
    check = test(x["data"], x["rv"].cdf)
    return pd.Series({
        "statistic": check.statistic,
        "pvalue": check.pvalue,
    })

Then we apply it on our grouped dataframe and we assemble the result:

final = pd.concat([
    groups,
    groups.apply(hypothesis_test, axis=1)
], axis=1)

The final result looks like:

id	mean	std	data	rv	statistic	pvalue
06a6aeae-1415-434f-8183-70214fe6e86e	0.000852	0.977518	[0.765054, …]	<scipy.stats._distn_…>	0.024662	0.991204
7c904f58-640d-4f68-8710-14d9b2ea9177	-0.029186	1.026545	[-1.085630, …]	<scipy.stats._distn_…>	0.022014	0.998044
91af2683-ab09-495b-a429-00d994a8ec77	0.089522	0.872507	[1.111701, …]	<scipy.stats._distn_…>	0.028399	0.963259
a4a2e0c7-5f42-4637-9f2e-13e844d9edd3	-0.045189	0.968537	[0.551302, …]	<scipy.stats._distn_…>	0.046312	0.525440
f0bb5a17-c731-467b-b2ee-d95a15801ec1	-0.048523	1.011672	[1.140656, …]	<scipy.stats._distn_…>	0.035174	0.838864

And show all usual steps of a classical statistical test.

Synthetic dataset

Pandas aggregation magic

Inferring statistics

Applying hypothesis test

jlandercy

Related Posts

Find the closest points to a specific location using KD Tree

Compute Acid/Base Partition functions

Compute integral using Monte-Carlo