Synthetic dataset
First let’s import few packages to do the job.
import uuid
import pandas as pd
from scipy import stats
Now we can create a synthetic dataset to support the discussion:
def create_dataset(id_gen=uuid.uuid4, law=stats.norm, size=300):
return pd.DataFrame({"value": law.rvs(size=size)}).assign(id=id_gen())
data = pd.concat([create_dataset() for _ in range(5)])
Pandas aggregation magic
The key to make easy and flexible aggregation with pandas is to define function with a specific signature:
- First argument is a dummy variable capturing slice of dataframe you want to work on;
- Next arguments are parameters to control the function;
- Returned structure must be a Pandas Series.
If we match those signature criteria, then our aggregation will rock!
Inferring statistics
Let’s say we want to assess the normality of our samples by fitting them using MLE procedure implemented in scipy stats. Then we just need to wrap it to match the signature requirements:
def fit(x, law=stats.norm):
parameters = law.fit(x)
return pd.Series({
"mean": parameters[0],
"std": parameters[1],
"data": x.values.squeeze(),
"rv": law(*parameters)
})
Now we can apply it on pandas dataframe and groupby accessor:
grouped = data.groupby("id").apply(fit)
The results of this operation is a dataframe:
id | mean | std | data | rv |
---|---|---|---|---|
06a6aeae-1415-434f-8183-70214fe6e86e | 0.000852 | 0.977518 | [0.765054, …] | <scipy.stats._distn_…> |
7c904f58-640d-4f68-8710-14d9b2ea9177 | -0.029186 | 1.026545 | [-1.085630, …] | <scipy.stats._distn_…> |
91af2683-ab09-495b-a429-00d994a8ec77 | 0.089522 | 0.872507 | [1.111701, …] | <scipy.stats._distn_…> |
a4a2e0c7-5f42-4637-9f2e-13e844d9edd3 | -0.045189 | 0.968537 | [0.551302, …] | <scipy.stats._distn_…> |
f0bb5a17-c731-467b-b2ee-d95a15801ec1 | -0.048523 | 1.011672 | [1.140656, …] | <scipy.stats._distn_…> |
Applying hypothesis test
Now we can create a function with the same signature that perform hypothesis testing:
def hypothesis_test(x, test=stats.kstest):
check = test(x["data"], x["rv"].cdf)
return pd.Series({
"statistic": check.statistic,
"pvalue": check.pvalue,
})
Then we apply it on our grouped dataframe and we assemble the result:
final = pd.concat([
groups,
groups.apply(hypothesis_test, axis=1)
], axis=1)
The final result looks like:
id | mean | std | data | rv | statistic | pvalue |
---|---|---|---|---|---|---|
06a6aeae-1415-434f-8183-70214fe6e86e | 0.000852 | 0.977518 | [0.765054, …] | <scipy.stats._distn_…> | 0.024662 | 0.991204 |
7c904f58-640d-4f68-8710-14d9b2ea9177 | -0.029186 | 1.026545 | [-1.085630, …] | <scipy.stats._distn_…> | 0.022014 | 0.998044 |
91af2683-ab09-495b-a429-00d994a8ec77 | 0.089522 | 0.872507 | [1.111701, …] | <scipy.stats._distn_…> | 0.028399 | 0.963259 |
a4a2e0c7-5f42-4637-9f2e-13e844d9edd3 | -0.045189 | 0.968537 | [0.551302, …] | <scipy.stats._distn_…> | 0.046312 | 0.525440 |
f0bb5a17-c731-467b-b2ee-d95a15801ec1 | -0.048523 | 1.011672 | [1.140656, …] | <scipy.stats._distn_…> | 0.035174 | 0.838864 |
And show all usual steps of a classical statistical test.