Understanding Z-Test and P-Value with ML Use Cases

Learn about z-test and p-value in statistics with detailed examples and Python code. Understand how they apply to Machine Learning and Deep Learning for model evaluation.

What is a P-Value?

The p-value is a probability that measures the strength of the evidence against the null hypothesis. Specifically, it is the probability of observing a test statistic (like the z-score) at least as extreme as the one computed from your sample, assuming that the null hypothesis is true.

A smaller p-value indicates stronger evidence against the null hypothesis. Common thresholds to reject the null hypothesis are:

p < 0.05: statistically significant
p < 0.01: highly significant

Python Example of Z-Test

Let’s assume we want to test whether the mean of a sample differs from a known population mean:


import numpy as np
from scipy import stats

# Sample data
sample = [2.9, 3.0, 2.5, 3.2, 3.8, 3.5]
mu = 3.0       # Population mean
sigma = 0.5    # Population std deviation
n = len(sample)
x_bar = np.mean(sample)

# Calculate z-score
z = (x_bar - mu) / (sigma / np.sqrt(n))
p_value = 2 * (1 - stats.norm.cdf(abs(z)))

print("Z-score:", z)
print("P-value:", p_value)

Using Z-Test and P-Value in ML/DL

In Machine Learning (ML) and Deep Learning (DL), z-tests and p-values help validate experimental results, such as whether a new model significantly outperforms a baseline model. Without statistical testing, we might mistake random fluctuations in performance for real improvements.

Compare two models: Test if the performance difference between two models (e.g., accuracy) is statistically significant.
A/B testing: Evaluate changes in algorithms, UI components, or features based on user interactions.
Feature selection: Check whether the mean of a feature differs between classes significantly, which may indicate predictive power.

Example: Comparing Two Models

Let’s compare the accuracy of two models over multiple runs:


acc_model_a = [0.83, 0.85, 0.82, 0.84, 0.86]
acc_model_b = [0.79, 0.78, 0.80, 0.77, 0.81]

mean_a = np.mean(acc_model_a)
mean_b = np.mean(acc_model_b)
sd = np.std(acc_model_a + acc_model_b, ddof=1)
n = len(acc_model_a)

z = (mean_a - mean_b) / (sd * np.sqrt(2/n))
p = 2 * (1 - stats.norm.cdf(abs(z)))

print("Z-Score:", z)
print("P-Value:", p)

Conclusion

The z-test and p-value are essential statistical tools for validating model improvements, experimental hypotheses, and performance evaluations. Especially in ML/DL pipelines, applying these tests ensures that your decisions are backed by robust statistical evidence rather than randomness.

AI Practitioner

Search This Blog