Mathematical basis¶

This section explains the "why" with a simple approach. If you want recipes, go to Tutorials. If you want full formulas, keep reading.

Huber M-Estimator¶

Simple idea: the classic mean gives too much weight to extreme values. Huber keeps the center stable and applies a gradual brake to outliers.

Quick example:

[10, 12, 11, 15, 10, 1000]

The mean jumps too much, but Huber stays near the center.

Definition:

\[ \rho_k(r) = \begin{cases} \frac{1}{2} r^2 & \text{if } |r| \le k \\ k\left(|r| - \frac{1}{2} k\right) & \text{if } |r| > k \end{cases} \]

Interpretation: - Near the center, it behaves like the mean (quadratic). - In the tails, it becomes linear to reduce impact.

MAD measures dispersion around the median. To compare it with standard deviation, scale it as follows:

\[ \sigma_{robust} = MAD \times 1.4826 \]

This makes it comparable under normal distributions.

Instead of the mean, use the median to avoid inflating variability:

\[ CV_r = \left(\frac{\sigma_{robust}}{|\tilde{x}|}\right) \times 100 \]

For an ordered set \(x_{(1)} \le \dots \le x_{(n)}\), quantiles follow rules defined by \(p_k\) and parameters \((a, b)\):

\[ p_k = \frac{k - a}{n + b} \]

Linear interpolation applies between \(x_{(j)}\) and \(x_{(j+1)}\) when \(p\) falls between positions. StatGuard implements the 9 types used by R.

Type	\(p_k\)	\(a\)	\(b\)	Note
1	\(k / n\)	0	0	Inverse of the empirical CDF (discontinuous).
2	\(k / n\)	0	0	Averages at discontinuities.
3	\((k - 0.5) / n\)	-0.5	0	Nearest order statistic.
4	\(k / n\)	0	1	Linear interpolation of CDF.
5	\((k - 0.5) / n\)	0.5	0.5	Hazen (1914).
6	\(k / (n + 1)\)	0	1	Weibull (1939).
7	\((k - 1) / (n - 1)\)	1	1	R default, mode of \(F(x)\).
8	\((k - 1/3) / (n + 1/3)\)	1/3	1/3	Median-unbiased.
9	\((k - 3/8) / (n + 1/4)\)	3/8	3/8	Normal-unbiased.

Success

If you are unsure, type 7 is the default behavior in R.