Box-Cox and Yeo-Johnson Transformations

The main motivations for transforming variables in data wrangling are to make data more suitable for analysis and modeling. Transformation often address specific challenges or requirements in the dataset, ensuring that it meets the assumptions of analytical methods or enhances interpretability and usability.

In this post, we will discuss two common transformations: Box-Cox and Yeo-Johnson. Both transformations are used to transform skewed data to make it more normal or Gaussian-like. The main difference between the two is that Box-Cox transformation is a simple and commonly used transformation, while Yeo-Johnson transformation is more flexible and can handle non-normal data.

Transforming variables to central normality

Box–Cox Transformation

The Box–Cox transformation is a family of power transformations designed for strictly positive data.

It is defined as:

$y(\lambda) = \begin{cases} \dfrac{x^{\lambda} - 1}{\lambda}, & \lambda \neq 0 \\ \ln(x), & \lambda = 0 \end{cases}$

Key ideas:

The parameter λ (lambda) controls the strength of the transformation.
When λ = 1, the data are almost unchanged.
When λ = 0, the transformation becomes the logarithmic transformation.
Smaller values of λ compress large observations more strongly, reducing right skewness.

Purpose:

Reduce skewness
Stabilize variance
Make data more compatible with statistical models that assume normality

Limitation:

Box–Cox cannot be applied to zero or negative values, which restricts its use in some datasets.

Yeo-Johnson Transformation

The Yeo–Johnson transformation is an extension of the Box–Cox transformation that allows zero and negative values.

It is defined as:

$y(\lambda) = \begin{cases} \dfrac{(x+1)^{\lambda} - 1}{\lambda}, & x \ge 0,\ \lambda \neq 0 \\ \ln(x+1), & x \ge 0,\ \lambda = 0 \\ -\dfrac{(-x+1)^{2-\lambda} - 1}{2-\lambda}, & x < 0,\ \lambda \neq 2 \\ -\ln(-x+1), & x < 0,\ \lambda = 2 \end{cases}·$

Key ideas:

Positive and negative values are transformed using symmetrical power functions.
The transformation is continuous and smooth around zero.
The parameter λ again controls the degree of skewness correction.

Advantage:

Can be applied to any real-valued data
Maintains similar interpretability to Box–Cox
Suitable for datasets containing negative values or zeros