Box-Cox and Yeo-Johnson Transformations

The main motivations for transforming variables in data wrangling are to make data more suitable for analysis and modeling. Transformation often address specific challenges or requirements in the dataset, ensuring that it meets the assumptions of analytical methods or enhances interpretability and usability.

In this post, we will discuss two common transformations: Box-Cox and Yeo-Johnson. Both transformations are used to transform skewed data to make it more normal or Gaussian-like. The main difference between the two is that Box-Cox transformation is a simple and commonly used transformation, while Yeo-Johnson transformation is more flexible and can handle non-normal data.

Transforming variables to central normality

Box–Cox Transformation

The Box–Cox transformation is a family of power transformations designed for strictly positive data.

It is defined as:

$y(\lambda) = \begin{cases} \dfrac{x^{\lambda} - 1}{\lambda}, & \lambda \neq 0 \\ \ln(x), & \lambda = 0 \end{cases}$

Key ideas:

The parameter λ (lambda) controls the strength of the transformation.
When λ = 1, the data are almost unchanged.
When λ = 0, the transformation becomes the logarithmic transformation.
Smaller values of λ compress large observations more strongly, reducing right skewness.

Purpose:

Reduce skewness
Stabilize variance
Make data more compatible with statistical models that assume normality

Limitation:

Box–Cox cannot be applied to zero or negative values, which restricts its use in some datasets.

Yeo-Johnson Transformation

The Yeo–Johnson transformation is an extension of the Box–Cox transformation that allows zero and negative values.

It is defined as:

$y(\lambda) = \begin{cases} \dfrac{(x+1)^{\lambda} - 1}{\lambda}, & x \ge 0,\ \lambda \neq 0 \\ \ln(x+1), & x \ge 0,\ \lambda = 0 \\ -\dfrac{(-x+1)^{2-\lambda} - 1}{2-\lambda}, & x < 0,\ \lambda \neq 2 \\ -\ln(-x+1), & x < 0,\ \lambda = 2 \end{cases}·$

Key ideas:

Positive and negative values are transformed using symmetrical power functions.
The transformation is continuous and smooth around zero.
The parameter λ again controls the degree of skewness correction.

Advantage:

Can be applied to any real-valued data
Maintains similar interpretability to Box–Cox
Suitable for datasets containing negative values or zeros

title: Feature Selection
date: 2025-11-14 09:22:12
type: page
tags:

da
categories:
da
cover: https://raw.githubusercontent.com/wenyupeng/pic-lib/main/blog/20251118091603317.png

Feature Selection

An Introduction to Variable and Feature Selection

This paper provides an introduction to the topic of variable and feature selection, which has become increasingly important in fields with high-dimensional datasets such as text processing, gene expression analysis, and combinatorial chemistry. The authors discuss the objectives of variable selection, which include improving prediction performance, reducing measurement and storage requirements, and gaining a better understanding of the underlying processes. The paper covers a range of aspects related to these problems, including defining objective functions, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment. The authors also provide a checklist of steps that can be taken to solve a feature selection problem.

Filter Methods for Feature Selection – A Comparative Study

This paper presents a comparative study of several filter methods for feature selection, including ReliefF, Correlation-based Feature Selection (CFS), Fast Correlated Based Filter (FCBF), and INTERACT. The authors applied these filter methods to synthetic datasets with varying numbers of relevant features, levels of noise in the output, feature interactions, and sample sizes. The goal was to determine the effectiveness of each filter method under different conditions and to identify the best filter method to use as part of a hybrid feature selection approach.

Penalized feature selection and classification in bioinformatics

This paper provides a review of several recently developed penalized feature selection and classification techniques for bioinformatics studies with high-dimensional input variables. The authors discuss classification objective functions, penalty functions, and computational algorithms for these embedded feature selection methods. The goal is to make researchers aware of these applicable techniques for high-dimensional bioinformatics data, which can help avoid overfitting, generate more reliable classifiers, and provide insights into underlying causal relationships.

Feature Engineering

This paper provides an overview of feature engineering, which is an essential step in the machine learning process. Feature engineering aims to transform raw data into meaningful features that can improve the performance of machine learning models, in terms of both accuracy and interpretability. The paper discusses various univariate and multivariate feature engineering techniques, including transformations, dimensionality reduction methods, and representation learning approaches. It also covers feature engineering for structured, time series, and unstructured data. The success of machine learning often depends heavily on the success of feature engineering, and there is no single “gold standard” set of techniques, so it is important to experiment with different approaches.