Descriptive Statistic fopr Continuous Data

Histograms are based on binned data and hence provide us with snapshots of how much probability mass is allocated in diferent parts of the data domain.

1
2
3
4
5
6
7
import numpy as np
income = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/uk_income_simulated_2020.txt")
b = [0, 10000, 20000, 30000, 40000, 50000, 60000, 80000, np.inf] # bin bounds
c = np.histogram(income, bins=b)[0] # counts
for i in range(len(c)):
print(f"{b[i]:5}-{b[i+1]:5}: {c[i]:4}")
1
2
3
4
5
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use("seaborn")
sns.histplot(income)
plt.show()

Mathematical Notation

Assume that a sample to be aggregated with x

we shall write

Using the programming syntax,
n corresponds to len(income) or equivalently income.shape[0] and xi is income[i-1] (because in Python the first element is at index 0).

1
2
3
income_sorted = np.sort(income)
income_sorted[0], income_sorted[-1] # the minimum and the maximum

1
2
3
4
5
6
7
heights = np.loadtxt("https://raw.githubusercontent.com/gagolews/" +
"teaching-data/master/marek/nhanes_adult_female_height_2020.txt")
heights_sorted = np.sort(heights)
heights_sorted[0], heights_sorted[-1]

sns.histplot(heights)
plt.show()

Measures of Location

Two main measures of central tendency are:

  • the arithmetic mean (sometimes for simplicity called the mean or average), is defined as the sum of all observations divided by the sample size:

  • the median, being the middle value in a sorted version of the sample if its length is odd, or the arithmetic mean of the two middle values otherwise:

1
np.mean(heights), np.median(heights)
1
2
3
4
5
np.mean(income), np.median(income)

# The arithmetic mean is strongly influenced by very large or very small observations
income2 = np.append(income, [1_000_000_000])
np.mean(income2)

For symmetrical distributions with no outliers, the mean will be better as it uses all data (and its efficiency can be proven for certain statistical models).
For skewed distributions, the median has a nice interpretation.

1
np.mean(heights), heights.mean(), np.sum(heights)/len(heights), heights.sum()/heights.shape[0]

Quantiles

Quantiles generalise the notion of the sample median. For any p between 0 and 1, a p-quantile, denoted $q_p$ is a value dividing the sample in such a way that:

  • 100p% of observations are not greater than $q_p$,
  • the remaining 100(1-p)% are not less than $q_p$.

Quantiles appear under many different names, but they all refer to the same concept. In particular, we can speak about 100p-th percentiles, e.g., the 0.5-quantile is the same as the 50th percentile.

  • 0-quantile ($q_0$) = the minimum (also: numpy.min),
  • 0.25-quantile ($q_0.25$) = the 1st quartile (denoted $Q_1$),
  • 0.5-quantile ($q_0.5$) = the 2nd quartile a.k.a. median,
  • 0.75-quantile ($q_0.75$) = the 3rd quartile (denoted $Q_3$),
  • 1-quantile ( $q_1$ ) = the maximum (also: numpy.max).
1
2
np.quantile(income, [0, 0.25, 0.5, 0.75, 1])
np.quantile(heights, [0, 0.25, 0.5, 0.75, 1])

Measures of Dispersion

Measures of central tendency quantify the location of the most typical value.

  1. the standard deviation: being the average distance to the arithmetic mean

  2. the interquartile range (IQR), being the difference between the 3rd and the 1st quartile:

1
2
3
np.std(income), np.quantile(income, 0.75)-np.quantile(income, 0.25)

np.std(heights), np.quantile(heights, 0.75)-np.quantile(heights, 0.25)

The IQR has an appealing interpretation, because we may say that this is the range comprised of the 50% most typical values.

The standard deviation measures the average degree of spread around the arithmetic mean. Thus, it makes the most sense for data distributions that are symmetric around the mean. This measure is useful overall for making comparisons across different samples. However, without further assumptions, it’s quite difficult to express the meaning of a particular value of s

Box (and Whisker) Plots

The box and whisker plot (or the box plot for short) depicts some of the most noteworthy features of a data sample.

1
2
3
4
5
6
7
8
9
plt.subplot(211)  # 2 rows, 1 column, 1st subplot
sns.boxplot(data=income, orient="h")
plt.title("income")

plt.subplot(212) # 2 rows, 1 column, 2nd subplot
sns.boxplot(data=heights, orient="h")
plt.title("heights")

plt.show()

Each box plot consists of:

  • the box, which spans between the 1st and the 3rd quartile:
    • the median is clearly marked by a vertical bar inside the box;
    • note that the width of the box corresponds to the IQR;
  • the whiskers, which span between
    • the smallest observation (the minimum) or $Q_1 - 1.5IQR$ (the left side of the box minus 3/2 of its width), whichever is larger,
    • the largest observation (the maximum) or $Q_3 + 1.5IQR$ (the right side of the box plus 3/2 of its width), whichever is smaller.

Additionally, all observations that are less than $Q_1 - 1.5IQR$ (if any) or greater than $Q_3 + 1.5IQR$ (if any) are separately marked.

kernel density estimator

1
2
sns.violinplot(data=income, orient="h")
plt.show()

Measures of Shape (*)

1
2
3
4
import scipy.stats
scipy.stats.skew(heights)

scipy.stats.skew(income)