Descriptive Statistic fopr Continuous Data
Histograms are based on binned data and hence provide us with snapshots of how much probability mass is allocated in diferent parts of the data domain.
1 | import numpy as np |
1 | import matplotlib.pyplot as plt |
Mathematical Notation
Assume that a sample to be aggregated with x
we shall write
Using the programming syntax,n corresponds to len(income) or equivalently income.shape[0] and xi is income[i-1] (because in Python the first element is at index 0).
1 | income_sorted = np.sort(income) |
1 | heights = np.loadtxt("https://raw.githubusercontent.com/gagolews/" + |
Measures of Location
Two main measures of central tendency are:
the arithmetic mean (sometimes for simplicity called the mean or average), is defined as the sum of all observations divided by the sample size:
the median, being the middle value in a sorted version of the sample if its length is odd, or the arithmetic mean of the two middle values otherwise:
1 | np.mean(heights), np.median(heights) |
1 | np.mean(income), np.median(income) |
For symmetrical distributions with no outliers, the mean will be better as it uses all data (and its efficiency can be proven for certain statistical models).
For skewed distributions, the median has a nice interpretation.
1 | np.mean(heights), heights.mean(), np.sum(heights)/len(heights), heights.sum()/heights.shape[0] |
Quantiles
Quantiles generalise the notion of the sample median. For any p between 0 and 1, a p-quantile, denoted $q_p$ is a value dividing the sample in such a way that:
- 100p% of observations are not greater than $q_p$,
- the remaining 100(1-p)% are not less than $q_p$.
Quantiles appear under many different names, but they all refer to the same concept. In particular, we can speak about 100p-th percentiles, e.g., the 0.5-quantile is the same as the 50th percentile.
- 0-quantile ($q_0$) = the minimum (also: numpy.min),
- 0.25-quantile ($q_0.25$) = the 1st quartile (denoted $Q_1$),
- 0.5-quantile ($q_0.5$) = the 2nd quartile a.k.a. median,
- 0.75-quantile ($q_0.75$) = the 3rd quartile (denoted $Q_3$),
- 1-quantile ( $q_1$ ) = the maximum (also: numpy.max).
1 | np.quantile(income, [0, 0.25, 0.5, 0.75, 1]) |
Measures of Dispersion
Measures of central tendency quantify the location of the most typical value.
the standard deviation: being the average distance to the arithmetic mean
the interquartile range (IQR), being the difference between the 3rd and the 1st quartile:
1 | np.std(income), np.quantile(income, 0.75)-np.quantile(income, 0.25) |
The IQR has an appealing interpretation, because we may say that this is the range comprised of the 50% most typical values.
The standard deviation measures the average degree of spread around the arithmetic mean. Thus, it makes the most sense for data distributions that are symmetric around the mean. This measure is useful overall for making comparisons across different samples. However, without further assumptions, it’s quite difficult to express the meaning of a particular value of s
Box (and Whisker) Plots
The box and whisker plot (or the box plot for short) depicts some of the most noteworthy features of a data sample.
1 | plt.subplot(211) # 2 rows, 1 column, 1st subplot |
Each box plot consists of:
- the box, which spans between the 1st and the 3rd quartile:
- the median is clearly marked by a vertical bar inside the box;
- note that the width of the box corresponds to the IQR;
- the whiskers, which span between
- the smallest observation (the minimum) or $Q_1 - 1.5IQR$ (the left side of the box minus 3/2 of its width), whichever is larger,
- the largest observation (the maximum) or $Q_3 + 1.5IQR$ (the right side of the box plus 3/2 of its width), whichever is smaller.
Additionally, all observations that are less than $Q_1 - 1.5IQR$ (if any) or greater than $Q_3 + 1.5IQR$ (if any) are separately marked.
kernel density estimator1
2sns.violinplot(data=income, orient="h")
plt.show()
Measures of Shape (*)
1 | import scipy.stats |

