Summary Statistics Part II

Mohamed Abdelrazek
15 min readMay 24, 2021

--

We have already learned in this article .The basic concept of Descriptive Statistics which is best way to take an inclusive image about your dataset .As it just use single value to give us a good understanding about out data this single value can be classified into 4 main category according to information gain from using it. Now we will move a step ahead and find a single value, which will help us to identify spreadness of our Data

The most commonly used Summary statistics can be classified according to their purpose to the following category;

1- LOCATION

2- SPREAD

3- SHAPE

4- DEPENDENCE

Simplified map for Summary statistics according to information they give

Spreadness

  1. Standard Deviation
  2. Variance
  3. Range
  4. Interquartile Range
  5. Absolute deviation
  6. Mean absolute difference
  7. Distance Standard Deviation

Before we dive deep in previous metric let’s discuss a side topic but related to our main topic which is: Population vs sample: what’s the difference?

A population is the entire group that you want to draw conclusions about.

A sample is the specific group that you will collect data from. The size of the sample is always less than the total size of the population.

In research, a population doesn’t always refer to people. It can mean a group containing elements of anything you want to study, such as objects, events, organizations, countries, species, organisms, etc.

Examples for population and sample

Standard Deviation and Variance

Standard deviation and variance are both determined by using the mean of a group of numbers in question. The mean is the average of a group of numbers, and the variance measures the average degree to which each number is different from the mean. The extent of the variance correlates to the size of the overall range of numbers , meaning the variance is greater when there is a wider range of numbers in the group, and the variance is less when there is a narrower range of numbers.

1- Standard Deviation

Standard deviation is a statistic that looks at how far from the mean a group of numbers is, by using the square root of the variance. The calculation of variance uses squares because it weighs outliers more heavily than data closer to the mean. This calculation also prevents differences above the mean from canceling out those below, which would result in a variance of zero.

Standard deviation is calculated as the square root of variance by figuring out the variation between each data point relative to the mean. If the points are further from the mean, there is a higher deviation within the date; if they are closer to the mean, there is a lower deviation. So the more spread out the group of numbers are, the higher the standard deviation.

Standard Deviation for population

2- Variance

The variance is the average of the squared differences from the mean. To figure out the variance, first calculate the difference between each point and the mean; then, square and average the results.

A useful property of the standard deviation is that, unlike the variance, it is expressed in the same units as the data. For example, if data expressed in kg, SD will be also in kg.

Variance Equation For Population

Population vs. Sample Variance and Standard Deviation

When calculating variance and standard deviation, it is important to know whether we are calculating them for the whole population using all the data, or we are calculation them using only a sample of data. In the first case we call them population variance and population standard deviation. In the second case we call them sample variance and sample standard deviation.

The Difference in Calculation: Population vs. Sample Variance

There is only one little difference in the calculation of variance and it is at the very end of it. For both population and sample variance, I calculate the mean, then the deviations from the mean, and then I square all the deviations. I sum all the squared deviations up. So far it was the same for both population and sample variance.

When I calculate population variance, I then divide the sum of squared deviations from the mean by the number of items in the population

When I calculate sample variance, I divide it by the number of items in the sample less one.As a result, the calculated sample variance (and therefore also the standard deviation) will be slightly higher than if we would have used the population variance formula. The purpose of this little difference it to get a better and unbiased estimate of the population‘s variance (by dividing by the sample size lowered by one, we compensate for the fact that we are working only with a sample rather than with the whole population).

But Why we dived by n-1 not just n ?

The reason dividing by n-1 corrects the bias is because we are using the sample mean, instead of the population mean, to calculate the variance. Since the sample mean is based on the data, it will get drawn toward the center of mass for the data. And for more details you can find this article useful.

Standard Deviation and Variance in Investing

For traders and analysts, these two concepts are of paramount importance as they are used to measure security and market volatility, which in turn plays a large role in creating a profitable trading strategy.

Standard deviation is one of the key methods that analysts, portfolio managers, and advisors use to determine risk. When the group of numbers is closer to the mean, the investment is less risky; when the group of numbers is further from the mean, the investment is of greater risk to a potential purchaser.

Securities that are close to their means are seen as less risky, as they are more likely to continue behaving as such. Securities with large trading ranges that tend to spike or change direction are riskier. In investing, risk in itself is not a bad thing, as the riskier the security, the greater potential for a payout.

3- Range

When you buy things, they are always sold within a price range. Take the example of your favorite pair of jeans. The store from where you made the purchase probably had a range of colors, a range of fits, a range of sizes, and a range of prices.

This range usually enables us to make a more informed decision about what exactly in this case a pair of jeans we want to buy.

Range is usually defined with an upper value and lower value and it refers to all the units between those values.

What is Meant by Range in Statistics?

The range is the difference between the highest value and the lowest value of the data. It helps in knowing the spread of the data.

Range Rule of Thumb

The range rule of thumb says that the range is approximately four times the standard deviation. Alternatively, the standard deviation is approximately one-fourth the range. That means that most of the data lies within two standard deviations of the mean.

Procedure for finding Standard Deviation using Range

  1. Find the range
  2. Divide it by four

Formula:

Range and standard deviation relationship

Why Does It Work?

It may seem like the range rule is a bit strange. Why does it work? Doesn’t it seem completely arbitrary to just divide the range by four? Why wouldn’t we divide by a different number? There is actually some mathematical justification going on behind the scenes.

Recall the properties of the bell curve and the probabilities from a standard normal distribution. One feature has to do with the amount of data that falls within a certain number of standard deviations:

  • Approximately 68% of the data is within one standard deviation (higher or lower) from the mean.
  • Approximately 95% of the data is within two standard deviations (higher or lower) from the mean.
  • Approximately 99% is within three standard deviations (higher or lower) from the mean.

The number that we will use has to do with 95%. We can say that 95% from two standard deviations below the mean to two standard deviations above the mean, we have 95% of our data. Thus nearly all of our normal distribution would stretch out over a line segment that is a total of four standard deviations long.

Not all data is normally distributed and bell curve shaped. But most data is well-behaved enough that going two standard deviations away from the mean captures nearly all of the data. We estimate and say that four standard deviations are approximately the size of the range, and so the range divided by four is a rough approximation of the standard deviation.

Uses for the Range Rule

The range rule is helpful in a number of settings. First, it is a very quick estimate of the standard deviation. The standard deviation requires us to first find the mean, then subtract this mean from each data point, square the differences, add these, divide by one less than the number of data points, then (finally) take the square root. On the other hand, the range rule only requires one subtraction and one division.

Other places where the range rule is helpful is when we have incomplete information. Formulas such as that to determine sample size require three pieces of information: the desired margin of error, the level of confidence and the standard deviation of the population we are investigating. Many times it is impossible to know what the population standard deviation is. With the range rule, we can estimate this statistic, and then know how large we should make our sample.

What Are the Limitations of Range?

Range is the most convenient metric to find. But it has the following limitations.

  • The range does not tell us the number of data points.
  • The range cannot be used to find mean, median, or mode.
  • The range is affected by extreme values(outliers).
  • The range cannot be used for open-ended distribution.

4- Interquartile Range

What Is a Quartile?

A quartile is a statistical term that describes a division of observations into four defined intervals based on the values of the data and how they compare to the entire set of observations.

Understanding Quartiles

To understand the quartile, it is important to understand the median as a measure of central tendency. The median in statistics is the middle value of a set of numbers. It is the point at which exactly half of the data lies below and above the central value

How Quartiles Work

Just like the median divides the data into half so that 50% of the measurement lies below the median and 50% lies above it, the quartile breaks down the data into quarters so that 25% of the measurements are less than the lower quartile, 50% are less than the median, and 75% are less than the upper quartile.

A quartile divides data into three points a lower quartile, median, and upper quartile to form four groups of the dataset. The lower quartile, or first quartile, is denoted as Q1 and is the middle number that falls between the smallest value of the dataset and the median. The second quartile, Q2, is also the median. The upper or third quartile, denoted as Q3, is the central point that lies between the median and the highest number of the distribution.

Now, we can map out the four groups formed from the quartiles. The first group of values contains the smallest number up to Q1; the second group includes Q1 to the median; the third set is the median to Q3; the fourth category comprises Q3 to the highest data point of the entire set.

Each quartile contains 25% of the total observations. Generally, the data is arranged from smallest to largest:

  1. First quartile: the lowest 25% of numbers
  2. Second quartile: between 25.1% and 50% (up to the median)
  3. Third quartile: 50.1% to 75% (above the median)
  4. Fourth quartile: the highest 25% of numbers

Example of Quartile

Suppose the distribution of math scores in a class of 19 students in ascending order is:

  • 59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98

First, mark down the median, Q2, which in this case is the 10th value: 75.

Q1 is the central point between the smallest score and the median. In this case, Q1 falls between the first and fifth score: 68. (Note that the median can also be included when calculating Q1 or Q3 for an odd set of values. If we were to include the median on either side of the middle point, then Q1 will be the middle value between the first and 10th score, which is the average of the fifth and sixth score — (fifth + sixth)/2 = (68 + 69)/2 = 68.5).

Q3 is the middle value between Q2 and the highest score: 84. (Or if you include the median, Q3 = (82 + 84)/2 = 83).

Now that we have our quartiles, let’s interpret their numbers. A score of 68 (Q1) represents the first quartile and is the 25th percentile. 68 is the median of the lower half of the score set in the available data that is, the median of the scores from 59 to 75.

Q1 tells us that 25% of the scores are less than 68 and 75% of the class scores are greater. Q2 (the median) is the 50th percentile and shows that 50% of the scores are less than 75, and 50% of the scores are above 75. Finally, Q3, the 75th percentile, reveals that 25% of the scores are greater and 75% are less than 84.

IQR Formula

IQR = Q 3 — Q 1.

The interquartile range shows how the data is spread about the median. It is less susceptible than the range to outliers and can, therefore, be more helpful.

The interquartile range rule is useful in detecting the presence of outliers. Outliers are individual values that fall outside of the overall pattern of a data set. This definition is somewhat vague and subjective, so it is helpful to have a rule to apply when determining whether a data point is truly an outlier this is where the interquartile range rule comes in.

Outliers Detection using IQR

To detect the outliers using this method, we define a new range, let’s call it decision range, and any data point lying outside this range is considered as outlier and is accordingly dealt with. The range is as given below:

Lower Bound: (Q1 - 1.5 * IQR)Upper Bound: (Q3 + 1.5 * IQR)

Any data point less than the Lower Bound or more than the Upper Bound is considered as an outlier.

But the question is:

Why only 1.5 times the IQR? Why not any other number?

Well, as you might have guessed, the number (here 1.5, hereinafter scale) clearly controls the sensitivity of the range and hence the decision rule. A bigger scale would make the outlier(s) to be considered as data point(s) while a smaller one would make some of the data point(s) to be perceived as outlier(s). And we’re quite sure, none of these cases is desirable.

But this is an abstract way of explaining the reason, it’s quite effective, but naïve nonetheless. So to what should we turn our heads for hope?

Mathematics! Of course!

For example, let’s say our data follows, our beloved, Gaussian Distribution.

Gaussian Distribution

You all must have seen how a Gaussian Distribution looks like, right? If not, here it is

Gaussian Distribution

There are certain observations which could be inferred from this figure:

  • About 68.26% of the whole data lies within one standard deviation () of the mean (μ), taking both sides into account, the pink region in the figure.
  • About 95.44% of the whole data lies within two standard deviations () of the mean (μ), taking both sides into account, the pink+blue region in the figure.
  • About 99.72% of the whole data lies within three standard deviations () of the mean (μ), taking both sides into account, the pink+blue+green region in the figure.
  • And the rest 0.28% of the whole data lies outside three standard deviations (>3σ) of the mean (μ), taking both sides into account, the little red region in the figure. And this part of the data is considered as outliers.
  • The first and the third quartiles, Q1 and Q3, lies at -0.675σ and +0.675σ from the mean, respectively.

Let’s calculate the IQR decision range in terms of σ

Taking scale = 1:

Lower Bound:
= Q1 - 1 * IQR
= Q1 - 1 * (Q3 - Q1)
= -0.675σ - 1 * (0.675 - [-0.675])σ
= -0.675σ - 1 * 1.35σ
= -2.025σ
Upper Bound:
= Q3 + 1 * IQR
= Q3 + 1 * (Q3 - Q1)
= 0.675σ + 1 * (0.675 - [-0.675])σ
= 0.675σ + 1 * 1.35σ
= 2.025σ

So, when scale is taken as 1, then according to IQR Method any data which lies beyond 2.025σ from the mean (μ), on either side, shall be considered as outlier. But as we know, up to , on either side of the μ ,the data is useful. So we cannot take scale = 1, because this makes the decision range too exclusive, means this results in too much outliers. In other words, the decision range gets so small (compared to ) that it considers some data points as outliers, which is not desirable.

Taking scale = 2:

Lower Bound:
= Q1 - 2 * IQR
= Q1 - 2 * (Q3 - Q1)
= -0.675σ - 2 * (0.675 - [-0.675])σ
= -0.675σ - 2 * 1.35σ
= -3.375σ
Upper Bound:
= Q3 + 2 * IQR
= Q3 + 2 * (Q3 - Q1)
= 0.675σ + 2 * (0.675 - [-0.675])σ
= 0.675σ + 2 * 1.35σ
= 3.375σ

So, when scale is taken as 2, then according to IQR Method any data which lies beyond 3.375σ from the mean (μ), on either side, shall be considered as outlier. But as we know, up to , on either side of the μ ,the data is useful. So we cannot take scale = 2, because this makes the decision range too inclusive, means this results in too few outliers. In other words, the decision range gets so big (compared to ) that it considers some outliers as data points, which is not desirable either.

Taking scale = 1.5:

Lower Bound:= Q1 - 1.5 * IQR
= Q1 - 1.5 * (Q3 - Q1)
= -0.675σ - 1.5 * (0.675 - [-0.675])σ
= -0.675σ - 1.5 * 1.35σ
= -2.7σ
Upper Bound:
= Q3 + 1.5 * IQR
= Q3 + 1.5 * (Q3 - Q1)
= 0.675σ + 1.5 * (0.675 - [-0.675])σ
= 0.675σ + 1.5 * 1.35σ
= 2.7σ

When scale is taken as 1.5, then according to IQR Method any data which lies beyond 2.7σ from the mean (μ), on either side, shall be considered as outlier. And this decision range is the closest to what Gaussian Distribution tells us, i.e., . In other words, this makes the decision rule closest to what Gaussian Distribution considers for outlier detection, and this is exactly what we wanted.

To get exactly , we need to take the scale = 1.7, but then 1.5 is more “symmetrical” than 1.7

5- Absolute Deviation & Mean Absolute Deviation

The average deviation, or mean absolute deviation, is calculated similarly to standard deviation, but it uses absolute values instead of squares to circumvent the issue of negative differences between the data points and their means. To calculate the average deviation:

  1. Calculate the mean of all data points.
  2. Calculate the difference between the mean and each data point.
  3. Calculate the average of the absolute values of those differences.

So, What are difference between Standard deviation and Mean absolute Deviation?

Both measure the spreadness of your data by computing the distance of the data to its mean.

  1. the mean absolute deviation is using norm L1 (it is also called Manhattan distance or rectilinear distance)
  2. the standard deviation is using norm L2 (also called Euclidean distance)

The difference between the two norms is that the standard deviation is calculating the square of the difference whereas the mean absolute deviation is only looking at the absolute difference. Hence large outliers will create a higher dispersion when using the standard deviation instead of the other method. The Euclidean distance is indeed also more often used. The main reason is that the standard deviation have nice properties when the data is normally distributed. So under this assumption, it is recommended to use it. However people often do this assumption for data which is actually not normally distributed which creates issues. If your data is not normally distributed, you can still use the standard deviation, but you should be careful with the interpretation of the results.

--

--

Mohamed Abdelrazek

Communications and Electronics Engineer graduate.quick-learning adaptable individual,strong analytical skills in order to drive successful business solutions.