Mean, Variance and Standard Deviation — Once and For All

14 min readMar 26, 2020

Objectives

By the end of this post you should:

understand what is mean, and why is it so useful,
understand the importance of variance and what it is tell us,
understand what is normal distribution, and why we use it.

We all are a bit different, aren’t we? It is mostly not a bad thing. If we were all the same color, we wouldn’t have any rainbow right? Some people are taller than others; some have blue eyes, some brown; some people can eat anything they want and still don’t gain any weight, and for some people, water is enough. My point is: we all are a bit different, there is a diversity on earth, not only on humans, on most of the things. What do you think about this question?: What is the height of humans? Isn’t it absurd? Indeed it is. So, can’t we have any idea about anything? We are living in an era that information is very important. Businesses grow or sink depending on the information. The more you know about your target customers the more likely that your business will succeed. In this large diversity of things, how do we collect information? How do we know more about something that isn’t constant? This is what we are going to discuss in this post.

Imagine you are in a carnival in Japan, and there is a very exciting contest: everybody needs to guess how tall a Japanese man is, and the person who guesses it right (or the one who gets the closest) wins a katana. This is the katana of your dreams, you really want it, but you have no idea how tall this person is. What to do? His height can be between −∞ and ∞. Wait a second, you know that height can’t be negative or zero, it doesn’t make any sense for these values. So, you have an information that the height can’t be negative or zero, good, now you have reduced the possible number for this person’s height into a range: 0<height<∞. What else do you know? A quick check from the internet shows you that the tallest man has ever lived had 272cm [3]. You saw some pictures of him near some objects, and based on this you don’t think this Japanese man is taller than him. Okayyy, so you can reduce the range into 0<height<272. With the same logic, you find the shortest man has ever lived and you can further reduce the range into 54.6<height<272 [6]. What more can you do after this moment? Let’s see…

Mean (Average)

Why do we even have something called mean? What information does it give us? When is it useful? Well, we have already talked about the diversity on earth. We can’t just say that humans are 160cm tall. But, we can say that, for example, the average height of Japanese men is 172cm (in 2020 [1]). It gives us information about the center of diversity.

First of all, how this number was found? There are 2 ways you can get it:

The first way is precise but very time-consuming. It includes measuring the height of every Japanese man (the “population” for this example), then summing all of the measurements and dividing it to the number of measurements. There are around 62 million men in Japan (in 2020 [2]), so:

Where hn (n=1,2,⋯,62000000) represents height of nth Japanese man, and μ represents “population mean”. The equation given above is specific to this example, and the general equation of mean is as follows:

2) The second way is to estimate the mean value. What it is meant by estimate is: instead of measuring the height of every Japanese man, we measure randomly selected subset (sample) of them (I won’t delve into randomness here, it is a subject of another post). As you can imagine, this way is much less time consuming than the first one. It is not as precise as the first one, however, in most of the cases, the error is negligible. This is how we calculate the sample mean:

This equation is exactly like the one above it, with only one difference: this time 0 <n ≪ 62000000. Here, not every Japanese man was taken into account while calculating the mean, but only a part of them. Bar (-) symbol is used above some letter to describe the “sample” mean.

As can be seen from the example above, we couldn’t get the real average value when we used only a sample from our population, however, we get pretty close. It is up to you to decide how many samples you are going to use, however, keep in mind that the more samples you use, the less error you will have.

Getting back to the contest, now you know another information that can be useful to you to better guess the height of this Japanese man: the mean value. A quick check from the internet showed that it is 172cm. Now you have a better idea of how tall this person could be. Having no further information, you could still guess it as 172, and still have better than before chance of winning the contest. But, you are still worried that this information might not be very reliable. Why? Look at the graph below:

Now look at this one:

(Why is it a good idea to guess with mean?)

As you can see from the figures above, although in both cases mean values are the same, the diversity is very different. So, okay, we know that the center of our data is located at mean value, however how much the rest of the data are spread? If most of the data are close to mean, you can have higher confidence in your guess, since it is very unlikely to have a data point that is further away from the mean (e.g. a 200cm or 120cm tall Japanese man). However, if the data are spread widely, you wouldn’t be very confident that predicting the height with mean value would be beneficial, because of the high variance. Let’s continue…

Variance

Have you noticed something keep popping up here and there in this post? Probably you have: diversity. Mean gives us information about the center of our data, and the variance tells us how it is spread around it.

Looking at a histogram plot could be very informative when we want to see the variance in data:

(What is a histogram?)

You can see from the figure above that most of the data points are collected around mean. Because of the variance is low, the number of extremes (both very low and very high values of height) cases are very rare. Because of the variance is not in the same units as values in this graph, we can’t directly show it on the graph (we haven’t seen its equation yet, but if you take a look at it below, there is a square in it), however knowing the definition of variance (which is going to be explained in just a moment) allows us to imagine how our data would look like without even a need to look at its graph. As a side note, we can, and we will show the standard deviation on the distribution graph when we talk about it.

Going back to the contest, now you know why knowing variation gives us the confidence boost on our guess with the mean value. If we know that the data have low variance, there is a much higher chance that our guess with the mean value will be closer to the actual height of this Japanese gentleman.

However, if you look at the figure below, you can see that the variance is very high. Although the mean is still 172cm, the data looks completely random. It is (almost) equally spread into each bin.

So, now that we know what the variance is we can talk about how to calculate it. The number we are trying to get is the average squared distance between a data point and the mean value. This makes sense, doesn’t it? Mean gives us the center of the data, and averaging the distance each data point has to this point can inform us about the spread. Don’t worry if it is still not so clear, like we did before, we are going to make an example to better understand how it works.

But why are we taking the squared distance? We said that we want to get an average distance between the data points and the mean. What happens when a distance is negative? Because we are summing all distances up (for averaging), negative distances would decrease the total summation. We don’t want that. That’s why we square the distances, the point here is to get an idea about the spread, regardless of where the data point is located (here I mean if it is located at the left or right side of the mean). Why not use absolute value then? I won’t go into details of explaining this question, but in short, while the absolute value is not differentiable, the square is. And if you think about it, taking the square of a distance doesn’t have any negative effect on investigating the spread of data around the mean. So, as it was with mean, you can calculate the variance on a population, or estimate it on a sample taken from the population.

Where sigma (σ) square represents the population variance. The equation of sample variance changes a little from population variance:

(Why denominator is n-1?)

To illustrate how to calculate the variance we have the following data:

There are 6 data points in the figure above, and the mean value for these 6 points is 172.

I didn’t think it was necessary to make another section about standard deviation, because after you know how to calculate variance it is very easy to find it. It is just the square root of the variance:

Sometimes it is useful to think in terms of standard deviation. Because, it is proven [4] that, for all distributions for which the standard deviation is defined, the amount of data within a number of standard deviations of the mean is at least as much as given in the following table [5]:

You can look at figure below for visualization.

Putting All Together

Now that we know both mean and variance, we can start talking about the normal distribution, and fitting a curve on our data to calculate some probabilities. There are many different probability distributions, however, I just want to make a gentle introduction to normal distribution to have an idea about how to use mean and variance. A curve gives us the same information that a histogram does. However, it has some advantages over a histogram. Namely:

When we sample from a population, because of the way we selected those samples, some bins in the histogram might be unoccupied. So, what happens when we need to calculate that probability with the histogram? Since there are no values in that bin, does that mean that it is impossible to get a probability for the values belong to that bin? It is possible when we use a curve.
We know that we have to choose several ranges to calculate the number of elements in a bin and to draw a histogram. Let’s say one of the ranges we have chosen for height distribution is (160, 165). But what happens if we want to calculate the probability of having a height between 163.24 and 164.95? How to calculate this with a histogram? Well, we cannot, but we can do it if we had a curve that fits onto that histogram.
If we don’t have enough time or money to collect a lot of measurements (samples), the histogram of the data might not be enough to make deductions. In that case, fitting a curve on the histogram using mean and variance (or standard deviation), could save us a lot of time and money.

However, remember that both the histogram and the curve are distributions, and they show us how the probabilities of samples are distributed.

To draw a normal distribution, knowing only the mean and the variance is sufficient. Important things to know about normal distribution are:

the total area under its curve is equal to 1.
It is a continuous distribution, hence probabilities are calculated for a specific range, (e.g. probability of height being in the range 170–180) and the probability of a single point (e.g. probability of height being 170) is 0.

I don’t want to go into so many details about the normal distribution, because this is a post mainly about mean and variance and how to use them. However, I am sure that many of you will wonder how this curve was drawn. In continuous probability distributions (where values are specified with ranges instead of singular values), probability density functions are used to describe these distributions. And, the probability density function for the normal distribution is as follows:

I don’t want you to try so hard to understand what is this equation, how it was found and why it is like that. Instead, if you noticed that the only unknowns in this equation are the mean and the variance (the standard deviation can be found through variance), it is sufficient. And, if you plot this equation, you get the curve shown below.

Here we have an example of a normal distribution that is drawn on a histogram plot using randomly generated data that are representing height values of Japanese men.

Now we are ready to go back to the contest and make an educated guess. You are planning to guess 172cm (the mean value) as this person’s height, but you also want to be sure that you have a good enough probability of winning that katana. You decide that if the probability of height being in the range 168<height<176 is more than 25%, you are going to go with the mean value. So, the next thing you do is to calculate this probability.

As we said before, we calculate probability in a range by calculating the area under the curve that is covering that range:

(How to plot a normal distribution, and calculate AUC?)

Based on this distribution, the probability that the height of a Japanese man is between 168–176cm is 0.31. Considering there are many more possible height values, you think this is a pretty good probability, and make your final guess with 172cm.

Congratulations! You won the katana!

Recap

Mean gives us information about the center location within a dataset. If measurements for all population is known (e.g. height values of every Japanese male), it can be calculated as follows:

Where N is the number of items in the population.

If we have measurements for only a part of the population (because, for example, we didn’t have enough time or money to collect measurements for the whole population), we can still estimate mean as follows:

Where n is the number of items in the sample that is taken from the population.

Variance is a measure of how much the data points are spread around the mean. As it was with the mean, we can calculate variance for both a population or a sample that is taken from a population.

Where N is the number of items in the population

The equation for estimating the variance is slightly different:

Where n is the number of items in the sample.

Standard Deviation is equal to the square root of variance. It is sometimes useful to think in terms of it, because it gives us an idea about minimum amount of data within a number of standard deviations of the mean.

Both histogram and curve are distributions, and they show us how the probabilities of samples are distributed. However the curve has some advantages over a histogram.
Normal distribution is a continuous probability distribution, hence it is represented with a probability density function:

As can be seen from the equation above, knowing mean and variance is enough to use the normal distribution. In continuous distributions, probabilities of events are calculated within specific ranges (instead of actual values), and AUC is used to calculate these probabilities.

Conclusions

There is diversity on earth, and most of the things are non deterministic. That’s why, to better understand things around us, we need models that might explain them to us. Statistical models are widely used to describe different populations and natural phenomena. Whether you are interested in having a clue about who might win the next election in your country, or you are trying to learn how much variations you should expect in measuring current in a copper wire, or something completely different; I hope that the things you learned in this blog post will help you to achieve that.

Final Words

Don’t forget to 👏🏻 if you liked this post, and please leave a comment below if you have any feedback, criticism, or something that you would like to discuss. I can also be reached on social media: linkedin, twitter, instagram