Accuracy, precision, recall and F1 score

Objectives:

By the end of this post you should:

  1. understand why accuracy is not always the best metric of choice in classification tasks,
  2. understand the difference between accuracy, precision, recall and F1 score and be able to choose the right metric for your needs.

One of the most important decisions that have to be made before starting a Machine Learning project is to decide which metric to use. It is so crucial, in a sense that the wrong metric can potentially trick you to believe that your model is good, or getting better when in reality it is not.

I have a feeling that accuracy sometimes might be considered as “one metric to rule them all”. However, this is not the case. Choosing a metric should be done after asking a number of questions: What is trying to be achieved? What is the output of the model? What is the consequence of its decision? What about your data? What kind of it do you have? Is it balanced? These are some essential questions that one should consider before doing anything else.

This post introduces four metrics, namely: accuracy, precision, recall, and f1 score. None of these metrics are better, or worse than the other. The point here is to choose the right one for the problem you are trying to solve. Alright, let’s get started then… :)

To understand these metrics easier, I am going to use a tool called Confusion Matrix. It is simply a graph that shows us our model’s performance. For example, let’s say that our model classifies emotions based on a face image. Our confusion matrix could look something like this:

()

Image for post
Image for post
Confusion Matrix

Columns of this graph show us how many samples a particular class has ( in this case, we have: happy=33, angry=29, sad=38 ), and rows show our model’s predictions. For example, we have 33 samples for the happy class, however, our model predicted happy=12, sad=15, angry=6. If we had all of the entries of this matrix on its diagonal, we would have a perfect model.

Now that we know how confusion matrix works, let’s come back to the main topic of this post. Imagine, our model’s predictions are as shown on the confusion matrix below,

Image for post
Image for post
Another example of confusion matrix

Well, you need to be very careful now… If you didn’t think carefully before you choose the metric you used, and directly picked accuracy as your metric, you might fool yourself to believing that your model is amazing. Look, it has 99% accuracy! But… There are a number of problems here. The first, and probably the most obvious, one is: the data is imbalanced. Your model can easily learn to output a single label no matter what its input is and still have very high accuracy. There are different methods you can use to solve this particular problem, and one of them is to choose a different metric then accuracy ( since this post is not specifically about this problem, I don’t mention the other methods here ).

Another problem is: have you actually thought about how important is your model’s decision? What impact it has in real life? For example, what if your model predicts whether a person has cancer? or what if it predicts if a person is a terrorist? ( From the confusion matrix above ) That 2 times your model predicted wrong could potentially change somebody’s life drastically. In one of those cases, there is an individual who is not a terrorist, but your model says that s/he is. Furthermore, although your model’s accuracy is 99%, it wasn’t able to detect any terrorist, even though there is one. Or, if this was a model which is used to classify if a patient has an infectious virus, what do you think the consequence would be when our model says with 99% confidence that a person doesn’t have a virus when in reality s/he has?

To deal with these problems, let’s check some other metrics than accuracy.

Precision

Let’s start with the mathematical formula of precision:

Image for post
Image for post

Let’s think about this equation a little bit. What benefit do we get using precision as our metric?

As long as there are no False Positives your model’s accuracy is 1

(assuming True Positives ≥ 0). The moment you start having some False Positives, the precision starts decreasing. We had 99% accuracy on above example, let’s check what precision we have:

Image for post
Image for post

Now we are getting somewhere. So, based on precision, our model is not that good after all… But, when to use precision? If the number of False Positives are crucial to you, then you should use precision. For example, if your model predicts whether an email is a spam, you would be very concerned with the number of False Positives you have. Because a False Positive would mean that an email is spam, and maybe a user would miss a very important email because of that decision.

Image for post
Image for post

Recall

What about recall then? Again, let’s check the equation:

Image for post
Image for post

This time, as long as there is no False Negative, we are home. Whenever the cost of having False Negatives is too high, maybe you should consider using recall as your metric. As an example, we can think of a model that predicts if a patient has an infectious virus. When we have a False Negative, we say, okay, this patient doesn’t have a virus, when in reality s/he has and the cost of having a False Positive is more people infected by the disease.

Image for post
Image for post

F1 Score

Before we delve into F1 Score’s equation, I would like you to imagine a situation where things can go wrong, let’s look at recall’s equation again:

Image for post
Image for post

What could possibly go wrong here? Well, considering the classification problem we discussed earlier; what happens if we classify everyone as a terrorist? Let’s see:

Image for post
Image for post

As you can see, even though our model has very little value in reality (because we classify everybody as a terrorist!), it has a perfect recall score. Of course, there is a reverse proportion in between precision and recall, when we increase recall we decrease precision. But, do we need to keep track of both of these metrics when both are important to us? The answer is no because we have F1 Score to do that for us. Now let’s check its equation:

Image for post
Image for post

F1 score combines both precision and recall, so that our metric considers both of them. The reason why F1 score uses harmonic mean instead of averaging both values ( (precision+recall)/2 ) is because harmonic mean punishes extreme values. When we have precision=1.0 and recall=0.0 average of them is 0.5 , however, the harmonic mean is 0.0.

Image for post
Image for post

Recap

  • Precision helps you to answer the following question: Among all of the positives my model predicted, how many percent of it was actually right? So when it is crucial for you to have a low amount of False Positives (e.g. if you classify an email as spam when your model outputs a positive), and “the negative cases (both false and true)” are less important for you, this could be the metric of your choice,
Image for post
Image for post
  • Recall helps you to answer the following question: Among all of the actual positives I have, how many percent of it my model could actually predict right? So when it is crucial for you to have a low amount of False Negatives (e.g. a patient has an infectious virus and a false negative means that your model says that s/he doesn’t have).
Image for post
Image for post
  • F1 Score is the metric of choice when both precision and recall is important for you. Because this metric combines both together and punishes extreme values for each.
Image for post
Image for post

Conclusions

It is very crucial to understand that the metric you choose meters a lot. A wrongly chosen metric can fool you to think that the model you trained is much better than it is in reality. This, in return, potentially harm your business or maybe even worse.

You should keep in mind that accuracy is not the only metric that exists. While deciding which metric to use, you should ask yourself the following questions:

  1. What my model is actually doing?
  2. What is the consequence of my model’s decision?
  3. What is the most important thing I want from my model? Is it a low number of False Positives? Low number of False Negatives? or both are important?
  4. How do my data look like? Is it balanced?

After asking these questions, hopefully, you can come up with a metric that best suits your needs.

You can read the extended version of this post in where I also talk about Receiver Operating Characteristic (ROC) curve, and Area Under the Curve (AUC).

Final Words

Don’t forget to 👏🏻 if you liked this post, and please leave a comment below if you have any feedback, criticism, or something that you would like to discuss. I can also be reached on social media: ,

Machine Learning Engineer/Researcher

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store