ML : ROC Curve, F1-Score, precision, choose the best metrics

90% accuracy misses the mark

Let me tell you the story of a data scientist intern, so happy to use machine learning algorithms on real-life data after a year studying Iris dataset, comparing Sepal.length and Petal.width.

This intern was assigned his first task in his new company. From browsing data, predict the probability for a visitor to come back on the website within the day. The goal of this prediction was to propose a different experience for a user who’s likely to come back.

Clic, clic, clic, shift+Enter : Notebook cell launched, algorithm doing his magic, accuracy calculating…

After a few endless seconds in the silence of a cell launch, the result pops up :

Accuracy: 90 %


This young, innocent data scientist intern then goes to his mentor, hand in the pocket : “Hey bud, anything more interesting for me?”

Truth is…, 90% is way off.

Among all the visits on the website, only 9% was leading to a second visit within the day.

If this intern had predicted that no one would return, he would have had 91% accuracy, which would have been better than his algorithm.

Therefore, it’s essential to know :

Which metrics to evaluate algorithms performance?

0. Quick answer

What do I expect from my model?

Answering this question makes 3/4 of the work.

  • Bank companies want to detect frauds. They expect a model that doesn’t miss any fraud. zero, zip, nada. The model can make mistakes: detect frauds that are not. Once in a while, clients will be contacted to validate a transaction, not a big deal.
  • Mail services, on the contrary, can miss spam emails which will go straight into the inbox. But they will do their maximum not to classify as Spam emails that are not (for UX reasons).
  • Cancer recognition algorithms from mole images can detect cancer that is not. The mole will be removed, end of the story. This is way better than missing a real cancer.

What do I expect from my model?

Said differently :

What mistakes my model can make? And what mistakes are forbidden?

The answer may lie in the business objectives of the company. Concerning my introduction example, this comes down to answer the question: Do we prefer to send a notification to someone who is not at all interested, or do we prefer to let someone very motivated leave the website?

We must therefore know what we are trying to minimize. For this purpose, we can use several formulas :

1. Metrics and tools

We will have a look at the following metrics and tools:

  • Confusion matrix
  • Accuracy
  • Precision
  • Recall
  • F1-Score
  • Sensitivity, specificity and ROC Curve

For this article, I will use results from this detection :

Results of a Bradley Cooper detection

The algorithm had to recognise Bradley Cooper, he obviously recognised too many people.

We then got y = [0,0,0,0,0,0,0,0,0,0,1] (real classes) and y pred = [0,0,0,0,0,0,0,0,0,1,1] (predicted classes).

I. Confusion matrix

Confusion matrix for the Bradley Cooper detection

This matrix is the basis of everything. The following calculations and the ROC curve are based on the tn, fn, fp and tp seen in the image above.

Any self-respecting data scientist has already hesitated on true negative / false positive / false negative / true positive … I found a method; the first part (true / false) indicates if the prediction is correct (? pred class = true class?), the second part (positive / negative) indicates the result of the prediction.

So, false positives are all predicted positive when they weren’t. In our case, a person who was not Bradley Cooper was identified as Bradley Cooper.

If we go back to my three previous examples, a bank will want fn = 0 when detecting fraud, an email service will want fp = 0 when detecting spam and a doctor will want fn = 0 when detecting cancer.

II. Accuracy

Accuracy allows you to know the proportion of good predictions compared to all predictions. The operation is simply: Number of good predictions / Total number of predictions

As I explained above, the problem with this metric is that it does not indicate the strengths and weaknesses of our model.

If our algorithm hadn’t detected a Bradley Cooper, we would have had 90% success too.

III. Precision

Precision refers to the number of documents correctly assigned to class i compared to the total number of documents predicted as belonging to class i (total predicted positive).

In our example, this is the number of times Bradley Cooper was recognized (it really was him) versus the number of times he was predicted (even if it wasn’t really him).

hum, our algorithm ain’t that good…

Precision measures the cost of false positives, that is, those detected by mistake. If we are trying to limit false positives, it is this indicator that we are going to minimize. In the spam detection example shown here, we have reached a precision = 1. This is ideal for this type of problem.

Note: The informed reader will understand that the example of Bradley Cooper is here to illustrate the metrics. Obviously, we cannot evaluate our algorithm with a single image.

But… how do we know if we detected all the positive classes?

To do so, we will use the…

IV. Recall

The recall refers to the number of documents correctly assigned to class i compared to the total number of documents belonging to class i (total true positive).

Well, the Bradley Cooper example isn’t appropriate to illustrate this indicator. But let’s say we had several Bradley Cooper’s in the photo, recall is the number of times Bradley Cooper was recognised versus the number of times he should have been recognised (where he really is).

This metrics allows us to estimate how many truly positive documents we managed to capture and how many fell through the cracks.

You saw it coming, in the case of bank fraud or cancer detection, we wish… we must have Recall = 1.

Recall is also called sensitivity.

Aoooright! We got metrics to estimate false negative and false positive impact. But how can we generally evaluate our model?

Dear reader, there is a magic combination between Precision and Recall called…

V. F1-Score

The F1-Score subtly combines precision and recall. It is interesting and more interesting than accuracy because the number of true negatives (tn) is not taken into account in the calculation. And in imbalanced class situations like Bradley Cooper, we have a majority of true negatives that completely skew our perception of the performance of the algorithm. A large number of true negatives (tn) will leave the F1-Score unmoved.

The F1-Score should therefore be favored over accuracy in the case of an imbalanced class situation.

VI. Sensitivity, Specificity, ROC Curve

A receiver operating characteristic (ROC) curve is a graph representing the performance of a classification model for all classification thresholds (Google says so).

What is this threshold?


Imagine a test for influenza. This test is only based on body temperature. To decide if someone is sick, we define a temperature threshold beyond which the person is considered as being sick.

If threshold = 40°C, there is a good chance that the test will do little, or even no error when considering patients as sick (fp = 0). On the other hand, he will forget many patients who can have a body temperature of 39.5 ° C (fn = very high).

This is an example with one parameter. Machine learning algorithms take many parameters into account. In reality, the threshold is not estimated in °C but between 0 and 1. The threshold represents the probability from which we consider that the individual belongs to class X.

In python, if we use the Sklearn library, we will call the predict_proba function and we will vary the threshold as follows:

y_pred = (clf.predict_proba(X_test)[:,1] >= THRESHOLD).astype(bool)

Then we have to count the tp, tn, fp, fn and we can calculate the sensitivity, specificity, and ROC curve :


Sensitivity measures the proportion of correctly identified positive predictions (positive predictions which were actually positive). It is indeed the same formula as the recall.

Everything that was seen on the recall remains valid, the sensitivity allows us to measure the number of positive classes that we missed.


Specificity, on the contrary to sensitivity, measures the proportion of correctly identified negative elements.

The formulas for specificity and sensitivity speak for themselves: the two are complementary. Watching one without the other does nothing. For example, if we classify sick / not sick with a threshold = 40 ° C, the specificity will be 1 but the sensitivity will be very low.

A good model should be sensitive and specific.

Let’s recap… We are going to vary the threshold, which will vary the values ​​of tn, tp, fn, fp, which will vary the sensitivity and specificity. We will represent these on a curve called… ROC curve.

ROC curve

If this previous part is understood, there is not much more to say … except the interpretation.

To represent the points, we will represent the sensitivity on the y-axis and (1-specificity) on the x-axis.


The curve above is taken from the identification of heart failure as a function of the level of BNP (B-type natriuretic peptide for the curious). The level of BNP is the threshold. It is observed that the more this level of BNP increases, the more the sensitivity decreases.

This curve serves two purposes:

  • Compare models
  • Choose an optimal threshold

How to compare models?

We can compare models by displaying their ROC curves. Then you can choose which model performs best. To choose it, it is necessary to be based on the area under the curve (Area Under the Curve). The larger the area under the curve, the better our model.

Source : Sensitivity, specificity, CNAM

In the previous example, we will choose the model from which the curve D results.

How to choose the ideal threshold?

The ideal threshold is the best compromise between sensitivity and specificity.

As explained here, the ideal threshold corresponds to the point both closest to the ideal (1.1) and at the same time farthest from the diagonal.

We will choose point B as the threshold! Source : Sensitivity, specificity, CNAM

Once we chose a threshold, we can use the function predict_proba I just talked about.

The construction of a ROC curve is presented step by step here.


After all these formulas, you have all the keys to evaluate the performance of your algorithms considering the objectives of your company / project. You also have the keys to choose the best model and define the ideal classification threshold for it.

Deepen the question? The formulas of this article are taken from Wikipedia articles there and there very complete!

400 views in a month!!

Few months ago, I published this exact article in french. I got 400 views and 0 messages.

My core motivation for writing is to help… If I helped you, please, leave a comment, contact me on Linkedin or please clap X times (X would be a multiple of 3 (X modulo 3 = 0), I will understand the message :)

Data scientist jr & Javascript + Python full-stack developer… ok I like tech and do all kind of stuff

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store