ML : ROC Curve, F1-Score, precision, choose the best metrics

Accuracy: 90 %


Which metrics to evaluate algorithms performance?

0. Quick answer

What do I expect from my model?

  • Bank companies want to detect frauds. They expect a model that doesn’t miss any fraud. zero, zip, nada. The model can make mistakes: detect frauds that are not. Once in a while, clients will be contacted to validate a transaction, not a big deal.
  • Mail services, on the contrary, can miss spam emails which will go straight into the inbox. But they will do their maximum not to classify as Spam emails that are not (for UX reasons).
  • Cancer recognition algorithms from mole images can detect cancer that is not. The mole will be removed, end of the story. This is way better than missing a real cancer.

1. Metrics and tools

We will have a look at the following metrics and tools:

  • Confusion matrix
  • Accuracy
  • Precision
  • Recall
  • F1-Score
  • Sensitivity, specificity and ROC Curve
Results of a Bradley Cooper detection

I. Confusion matrix

Confusion matrix for the Bradley Cooper detection

II. Accuracy

If our algorithm hadn’t detected a Bradley Cooper, we would have had 90% success too.

III. Precision

Precision refers to the number of documents correctly assigned to class i compared to the total number of documents predicted as belonging to class i (total predicted positive).

IV. Recall

The recall refers to the number of documents correctly assigned to class i compared to the total number of documents belonging to class i (total true positive).

V. F1-Score

The F1-Score subtly combines precision and recall. It is interesting and more interesting than accuracy because the number of true negatives (tn) is not taken into account in the calculation. And in imbalanced class situations like Bradley Cooper, we have a majority of true negatives that completely skew our perception of the performance of the algorithm. A large number of true negatives (tn) will leave the F1-Score unmoved.

VI. Sensitivity, Specificity, ROC Curve

A receiver operating characteristic (ROC) curve is a graph representing the performance of a classification model for all classification thresholds (Google says so).

y_pred = (clf.predict_proba(X_test)[:,1] >= THRESHOLD).astype(bool)
  • Compare models
  • Choose an optimal threshold
Source : Sensitivity, specificity, CNAM
We will choose point B as the threshold! Source : Sensitivity, specificity, CNAM


After all these formulas, you have all the keys to evaluate the performance of your algorithms considering the objectives of your company / project. You also have the keys to choose the best model and define the ideal classification threshold for it.

400 views in a month!!

Few months ago, I published this exact article in french. I got 400 views and 0 messages.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Beranger Natanelic

Beranger Natanelic

Future Unicorn Founder — Using tech for good