When we train our model, the model generalizes on unseen data and then we need to know whether it actually works. Thus we use some evaluation techniques.

These performance metrics are categorized based on the type of Machine Learning problem. It means we have different evaluation techniques for respective Regression and Classification problems.

We will read about all these metrics in this blog along with its implication on our Machine Learning model. Let’s see how we can break these metrics based on different categories.

Regression Problem

  • Mean Absolute Error
  • Mean Squared Error
  • Root Mean Squared Error
  • R squared
  • Adjusted R squared

Classification Problem

  • Confusion matrix
  • Accuracy score
  • Classification Report
  • ROC Curve
  • AUC

Let us begin with each metrics one by one with example in Python.


Regression problems are the one where we find a linear relationship between target variables and predictors. Here the target variable holds a continuous value. This method is mostly used for forecasting. Regression Models include algorithms such as Linear regression, Decision Tree, Random forest, SVM, etc.  We can evaluate a regression model performance using the following metrics.

  • Mean Absolute Error – Mean Absolute Error is the average of the absolute difference between the Original Values and the Predicted Values of data. It gives us the measure of how far the predictions were from the actual output i.e., the magnitude of the error. However, they don’t give us any idea of the direction of the error i.e. whether we are under predicting the data or over predicting the data.

In Python, we find Mean Absolute Error using the sklearn library as shown below:

from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, pred)


MAE value of 0 indicates no error or perfect predictions.

  • Mean Squared Error –  Mean Squared Error is much like Mean Absolute Error except that It finds the average squared error between the predicted and actual values. It also provides a rough idea of the magnitude of the error.

In Python, we find Mean Squared Error using the sklearn library as shown below:

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, pred)


An MSE of zero means that the estimator predicts observations of the parameter with perfect accuracy, is ideal but is generally not possible.

The smaller the means squared error, the closer you are to finding the line of best fit. Hence, MSE is a measure of the quality of an estimator and is always non-negative, and values closer to zero are better.

  • Root Mean Squared Error – Root Mean Squared Error (RMSE) measures the average magnitude of the error by taking the square root of the average of squared differences between prediction and actual observation. it tells you how concentrated the data is around the line of best fit. The RMSE is the square root of the variance of the residuals. Lower values of RMSE indicate a better fit. RMSE is a good measure of how accurately the model predicts the response.

In Python, we find Mean Squared Error using the sklearn library as shown below:

from sklearn.metrics import mean_squared_error

import numpy as np

rmse = np.sqrt(mean_squared_error(y_test, pred))


The RMSE will always be larger or equal to the MAE; the greater the difference between them, the greater the variance in the individual errors in the sample. If the RMSE=MAE, then all the errors are of the same magnitude.

  • R Squared – The r2_score or commonly known as the R² (R-squared) is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is known as the coefficient of determination. It is a statistical measure of how close the data are to the fitted regression line Or indicates the goodness of fit of a set of predictions to the actual values. The value of R² lies between 0 and 1 where 0 means no-fit and 1 means perfectly-fit.

In Python, we find r2_score using the sklearn library as shown below:

from sklearn.metrics import r2_score

r_squared = r2_score(y_test, pred)


The formula to find R² is as follows:

  • R² = 1 – SSE/SST

Where SSE is the Sum of Square of Residuals. Here residual is the difference between the predicted value and the actual value.

And SST is the Total Sum of Squares.

For eg:  we have data as 6, 3, 9, and 7 and the model predicts the outcome as 5.5, 3.4, 7.4 and 6.9.

Now SSE = (6 – 5.5)^2 + (3 – 3.4)^2 + (9 – 7.4)^2 + (7 – 6.9)^2

And SST  = mean → (6+3+9+7)/4 = 6.25; SST = (6 – 6.25)^2 + (3 – 6.25)^2 + (9 – 6.25)^2 + (7 – 6.25)^2

Hence we will put the value of SSE and SST in the formula and will get the value of R² .

  • Adjusted R squared – R-squared explains the degree to which our input variables explain the variation of our output/predicted variable. So, the higher the R squared, the more variation is explained by our input variables and hence better is our model. The Adjusted R-squared value is similar to the R-squared value, but it accounts for the number of variables that is, R-squared will either stay the same or increase with the addition of more variables, even if they do not have any relationship with the output variables. This is where “Adjusted R square” comes to help. Adjusted R-square penalizes for adding variables that are not useful for predicting the target.

Hence, if you are building Linear regression on multiple variables, it is always suggested that we use Adjusted R-squared to judge the goodness of the model. In case you only have one input variable, R-squared and Adjusted R squared would be exactly the same.


  • If the R² increases by a significant value, then the adjusted r-squared would increase.
  • If there is no significant change in R², then the adjusted r2 would decrease.

In python we can find the adjusted R² using the OLS model as shown below: 

import statsmodels.formula.api as sm

model = sm.ols(formula=’Y ~ x1+x2+x3+x4+x5′, data=df)

fitted = model.fit()


Here x and y are the independent and dependent features respectively. 


A classification problem is the one where the dependent variable is a categorical one, that is, belongs to either of the two classes, yes or no, true or false, etc. This algorithm is used in scenarios where we have to find whether the email is spam or not, whether the transaction data is fraudulent or authorized, etc. Classification Models include algorithms such as Logistic regression, Decision Tree, Random forest, Naive-Bayes, etc.  We can evaluate a classification model performance using the following metrics.

  • Confusion Matrix – It is used for the classification problem where the output can be of two or more types of classes. It is one of the most intuitive metrics used to find the accuracy of the model. As the name suggests, confusion matrix gives us a matrix of output and tells us whether our model has classified the data correctly or not. The confusion matrix in itself is not a performance measure as such, but almost all of the performance metrics are based on the outcome given by the confusion matrix.

Let’s see how we read data from a matrix.

Actual Yes(1)True Positive(TP)False Negative(FN)
Actual No(0)False Positive(FP)True Negative(TN)

Here the matrix represents the Actual and the Predicted outcome in the form of Yes(1) and No(0). There are four important terms:

  • True Positives : The cases in which we predicted YES and the actual output was also YES.
  • True Negatives : The cases in which we predicted NO and the actual output was also NO.
  • False Positives : The cases in which we predicted YES whereas the actual output was NO.
  • False Negatives : The cases in which we predicted NO whereas the actual output was YES.

In python we implement the confusion matrix from the below code:

from sklearn.metrics import confusion_matrix

conf_Matrix = confusion_matrix(y_test, pred)


  • Accuracy Score – Accuracy or classification accuracy tells the number of correct predictions made by the model. It is the ratio of the number of correct predictions to the total number of input samples. 

For binary classification the accuracy can be defined as:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

However, accuracy is a good measure only when the target variable classes are nearly balanced. Accuracy alone doesn’t tell the full performance when we’re working with a class-imbalanced dataset. Therefore we have better metrics for evaluating class-imbalanced problems like precision and recall, which we will discuss in our next heading.

In python we calculate the accuracy score as follows:

from sklearn.metrics import accuracy_score

accuracy_score(y_test, pred)

  • Classification Report – a classification report generated through sklearn library is a report which is used to measure the quality of predictions of a classification problem. This report shows metrics such as Precision, Recall, F1 score and Support. These metrics are defined in terms of true/false positives and true/false negatives. 

In Python the classification report is generated using the below code: 

from sklearn.metrics import classification_report


Lets us discuss each of the metrics in brief:

  • Precision – Precision is the fraction of predicted positives events that are actually positive. Said another way, “for all instances classified positive, what percent was correct?”

The formula for calculating precision is :

Precision = TP / (TP + FP)

  • Recall/Sensitivity – Recall (also known as sensitivity) is the fraction of positive events that you predicted correctly. Said another way, “for all instances that were actually positive, what percent was classified correctly?”

The formula for calculating precision is :

Precision = TP / (TP + FN)

  • F1 Score – The f1 score is the harmonic mean of recall and precision. The range for F1 Score is [0, 1] with a higher score as a better model.F1 scores are lower than accuracy measures as they embed precision and recall into their computation. It tells us how precise our classifier is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances). The f1 score is calculated using the following formula:
  • ROC – ROC that stands for Receiver Operating Characteristic graph is a useful tool for predicting the performance of a binary classifier that is a classifier with two possible output classes. A ROC plots the relationship between the True Positive rate(on the y-axis) and the False Positive rate(on the x-axis). ROC curve visualizes all possible thresholds. 

The True positive rate is calculated as:

True Positive Rate = True Positives / (True Positives + False Negatives)

And the false positive rate is calculated as:

False Positive Rate = False Positives / (False Positives + True Negatives)

  • AUC – The AUC or the Area under the curve is the whole area which is under the ROC curve. The AUC makes it easy to compare one ROC curve to another as to quantify the performance of classifier. AUC has a range of (0,1). AUC is a useful metric even when the classes are highly imbalanced. 

The image below shows the ROC and AUC curve:

The AUC is the whole part shown in the shaded region.

A good ROC curve is the one which covers a lot of space under it whereas a bad ROC curve is the one which is close to the black diagonal line and covers a very little area. The greater the value of AUC, the better is the performance of our model.

In python we find the ROC and AUC as below:

from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_test, pred)


And the graph is plotted using the below code:

import matplotlib.pyplot as plt

plt.title(‘Receiver Operating Characteristic’)

plt.plot(fpr, tpr, ‘b’, label = ‘AUC = %0.2f’ % roc_auc)

plt.legend(loc = ‘lower right’)

plt.plot([0, 1], [0, 1],’r–‘)

plt.xlim([0, 1])

plt.ylim([0, 1])

plt.ylabel(‘True Positive Rate’)

plt.xlabel(‘False Positive Rate’)


The output will be something as shown below:

In the above output, we can see AUC is 0.91 that is 91% which is pretty good.

So these were some of the metrics with the help of which we evaluate the performance of our regression and classification model.