Tensorflow with Python: Evaluating our Handwritten Digit Classifier

Welcome Developers!

In my last post I taught you how to create a simple Neural Network that learned to read Handwritten digits using the MNIST Dataset. For this post, I’d like to provide you with some different ways you can evaluate your Neural Network, and then use those to improve its accuracy, which will be the next post.

Model Evaluation

There are many different flavors of evaluation functions that we can use when it comes to evaluating our Neural Network, or any Machine Learning Model.Previously, I talked about measuring the accuracy of the model as a way of evaluating a particular Machine Learning Algorithm. While it shows us how accurate our algorithm is predicting, using the accuracy of a model isn’t always the best method to judge how effective a model is.

One reason that measuring the accuracy of a Machine Learning Model isn’t always the best approach, is because of the way the training data is split up. Let’s take a binary classifier, one that tells us if the current data instance is a truck or not. Our data set will be composed of 25% cars, 25% SUV’s, and 50% trucks. If we train a model to tell us if the current data instance is a truck or not, then the Model always has a 50% chance of being correct. If the model predicts that every instance is a not a truck, then it will be right 50% of the time. This is very inaccurate, and not a good way to evaluate a particular Machine Learning Model.

The Confusion Matrix

Instead of evaluating our Machine Learning Model using it’s accuracy, let’s use a Confusion Matrix. The Confusion Matrix is an interesting Metric. It splits the prediction of a model into 4 categories: True-Positives, True-Negatives, False-Positives, and False-Negatives. Below I have listed more explicit definitions of each of these categories using our example from above.

True-Positivies – Occur when the model correctly predicts a data instance as a Truck

True-Negatives – Occur when the model correctly predicts a data instance as not being a Truck

False-Positive – Occur when the model incorrectly predicts a data instance as a Truck

False-Negative – Occur when the model incorrectly predicts a data instance as a not being a Truck.

Already, this presents much more information to use than just taking the accuracy of the Model. Later, we will see how we can implement these using our previous Machine Learning Model to classify Handwritten digits.

Next, we have a few other statistic measures that branch from the Confusion Matrix: Precision, and Recall.


This statistical measure, finds the accuracy of the positive predictions of the model. The way that you can find the Precision using the information from the Confusion Matrix is ratio of the True-Positives to True-Positive + the False-Positives.

Precision Eq.

Image Source


The Recall is a way to measure how sensitive the a model is. It found by calculating the ratio of True-Positives to True-Positive + False-Negatives.

Recall Eq

Image Source

F1 Score

Finally, we have one final metric known as the F1-Score. This nifty function combines both the recall and the precision, leaving you one final percentage for showing you how well your model is doing. Next, I’ll show you how to implement these things into our Handwritten digit classifier.

Implementing the Confusion Matrix

For these functions, we will be using another popular machine learning library called Sci-Kit Learn. You might have already heard of it, but in case you haven’t, all you need to know is that it is a Python Library that includes many different Machine Learning Functions and utilities.

A convenient way to get Sci-Kit Learn is to install a popular Python Package Manager known as Anaconda. Here is the link to download.

Anaconda Download

Anaconda has a custom Command Prompt wrapper that includes extra function and a special Python Environment. From now on, we will use this command prompt to execute our Python Scripts.

To start this Shell, open the “Anaconda Navigator.” Then click on Environments, then hit the green play button beside the root Environment. Then click Open with Terminal.


This is the shell Environment that we will be using to run all of our Python Scripts. It’s very easy to install new packages, all you have to do is go back to the Anaconda Navigator, Environments, click the root Environment, then select “Not Installed.” Now using the Search Box over to the right, you should be able to type in any Python Package that you need and install it. Simple! Now that you know what to do, go ahead and install Tensorflow.

Importing the Confusion Matrix

In order to use Sci-Kit Learn’s Confusion Matrix, we are first going to need to import it.

from sklearn.metrics import confusion_matrix

The Confusion Matrix takes two parameters: the labels of the data set, and the output of the model. From this, it determines all of the True-Positives, True-Negatives, False-Positives, and False-Negatives.

Gathering Inputs

So, we must then gather our models output and the data sets labels. We can attain these by executing the code below.

network_output = sess.run(tf.argmax(output, 1), feed_dict={input_layer: mnist_data.test.images, labels:mnist_data.test.labels})
mnist_labels = test_labels = sess.run(tf.argmax(mnist_data.test.labels, 1), feed_dict={input_layer: mnist_data.test.images, labels:mnist_data.test.labels})

Our Neural Network returns to us a list of percentages that it thinks the current data instance could possibly be identified as. I explained this concept in my last post.

The tf.argmax() function is a way of converting the list of all the percentages into a list with all zeros except at the index with the highest percentage, which will be a 1. By doing these operations to both the labels and the output, we can be sure that the data, when compared with one another, will have no discrepancies.

Implementing the Confusion Matrix

Now that we have gathered the inputs for our Confusion Matrix, let us implement it.

matrix = confusion_matrix(mnist_labels, network_output)

You can play around with printing out the result, but you may not be able to understand right away. The Confusion Matrix function returns an array representing what classifications it got right, and didn’t.

What we really wanted to see is, how well the Neural Network is doing, in a more concise format. So, let us continue.

The Precision Score

For this bit, we must import the precision_score() function from the Sci-Kit Learn library.

from sklearn.metrics import precision_score

Then just as we did with the Confusion Matrix, we simply pass in the data sets labels and the models predicted outputs.

precision = precision_score(mnist_labels, network_output)

The result is a list of percentages, telling us how precise our Model is terms of each label.

The Recall Score

For this, we must import the recall_score() function.

from sklearn.metrics import recall_score

Then, we can pass in the data.

recall = recall_score(mnist_labels, network_output)

Once again, if we print out the the result, what we’ll find is a list of percentages telling us how sensitive our Neural Network is for that particular label.

The F1 Score

Finally, we have the F1 Score, which combines both the Precision and Recall.

from sklearn.metrics import f1_score

f1 = f1_score(mnist_labels, network_output)

Play around with each one of these scores. Next time, we’ll talk about how we can increase them. We will be focusing on making improvements to our model in the next post.

You can get the code for this post at my github page here.


Come back next week, when I do another post covering Neural Network Theory. Until then, have a good weekend!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s