Deep Residual Networks for Image Classification with Python + NumPy
Update
I am proud to announce that now you can read this post also on kdnuggets!
Thanks @ Matthew Mayo!
TL;DR
I wanted to implement “Deep Residual Learning for Image Recognition” from scratch with Python for my master’s thesis in computer engineering, I ended up implementing a simple (CPUonly) deep learning framework along with the residual model, and trained it on CIFAR10, MNIST and SFDDD. Results speak by themselves.
Convolutional Neural Networks for Computer Vision
On Monday, June 13rd, I graduated with a master’s degree in computer engineering, presenting a thesis on deep convolutional neural networks for computer vision. For now it is available only in Italian, I am working on the english translation but don’t know if and when I’ll got the time to finish it, so I try to describe in brief each chapter.
The document is composed as follows:

Introduction
An introduction of the topic, the description of the thesis’ structure and a rapid description of the neural networks history from perceptrons to NeoCognitron.

Neural Networks fundamentals
A description of the fundamental mathematical concepts behind deep learning.

State of the Art
A description of the main concepts that permitted the goals achieved in the last decade, an introduction of image classification and object localization problems, ILSVRC and the models that obtained best results from 2012 to 2015 in both the tasks.

Implementing a Deep Learning Framework
This chapter contains an explanation on how to implement both forward and backward steps for each one of the layers used by the residual model, the residual model’s implementation and some method to test a network before training.

Experimental Results
After developed the model and a solver to train it, I conducted several experiments with the residual model on CIFAR10, in this chapter I show how I tested the model and how the behavior of the network changes when one removes the residual paths, applies dataaugmenting functions to reduce overfitting or increases the number of the layers, then I show how to foil a trained network using random generated images or images from the dataset.

Conclusions
Here I describe other results obtained training the same model on MNIST and SFDDD (check below for more infos), an overview of the project and possible future works with it.
Thesis links:
 Italian
 English (WIP)
Presentation links:
Below I describe in brief how I got all of that, the sources I used, the structure of the residual model I trained and the results I obtained. Please keep in mind that my first objective was to develop and train the model so I didn’t spent much time on the design aspect of the framework, but I’m working on it (and pull requests are welcome)!
Sources
When I started to think I wanted to implement “Deep Residual Networks for Image Recognition”, on GitHub there was only this project from gcr, based on Lua + Torch, this code really helped me a lot when I had to implement the residual model.
Neural Networks and Deep Learning by Michael Nielsen contains a really well organized exhaustive introduction to the subject and a lot of code to help the user understand what is going on on each part of the process.
colah.github.io by Christopher Olah has a lot of very well written posts about deep learning and NNs, for example I found this post about convolution layers really illuminating.
Stanford’s CS231N by Andrej Karpathy et Al., a really interesting course about CNN for visual recognition, I mainly used the course material and my assignments’ solutions to buildPyFunt.
Arxiv, a repository of eprints of scientific papers in the fields of mathematics, physics, astronomy, computer science, quantitative biology, statistics, and quantitative finance, which can be accessed online. Check also Arxiv Sanity Preserver by Karpathy.
Many other awesome resources are listed here: awesomedeeplearning.
When I started studying deep learning I kept track of the best papers and collected titles, authors, years and links in this google sheet.
#PyFunt, PyDatSet and Deep Residual Networks
Pyfunt is a simple pythonic imperative deep learning framework: it mainly provides the implementations for the forward and backward steps for most notorious neural layers, some useful initialization function, and a solver, that is essentially a class that you instantiate and to which you pass the model to be trained and the data loaded with pydatset, which contains functions to import some dataset and a set of functions to artificially augment the training set. Just to clarify, PyFunt and PyDatSet are the names for the repos, pyfunt and pydatset are the names for the packages (so you import them with from pydatset import ...
).
The residual model implementation resides in deepresidualnetworkspyfunt, which also contains the train.py file.
The residual model proposed in the reference paper is derived from the VGG model, in which convolution filters of 3x3 applied with a step of 1 if the number of channels is constant, 2 if the number of features got doubled (this is done to preserve the computational complexity on each convolutional layer). So the residual model is composed by a cascade of many residual block (or residual layers), which are groups of convolutional layers in series where the output of the last layer output is added to the original input to the block, authors suggest a couple of conv layer for each residual block should work well.
Input

,+.
Downsampling 3x3 convolution+dimensionality reduction
 
v v
Zeropadding 3x3 convolution
 
`( Add )'

Output
Each residual block is composed like above, where, if dimensionality reduction is applied (using a convolution step of 2 instead of 1), downsampling and zeropadding must be applied to the input before the addition, in order to permit the sum of the two ndarrays (skip_path + conv_out).
A parametric residual network have in total (6*n)+2 layers, composed as below (right values represents the dimension of a [3,32,32] sample like CIFAR images):
(image_dim: 3, 32, 32; F=16)
(input_dim: N, *image_dim)
INPUT

v
++
conv[F, image_ch, 3, 3] (out_shape: N, 16, 32, 32)
++

v
++
n * res_block[F, F, 3, 3] (out_shape: N, 16, 32, 32)
++

v
++
res_block[2*F, F, 3, 3]  (out_shape: N, 32, 16, 16)
++

v
++
(n1) * res_block[2*F, 2*F, 3, 3] (out_shape: N, 32, 16, 16)
++

v
++
res_block[4*F, 2*F, 3, 3] (out_shape: N, 64, 8, 8)
++

v
++
(n1) * res_block[4*F, 4*F, 3, 3] (out_shape: N, 64, 8, 8)
++

v
++
pool[8, 8, 8] (out_shape: N, 64, 1, 1)
++

v
+        +
(opt) m * affine  (out_shape: N, 64, 1, 1)
+        +

v
++
softmax (out_shape: N, num_classes)
++

v
OUTPUT
You can see below a sort of package diagram that shows how train.py uses the other components to train the residual model.
After I had every piece I started experimenting what happens when you remove the residual paths, when you apply or not data augmenting functions for the training set, when increase the number of layers or the number of filters for each layer. Below you can find some image of the results but I suggest to give a look at the respective JuPyter notebooks (in addition to thesis and presentation linked above), for a deeper understanding, as you can find a more exhaustive description of the results on all datasets I show below.
Results
I trained the residual model on CIFAR10, MNIST and SFDDD, and results are really exciting, at least for me. The networks learn well in nearly every test I’ve done, obviously my limit is the capacity of my desktop PC.
CIFAR10
One of the experiments on CIFAR10 implied training a simple 20 layers resnet, applying dataaugmenting regularization functions I obtained a similar result showed in the reference paper as you can see below.
The training for this model took approximately 10 hours. more infos are available in this jupyter ipython notebook from the repo’s docs folder.
MNIST
MNIST is a much simpler dataset in comparison with CIFAR10, so the training times are relatively shorter and I also tried to use the half of the number of filters of each conv layers.
More infos for experiments with residual networks on MNIST are available here.
In the image above you can see all the wrongly classified validation samples from the 32 layers network, trained for just 30 epochs(!). upper left are the groundtruth class, lower left the wrong classification from the net and lower right the second classification for confidence.
SFDDD
State Farm Distracted Driver Detection is a dataset from State Farm on kaggle.com, it contains 640x480 images of drivers in 10 classes of distraction. For this dataset I decided to resize all the images to 64x48 and use random cropping of 32x32 for training and using the center 32x32 crop for testing. I also tried to directly scale all images to 32x32 but results were worse (confirming the fact that scaling the images doesn’t help a lot conv nets to learn more general features).
Below you can see the learning curves for two models of respectively 32 and 44 layers, it looks that both models produce a low error after 80 epochs, but the problem here is that for the validation set I used 2k images randomly extracted from the training set, so my validation set has a correlation factor which is higher than the correlation between the original training set and the validation set proposed by State Farm (on which I got an error of circa 3%).
Below you can see the saliency maps for six images for the class “talking on phone with right hand”, in where the lighter zones represent the portions of the images that most contributed to a correct classification from the network.
Other infos will be available here after competition ends.
Final Words
I hope my projects could help you learn something new. If not, maybe you can teach me something new, comments and pull requests are welcome as always!