Java Image Cat and Dog Recognition With Deep Neural Networks
In this post, we are going to develop a cat and dog image recognition Java application using Deeplearning4j. If you would like to experiment with your own cat or dog, feel free to check out the source code or download the application (fairly short instructions at the end).
Computer Vision Nature
Even with the great progress of deep learning, computer vision problems tend to be hard to solve. One of the reasons is that neural networks (NN) are trying to learn a highly complex function, like image recognition or image object detection. We have a bunch of pixel values, and from there, we would like to figure out what is inside — so this really is a complex problem on its own.
Another reason computer vision still struggles is the amount of data we have. The amount of data we have now is definitely bigger than before, but it's still not enough for all computer vision problems. In particular, image object detection has even less data compared to image recognition (i.e. is it a cat, dog, or flower?) because it requires more intensive data labeling (going into each image and specifically marking each object).
Because computer vision is hard, complex architectures and techniques have been developed to achieve better results. We saw in a previous post how adding convolution (specialized image feature detectors) to neural networks greatly improved the performance in a handwritten digit recognition problem (97% to 99.5%) but also introduced higher complexity parameters and greatly increased training time (to more than two hours).
Usually, an NN that works for particular image recognition problem can also work for other image-related problems. Fortunately, there are several ways we can approach computer vision problems and still be productive and get great results.
We can re-use already-successful architectures by reducing the time needed for choosing different neural hidden layers, convolution layers, or other configuration parameters (i.e. learning rate).
We can also re-use already-trained neural networks (i.e. maybe someone already let the NN learn for a few weeks) and greatly reduce the training time.
Play with training data by cropping, changing colors, rotating, etc. to obtain more data so we can help NNs learn more and be smarter.
Let's see how we can solve the problem of detecting a cat and dog!
Well-Known Architectures
This is a classic neural network architecture successfully used on the handwritten digit recognizer problem back in 1998. You can find more information for other versions of the LeNet architecture here. There is an already-existing implementation in the Deeplearning4j library on GitHub (although not exactly as it is the paper).
The LeNet - 5 architecture looks like this (if you're not familiar with convolution, please have a quick look here):
In principle, this architecture introduced the idea of applying several convolution layers and pooling layers before connecting to a neural network and then to the outputs.
It takes as input a 32x32x1 matrix (the third dimension is 1 for black and white; it will be 3 for RGB), then applies 6 convolution 5×5 matrices. Applying formula described here, we get a 28x28x6 matrix. Notice that the third dimension is equal to the number of convolution matrices. Usually, convolution will reduce first two dimensions (width X height) but increase the third dimension (channels).
After that, we apply 2×2 with stride 2 for the max pooling layer (in the paper, this was average pool), which gives a matrix of 14x14x6. Notice that the pooling layer left the third dimension unchanged but reduced the first two (width X height) by dividing by 2, so pooling layers are used to reduce only the first two dimensions.
Additionally, we apply 16 convolution 5×5 matrices, which gives 10x10x16. Then. by adding 2×2 max pooling, we end up with 5x5x16.
We use the output 5x5x16 of several convolutions and pooling to feed a 500-layer neural network with only one hidden layer and ten outputs (0-9 digits). The model has to learn approximately 60,000 parameters.
According to paper, this model was able to achieve 99.05% accuracy!
This is a more modern architecture (2012) that works on RGB colored images and has way more convolutions and full connected neurons. This architecture showed great results and convinced a lot of people that deep learning works pretty well for image problems. In a way, this is similar to LeNet - 5 — just bigger and deeper because at that time, the processing power was also way greater (GPUs were widely introduced).
There is also an already-existing implementation in the Deeplearning4j library on GitHub.
The architecture will look like below:
We start with more pixels and also colored images (224x224x3RGB image). In principle, this is the same as LeNet - 5 above — just with more convolutions and pooling layers. Convolutions are used to increase the third dimension and the first two dimensions are usually left unchanged (except the first one with stride s=4). Pooling layers are used to decrease (usually by dividing by 2) the first two dimensions (width X height) and leave the third dimension untouched. Conv same simply means to leave two first dimensions (width X height) unchanged. Using the formulas described in the previous post, it's fairly easy to get the same values as in the picture.
After adding several convolution and pooling layers, we end up with a 6x6x256 matrix, which is used to feed a large neural network with three hidden layers (9216, 4096, 4096).
AlexNet is trying to detect more categories — 1,000 of them, compared to LeNet - 5, which had only ten (0-9 digits). At the same time, it has way more parameters to learn — approximately 60 million (100 times more than LeNet - 5).
This architecture from 2015, besides having even more parameters, is also more uniform and simple. Instead of having different-sized convolution and pooling layers, VGG - 16 uses only one size for each of them and just applies them several times.
There is also an already-existing implementation in the Deeplearning4j library on GitHub.
It always uses convolution same 3X3XN with stride s=1; the third dimension differs from time to time. Also, it uses max pooling 2×2 stride s=2. The pooling layer always has the same third dimension value as the input (they play only with width and height), so we do not show the third dimension. Let's see how this architecture will look:
Notice again how step-by-step, height and width were decreased by adding pooling layers and channels (third dimension) and increased by adding convolutions. Although the model is bigger, it's easier to read and understand thanks to the uniform way of using convolution and pooling layers.
This architecture has 138 million parameters — approximately 3x more than AlexNet (60 million), and similarly, it tries to detect 1,000 image categories.
Other Great Architectures
There more architectures that are even bigger and deeper than the three above. For their implementation and a list of the other architectures, please see the Deeplearning4j classes list on GitHub. But just to mention a few, there is also:
- VGG- 19, a bigger version of VGG - 16.
- ResNet - 50, which uses residual neural networks.
- GoogLeNet, which uses inception networks; was developed by Google and took the name in honor to the classic LeNet - 5.
Transfer Learning
One great thing about machine learning applications is that they are highly portable between different frameworks and even programming languages. Once you train a neural network, you get a bunch of parameters values (decimal values) — in the case of LeNet-5, 60,000 parameter values; AlexNet, 60 million; and VGG - 16, 138 million. Then, we use those parameter values to classify incoming images into one of 1,000 for of AlexNet and VGG-16 or 10 for LeNet-5.
The most valuable part of our application is the parameters. If we save parameters on disk and load them later, we get the same result as prior to saving (for previously predicted images). Even if we save with Python and load with Java (or the other way around), we get the same result (assuming the neural network implementation is correct on both of them).
Transfer learning, as the name suggests, transfers already-trained neural weights to others, like different machines, operating systems, frameworks, languages like Java or Python, etc. — as long as you can read and save the weights' values.
Maybe someone else already trained the network; with transfer learning, we can re-use that work in a few minutes and start from there. We get the painful tuning of hyper-parameters without spending our own time, which is especially useful when we do not have a lot of processing power (someone else trained with thousands of GPUs). As we will see later, Deeplearning4j already has the ability to save and load pre-trained neural networks, even from frameworks like Keras.
There are several things we can do once we load a pretrained neural network:
- Directly use the network to classify or predict already-trained outputs.
- Modify only the output layer from, let's say, 1,000 to 5, and freeze everything else. We train only from the last layer to output and re-use everything else by speeding up training time. Freezing means that we don't train and don't use any processing power for those layer parameters but rather use as they are.
- Freeze some of the layers and add or remove other layers. Then, we only train the network on new added layers and not frozen or removed layers. We freeze some of the layers because most of the layers are already useful for our problem (i.e. we have similar problems in the trained network) and it will take long time to train all the layers, without improving much.
- Use the new wights only as initial values and then train all the networks, including other possible new layers. Usually, this a good choice when we have a lot of processing power and more data and can be almost positive that doing this will bring new findings and maybe better performance.
As we will see, Deeplearning4j supports freezing layers and adding layers to or removing layers from a pre-trained neural network.
Cat and Dog Recognizer
As always, every machine learning problem starts with the data. The amount and quality of data are very crucial for the performance of the system and most of the time, it requires a great deal of effort and resources. We need to rely on online public datasets as a start and then try to augment or transform existing images to create a larger variety.
For the cat and dog recognizer problem, we have a good dataset provided by Microsoft. The same dataset can also be found on Kaggle. This is a dog and cat dataset with 12,500 cat photos and 12,500 dog photos, and with 12,500 photos with dogs and cats.
Architecture
Since 2010, ImageNet has hosted an annual challenge where research teams present solutions for image classification and other tasks by training on the ImageNet dataset. ImageNet currently has millions of labeled images; it's one of the largest high-quality image datasets in the world. The Visual Geometry group at the University of Oxford did really well in 2014 with VGG-16 and VGG-19. We will choose VGG-16 trained with ImageNet for our cat problem because it is similar to what we want to predict. VGG-16 with ImageNet already is trained to detect different breeds of cats and dogs, find the list here.
The size of all trained weights and the model is about 500MB, so if you are going to use the code to train, it may take few moments to download. The code in Deeplearning4j for downloading VGG-16 trained with ImageNet looks like below:
ZooModel zooModel = new VGG16();
ComputationGraph pretrainedNet = (ComputationGraph)
zooModel.initPretrained(PretrainedType.IMAGENET);
Architecture Adaption
VGG-16 predicts 1,000 classes of images, while we need only two: cat or dog. We need to slightly modify the model to output only two classes instead of 1,000. We leave everything else as-is since our problem is similar to what VGG-16 is already trained for (freezing all other layers). The modified VGG-16 will look like below:
That part that freezes is gray, so we do not use any processing power to train it but instead use the weight's initial downloaded values. In green is the part that we trained, so we are going to train only 8,192 parameters (out of 138 million) from the last layer to the two outputs. The code will look like below:
FineTuneConfiguration fineTuneConf = new FineTuneConfiguration.Builder()
.learningRate(5e-5)
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.updater(Updater.NESTEROVS)
.seed(seed)
.build();
ComputationGraph vgg16Transfer = new TransferLearning.GraphBuilder(preTrainedNet)
.fineTuneConfiguration(fineTuneConf)
.setFeatureExtractor(featurizeExtractionLayer)
.removeVertexKeepConnections("predictions")
.addLayer("predictions",
new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.nIn(4096).nOut(NUM_POSSIBLE_LABELS)//2 .weightInit(WeightInit.XAVIER) .activation(Activation.SOFTMAX).build(), featurizeExtractionLayer) .build();
The method that freezes the weights is setFeatureExtractor
(from the Java doc of Deeplearning4j):
/**
* Specify a layer vertex to set as a "feature extractor"
* The specified layer vertex and the layers on the path from an input vertex to it it will be "frozen" with parameters staying constant
* @param layerName
* @return Builder
*/
public GraphBuilder setFeatureExtractor(String... layerName) {
this.hasFrozen = true;
this.frozenOutputAt = layerName;
return this;
}
Everything from the input to the layer name that you defined will frozen. If you are wondering what the layer name is and how to find it, you can print first the model architecture like below:
ZooModel zooModel = new VGG16();
ComputationGraph pretrainedNet = (ComputationGraph) zooModel.initPretrained(PretrainedType.IMAGENET);
log.info(pretrainedNet.summary());
After that, you will get something that looks like below:
Notice that the number of trainable parameters is equal to the total number of parameters (138 million). We are going to freeze from the input to the last dense layer, fc2
, so the variable value featurizeExtractionLayer
will be fc2
. Please find below a view after freeze:
Notice how names end with frozen
now. Also, the trainable parameters changed from 138 million to 8194 (8192+ 2 bias parameters).
Train and Results
Now we are ready to train the model with a few lines of code:
DataSetIterator testIterator = getDataSetIterator(test.sample(PATH_FILTER, 1, 0)[0]);
int iEpoch = 0;
int i = 0;
while (iEpoch < EPOCH) {
while (trainIterator.hasNext()) {
DataSet trained = trainIterator.next();
vgg16Transfer.fit(trained);
if (i % SAVED_INTERVAL == 0 && i != 0) {
ModelSerializer.writeModel(vgg16Transfer, new File(SAVING_PATH), false);
evalOn(vgg16Transfer, devIterator, i);
}
i++;
}
trainIterator.reset();
iEpoch++;
evalOn(vgg16Transfer, testIterator, iEpoch);
}
We are using a batch size of 16 and 3 epochs.The while
loop will be executed three times since epoch=3. The second inner while
loop will be executed 1,563 times (25,000 cats and dogs/16). One epoch is a full traversal through the data and one iteration is one forward and back propagation on the batch size (16 images, in our case). Our model learns with small steps of 16 images and each time becomes smarter and smarter.
Before, it was common to not train neural networks with batches but rather to feed all the data at once and have epochs with bigger values like 100,200. In modern deep learning, due to big data, this method is not used anymore because is really slow. If we feed the network all the data at once, we'll wait until the model iterates all of the data (million of images) before making any progress. With batches, the model learns and progresses faster with small steps. There is more to batch vs. no batch, but that is out of the scope of this post, so we will leave that for another post.
You can find the full code used for training on GitHub. The first time it runs, it has to download and unzip 600MB of data images to the resources
folder, so the first run may take some time.
After training on85% of the training set (25,000) for three hours,) we were able to get the below results (code used for evaluating).
Dev set accuracy:
15% of Training Set Used as Dev Set
Examples labeled as cat classified by model as cat: 1833 times
Examples labeled as cat classified by model as dog: 42 times
Examples labeled as dog classified by model as cat: 31 times
Examples labeled as dog classified by model as dog: 1844 times
==========================Scores==========================
# of classes: 2
Accuracy: 0.9805
Precision: 0.9805
Recall: 0.9805
F1 Score: 0.9806
=========================================================
Test set accuracy:
1246 Cats and 1009 Dogs
Examples labeled as cat classified by model as cat: 934 times
Examples labeled as cat classified by model as dog: 12 times
Examples labeled as dog classified by model as cat: 46 times
Examples labeled as dog classified by model as dog: 900 times
==========================Scores=====================================
# of classes: 2
Accuracy: 0.9693
Precision: 0.9700
Recall: 0.9693
F1 Score: 0.9688
===================================================================
Application
The application can be downloaded and executed without any knowledge of Java, although Java does have to be installed on your computer. Feel to try it with your own cat or dog!
It is possible to run from the source by simply executing the RUN
class, or, if you do not want to open it with IDE, just run mvn clean install exec:java
.
Some final notes:
- Please be aware that the first time you run the code, it will download 500MB weights from Dropbox, so it may take some time depending your network.
- To speed up training time, the model was trained only on cats and dogs images and has not seen (trained) anything other than cats and dogs. To avoid predicting non-cat or dog images as cats or dogs, the threshold was increased to 0.95 — meaning that we classify as cat or dog only when the confidence of the model is very high, like 95%. In reality, the threshold will be much lower, like 50% (0.5), so if you are not satisfied with the prediction, try to lower the threshold.
- The downloadable application has the threshold set to 95%.
- Training is not provided with the GUI since it takes a lot of time and memory. Feel free to train and experiment on your own by modifying this class. Expect some time for the code to download 500MB of VG-166 ImageNet weights and training data of 600MB for the first time.
After running the application, you should be able to see below view:
Enjoy!