Implementaiton: Single Shot-MultiBox

Detector Using RESNET

Yannick Roberts, Robson Adem, and Yang Gao

Abstract

We implemented a method for effectively detecting as well as localizing objects in images using the approach,

named Single-Shot Detector. This method used a set of scalable boxes obtained from discretized output space of

bounding boxes. Then, a feed-forward convolutional network uses the scalable boxes to check the presence of each

category and give a confidence score. The network also performs a bounding box adjustment to produce a better

matching box for the object shape. The base of the network is designated for high quality image classification.

In addition, the auxiliary structure to the network is used to produce detections with the key features such as the

Multi-scale feature maps for detection, the Convolutional predictors for detection, and the Default boxes and aspect

ratios. SSD is straightforward to train, and it can easily be integrated to any applications. The main reason for this

is the fact that, unlike other methods, the SSD approach encapsulates all computation in a single network since it

does not use proposal generation and subsequent pixel or feature resampling stages. Using the RESNET network, we

implemented the SSD approach and obtained experimental results on the Dataset COCO2014 ( 82,000 Images; 82

Classes). We chose to use Resnet 18 becuase it maintiains the residuals from previous layers in the network. This

allows high level layers to directly access information from low levels. We premised this should lead to better results

than VGG.Following our experiments on 26 Epochs and close to 50 hours of training, we concluded that SSD has a

significant result in localization accuracy and a reasonable accuracy on detection.

I. INTRODUCTION

Object detection is the process localizing and detecting objects within an image. Classical image classification

entailed classifying a single object within an image. While state of the art classification accuracy could be achieved

with simple networks like AlexNet, the use case for such network remains limited to scenes with singular objects.

As the need for identifying objects within any arbitrary scene grew, research began devising new ways of detecting

and classifying objects. One of the landmark creations in such research entails the creation of RCNN; which was

first proposed in Selective Search for Object Recognition. To summarize, the network utilizes classical image feature

extraction techniques to determine specific regions of interest within a scene. From the these regions are passing

into a convolutional network where they are further classified or discarded. A bounding region and label is finally

determined and the results presented as the networks output. While this method was able to generate state of the art

results in detecting and classifying objects within a scene it failed to achieve the necessary performance to classify

scenes in real time; this is due to the feature extraction phase. Given the slow performance, another method was

introduced which increase the performance of RCNN network. This was called faster RCNN and relied on the

convolutional network to process image details. While performance was improved, both RCNN and Faster RCNN

relied on selective search to obtain potential region of interests. Selective search is an extremely slow algorithm

and significantly restricted either network performance.

To Remedy this situation, Wei Lui [?] et al developed a ground breaking network called SSD or Single Shot

MultiBox Detector which was designed to inline the entire region proposal network into the entire convolutions

network. As a result, the network is able to detect any arbitrary object with real time performance, significantly

bettering that of RCNN and other state of the art networks during that period.

II. PROBLEM STATEMENT

Given the potential performance gain and classification accuracy, we choose to implement of the single shot

multibox detector algorithm as described in the paper Single Shot-Multibox Detector [1] by Wei Liu Etal.

III. PROPOSED APPROACH

A. Single Shot Multibox Detector

The paper proposes that a pre-defined network be utilizes as the backbone for recognizing images; in the case

of the paper, VGG16 was proposed as the backbone network. The backbone network is further modified where

a set of 6 auxiliary convolution layers are added to the tale end of the final convolution layer of the VGG16

network. From here a set feature maps are extracted from the the final 2 layers of backbone network and after

each convolution layer of the auxiliary network. A total of 6 features maps are extracted. Each feature map is then

divided into regions determined by the scale of the feature map and a set of bounding boxes are defined for each

region. These bounding boxes, referred to as priors within the feature space, form the underlying foundation behind

the algorithm’s performance 1.

Fig. 1. SSD Priors defined in feature space.

A total of 8732 priors are established across the 6 extraced feature maps. Each prior is then pass through a

predictive head where their positions are regressed throughout the network with respect to the network’s cost

function. While processing the network criterion, the Jaccard Index is then calculated between the bounding box

ground truth. Priors with the greatest overlap are said to have a potential match and are passed to the localizing

and classification phase.

Fig. 2. Jaccard Index Used to calculate IoU measure.

B. Localization Loss

The localization loss is dependent on priors from the prior layer that had passed a specific threshold computed

by their Jaccard index. An L1 distance is then used as the localization loss function To regress the ground truth

locations across the network, and drive localization to convergence.As noted in equation 3, the localization loss is

the averaged smooth l1 loss between predicted offsets and matching priors filtered by their intersection over unions

with the ground truth.

Fig. 3. Localization loss relies on l1 difference between predicted boxes and associated priors.

C. Confidence Loss

In addition to the localization loss, the network also relies on the confidence loss to drive the classification of the

detected objects. Because the majority of selected priors does not coincide with a positive prediction, the network

focuses on priors that have the highest negative values. This method, called Hard Mining, ensures that the most

critical features within the selected priors are used to decipher what classes are present in said prior. As such, the

confidence loss follows equation 4, whereby the cross entropy loss between positive and hard negative elements

are calculated and averaged.

Fig. 4. Confidence loss relies on training the network on the most difficult features.

D. Total Loss

The total multibox loss is the sum of both the localization loss and classification loss. A coefficient λ is used to

weight the loss terms to optimize training for both classification and localization.

L_total = L_conf + λ ∗ L_loc

(1)

E. Network Modifications

In an attempt to improve the performance of the network as defined in the paper, [1], we proposed sought to

integrate a ResNet18 network as the backbone instead of the associated VGG16. Unlike VGG16, ResNet18 utilizes

residuals connections to improve information flow throughout the network. This has been proven to significantly

improve network performance [2], and should preserve small details which would also improve classification

performance. As such, a Resnet network pretrained on ImageNet was choosen as the backbone network for our

experiment. Because the output layers did not match that of the classical VGG16 network, we had modified the

initial and final layer parameters to allow ResNet to interface with the SSD’s auxiliary network.

Fig. 5. ResNet18 was used in place of VGG net.

IV. EXPERIMENTS

After implementing the network as described in Single Shot MultiBox Detector [1], we then trained our network

and compared the performance to what is declared in the paper. Table I presents the parameters choosen in training

the network. The network was trained for 26 epochs which correlated to approximately 48 hours on a single Nvidia

GTX 1070. The entire COCO2014 dataset was chosen for training and testing. This dataset consist of approximately

82,000 classes spanning 82 classes. Each image consisted of multi objects in a variety of scenes and poses.

Given the above training details, the training loss for the network is described by 6. At the final moments of

training, the classification loss was reduced from 20 to 2.8 and the localization loss decreased from 11 to 0.7. While

we had to stop at such results, there was significant room to further imrpove the network just by increasing the

training time.

TABLE I

HYPER PARAMETERS USED DURING NETWORK TRAINING.

Param

SSD

Epochs

Batch Size

Learning Rate

0.001

Fig. 6. Total Loss vs Training Iterations

A. Results

Given the aforementioned training parameters, the network was able to localize and classify many object when

provided an image. Moreover the network was also able to identify obscured objects and provide a suitable bounding

box. While the localization and classification results did show potential, many images failed to correctly classify

various object within the scenes. Figure 7 showcases these results. As shown the dogs are correctly labeled but

the network incorrectly classified the cow as a horse. Similarly, the network was unable to localize the human in

against the sheeps.

When compared with the original SSD network and Fast RCNN, our implementation was not able to compare in

using Mean Average Precision. But it does achieve better results than Fast RCNN for performance. To summarize,

the network is able to present values at a rate of 10 fps, when ran on an Intel core i7. In contrast to Fast RCNN 7

fps our network is able to achieve better performance. We also expect, given more accurate localizing results given

more training time and a higher batch size.

TABLE II

COMPARISON RESULTS WITH SSD AND FAST RCNN.

Param SSD(Ours) SSD(official) Fast RCNN

mAP

10.2

74.8

73.2

FPS

Fig. 7. Confidence loss relies on training the network on the most difficult features.

V. RELATED WORK

One major state of the art methods for Object detection is RCNN. This method relies on a region proposal

network to define region of interest which is then passed into a CNN for classification. Now the Region Proposal

Networks usually rely on classical feature extraction algorithms to declare where regions of interests exists within

images. Determining these region of interests are usually computationally costly. While there are faster approaches

to implementing RCNN it remains relatively slow for real-time uses. Another method is called YOLO (You Only

Look Once). It works by first splitting the input image into a grid of cells, where each cell is responsible for

predicting a bounding box if the center of a bounding box falls within it. Each grid cell predicts a bounding box

involving the x, y coordinate and the width and height and the confidence. A class prediction is also based on each

cell. The single-shot detector for multiple categories when it is compared with the aforementioned related works it:

• does not re-sample pixels or features for bounding box hypotheses

• is faster than the previous state-of-the-art for single shot detectors

(YOLO). This is mainly because of

elimination of the bounding box proposals and the subsequent pixel or feature re-sampling stage.

• uses separate predictors (filters) for different aspect ratio detection, and applying these filters to multiple feature

maps from the later stages of a network in order to perform detection at multiple scales is significantly more

accurate, in fact as accurate as slower techniques that perform explicit region proposals and pooling (including

Faster R-CNN).

- With using multiple layers for prediction at different scales—we can achieve high-accuracy using relatively

low resolution input, further increasing detection speed.

Robust localization and classification could be achieved using predefined bounding boxes to capture features

within particular regions of an image. With enough training, classification and localization accuracy could be

significantly improved. Increasing the batch size would improve our network’s accuracy. Contrasted with VGG net,

Resnet implements skip connections between blocks. This allows the network better discern features of differing

categories by relying on direct information from prior layers.

VI. CONCLUSIONS

In conclusion, we were able to review multiple papers on image detection. In doing so we chose to implement

the Single Shot MultiBox Detector as described in paper [1]. We were then able to customize the network by

replacing the VGG16 network as described in the paper with ResNet18. To test our implementation we then trained

the network on the Coco dataset for 26 epochs and tested the network on a variety of images. The results, while

not perfect showcases relative accurate localization and classification and such performance could be improved with

further training. As such, our network was able to perform as expected.

REFERENCES

[1] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” Lecture Notes in

Computer Science, p. 21-37, 2016.

[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015.

[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pp. 770-778, 2016.