Deepforest and Retinanet
In every blog we are covering how DeepForest works and it's use cases and how can we increase it's productivity but as we know DeepForest uses deep learning object detection networks to predict bounding boxes corresponding to individual trees in RGB imagery. DeepForest is built on the retinanet model and designed to make training models for tree detection simpler.
So we should talk about Retinanet for object detection so that we can know about how actually Deepforest predicts trees from image rasters.
So the first question arises why retinanet over any other model?
In many papers, they addressed the problem that one-stage object detectors suffer and couldn’t compete with two-stage detectors in terms of accuracy. But, RetinaNet as a one-stage detector overcomes this problem and outperforms the best two-stage detector while still being fast.
One-Stage and Two-Stage detectors
One-Stage detectors
One-Stage detectors make the predictions about the object in the image on the grid, there is no intermediary task. So, they take an image as the input and pass it through a certain number of convolutional layers and find bounding boxes that are likely to contain the object, and then do the prediction. These models use already trained image classifiers as their backbone network to identify the objects in the image. This results in a simpler and faster model, but lack the accuracy in comparison with two-stage detectors.
Two-Stage detectors
In contrast with one-stage detectors, two-stage detectors use two stages to identify the objects in the image.
The first stage contains some region proposal networks (RPN) which reduces the number the locations which are likely to contain the objects (sometimes also called Region of Interest (ROI)) significantly. So, in the second stage, we don’t have to search over all the locations over the image to find the objects in the image but just the ones which are proposed by RPNs
Two-stage detectors also use some pre-trained image classifier as the backbone network
Some sampling techniques like Online Hard Example Mining (OHEM) or setting the foreground to background ratio are also used to strike a balance between the classes
In the second stage, classification is performed on the object locations and label the objects based on the confidence of the model
Two-stage detectors perform better than one-stage detectors but they are very slow in comparison with one-stage detectors
Problem with one-stage
Using a two-stage detector, the first stage, that is, the region proposal network significantly increases the number of locations of objects in the network, and then also uses some sampling techniques to deal with such imbalances
In one-stage detectors we end up with a large number of locations and a large number of samples are easily classified and usually contains no important information, while on the other hand there are hard examples which contain important information but are less in number
Cross-Entropy (CE) is used as the loss function, suppose there are 100k easy examples with an average loss of 0.1 and 100 hard examples with a loss of 2. Easy examples will clearly dominate the other class, so the model will focus on easy examples instead of hard ones and thus suffers in accuracy.
The loss for easy examples is almost 43 times the hard examples and so there is huge class imbalance and thus CE is not the right choice
RetinaNet
RetinaNet, a one-stage detector using the focal loss so that the lower loss is contributed by “easy” examples and loss is focusing on “hard” examples.
As shown in the figure, RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks, RetinaNet uses ResNet and Feature Pyramid Network (FPN) as the backbone networks
The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self convolutional network.
The first subnet performs convolutional object classification on the backbone’s output; the second subnet performs convolutional bounding box regression.
The network was initialized with the prior probability of finding an object as 0.1 which enabled self-learning. Earlier, the first attempt was to train the network using cross-entropy loss but it failed quickly with the network diverging during the training.
Results
Results of RetinaNet, a one-stage detector using focal loss were significant even on the challenging COCO dataset and beat every one-stage and two-stage detector by a significant margin and delivered state of the art performance.
How Deepforest uses Retinanet
First we have to load backbone from Resnet_50
def load_backbone():
"""A torch vision retinanet model"""
backbone = torchvision.models.detection
.retinanet_resnet50_fpn(pretrained=True)
# load the model onto the computation device
return backbone
Then we can create our model with the following snippet
def create_model(num_classes, nms_thresh, score_thresh, backbone = None):
"""Create a retinanet model
Args:
num_classes (int): number of classes in the model
nms_thresh (float): non-max suppression threshold
for intersection-over-union [0,1]
score_thresh (float): minimum prediction score to keep
during prediction [0,1]
Returns:
model: a pytorch nn module
"""
if not backbone:
resnet = load_backbone()
backbone = resnet.backbone
model = RetinaNet(backbone=backbone, num_classes=num_classes)
model.nms_thresh = nms_thresh
model.score_thresh = score_thresh
return model
Comments
Post a Comment