Deepforest and Retinanet

In every blog we are covering how DeepForest works and it's use cases and how can we increase it's productivity but as we know DeepForest uses deep learning object detection networks to predict bounding boxes corresponding to individual trees in RGB imagery. DeepForest is built on the retinanet model and designed to make training models for tree detection simpler.

So we should talk about Retinanet for object detection so that we can know about how actually Deepforest predicts trees from image rasters.

So the first question arises why retinanet over any other model?

In many papers, they addressed the problem that one-stage object detectors suffer and couldn’t compete with two-stage detectors in terms of accuracy. But, RetinaNet as a one-stage detector overcomes this problem and outperforms the best two-stage detector while still being fast.

One-Stage and Two-Stage detectors

One-Stage detectors

One-Stage detectors make the predictions about the object in the image on the grid, there is no intermediary task. So, they take an image as the input and pass it through a certain number of convolutional layers and find bounding boxes that are likely to contain the object, and then do the prediction. These models use already trained image classifiers as their backbone network to identify the objects in the image. This results in a simpler and faster model, but lack the accuracy in comparison with two-stage detectors.

Two-Stage detectors

In contrast with one-stage detectors, two-stage detectors use two stages to identify the objects in the image.

The first stage contains some region proposal networks (RPN) which reduces the number the locations which are likely to contain the objects (sometimes also called Region of Interest (ROI)) significantly. So, in the second stage, we don’t have to search over all the locations over the image to find the objects in the image but just the ones which are proposed by RPNs

Two-stage detectors also use some pre-trained image classifier as the backbone network

Some sampling techniques like Online Hard Example Mining (OHEM) or setting the foreground to background ratio are also used to strike a balance between the classes

In the second stage, classification is performed on the object locations and label the objects based on the confidence of the model

Two-stage detectors perform better than one-stage detectors but they are very slow in comparison with one-stage detectors

Problem with one-stage

Using a two-stage detector, the first stage, that is, the region proposal network significantly increases the number of locations of objects in the network, and then also uses some sampling techniques to deal with such imbalances

In one-stage detectors we end up with a large number of locations and a large number of samples are easily classified and usually contains no important information, while on the other hand there are hard examples which contain important information but are less in number

Cross-Entropy (CE) is used as the loss function, suppose there are 100k easy examples with an average loss of 0.1 and 100 hard examples with a loss of 2. Easy examples will clearly dominate the other class, so the model will focus on easy examples instead of hard ones and thus suffers in accuracy.

The loss for easy examples is almost 43 times the hard examples and so there is huge class imbalance and thus CE is not the right choice

RetinaNet

RetinaNet, a one-stage detector using the focal loss so that the lower loss is contributed by “easy” examples and loss is focusing on “hard” examples.

As shown in the figure, RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks, RetinaNet uses ResNet and Feature Pyramid Network (FPN) as the backbone networks

The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self convolutional network.

The first subnet performs convolutional object classification on the backbone’s output; the second subnet performs convolutional bounding box regression.

The network was initialized with the prior probability of finding an object as 0.1 which enabled self-learning. Earlier, the first attempt was to train the network using cross-entropy loss but it failed quickly with the network diverging during the training.

Results

Results of RetinaNet, a one-stage detector using focal loss were significant even on the challenging COCO dataset and beat every one-stage and two-stage detector by a significant margin and delivered state of the art performance.

How Deepforest uses Retinanet

First we have to load backbone from Resnet_50

def load_backbone():
    """A torch vision retinanet model"""
    backbone = torchvision.models.detection
                            .retinanet_resnet50_fpn(pretrained=True)

    # load the model onto the computation device
    return backbone

Then we can create our model with the following snippet

def create_model(num_classes, nms_thresh, score_thresh, backbone = None):
    """Create a retinanet model
    Args:
        num_classes (int): number of classes in the model
        nms_thresh (float): non-max suppression threshold 
                                            for intersection-over-union [0,1]
        score_thresh (float): minimum prediction score to keep 
                                                    during prediction  [0,1]
    Returns:
        model: a pytorch nn module
    """
    if not backbone:
        resnet = load_backbone()
        backbone = resnet.backbone
        
    model = RetinaNet(backbone=backbone, num_classes=num_classes)
    model.nms_thresh = nms_thresh
    model.score_thresh = score_thresh

    return model

Blogs

Search This Blog