Skip to main content

Deepforest and Retinanet

Deepforest and Retinanet

In every blog we are covering how DeepForest works and it's use cases and how can we increase it's productivity but as we know DeepForest uses deep learning object detection networks to predict bounding boxes corresponding to individual trees in RGB imagery. DeepForest is built on the retinanet model and designed to make training models for tree detection simpler.

So we should talk about Retinanet for object detection so that we can know about how actually Deepforest predicts trees from image rasters.

So the first question arises why retinanet over any other model?

In many papers, they addressed the problem that one-stage object detectors suffer and couldn’t compete with two-stage detectors in terms of accuracy. But, RetinaNet as a one-stage detector overcomes this problem and outperforms the best two-stage detector while still being fast.

One-Stage and Two-Stage detectors

One-Stage detectors

One-Stage detectors make the predictions about the object in the image on the grid, there is no intermediary task. So, they take an image as the input and pass it through a certain number of convolutional layers and find bounding boxes that are likely to contain the object, and then do the prediction. These models use already trained image classifiers as their backbone network to identify the objects in the image. This results in a simpler and faster model, but lack the accuracy in comparison with two-stage detectors.



Two-Stage detectors

In contrast with one-stage detectors, two-stage detectors use two stages to identify the objects in the image.

The first stage contains some region proposal networks (RPN) which reduces the number the locations which are likely to contain the objects (sometimes also called Region of Interest (ROI)) significantly. So, in the second stage, we don’t have to search over all the locations over the image to find the objects in the image but just the ones which are proposed by RPNs

Two-stage detectors also use some pre-trained image classifier as the backbone network

Some sampling techniques like Online Hard Example Mining (OHEM) or setting the foreground to background ratio are also used to strike a balance between the classes

In the second stage, classification is performed on the object locations and label the objects based on the confidence of the model

Two-stage detectors perform better than one-stage detectors but they are very slow in comparison with one-stage detectors

Problem with one-stage

Using a two-stage detector, the first stage, that is, the region proposal network significantly increases the number of locations of objects in the network, and then also uses some sampling techniques to deal with such imbalances

In one-stage detectors we end up with a large number of locations and a large number of samples are easily classified and usually contains no important information, while on the other hand there are hard examples which contain important information but are less in number

Cross-Entropy (CE) is used as the loss function, suppose there are 100k easy examples with an average loss of 0.1 and 100 hard examples with a loss of 2. Easy examples will clearly dominate the other class, so the model will focus on easy examples instead of hard ones and thus suffers in accuracy.

The loss for easy examples is almost 43 times the hard examples and so there is huge class imbalance and thus CE is not the right choice

RetinaNet

RetinaNet, a one-stage detector using the focal loss so that the lower loss is contributed by “easy” examples and loss is focusing on “hard” examples.

As shown in the figure, RetinaNet is a single, unified network composed of a backbone network and two task-specific subnetworks, RetinaNet uses ResNet and Feature Pyramid Network (FPN) as the backbone networks



The backbone is responsible for computing a convolutional feature map over an entire input image and is an off-the-self convolutional network.

The first subnet performs convolutional object classification on the backbone’s output; the second subnet performs convolutional bounding box regression.

The network was initialized with the prior probability of finding an object as 0.1 which enabled self-learning. Earlier, the first attempt was to train the network using cross-entropy loss but it failed quickly with the network diverging during the training.

Results

Results of RetinaNet, a one-stage detector using focal loss were significant even on the challenging COCO dataset and beat every one-stage and two-stage detector by a significant margin and delivered state of the art performance.

How Deepforest uses Retinanet

First we have to load backbone from Resnet_50

def load_backbone():
"""A torch vision retinanet model"""
backbone = torchvision.models.detection
                            .retinanet_resnet50_fpn(pretrained=True)

# load the model onto the computation device
return backbone


Then we can create our model with the following snippet

def create_model(num_classes, nms_thresh, score_thresh, backbone = None):
"""Create a retinanet model
Args:
num_classes (int): number of classes in the model
nms_thresh (float): non-max suppression threshold
                                            for intersection-over-union [0,1]
score_thresh (float): minimum prediction score to keep
                                                    during prediction [0,1]
Returns:
model: a pytorch nn module
"""
if not backbone:
resnet = load_backbone()
backbone = resnet.backbone
model = RetinaNet(backbone=backbone, num_classes=num_classes)
model.nms_thresh = nms_thresh
model.score_thresh = score_thresh

return model



Comments

Popular posts from this blog

GSoC Final Report

GSoC Final Report My journey on the Google Summer of Code project passed by so fast, A lot of stuff happened during those three months, and as I’m writing this blog post, I feel quite nostalgic about these three months. GSoC was indeed a fantastic experience. It gave me an opportunity to grow as a developer in an open source community and I believe that I ended up GSoC with a better understanding of what open source is. I learned more about the community, how to communicate with them, and who are the actors in this workflow. So, this is a summary report of all my journey at GSoC 2022. Name : Ansh Dassani Organization:   NumFOCUS- Data Retriever Project title : Training and Evaluation of model on various resolutions Project link:  DeepForest Mentors :  Ben Weinstein ,  Henry Senyondo , Ethan White Introduction                                        DeepForest is a pytho...

Deep Learning

What is deep learning? Deep learning is one of the subsets of machine learning that uses deep learning algorithms to implicitly come up with important conclusions based on input data. Genrally deeplearning is unsupervised learning or semi supervised learning and is based on representation learning that is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task, it learns from representative examples. For example: if you want to build a model that recognizes trees, you need to prepare a database that includes a lot of different tree images. The main architectures of deep learning are: -Convolutional neural networks -Recurrent neural networks -Generative adversarial networks -Recursive neural networks I'll be talking about them more in later part of this blog. Diffe...

Sensitivity of model to input resolution

Sensitivity of the model to resolution The Deepforest model was trained on 10cm data at 400px crops is way too sensitive to the input resolution of images and quality of images, and it tends to give inaccurate results on these images and it's not possible to always have images from drones from a particular height that is 10cm in our case, so we have to come up with a solution for how to get better results for multiple resolutions of data. So we have two solutions to get better predictions, which can be preprocessing of data and retraining the model on different input resolutions. In preprocessing what we can do is to try to get nearby patch size to give better results as the resolution of the input data decreases compared to the data used to train the model, the patch size needs to be larger and we can pass the appropriate patch size in ```predict_tile``` function, but retaining of the model on different resolutions can be more amount of work but yes we tried to achieve it by evalu...