Skip to content

fai-rtdetr-l-coco#

Overview#

The models is the reimplementation of the RT-DETR model by FocoosAI for the COCO dataset. It is a object detection model able to detect 80 thing (dog, cat, car, etc.) classes.

Benchmark#

Benchmark Comparison Note: FPS are computed on NVIDIA T4 using TensorRT and image size 640x640.

Model Details#

The model is based on the RT-DETR architecture. It is a object detection model that uses a transformer-based encoder-decoder architecture.

Neural Network Architecture#

This implementation is a reimplementation of the RT-DETR model by FocoosAI. The original model is fully described in this paper.

RT-DETR is a hybrid model that uses three main components: a backbone for extracting features, an encoder for upscaling the features, and a transformer-based decoder for generating the detection output.

alt text

In this implementation:

  • the backbone is a Resnet-50,that guarantees a good performance while having good efficiency.
  • the encoder is the Hybrid Encoder, as proposed by the paper, and it is a bi-FPN (bilinear feature pyramid network) that includes a transformer encoder on the smaller feature resolution for improving efficiency.
  • The query selection mechanism select the features of the pixels (aka queries) with the highest probability of containing an object and pass them to a transformer decoder head that will generate the final detection output. In this implementation, we select 300 queries and use 6 transformer decoder layers.

Losses#

We use the same losses as the original paper:

  • loss_vfl: a variant of the binary cross entropy loss for the classification of the classes that is weighted by the correctness of the predicted bounding boxes IoU.
  • loss_bbox: an L1 loss computing the distance between the predicted bounding boxes and the ground truth bounding boxes.
  • loss_giou: a loss minimizing the IoU the predicted bounding boxes and the ground truth bounding boxes. For more details look at GIoU.

These losses are applied to each output of the transformer decoder, meaning that we apply it on the output and on each auxiliary output of the transformer decoder layers. Please refer to the RT-DETR paper for more details.

Output Format#

The pre-processed output of the model is set of bounding boxes with associated class probabilities. In particular, the output is composed by three tensors:

  • class_ids: a tensor of 300 elements containing the class id associated with each bounding box (such as 1 for wall, 2 for building, etc.)
  • scores: a tensor of 300 elements containing the corresponding probability of the class_id
  • boxes: a tensor of shape (300, 4) where the values represent the coordinates of the bounding boxes in the format [x1, y1, x2, y2]

The model does not need NMS (non-maximum suppression) because the output is already a set of bounding boxes with associated class probabilities and has been trained to avoid overlaps.

After the post-processing, the output is a the output is a Focoos Detections object containing the predicted bounding boxes with confidence greather than a specific threshold (0.5 by default).

Classes#

The model is pretrained on the COCO dataset with 80 classes.

Class AP
1 person 63.2
2 bicycle 40.5
3 car 52.3
4 motorcycle 55.0
5 airplane 76.3
6 bus 74.9
7 train 75.0
8 truck 47.9
9 boat 36.6
10 traffic light 32.6
11 fire hydrant 75.5
12 stop sign 71.2
13 parking meter 54.6
14 bench 34.9
15 bird 46.6
16 cat 79.8
17 dog 75.4
18 horse 69.7
19 sheep 63.0
20 cow 68.8
21 elephant 74.1
22 bear 83.2
23 zebra 78.3
24 giraffe 76.9
25 backpack 25.1
26 umbrella 53.8
27 handbag 24.3
28 tie 44.8
29 suitcase 52.6
30 frisbee 75.3
31 skis 37.2
32 snowboard 50.8
33 sports ball 53.9
34 kite 54.8
35 baseball bat 53.2
36 baseball glove 45.3
37 skateboard 63.7
38 surfboard 50.3
39 tennis racket 61.1
40 bottle 48.8
41 wine glass 44.1
42 cup 53.4
43 fork 51.3
44 knife 34.1
45 spoon 33.5
46 bowl 52.1
47 banana 33.0
48 apple 27.1
49 sandwich 48.1
50 orange 37.9
51 broccoli 28.9
52 carrot 28.2
53 hot dog 50.3
54 pizza 62.5
55 donut 62.3
56 cake 47.5
57 chair 41.2
58 couch 57.3
59 potted plant 36.0
60 bed 58.5
61 dining table 39.4
62 toilet 72.6
63 tv 65.8
64 laptop 73.1
65 mouse 67.1
66 remote 48.2
67 keyboard 63.0
68 cell phone 46.0
69 microwave 64.6
70 oven 44.9
71 toaster 50.4
72 sink 45.7
73 refrigerator 69.4
74 book 22.3
75 clock 59.2
76 vase 45.7
77 scissors 42.4
78 teddy bear 59.5
79 hair drier 35.1
80 tootdbrush 42.0

What are you waiting? Try it!#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from focoos import Focoos
import os

# Initialize the Focoos client with your API key
focoos = Focoos(api_key=os.getenv("FOCOOS_API_KEY"))

# Get the remote model (fai-rtdetr-l-coco) from Focoos API
model = focoos.get_remote_model("fai-rtdetr-l-coco")

# Run inference on an image
predictions = model.infer("./image.jpg", threshold=0.5)

# Output the predictions
print(predictions)