Skip to content

fai-rtdetr-n-coco#

Overview#

The models is a RT-DETR model otimized by FocoosAI for the COCO dataset. It is a object detection model able to detect 80 thing (dog, cat, car, etc.) classes.

Benchmark#

Benchmark Comparison Note: FPS are computed on NVIDIA T4 using TensorRT and image size 640x640.

Model Details#

The model is based on the RT-DETR architecture. It is a object detection model that uses a transformer-based encoder-decoder architecture.

Neural Network Architecture#

The RT-DETR FocoosAI implementation optimize the original neural network architecture for improving the model's efficiency and performance. The original model is fully described in this paper.

RT-DETR is a hybrid model that uses three main components: a backbone for extracting features, an encoder for upscaling the features, and a transformer-based decoder for generating the detection output.

alt text

In this implementation:

  • the backbone is STDC-1 that shows an amazing speed while maintaining a satisfactory accuracy.
  • the encoder is a bi-FPN (bilinear feature pyramid network). With respect to the original paper, we removed the attention modules in the encoder and we reduce the internal features dimension, speeding up the inference while only marginally affecting the accuracy.
  • the transformer decoder is a lighter version of the original, having only 3 decoder layers, instead of 6, and we select 300 queries.

Losses#

We use the same losses as the original paper:

  • loss_vfl: a variant of the binary cross entropy loss for the classification of the classes that is weighted by the correctness of the predicted bounding boxes IoU.
  • loss_bbox: an L1 loss computing the distance between the predicted bounding boxes and the ground truth bounding boxes.
  • loss_giou: a loss minimizing the IoU the predicted bounding boxes and the ground truth bounding boxes. for more details look here: GIoU.

These losses are applied to each output of the transformer decoder, meaning that we apply it on the output and on each auxiliary output of the transformer decoder layers. Please refer to the RT-DETR paper for more details.

Output Format#

The pre-processed output of the model is set of bounding boxes with associated class probabilities. In particular, the output is composed by three tensors:

  • class_ids: a tensor of 300 elements containing the class id associated with each bounding box (such as 1 for wall, 2 for building, etc.)
  • scores: a tensor of 300 elements containing the corresponding probability of the class_id
  • boxes: a tensor of shape (300, 4) where the values represent the coordinates of the bounding boxes in the format [x1, y1, x2, y2]

The model does not need NMS (non-maximum suppression) because the output is already a set of bounding boxes with associated class probabilities and has been trained to avoid overlaps.

After the post-processing, the output is a the output is a Focoos Detections object containing the predicted bounding boxes with confidence greather than a specific threshold (0.5 by default).

Classes#

The model is pretrained on the COCO dataset with 80 classes.

Class ID Class Name AP
1 person 53.3
2 bicycle 28.0
3 car 40.5
4 motorcycle 42.6
5 airplane 67.8
6 bus 65.0
7 train 63.7
8 truck 34.7
9 boat 27.4
10 traffic light 25.0
11 fire hydrant 63.9
12 stop sign 62.1
13 parking meter 46.6
14 bench 23.1
15 bird 35.0
16 cat 70.6
17 dog 65.8
18 horse 54.2
19 sheep 52.7
20 cow 56.5
21 elephant 64.0
22 bear 72.9
23 zebra 69.7
24 giraffe 68.1
25 backpack 12.1
26 umbrella 37.1
27 handbag 11.9
28 tie 31.3
29 suitcase 40.2
30 frisbee 66.2
31 skis 22.4
32 snowboard 27.6
33 sports ball 42.7
34 kite 44.9
35 baseball bat 24.8
36 baseball glove 33.4
37 skateboard 49.1
38 surfboard 34.9
39 tennis racket 43.8
40 bottle 34.3
41 wine glass 30.7
42 cup 38.6
43 fork 32.2
44 knife 15.4
45 spoon 15.1
46 bowl 38.1
47 banana 26.0
48 apple 18.8
49 sandwich 36.6
50 orange 30.6
51 broccoli 23.6
52 carrot 22.2
53 hot dog 31.9
54 pizza 53.9
55 donut 45.7
56 cake 34.7
57 chair 26.0
58 couch 44.1
59 potted plant 24.5
60 bed 46.2
61 dining table 28.7
62 toilet 60.6
63 tv 56.0
64 laptop 58.3
65 mouse 58.4
66 remote 27.6
67 keyboard 51.6
68 cell phone 32.6
69 microwave 56.1
70 oven 34.4
71 toaster 45.6
72 sink 35.6
73 refrigerator 53.8
74 book 12.6
75 clock 48.9
76 vase 33.9
77 scissors 26.9
78 teddy bear 45.1
79 hair drier 10.0
80 toothbrush 26.3

What are you waiting? Try it!#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from focoos import Focoos
import os

# Initialize the Focoos client with your API key
focoos = Focoos(api_key=os.getenv("FOCOOS_API_KEY"))

# Get the remote model (fai-rtdetr-n-coco) from Focoos API
model = focoos.get_remote_model("fai-rtdetr-n-coco")

# Run inference on an image
predictions = model.infer("./image.jpg", threshold=0.5)

# Output the predictions
print(predictions)