fai-rtdetr-s-coco#

Overview#

The models is a RT-DETR model otimized by FocoosAI for the COCO dataset. It is a object detection model able to detect 80 thing (dog, cat, car, etc.) classes.

Benchmark#

Benchmark Comparison Note: FPS are computed on NVIDIA T4 using TensorRT and image size 640x640.

Model Details#

The model is based on the RT-DETR architecture. It is a object detection model that uses a transformer-based encoder-decoder architecture.

Neural Network Architecture#

The RT-DETR FocoosAI implementation optimize the original neural network architecture for improving the model's efficiency and performance. The original model is fully described in this paper.

RT-DETR is a hybrid model that uses three main components: a backbone for extracting features, an encoder for upscaling the features, and a transformer-based decoder for generating the detection output.

alt text

In this implementation:

the backbone is STDC-2 that show an amazing trade-off between performance and efficiency.
the encoder is a bi-FPN (bilinear feature pyramid network). With respect to the original paper, we removed the attention modules in the encoder and we reduce the internal features dimension, speeding up the inference while only marginally affecting the accuracy.
the transformer decoder is a lighter version of the original, having only 3 decoder layers, instead of 6, and we select 300 queries.

Losses#

We use the same losses as the original paper:

loss_vfl: a variant of the binary cross entropy loss for the classification of the classes that is weighted by the correctness of the predicted bounding boxes IoU.
loss_bbox: an L1 loss computing the distance between the predicted bounding boxes and the ground truth bounding boxes.
loss_giou: a loss minimizing the IoU the predicted bounding boxes and the ground truth bounding boxes. for more details look here: GIoU.

These losses are applied to each output of the transformer decoder, meaning that we apply it on the output and on each auxiliary output of the transformer decoder layers. Please refer to the RT-DETR paper for more details.

Output Format#

The pre-processed output of the model is set of bounding boxes with associated class probabilities. In particular, the output is composed by three tensors:

class_ids: a tensor of 300 elements containing the class id associated with each bounding box (such as 1 for wall, 2 for building, etc.)
scores: a tensor of 300 elements containing the corresponding probability of the class_id
boxes: a tensor of shape (300, 4) where the values represent the coordinates of the bounding boxes in the format [x1, y1, x2, y2]

The model does not need NMS (non-maximum suppression) because the output is already a set of bounding boxes with associated class probabilities and has been trained to avoid overlaps.

After the post-processing, the output is a the output is a Focoos Detections object containing the predicted bounding boxes with confidence greather than a specific threshold (0.5 by default).

Classes#

The model is pretrained on the COCO dataset with 80 classes.

Class ID	Class Name	AP
1	person	54.7
2	bicycle	29.1
3	car	41.4
4	motorcycle	44.9
5	airplane	71.4
6	bus	67.8
7	train	68.9
8	truck	36.4
9	boat	26.8
10	traffic light	25.0
11	fire hydrant	66.0
12	stop sign	62.2
13	parking meter	46.1
14	bench	25.2
15	bird	36.5
16	cat	72.6
17	dog	68.5
18	horse	57.9
19	sheep	54.1
20	cow	56.6
21	elephant	66.2
22	bear	78.3
23	zebra	70.0
24	giraffe	70.0
25	backpack	14.9
26	umbrella	39.9
27	handbag	13.2
28	tie	32.6
29	suitcase	41.2
30	frisbee	66.3
31	skis	24.9
32	snowboard	31.6
33	sports ball	44.8
34	kite	45.1
35	baseball bat	29.7
36	baseball glove	35.2
37	skateboard	54.5
38	surfboard	39.9
39	tennis racket	46.1
40	bottle	35.8
41	wine glass	32.6
42	cup	41.1
43	fork	35.5
44	knife	18.9
45	spoon	18.0
46	bowl	42.2
47	banana	24.6
48	apple	18.6
49	sandwich	41.6
50	orange	33.1
51	broccoli	22.4
52	carrot	22.2
53	hot dog	37.6
54	pizza	55.2
55	donut	48.0
56	cake	36.7
57	chair	28.4
58	couch	47.8
59	potted plant	26.8
60	bed	49.0
61	dining table	30.5
62	toilet	60.1
63	tv	57.2
64	laptop	59.6
65	mouse	62.3
66	remote	27.7
67	keyboard	53.8
68	cell phone	33.2
69	microwave	60.7
70	oven	38.8
71	toaster	41.9
72	sink	37.0
73	refrigerator	57.6
74	book	13.8
75	clock	50.3
76	vase	35.5
77	scissors	31.8
78	teddy bear	44.7
79	hair drier	10.3
80	toothbrush	26.8

What are you waiting? Try it!#

from focoos import Focoos
import os

# Initialize the Focoos client with your API key
focoos = Focoos(api_key=os.getenv("FOCOOS_API_KEY"))

# Get the remote model (fai-rtdetr-s-coco) from Focoos API
model = focoos.get_remote_model("fai-rtdetr-s-coco")

# Run inference on an image
predictions = model.infer("./image.jpg", threshold=0.5)

# Output the predictions
print(predictions)