fai-rtdetr-l-coco#

Overview#

The models is the reimplementation of the RT-DETR model by FocoosAI for the COCO dataset. It is a object detection model able to detect 80 thing (dog, cat, car, etc.) classes.

Benchmark#

Benchmark Comparison Note: FPS are computed on NVIDIA T4 using TensorRT and image size 640x640.

Model Details#

The model is based on the RT-DETR architecture. It is a object detection model that uses a transformer-based encoder-decoder architecture.

Neural Network Architecture#

This implementation is a reimplementation of the RT-DETR model by FocoosAI. The original model is fully described in this paper.

RT-DETR is a hybrid model that uses three main components: a backbone for extracting features, an encoder for upscaling the features, and a transformer-based decoder for generating the detection output.

alt text

In this implementation:

the backbone is a Resnet-50,that guarantees a good performance while having good efficiency.
the encoder is the Hybrid Encoder, as proposed by the paper, and it is a bi-FPN (bilinear feature pyramid network) that includes a transformer encoder on the smaller feature resolution for improving efficiency.
The query selection mechanism select the features of the pixels (aka queries) with the highest probability of containing an object and pass them to a transformer decoder head that will generate the final detection output. In this implementation, we select 300 queries and use 6 transformer decoder layers.

Losses#

We use the same losses as the original paper:

loss_vfl: a variant of the binary cross entropy loss for the classification of the classes that is weighted by the correctness of the predicted bounding boxes IoU.
loss_bbox: an L1 loss computing the distance between the predicted bounding boxes and the ground truth bounding boxes.
loss_giou: a loss minimizing the IoU the predicted bounding boxes and the ground truth bounding boxes. For more details look at GIoU.

These losses are applied to each output of the transformer decoder, meaning that we apply it on the output and on each auxiliary output of the transformer decoder layers. Please refer to the RT-DETR paper for more details.

Output Format#

The pre-processed output of the model is set of bounding boxes with associated class probabilities. In particular, the output is composed by three tensors:

class_ids: a tensor of 300 elements containing the class id associated with each bounding box (such as 1 for wall, 2 for building, etc.)
scores: a tensor of 300 elements containing the corresponding probability of the class_id
boxes: a tensor of shape (300, 4) where the values represent the coordinates of the bounding boxes in the format [x1, y1, x2, y2]

The model does not need NMS (non-maximum suppression) because the output is already a set of bounding boxes with associated class probabilities and has been trained to avoid overlaps.

After the post-processing, the output is a the output is a Focoos Detections object containing the predicted bounding boxes with confidence greather than a specific threshold (0.5 by default).

Classes#

The model is pretrained on the COCO dataset with 80 classes.

	Class	AP
1	person	63.2
2	bicycle	40.5
3	car	52.3
4	motorcycle	55.0
5	airplane	76.3
6	bus	74.9
7	train	75.0
8	truck	47.9
9	boat	36.6
10	traffic light	32.6
11	fire hydrant	75.5
12	stop sign	71.2
13	parking meter	54.6
14	bench	34.9
15	bird	46.6
16	cat	79.8
17	dog	75.4
18	horse	69.7
19	sheep	63.0
20	cow	68.8
21	elephant	74.1
22	bear	83.2
23	zebra	78.3
24	giraffe	76.9
25	backpack	25.1
26	umbrella	53.8
27	handbag	24.3
28	tie	44.8
29	suitcase	52.6
30	frisbee	75.3
31	skis	37.2
32	snowboard	50.8
33	sports ball	53.9
34	kite	54.8
35	baseball bat	53.2
36	baseball glove	45.3
37	skateboard	63.7
38	surfboard	50.3
39	tennis racket	61.1
40	bottle	48.8
41	wine glass	44.1
42	cup	53.4
43	fork	51.3
44	knife	34.1
45	spoon	33.5
46	bowl	52.1
47	banana	33.0
48	apple	27.1
49	sandwich	48.1
50	orange	37.9
51	broccoli	28.9
52	carrot	28.2
53	hot dog	50.3
54	pizza	62.5
55	donut	62.3
56	cake	47.5
57	chair	41.2
58	couch	57.3
59	potted plant	36.0
60	bed	58.5
61	dining table	39.4
62	toilet	72.6
63	tv	65.8
64	laptop	73.1
65	mouse	67.1
66	remote	48.2
67	keyboard	63.0
68	cell phone	46.0
69	microwave	64.6
70	oven	44.9
71	toaster	50.4
72	sink	45.7
73	refrigerator	69.4
74	book	22.3
75	clock	59.2
76	vase	45.7
77	scissors	42.4
78	teddy bear	59.5
79	hair drier	35.1
80	tootdbrush	42.0

What are you waiting? Try it!#

from focoos import Focoos
import os

# Initialize the Focoos client with your API key
focoos = Focoos(api_key=os.getenv("FOCOOS_API_KEY"))

# Get the remote model (fai-rtdetr-l-coco) from Focoos API
model = focoos.get_remote_model("fai-rtdetr-l-coco")

# Run inference on an image
predictions = model.infer("./image.jpg", threshold=0.5)

# Output the predictions
print(predictions)