fai-m2f-l-coco-ins#
Overview#
The models is a Mask2Former model otimized by FocoosAI for the COCO dataset. It is an instance segmentation model able to segment 80 thing (dog, cat, car, etc.) classes.
Model Details#
The model is based on the Mask2Former architecture. It is a segmentation model that uses a mask-classification approach and a transformer-based encoder-decoder architecture.
Neural Network Architecture#
The Mask2Former FocoosAI implementation optimize the original neural network architecture for improving the model's efficiency and performance. The original model is fully described in this paper.
Mask2Former is a hybrid model that uses three main components: a backbone for extracting features, a pixel decoder for upscaling the features, and a transformer-based decoder for generating the segmentation output.
In this implementation:
- the backbone is Resnet-50 that show an amazing trade-off between performance and efficiency.
- the pixel decoder is a transformer-augmented FPN. It gets the features from the stage 2 (1/4 resolution), 3 (1/8 resolution), 4 (1/16 resolution) and 5 (1/32 resolution) of the backbone. It first uses a transformer encoder to process the features at the lowest resolution (stage 5) and then uses a feature pyramid network to upsample the features. This part is different from the original implementation using deformable attention modules.
- the transformer decoder is implemented as in the original paper, having 9 decoder layers and 100 learnable queries.
Losses#
We use the same losses as the original paper:
- loss_ce: Cross-entropy loss for the classification of the classes
- loss_dice: Dice loss for the segmentation of the classes
- loss_mask: A binary cross-entropy loss applied to the predicted segmentation masks
These losses are applied to each output of the transformer decoder, meaning that we apply it on the output and on each auxiliary output of the 3 transformer decoder layers. Please refer to the Mask2Former paper for more details.
Output Format#
The pre-processed output of the model is set of masks with associated class probabilities. In particular, the output is composed by three tensors:
- class_ids: a tensor of 100 elements containing the class id associated with each mask (such as 1 for wall, 2 for building, etc.)
- scores: a tensor of 100 elements containing the corresponding probability of the class_id
- masks: a tensor of shape (100, H, W) where H and W are the height and width of the input image and the values represent the index of the class_id associated with the pixel
The model does not need NMS (non-maximum suppression) because the output is already a set of masks with associated class probabilities and has been trained to avoid overlapping masks.
After the post-processing, the output is a Focoos Detections object containing the predicted masks with confidence greather than a specific threshold (0.5 by default).
Classes#
The model is pretrained on the COCO dataset with 80 classes.
Class | Segmentation AP | |
---|---|---|
1 | person | 48.9 |
2 | bicycle | 22.2 |
3 | car | 41.3 |
4 | motorcycle | 40.0 |
5 | airplane | 55.6 |
6 | bus | 68.2 |
7 | train | 69.6 |
8 | truck | 40.5 |
9 | boat | 26.2 |
10 | traffic light | 27.4 |
11 | fire hydrant | 69.2 |
12 | stop sign | 65.0 |
13 | parking meter | 45.4 |
14 | bench | 23.4 |
15 | bird | 33.8 |
16 | cat | 77.7 |
17 | dog | 68.9 |
18 | horse | 50.1 |
19 | sheep | 54.0 |
20 | cow | 51.0 |
21 | elephant | 63.4 |
22 | bear | 81.1 |
23 | zebra | 66.0 |
24 | giraffe | 60.5 |
25 | backpack | 22.7 |
26 | umbrella | 52.6 |
27 | handbag | 23.3 |
28 | tie | 33.2 |
29 | suitcase | 45.3 |
30 | frisbee | 66.4 |
31 | skis | 7.4 |
32 | snowboard | 28.2 |
33 | sports ball | 42.8 |
34 | kite | 30.3 |
35 | baseball bat | 32.1 |
36 | baseball glove | 42.3 |
37 | skateboard | 36.8 |
38 | surfboard | 37.3 |
39 | tennis racket | 58.7 |
40 | bottle | 39.2 |
41 | wine glass | 36.9 |
42 | cup | 46.0 |
43 | fork | 22.2 |
44 | knife | 17.8 |
45 | spoon | 18.0 |
46 | bowl | 44.3 |
47 | banana | 26.5 |
48 | apple | 23.9 |
49 | sandwich | 43.0 |
50 | orange | 33.8 |
51 | broccoli | 24.4 |
52 | carrot | 22.7 |
53 | hot dog | 36.3 |
54 | pizza | 55.1 |
55 | donut | 51.1 |
56 | cake | 44.6 |
57 | chair | 25.0 |
58 | couch | 47.7 |
59 | potted plant | 25.0 |
60 | bed | 45.0 |
61 | dining table | 22.9 |
62 | toilet | 67.6 |
63 | tv | 64.3 |
64 | laptop | 67.2 |
65 | mouse | 60.1 |
66 | remote | 36.1 |
67 | keyboard | 52.6 |
68 | cell phone | 42.0 |
69 | microwave | 60.7 |
70 | oven | 33.8 |
71 | toaster | 35.9 |
72 | sink | 39.9 |
73 | refrigerator | 64.0 |
74 | book | 12.0 |
75 | clock | 52.5 |
76 | vase | 37.7 |
77 | scissors | 26.8 |
78 | teddy bear | 55.1 |
79 | hair drier | 16.8 |
80 | toothbrush | 22.4 |
What are you waiting? Try it!#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|