Skip to content

fai-m2f-m-ade#

Overview#

The models is a Mask2Former model otimized by FocoosAI for the ADE20K dataset. It is a semantic segmentation model able to segment 150 classes, comprising both stuff (sky, road, etc.) and thing (dog, cat, car, etc.).

Benchmark#

Benchmark Comparison Note: FPS are computed on NVIDIA T4 using TensorRT and image size 640x640.

Model Details#

The model is based on the Mask2Former architecture. It is a segmentation model that uses a transformer-based encoder-decoder architecture. Differently from traditional segmentation models (such as DeepLab), Mask2Former uses a mask-classification approach, where the prediction is made by a set of segmentation mask with associated class probabilities.

Neural Network Architecture#

The Mask2Former FocoosAI implementation optimize the original neural network architecture for improving the model's efficiency and performance. The original model is fully described in this paper.

Mask2Former is a hybrid model that uses three main components: a backbone for extracting features, a pixel decoder for upscaling the features, and a transformer-based decoder for generating the segmentation output.

alt text

In this implementation:

  • the backbone is STDC-2 that show an amazing trade-off between performance and efficiency.
  • the pixel decoder is a FPN getting the features from the stage 2 (1/4 resolution), 3 (1/8 resolution), 4 (1/16 resolution) and 5 (1/32 resolution) of the backbone. Differently from the original paper, for the sake of portability, we removed the deformable attention modules in the pixel decoder, speeding up the inference while only marginally affecting the accuracy.
  • the transformer decoder is a lighter version of the original, having only 3 decoder layers (instead of 9) and 100 learnable queries.

Losses#

We use the same losses as the original paper:

  • loss_ce: Cross-entropy loss for the classification of the classes
  • loss_dice: Dice loss for the segmentation of the classes
  • loss_mask: A binary cross-entropy loss applied to the predicted segmentation masks

These losses are applied to each output of the transformer decoder, meaning that we apply it on the output and on each auxiliary output of the 3 transformer decoder layers. Please refer to the Mask2Former paper for more details.

Output Format#

The pre-processed output of the model is set of masks with associated class probabilities. In particular, the output is composed by three tensors:

  • class_ids: a tensor of 100 elements containing the class id associated with each mask (such as 1 for wall, 2 for building, etc.)
  • scores: a tensor of 100 elements containing the corresponding probability of the class_id
  • masks: a tensor of shape (100, H, W) where H and W are the height and width of the input image and the values represent the index of the class_id associated with the pixel

The model does not need NMS (non-maximum suppression) because the output is already a set of masks with associated class probabilities and has been trained to avoid overlapping masks.

After the post-processing, the output is a Focoos Detections object containing the predicted masks with confidence greather than a specific threshold (0.5 by default).

Classes#

The model is pretrained on the ADE20K dataset with 150 classes.

Class mIoU
1 wall 75.369549
2 building 79.835995
3 sky 94.176995
4 floor 79.620841
5 tree 73.204506
6 ceiling 82.303035
7 road, route 80.822591
8 bed 87.573840
9 window 57.452584
10 grass 70.099493
11 cabinet 56.903790
12 sidewalk, pavement 62.247267
13 person 79.460606
14 earth, ground 38.537802
15 door 43.930878
16 table 56.753292
17 mountain, mount 61.160462
18 plant 48.995487
19 curtain 71.951930
20 chair 52.852125
21 car 80.725703
22 water 51.233498
23 painting, picture 66.989493
24 sofa 58.103663
25 shelf 34.979205
26 house 36.828611
27 sea 51.219096
28 mirror 58.572852
29 rug 54.897799
30 field 29.053876
31 armchair 39.565663
32 seat 53.113668
33 fence 41.113128
34 desk 37.930189
35 rock, stone 44.940982
36 wardrobe, closet, press 39.897858
37 lamp 60.921356
38 tub 78.041637
39 rail 31.893878
40 cushion 53.029316
41 base, pedestal, stand 20.233620
42 box 18.276924
43 column, pillar 42.655306
44 signboard, sign 35.959448
45 chest of drawers, chest, bureau, dresser 36.521600
46 counter 29.353667
47 sand 38.729599
48 sink 72.303141
49 skyscraper 44.122387
50 fireplace 66.614683
51 refrigerator, icebox 72.137179
52 grandstand, covered stand 29.061628
53 path 26.629478
54 stairs 31.833328
55 runway 76.017706
56 case, display case, showcase, vitrine 37.452627
57 pool table, billiard table, snooker table 93.246039
58 pillow 54.689591
59 screen door, screen 58.096890
60 stairway, staircase 29.962829
61 river 15.010211
62 bridge, span 66.617580
63 bookcase 31.383789
64 blind, screen 39.221180
65 coffee table 63.300795
66 toilet, can, commode, crapper, pot, potty, stool, throne 84.038177
67 flower 35.994798
68 book 43.252042
69 hill 6.240850
70 bench 35.007473
71 countertop 56.592858
72 stove 74.866261
73 palm, palm tree 49.092486
74 kitchen island 32.353614
75 computer 57.673329
76 swivel chair 43.202283
77 boat 48.170742
78 bar 24.034261
79 arcade machine 11.467819
80 hovel, hut, hutch, shack, shanty 10.258017
81 bus 81.375072
82 towel 54.954106
83 light 53.256340
84 truck 29.656645
85 tower 36.864496
86 chandelier 63.787459
87 awning, sunshade, sunblind 23.610311
88 street lamp 29.944617
89 booth 29.360433
90 tv 61.512572
91 plane 53.270513
92 dirt track 4.206758
93 clothes 35.342074
94 pole 20.678348
95 land, ground, soil 3.195710
96 bannister, banister, balustrade, balusters, handrail 17.522631
97 escalator, moving staircase, moving stairway 20.889345
98 ottoman, pouf, pouffe, puff, hassock 47.003450
99 bottle 15.504667
100 buffet, counter, sideboard 26.077572
101 poster, posting, placard, notice, bill, card 30.691103
102 stage 11.744151
103 van 40.161822
104 ship 79.300311
105 fountain 0.112958
106 conveyer belt, conveyor belt, conveyer, conveyor, transporter 60.552373
107 canopy 25.086350
108 washer, automatic washer, washing machine 63.550537
109 plaything, toy 18.290597
110 pool 32.873865
111 stool 39.256308
112 barrel, cask 6.358771
113 basket, handbasket 29.850719
114 falls 57.657161
115 tent 93.717152
116 bag 10.629695
117 minibike, motorbike 56.217901
118 cradle 69.441302
119 oven 38.940583
120 ball 45.543376
121 food, solid food 52.779065
122 step, stair 10.843115
123 tank, storage tank 30.871163
124 trade name 27.908376
125 microwave 32.381977
126 pot 41.040635
127 animal 55.882266
128 bicycle 50.185374
129 lake 0.007605
130 dishwasher 58.970317
131 screen 60.016197
132 blanket, cover 26.963189
133 sculpture 27.667732
134 hood, exhaust hood 58.025458
135 sconce 39.341998
136 vase 31.185747
137 traffic light 23.810429
138 tray 7.244281
139 trash can 30.072544
140 fan 52.113861
141 pier 56.678802
142 crt screen 9.133357
143 plate 38.900407
144 monitor 3.323130
145 bulletin board 52.337659
146 shower 4.692180
147 radiator 43.811464
148 glass, drinking glass 14.036491
149 clock 25.044316
150 flag 40.007933

What are you waiting? Try it!#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from focoos import Focoos
import os

# Initialize the Focoos client with your API key
focoos = Focoos(api_key=os.getenv("FOCOOS_API_KEY"))

# Get the remote model (fai-m2f-m-ade) from Focoos API
model = focoos.get_remote_model("fai-m2f-m-ade")

# Run inference on an image
predictions = model.infer("./image.jpg", threshold=0.5)

# Output the predictions
print(predictions)