Skip to content

fai-m2f-l-ade#

Overview#

The models is a Mask2Former model otimized by FocoosAI for the ADE20K dataset. It is a semantic segmentation model able to segment 150 classes, comprising both stuff (sky, road, etc.) and thing (dog, cat, car, etc.).

Benchmark#

Benchmark Comparison Note: FPS are computed on NVIDIA T4 using TensorRT and image size 640x640.

Model Details#

The model is based on the Mask2Former architecture. It is a segmentation model that uses a transformer-based encoder-decoder architecture. Differently from traditional segmentation models (such as DeepLab), Mask2Former uses a mask-classification approach, where the prediction is made by a set of segmentation mask with associated class probabilities.

Neural Network Architecture#

The Mask2Former FocoosAI implementation optimize the original neural network architecture for improving the model's efficiency and performance. The original model is fully described in this paper.

Mask2Former is a hybrid model that uses three main components: a backbone for extracting features, a pixel decoder for upscaling the features, and a transformer-based decoder for generating the segmentation output.

alt text

In this implementation:

  • the backbone is STDC-1 that shows a trade-off tending to be more efficient.
  • the pixel decoder is a FPN getting the features from the stage 2 (1/4 resolution), 3 (1/8 resolution), 4 (1/16 resolution) and 5 (1/32 resolution) of the backbone. Differently from the original paper, for the sake of portability, we removed the deformable attention modules in the pixel decoder, speeding up the inference while only marginally affecting the accuracy.
  • the transformer decoder is a extremely light version of the original, having only 1 decoder layer (instead of 9) and 100 learnable queries.

Losses#

We use the same losses as the original paper:

  • loss_ce: Cross-entropy loss for the classification of the classes
  • loss_dice: Dice loss for the segmentation of the classes
  • loss_mask: A binary cross-entropy loss applied to the predicted segmentation masks

Please refer to the Mask2Former paper for more details.

Output Format#

The pre-processed output of the model is set of masks with associated class probabilities. In particular, the output is composed by three tensors:

  • class_ids: a tensor of 100 elements containing the class id associated with each mask (such as 1 for wall, 2 for building, etc.)
  • scores: a tensor of 100 elements containing the corresponding probability of the class_id
  • masks: a tensor of shape (100, H, W) where H and W are the height and width of the input image and the values represent the index of the class_id associated with the pixel

The model does not need NMS (non-maximum suppression) because the output is already a set of masks with associated class probabilities and has been trained to avoid overlapping masks.

After the post-processing, the output is a Focoos Detections object containing the predicted masks with confidence greather than a specific threshold (0.5 by default).

Classes#

The model is pretrained on the ADE20K dataset with 150 classes.

Class mIoU
1 wall 69.973850
2 building 78.431035
3 sky 91.401107
4 floor 73.162280
5 tree 70.535439
6 ceiling 77.258595
7 road, route 78.314172
8 bed 77.755793
9 window 53.012898
10 grass 64.432303
11 cabinet 51.032268
12 sidewalk, pavement 55.642697
13 person 70.461440
14 eartd, ground 30.454824
15 door 36.431782
16 table 43.096636
17 mountain, mount 54.971609
18 plant 45.115711
19 curtain 64.930372
20 chair 40.465565
21 car 76.888113
22 water 41.666148
23 painting, picture 60.099652
24 sofa 49.840449
25 shelf 31.991519
26 house 45.338182
27 sea 51.614250
28 mirror 55.731406
29 rug 51.858072
30 field 23.065903
31 armchair 30.602317
32 seat 50.277596
33 fence 34.439293
34 desk 35.494495
35 rock, stone 39.573617
36 wardrobe, closet, press 51.343586
37 lamp 47.754304
38 tub 71.511291
39 rail 23.280869
40 cushion 39.251768
41 base, pedestal, stand 28.472143
42 box 16.070477
43 column, pillar 37.924454
44 signboard, sign 29.057276
45 chest of drawers, chest, bureau, dresser 36.343963
46 counter 19.595326
47 sand 31.296151
48 sink 54.413180
49 skyscraper 47.583224
50 fireplace 62.204434
51 refrigerator, icebox 54.270643
52 grandstand, covered stand 31.345801
53 patd 22.330369
54 stairs 20.323718
55 runway 63.892811
56 case, display case, showcase, vitrine 34.649422
57 pool table, billiard table, snooker table 85.365581
58 pillow 46.426184
59 screen door, screen 57.292321
60 stairway, staircase 28.904954
61 river 16.681450
62 bridge, span 52.791513
63 bookcase 26.722881
64 blind, screen 36.787453
65 coffee table 41.603442
66 toilet, can, commode, crapper, pot, potty, stool, tdrone 75.753455
67 flower 30.200230
68 book 37.602484
69 hill 5.509057
70 bench 29.331054
71 countertop 46.661677
72 stove 58.972851
73 palm, palm tree 48.317300
74 kitchen island 25.279206
75 computer 49.335666
76 swivel chair 34.845392
77 boat 48.521646
78 bar 30.174155
79 arcade machine 24.721694
80 hovel, hut, hutch, shack, shanty 32.843717
81 bus 82.174778
82 towel 46.050430
83 light 30.983118
84 truck 23.456256
85 tower 32.147803
86 chandelier 54.045160
87 awning, sunshade, sunblind 18.526182
88 street lamp 13.641714
89 bootd 60.471570
90 tv 55.530715
91 plane 42.894525
92 dirt track 0.001787
93 clotdes 30.124455
94 pole 11.280532
95 land, ground, soil 4.243296
96 bannister, banister, balustrade, balusters, handrail 9.922319
97 escalator, moving staircase, moving stairway 19.186240
98 ottoman, pouf, pouffe, puff, hassock 30.352586
99 bottle 11.872842
100 buffet, counter, sideboard 34.547476
101 poster, posting, placard, notice, bill, card 15.081001
102 stage 17.466091
103 van 39.027877
104 ship 66.778301
105 fountain 18.879113
106 conveyer belt, conveyor belt, conveyer, conveyor, transporter 67.580228
107 canopy 25.654567
108 washer, automatic washer, washing machine 60.187881
109 playtding, toy 13.836259
110 pool 28.796494
111 stool 26.432746
112 barrel, cask 43.777156
113 basket, handbasket 19.144369
114 falls 47.131198
115 tent 88.431441
116 bag 7.634387
117 minibike, motorbike 40.625528
118 cradle 54.247514
119 oven 33.695444
120 ball 36.066130
121 food, solid food 50.837348
122 step, stair 13.071184
123 tank, storage tank 43.042742
124 trade name 21.579095
125 microwave 32.179626
126 pot 27.438416
127 animal 55.993825
128 bicycle 38.273475
129 lake 35.704904
130 dishwasher 37.616793
131 screen 57.100955
132 blanket, cover 15.560568
133 sculpture 31.317035
134 hood, exhaust hood 49.290385
135 sconce 29.971644
136 vase 24.983318
137 traffic light 17.806663
138 tray 5.720345
139 trash can 28.621136
140 fan 39.083851
141 pier 51.310956
142 crt screen 0.858346
143 plate 35.344330
144 monitor 1.994270
145 bulletin board 35.468027
146 shower 1.090403
147 radiator 42.574652
148 glass, drinking glass 8.510381
149 clock 14.128872
150 flag 24.098100

What are you waiting? Try it!#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from focoos import Focoos
import os

# Initialize the Focoos client with your API key
focoos = Focoos(api_key=os.getenv("FOCOOS_API_KEY"))

# Get the remote model (fai-m2f-s-ade) from Focoos API
model = focoos.get_remote_model("fai-m2f-s-ade")

# Run inference on an image
predictions = model.infer("./image.jpg", threshold=0.5)

# Output the predictions
print(predictions)