fai-m2f-l-ade#

Overview#

The models is a Mask2Former model otimized by FocoosAI for the ADE20K dataset. It is a semantic segmentation model able to segment 150 classes, comprising both stuff (sky, road, etc.) and thing (dog, cat, car, etc.).

Benchmark#

Benchmark Comparison Note: FPS are computed on NVIDIA T4 using TensorRT and image size 640x640.

Model Details#

The model is based on the Mask2Former architecture. It is a segmentation model that uses a transformer-based encoder-decoder architecture. Differently from traditional segmentation models (such as DeepLab), Mask2Former uses a mask-classification approach, where the prediction is made by a set of segmentation mask with associated class probabilities.

Neural Network Architecture#

The Mask2Former FocoosAI implementation optimize the original neural network architecture for improving the model's efficiency and performance. The original model is fully described in this paper.

Mask2Former is a hybrid model that uses three main components: a backbone for extracting features, a pixel decoder for upscaling the features, and a transformer-based decoder for generating the segmentation output.

alt text

In this implementation:

the backbone is STDC-1 that shows a trade-off tending to be more efficient.
the pixel decoder is a FPN getting the features from the stage 2 (1/4 resolution), 3 (1/8 resolution), 4 (1/16 resolution) and 5 (1/32 resolution) of the backbone. Differently from the original paper, for the sake of portability, we removed the deformable attention modules in the pixel decoder, speeding up the inference while only marginally affecting the accuracy.
the transformer decoder is a extremely light version of the original, having only 1 decoder layer (instead of 9) and 100 learnable queries.

Losses#

We use the same losses as the original paper:

loss_ce: Cross-entropy loss for the classification of the classes
loss_dice: Dice loss for the segmentation of the classes
loss_mask: A binary cross-entropy loss applied to the predicted segmentation masks

Please refer to the Mask2Former paper for more details.

Output Format#

The pre-processed output of the model is set of masks with associated class probabilities. In particular, the output is composed by three tensors:

class_ids: a tensor of 100 elements containing the class id associated with each mask (such as 1 for wall, 2 for building, etc.)
scores: a tensor of 100 elements containing the corresponding probability of the class_id
masks: a tensor of shape (100, H, W) where H and W are the height and width of the input image and the values represent the index of the class_id associated with the pixel

The model does not need NMS (non-maximum suppression) because the output is already a set of masks with associated class probabilities and has been trained to avoid overlapping masks.

After the post-processing, the output is a Focoos Detections object containing the predicted masks with confidence greather than a specific threshold (0.5 by default).

Classes#

The model is pretrained on the ADE20K dataset with 150 classes.

	Class	mIoU
1	wall	69.973850
2	building	78.431035
3	sky	91.401107
4	floor	73.162280
5	tree	70.535439
6	ceiling	77.258595
7	road, route	78.314172
8	bed	77.755793
9	window	53.012898
10	grass	64.432303
11	cabinet	51.032268
12	sidewalk, pavement	55.642697
13	person	70.461440
14	eartd, ground	30.454824
15	door	36.431782
16	table	43.096636
17	mountain, mount	54.971609
18	plant	45.115711
19	curtain	64.930372
20	chair	40.465565
21	car	76.888113
22	water	41.666148
23	painting, picture	60.099652
24	sofa	49.840449
25	shelf	31.991519
26	house	45.338182
27	sea	51.614250
28	mirror	55.731406
29	rug	51.858072
30	field	23.065903
31	armchair	30.602317
32	seat	50.277596
33	fence	34.439293
34	desk	35.494495
35	rock, stone	39.573617
36	wardrobe, closet, press	51.343586
37	lamp	47.754304
38	tub	71.511291
39	rail	23.280869
40	cushion	39.251768
41	base, pedestal, stand	28.472143
42	box	16.070477
43	column, pillar	37.924454
44	signboard, sign	29.057276
45	chest of drawers, chest, bureau, dresser	36.343963
46	counter	19.595326
47	sand	31.296151
48	sink	54.413180
49	skyscraper	47.583224
50	fireplace	62.204434
51	refrigerator, icebox	54.270643
52	grandstand, covered stand	31.345801
53	patd	22.330369
54	stairs	20.323718
55	runway	63.892811
56	case, display case, showcase, vitrine	34.649422
57	pool table, billiard table, snooker table	85.365581
58	pillow	46.426184
59	screen door, screen	57.292321
60	stairway, staircase	28.904954
61	river	16.681450
62	bridge, span	52.791513
63	bookcase	26.722881
64	blind, screen	36.787453
65	coffee table	41.603442
66	toilet, can, commode, crapper, pot, potty, stool, tdrone	75.753455
67	flower	30.200230
68	book	37.602484
69	hill	5.509057
70	bench	29.331054
71	countertop	46.661677
72	stove	58.972851
73	palm, palm tree	48.317300
74	kitchen island	25.279206
75	computer	49.335666
76	swivel chair	34.845392
77	boat	48.521646
78	bar	30.174155
79	arcade machine	24.721694
80	hovel, hut, hutch, shack, shanty	32.843717
81	bus	82.174778
82	towel	46.050430
83	light	30.983118
84	truck	23.456256
85	tower	32.147803
86	chandelier	54.045160
87	awning, sunshade, sunblind	18.526182
88	street lamp	13.641714
89	bootd	60.471570
90	tv	55.530715
91	plane	42.894525
92	dirt track	0.001787
93	clotdes	30.124455
94	pole	11.280532
95	land, ground, soil	4.243296
96	bannister, banister, balustrade, balusters, handrail	9.922319
97	escalator, moving staircase, moving stairway	19.186240
98	ottoman, pouf, pouffe, puff, hassock	30.352586
99	bottle	11.872842
100	buffet, counter, sideboard	34.547476
101	poster, posting, placard, notice, bill, card	15.081001
102	stage	17.466091
103	van	39.027877
104	ship	66.778301
105	fountain	18.879113
106	conveyer belt, conveyor belt, conveyer, conveyor, transporter	67.580228
107	canopy	25.654567
108	washer, automatic washer, washing machine	60.187881
109	playtding, toy	13.836259
110	pool	28.796494
111	stool	26.432746
112	barrel, cask	43.777156
113	basket, handbasket	19.144369
114	falls	47.131198
115	tent	88.431441
116	bag	7.634387
117	minibike, motorbike	40.625528
118	cradle	54.247514
119	oven	33.695444
120	ball	36.066130
121	food, solid food	50.837348
122	step, stair	13.071184
123	tank, storage tank	43.042742
124	trade name	21.579095
125	microwave	32.179626
126	pot	27.438416
127	animal	55.993825
128	bicycle	38.273475
129	lake	35.704904
130	dishwasher	37.616793
131	screen	57.100955
132	blanket, cover	15.560568
133	sculpture	31.317035
134	hood, exhaust hood	49.290385
135	sconce	29.971644
136	vase	24.983318
137	traffic light	17.806663
138	tray	5.720345
139	trash can	28.621136
140	fan	39.083851
141	pier	51.310956
142	crt screen	0.858346
143	plate	35.344330
144	monitor	1.994270
145	bulletin board	35.468027
146	shower	1.090403
147	radiator	42.574652
148	glass, drinking glass	8.510381
149	clock	14.128872
150	flag	24.098100

What are you waiting? Try it!#

from focoos import Focoos
import os

# Initialize the Focoos client with your API key
focoos = Focoos(api_key=os.getenv("FOCOOS_API_KEY"))

# Get the remote model (fai-m2f-s-ade) from Focoos API
model = focoos.get_remote_model("fai-m2f-s-ade")

# Run inference on an image
predictions = model.infer("./image.jpg", threshold=0.5)

# Output the predictions
print(predictions)