fai-m2f-m-ade#

Overview#

The models is a Mask2Former model otimized by FocoosAI for the ADE20K dataset. It is a semantic segmentation model able to segment 150 classes, comprising both stuff (sky, road, etc.) and thing (dog, cat, car, etc.).

Benchmark#

Benchmark Comparison Note: FPS are computed on NVIDIA T4 using TensorRT and image size 640x640.

Model Details#

The model is based on the Mask2Former architecture. It is a segmentation model that uses a transformer-based encoder-decoder architecture. Differently from traditional segmentation models (such as DeepLab), Mask2Former uses a mask-classification approach, where the prediction is made by a set of segmentation mask with associated class probabilities.

Neural Network Architecture#

The Mask2Former FocoosAI implementation optimize the original neural network architecture for improving the model's efficiency and performance. The original model is fully described in this paper.

Mask2Former is a hybrid model that uses three main components: a backbone for extracting features, a pixel decoder for upscaling the features, and a transformer-based decoder for generating the segmentation output.

alt text

In this implementation:

the backbone is STDC-2 that show an amazing trade-off between performance and efficiency.
the pixel decoder is a FPN getting the features from the stage 2 (1/4 resolution), 3 (1/8 resolution), 4 (1/16 resolution) and 5 (1/32 resolution) of the backbone. Differently from the original paper, for the sake of portability, we removed the deformable attention modules in the pixel decoder, speeding up the inference while only marginally affecting the accuracy.
the transformer decoder is a lighter version of the original, having only 3 decoder layers (instead of 9) and 100 learnable queries.

Losses#

We use the same losses as the original paper:

loss_ce: Cross-entropy loss for the classification of the classes
loss_dice: Dice loss for the segmentation of the classes
loss_mask: A binary cross-entropy loss applied to the predicted segmentation masks

These losses are applied to each output of the transformer decoder, meaning that we apply it on the output and on each auxiliary output of the 3 transformer decoder layers. Please refer to the Mask2Former paper for more details.

Output Format#

The pre-processed output of the model is set of masks with associated class probabilities. In particular, the output is composed by three tensors:

class_ids: a tensor of 100 elements containing the class id associated with each mask (such as 1 for wall, 2 for building, etc.)
scores: a tensor of 100 elements containing the corresponding probability of the class_id
masks: a tensor of shape (100, H, W) where H and W are the height and width of the input image and the values represent the index of the class_id associated with the pixel

The model does not need NMS (non-maximum suppression) because the output is already a set of masks with associated class probabilities and has been trained to avoid overlapping masks.

After the post-processing, the output is a Focoos Detections object containing the predicted masks with confidence greather than a specific threshold (0.5 by default).

Classes#

The model is pretrained on the ADE20K dataset with 150 classes.

	Class	mIoU
1	wall	75.369549
2	building	79.835995
3	sky	94.176995
4	floor	79.620841
5	tree	73.204506
6	ceiling	82.303035
7	road, route	80.822591
8	bed	87.573840
9	window	57.452584
10	grass	70.099493
11	cabinet	56.903790
12	sidewalk, pavement	62.247267
13	person	79.460606
14	earth, ground	38.537802
15	door	43.930878
16	table	56.753292
17	mountain, mount	61.160462
18	plant	48.995487
19	curtain	71.951930
20	chair	52.852125
21	car	80.725703
22	water	51.233498
23	painting, picture	66.989493
24	sofa	58.103663
25	shelf	34.979205
26	house	36.828611
27	sea	51.219096
28	mirror	58.572852
29	rug	54.897799
30	field	29.053876
31	armchair	39.565663
32	seat	53.113668
33	fence	41.113128
34	desk	37.930189
35	rock, stone	44.940982
36	wardrobe, closet, press	39.897858
37	lamp	60.921356
38	tub	78.041637
39	rail	31.893878
40	cushion	53.029316
41	base, pedestal, stand	20.233620
42	box	18.276924
43	column, pillar	42.655306
44	signboard, sign	35.959448
45	chest of drawers, chest, bureau, dresser	36.521600
46	counter	29.353667
47	sand	38.729599
48	sink	72.303141
49	skyscraper	44.122387
50	fireplace	66.614683
51	refrigerator, icebox	72.137179
52	grandstand, covered stand	29.061628
53	path	26.629478
54	stairs	31.833328
55	runway	76.017706
56	case, display case, showcase, vitrine	37.452627
57	pool table, billiard table, snooker table	93.246039
58	pillow	54.689591
59	screen door, screen	58.096890
60	stairway, staircase	29.962829
61	river	15.010211
62	bridge, span	66.617580
63	bookcase	31.383789
64	blind, screen	39.221180
65	coffee table	63.300795
66	toilet, can, commode, crapper, pot, potty, stool, throne	84.038177
67	flower	35.994798
68	book	43.252042
69	hill	6.240850
70	bench	35.007473
71	countertop	56.592858
72	stove	74.866261
73	palm, palm tree	49.092486
74	kitchen island	32.353614
75	computer	57.673329
76	swivel chair	43.202283
77	boat	48.170742
78	bar	24.034261
79	arcade machine	11.467819
80	hovel, hut, hutch, shack, shanty	10.258017
81	bus	81.375072
82	towel	54.954106
83	light	53.256340
84	truck	29.656645
85	tower	36.864496
86	chandelier	63.787459
87	awning, sunshade, sunblind	23.610311
88	street lamp	29.944617
89	booth	29.360433
90	tv	61.512572
91	plane	53.270513
92	dirt track	4.206758
93	clothes	35.342074
94	pole	20.678348
95	land, ground, soil	3.195710
96	bannister, banister, balustrade, balusters, handrail	17.522631
97	escalator, moving staircase, moving stairway	20.889345
98	ottoman, pouf, pouffe, puff, hassock	47.003450
99	bottle	15.504667
100	buffet, counter, sideboard	26.077572
101	poster, posting, placard, notice, bill, card	30.691103
102	stage	11.744151
103	van	40.161822
104	ship	79.300311
105	fountain	0.112958
106	conveyer belt, conveyor belt, conveyer, conveyor, transporter	60.552373
107	canopy	25.086350
108	washer, automatic washer, washing machine	63.550537
109	plaything, toy	18.290597
110	pool	32.873865
111	stool	39.256308
112	barrel, cask	6.358771
113	basket, handbasket	29.850719
114	falls	57.657161
115	tent	93.717152
116	bag	10.629695
117	minibike, motorbike	56.217901
118	cradle	69.441302
119	oven	38.940583
120	ball	45.543376
121	food, solid food	52.779065
122	step, stair	10.843115
123	tank, storage tank	30.871163
124	trade name	27.908376
125	microwave	32.381977
126	pot	41.040635
127	animal	55.882266
128	bicycle	50.185374
129	lake	0.007605
130	dishwasher	58.970317
131	screen	60.016197
132	blanket, cover	26.963189
133	sculpture	27.667732
134	hood, exhaust hood	58.025458
135	sconce	39.341998
136	vase	31.185747
137	traffic light	23.810429
138	tray	7.244281
139	trash can	30.072544
140	fan	52.113861
141	pier	56.678802
142	crt screen	9.133357
143	plate	38.900407
144	monitor	3.323130
145	bulletin board	52.337659
146	shower	4.692180
147	radiator	43.811464
148	glass, drinking glass	14.036491
149	clock	25.044316
150	flag	40.007933

What are you waiting? Try it!#

from focoos import Focoos
import os

# Initialize the Focoos client with your API key
focoos = Focoos(api_key=os.getenv("FOCOOS_API_KEY"))

# Get the remote model (fai-m2f-m-ade) from Focoos API
model = focoos.get_remote_model("fai-m2f-m-ade")

# Run inference on an image
predictions = model.infer("./image.jpg", threshold=0.5)

# Output the predictions
print(predictions)