fai-m2f-l-ade#

Overview#

The models is a Mask2Former model otimized by FocoosAI for the ADE20K dataset. It is a semantic segmentation model able to segment 150 classes, comprising both stuff (sky, road, etc.) and thing (dog, cat, car, etc.).

Benchmark#

Benchmark Comparison Note: FPS are computed on NVIDIA T4 using TensorRT and image size 640x640.

Model Details#

The model is based on the Mask2Former architecture. It is a segmentation model that uses a transformer-based encoder-decoder architecture. Differently from traditional segmentation models (such as DeepLab), Mask2Former uses a mask-classification approach, where the prediction is made by a set of segmentation mask with associated class probabilities.

Neural Network Architecture#

The Mask2Former FocoosAI implementation optimize the original neural network architecture for improving the model's efficiency and performance. The original model is fully described in this paper.

Mask2Former is a hybrid model that uses three main components: a backbone for extracting features, a pixel decoder for upscaling the features, and a transformer-based decoder for generating the segmentation output.

alt text

In this implementation:

the backbone is a Resnet-101,that guarantees a good performance while having good efficiency.
the pixel decoder is a FPN getting the features from the stage 2 (1/4 resolution), 3 (1/8 resolution), 4 (1/16 resolution) and 5 (1/32 resolution) of the backbone. Differently from the original paper, for the sake of portability, we removed the deformable attention modules in the pixel decoder, speeding up the inference while only marginally affecting the accuracy.
the transformer decoder is a lighter version of the original, having only 6 decoder layers (instead of 9) and 100 learnable queries.

Losses#

We use the same losses as the original paper:

loss_ce: Cross-entropy loss for the classification of the classes
loss_dice: Dice loss for the segmentation of the classes
loss_mask: A binary cross-entropy loss applied to the predicted segmentation masks

These losses are applied to each output of the transformer decoder, meaning that we apply it on the output and on each auxiliary output of the 6 transformer decoder layers. Please refer to the Mask2Former paper for more details.

Output Format#

The pre-processed output of the model is set of masks with associated class probabilities. In particular, the output is composed by three tensors:

class_ids: a tensor of 100 elements containing the class id associated with each mask (such as 1 for wall, 2 for building, etc.)
scores: a tensor of 100 elements containing the corresponding probability of the class_id
masks: a tensor of shape (100, H, W) where H and W are the height and width of the input image and the values represent the index of the class_id associated with the pixel

The model does not need NMS (non-maximum suppression) because the output is already a set of masks with associated class probabilities and has been trained to avoid overlapping masks.

After the post-processing, the output is a Focoos Detections object containing the predicted masks with confidence greather than a specific threshold (0.5 by default).

Classes#

The model is pretrained on the ADE20K dataset with 150 classes.

Class ID	Class Name	mIoU
1	wall	77.284
2	building	81.396
3	sky	94.337
4	floor	81.584
5	tree	74.103
6	ceiling	83.073
7	road, route	83.013
8	bed	88.120
9	window	61.048
10	grass	69.099
11	cabinet	56.303
12	sidewalk, pavement	62.300
13	person	82.073
14	earth, ground	35.094
15	door	45.140
16	table	59.436
17	mountain, mount	60.538
18	plant	51.829
19	curtain	71.510
20	chair	56.219
21	car	83.766
22	water	49.028
23	painting, picture	70.214
24	sofa	68.081
25	shelf	35.453
26	house	45.656
27	sea	51.205
28	mirror	61.611
29	rug	64.144
30	field	30.577
31	armchair	45.761
32	seat	61.850
33	fence	40.992
34	desk	41.814
35	rock, stone	47.600
36	wardrobe, closet, press	39.846
37	lamp	64.062
38	tub	74.760
39	rail	24.105
40	cushion	56.811
41	base, pedestal, stand	27.777
42	box	24.670
43	column, pillar	40.094
44	signboard, sign	33.495
45	chest of drawers, chest, bureau, dresser	41.847
46	counter	21.387
47	sand	29.763
48	sink	74.092
49	skyscraper	37.613
50	fireplace	65.037
51	refrigerator, icebox	57.648
52	grandstand, covered stand	46.626
53	path	24.543
54	stairs	28.681
55	runway	73.779
56	case, display case, showcase, vitrine	38.437
57	pool table, billiard table, snooker table	91.825
58	pillow	49.388
59	screen door, screen	59.058
60	stairway, staircase	32.832
61	river	18.597
62	bridge, span	56.011
63	bookcase	28.848
64	blind, screen	43.934
65	coffee table	59.869
66	toilet, can, commode, crapper, pot, potty, stool, throne	86.346
67	flower	38.141
68	book	42.528
69	hill	6.905
70	bench	45.494
71	countertop	49.007
72	stove	73.973
73	palm, palm tree	49.478
74	kitchen island	42.603
75	computer	72.142
76	swivel chair	44.262
77	boat	73.689
78	bar	37.749
79	arcade machine	78.733
80	hovel, hut, hutch, shack, shanty	30.537
81	bus	90.808
82	towel	58.158
83	light	57.444
84	truck	31.745
85	tower	32.058
86	chandelier	67.524
87	awning, sunshade, sunblind	28.566
88	street lamp	30.507
89	booth	39.696
90	tv	76.194
91	plane	50.005
92	dirt track	18.268
93	clothes	37.748
94	pole	23.343
95	land, ground, soil	0.001
96	bannister, banister, balustrade, balusters, handrail	16.222
97	escalator, moving staircase, moving stairway	54.888
98	ottoman, pouf, pouffe, puff, hassock	32.444
99	bottle	22.166
100	buffet, counter, sideboard	48.994
101	poster, posting, placard, notice, bill, card	31.773
102	stage	18.731
103	van	46.747
104	ship	79.937
105	fountain	21.205
106	conveyer belt, conveyor belt, conveyer, conveyor, transporter	62.591
107	canopy	23.719
108	washer, automatic washer, washing machine	66.458
109	plaything, toy	35.377
110	pool	34.297
111	stool	41.199
112	barrel, cask	61.803
113	basket, handbasket	34.313
114	falls	57.149
115	tent	94.077
116	bag	19.126
117	minibike, motorbike	71.207
118	cradle	85.775
119	oven	50.996
120	ball	32.601
121	food, solid food	58.662
122	step, stair	16.474
123	tank, storage tank	37.627
124	trade name	20.788
125	microwave	37.998
126	pot	53.411
127	animal	57.360
128	bicycle	58.772
129	lake	41.597
130	dishwasher	74.543
131	screen	79.757
132	blanket, cover	15.202
133	sculpture	53.537
134	hood, exhaust hood	52.684
135	sconce	48.160
136	vase	45.300
137	traffic light	35.375
138	tray	14.093
139	trash can	30.699
140	fan	56.574
141	pier	10.286
142	crt screen	0.936
143	plate	53.268
144	monitor	9.358
145	bulletin board	29.970
146	shower	8.978
147	radiator	59.763
148	glass, drinking glass	18.246
149	clock	29.088
150	flag	37.727

What are you waiting? Try it!#

from focoos import Focoos
import os

# Initialize the Focoos client with your API key
focoos = Focoos(api_key=os.getenv("FOCOOS_API_KEY"))

# Get the remote model (fai-m2f-l-ade) from Focoos API
model = focoos.get_remote_model("fai-m2f-l-ade")

# Run inference on an image
predictions = model.infer("./image.jpg", threshold=0.5)

# Output the predictions
print(predictions)