fai-m2f-l-coco-ins#

Overview#

The models is a Mask2Former model otimized by FocoosAI for the COCO dataset. It is an instance segmentation model able to segment 80 thing (dog, cat, car, etc.) classes.

Model Details#

The model is based on the Mask2Former architecture. It is a segmentation model that uses a mask-classification approach and a transformer-based encoder-decoder architecture.

Neural Network Architecture#

The Mask2Former FocoosAI implementation optimize the original neural network architecture for improving the model's efficiency and performance. The original model is fully described in this paper.

Mask2Former is a hybrid model that uses three main components: a backbone for extracting features, a pixel decoder for upscaling the features, and a transformer-based decoder for generating the segmentation output.

alt text

In this implementation:

the backbone is Resnet-50 that show an amazing trade-off between performance and efficiency.
the pixel decoder is a transformer-augmented FPN. It gets the features from the stage 2 (1/4 resolution), 3 (1/8 resolution), 4 (1/16 resolution) and 5 (1/32 resolution) of the backbone. It first uses a transformer encoder to process the features at the lowest resolution (stage 5) and then uses a feature pyramid network to upsample the features. This part is different from the original implementation using deformable attention modules.
the transformer decoder is implemented as in the original paper, having 9 decoder layers and 100 learnable queries.

Losses#

We use the same losses as the original paper:

loss_ce: Cross-entropy loss for the classification of the classes
loss_dice: Dice loss for the segmentation of the classes
loss_mask: A binary cross-entropy loss applied to the predicted segmentation masks

These losses are applied to each output of the transformer decoder, meaning that we apply it on the output and on each auxiliary output of the 3 transformer decoder layers. Please refer to the Mask2Former paper for more details.

Output Format#

The pre-processed output of the model is set of masks with associated class probabilities. In particular, the output is composed by three tensors:

class_ids: a tensor of 100 elements containing the class id associated with each mask (such as 1 for wall, 2 for building, etc.)
scores: a tensor of 100 elements containing the corresponding probability of the class_id
masks: a tensor of shape (100, H, W) where H and W are the height and width of the input image and the values represent the index of the class_id associated with the pixel

The model does not need NMS (non-maximum suppression) because the output is already a set of masks with associated class probabilities and has been trained to avoid overlapping masks.

After the post-processing, the output is a Focoos Detections object containing the predicted masks with confidence greather than a specific threshold (0.5 by default).

Classes#

The model is pretrained on the COCO dataset with 80 classes.

	Class	Segmentation AP
1	person	48.9
2	bicycle	22.2
3	car	41.3
4	motorcycle	40.0
5	airplane	55.6
6	bus	68.2
7	train	69.6
8	truck	40.5
9	boat	26.2
10	traffic light	27.4
11	fire hydrant	69.2
12	stop sign	65.0
13	parking meter	45.4
14	bench	23.4
15	bird	33.8
16	cat	77.7
17	dog	68.9
18	horse	50.1
19	sheep	54.0
20	cow	51.0
21	elephant	63.4
22	bear	81.1
23	zebra	66.0
24	giraffe	60.5
25	backpack	22.7
26	umbrella	52.6
27	handbag	23.3
28	tie	33.2
29	suitcase	45.3
30	frisbee	66.4
31	skis	7.4
32	snowboard	28.2
33	sports ball	42.8
34	kite	30.3
35	baseball bat	32.1
36	baseball glove	42.3
37	skateboard	36.8
38	surfboard	37.3
39	tennis racket	58.7
40	bottle	39.2
41	wine glass	36.9
42	cup	46.0
43	fork	22.2
44	knife	17.8
45	spoon	18.0
46	bowl	44.3
47	banana	26.5
48	apple	23.9
49	sandwich	43.0
50	orange	33.8
51	broccoli	24.4
52	carrot	22.7
53	hot dog	36.3
54	pizza	55.1
55	donut	51.1
56	cake	44.6
57	chair	25.0
58	couch	47.7
59	potted plant	25.0
60	bed	45.0
61	dining table	22.9
62	toilet	67.6
63	tv	64.3
64	laptop	67.2
65	mouse	60.1
66	remote	36.1
67	keyboard	52.6
68	cell phone	42.0
69	microwave	60.7
70	oven	33.8
71	toaster	35.9
72	sink	39.9
73	refrigerator	64.0
74	book	12.0
75	clock	52.5
76	vase	37.7
77	scissors	26.8
78	teddy bear	55.1
79	hair drier	16.8
80	toothbrush	22.4

What are you waiting? Try it!#

from focoos import Focoos
import os

# Initialize the Focoos client with your API key
focoos = Focoos(api_key=os.getenv("FOCOOS_API_KEY"))

# Get the remote model (fai-m2f-l-coco-ins) from Focoos API
model = focoos.get_remote_model("fai-m2f-l-coco-ins")

# Run inference on an image
predictions = model.infer("./image.jpg", threshold=0.5)

# Output the predictions
print(predictions)