Skip to content

fai-m2f-l-coco-ins#

Overview#

The models is a Mask2Former model otimized by FocoosAI for the COCO dataset. It is an instance segmentation model able to segment 80 thing (dog, cat, car, etc.) classes.

Model Details#

The model is based on the Mask2Former architecture. It is a segmentation model that uses a mask-classification approach and a transformer-based encoder-decoder architecture.

Neural Network Architecture#

The Mask2Former FocoosAI implementation optimize the original neural network architecture for improving the model's efficiency and performance. The original model is fully described in this paper.

Mask2Former is a hybrid model that uses three main components: a backbone for extracting features, a pixel decoder for upscaling the features, and a transformer-based decoder for generating the segmentation output.

alt text

In this implementation:

  • the backbone is Resnet-50 that show an amazing trade-off between performance and efficiency.
  • the pixel decoder is a transformer-augmented FPN. It gets the features from the stage 2 (1/4 resolution), 3 (1/8 resolution), 4 (1/16 resolution) and 5 (1/32 resolution) of the backbone. It first uses a transformer encoder to process the features at the lowest resolution (stage 5) and then uses a feature pyramid network to upsample the features. This part is different from the original implementation using deformable attention modules.
  • the transformer decoder is implemented as in the original paper, having 9 decoder layers and 100 learnable queries.

Losses#

We use the same losses as the original paper:

  • loss_ce: Cross-entropy loss for the classification of the classes
  • loss_dice: Dice loss for the segmentation of the classes
  • loss_mask: A binary cross-entropy loss applied to the predicted segmentation masks

These losses are applied to each output of the transformer decoder, meaning that we apply it on the output and on each auxiliary output of the 3 transformer decoder layers. Please refer to the Mask2Former paper for more details.

Output Format#

The pre-processed output of the model is set of masks with associated class probabilities. In particular, the output is composed by three tensors:

  • class_ids: a tensor of 100 elements containing the class id associated with each mask (such as 1 for wall, 2 for building, etc.)
  • scores: a tensor of 100 elements containing the corresponding probability of the class_id
  • masks: a tensor of shape (100, H, W) where H and W are the height and width of the input image and the values represent the index of the class_id associated with the pixel

The model does not need NMS (non-maximum suppression) because the output is already a set of masks with associated class probabilities and has been trained to avoid overlapping masks.

After the post-processing, the output is a Focoos Detections object containing the predicted masks with confidence greather than a specific threshold (0.5 by default).

Classes#

The model is pretrained on the COCO dataset with 80 classes.

Class Segmentation AP
1 person 48.9
2 bicycle 22.2
3 car 41.3
4 motorcycle 40.0
5 airplane 55.6
6 bus 68.2
7 train 69.6
8 truck 40.5
9 boat 26.2
10 traffic light 27.4
11 fire hydrant 69.2
12 stop sign 65.0
13 parking meter 45.4
14 bench 23.4
15 bird 33.8
16 cat 77.7
17 dog 68.9
18 horse 50.1
19 sheep 54.0
20 cow 51.0
21 elephant 63.4
22 bear 81.1
23 zebra 66.0
24 giraffe 60.5
25 backpack 22.7
26 umbrella 52.6
27 handbag 23.3
28 tie 33.2
29 suitcase 45.3
30 frisbee 66.4
31 skis 7.4
32 snowboard 28.2
33 sports ball 42.8
34 kite 30.3
35 baseball bat 32.1
36 baseball glove 42.3
37 skateboard 36.8
38 surfboard 37.3
39 tennis racket 58.7
40 bottle 39.2
41 wine glass 36.9
42 cup 46.0
43 fork 22.2
44 knife 17.8
45 spoon 18.0
46 bowl 44.3
47 banana 26.5
48 apple 23.9
49 sandwich 43.0
50 orange 33.8
51 broccoli 24.4
52 carrot 22.7
53 hot dog 36.3
54 pizza 55.1
55 donut 51.1
56 cake 44.6
57 chair 25.0
58 couch 47.7
59 potted plant 25.0
60 bed 45.0
61 dining table 22.9
62 toilet 67.6
63 tv 64.3
64 laptop 67.2
65 mouse 60.1
66 remote 36.1
67 keyboard 52.6
68 cell phone 42.0
69 microwave 60.7
70 oven 33.8
71 toaster 35.9
72 sink 39.9
73 refrigerator 64.0
74 book 12.0
75 clock 52.5
76 vase 37.7
77 scissors 26.8
78 teddy bear 55.1
79 hair drier 16.8
80 toothbrush 22.4

What are you waiting? Try it!#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from focoos import Focoos
import os

# Initialize the Focoos client with your API key
focoos = Focoos(api_key=os.getenv("FOCOOS_API_KEY"))

# Get the remote model (fai-m2f-l-coco-ins) from Focoos API
model = focoos.get_remote_model("fai-m2f-l-coco-ins")

# Run inference on an image
predictions = model.infer("./image.jpg", threshold=0.5)

# Output the predictions
print(predictions)