FAI-MF (FocoosAI MaskFormer)#
Overview#
The FAI-MF model is a Mask2Former implementation optimized by FocoosAI for semantic and instance segmentation tasks. Unlike traditional segmentation models such as DeepLab, Mask2Former employs a mask-classification approach where predictions consist of segmentation masks paired with class probabilities.
Neural Network Architecture#
The FAI-MF model is built on the Mask2Former architecture, featuring a transformer-based encoder-decoder design with three main components:
Backbone#
- Network: Any backbone that can extract multi-scale features from an image
- Output: Multi-scale features from stages 2-5 at resolutions 1/4, 1/8, 1/16, and 1/32
Pixel Decoder#
- Architecture: Feature Pyramid Network (FPN)
- Input: Features from backbone stages 2-5
- Modifications: Deformable attention modules removed for improved portability and inference speed
- Output: Upscaled multi-scale features for mask generation
Transformer Decoder#
- Design: Lightweight version of the original Mask2Former decoder
- Layers: N decoder layer (depending on the speed/accuracy trade-off)
- Queries: Q learnable object queries (usually 100)
- Components:
- Self-attention layers
- Masked cross-attention layers
- Feed-forward networks (FFN)
Configuration Parameters#
Core Model Parameters#
num_classes
(int): Number of segmentation classesnum_queries
(int, default=100): Number of learnable object queriesresolution
(int, default=640): Input image resolution
Backbone Configuration#
backbone_config
(BackboneConfig): Backbone network configuration
Architecture Dimensions#
pixel_decoder_out_dim
(int, default=256): Pixel decoder output channelspixel_decoder_feat_dim
(int, default=256): Pixel decoder feature channelstransformer_predictor_hidden_dim
(int, default=256): Transformer hidden dimensiontransformer_predictor_dec_layers
(int, default=6): Number of decoder layershead_out_dim
(int, default=256): Prediction head output dimension
Inference Configuration#
postprocessing_type
(str): Either "semantic" or "instance" segmentationmask_threshold
(float, default=0.5): Binary mask thresholdthreshold
(float, default=0.5): Confidence threshold for the classification scorespredict_all_pixels
(bool, default=False): Predict class for every pixel, this is usually better for semantic segmentation
Supported Tasks#
Semantic Segmentation#
- Output: Dense pixel-wise class predictions
- Use case: Scene understanding, medical imaging, autonomous driving
- Configuration: Set
postprocessing_type="semantic"
Instance Segmentation#
- Output: Individual object instances with masks and bounding boxes
- Use case: Object detection and counting, robotics, surveillance
- Configuration: Set
postprocessing_type="instance"
Model Outputs#
Inner Model Output (MaskFormerModelOutput
)#
masks
(torch.Tensor): Shape [B, num_queries, H, W] - Query mask predictionslogits
(torch.Tensor): Shape [B, num_queries, num_classes] - Class predictionsloss
(Optional[dict]): Training losses including:loss_ce
: Cross-entropy classification lossloss_mask
: Binary cross-entropy mask lossloss_dice
: Dice coefficient loss
Inference Output (FocoosDetections
)#
For each detected object:
bbox
(List[float]): Bounding box coordinates [x1, y1, x2, y2]conf
(float): Confidence scorecls_id
(int): Class identifiermask
(str): Base64-encoded binary masklabel
(Optional[str]): Human-readable class name
Losses#
The model employs three complementary loss functions as described in the original paper:
- Cross-entropy Loss (
loss_ce
): Classification of object classes - Dice Loss (
loss_dice
): Shape-aware segmentation loss - Mask Loss (
loss_mask
): Binary cross-entropy on predicted masks
Available Models#
Currently, you can find 5 fai-mf models on the Focoos Hub, 2 for semantic segmentation and 3 for instance-segmentation.
Semantic Segmentation Models#
Model Name | Architecture | Dataset | Metric | FPS Nvidia-T4 |
---|---|---|---|---|
fai-mf-l-ade | Mask2Former (Resnet-101) | ADE20K | mIoU: 48.27 mAcc: 62.15 |
73 |
fai-mf-m-ade | Mask2Former (STDC-2) | ADE20K | mIoU: 45.32 mACC: 57.75 |
127 |
Instance Segmentation Models#
Model Name | Architecture | Dataset | Metric | FPS Nvidia-T4 |
---|---|---|---|---|
fai-mf-s-coco-ins | Mask2Former (Resnet-50) | COCO | segm/AP: 41.45 segm/AP50: 64.12 |
86 |
fai-mf-m-coco-ins | Mask2Former (Resnet-101) | COCO | segm/AP: 43.09 segm/AP50: 65.87 |
70 |
fai-mf-l-coco-ins | Mask2Former (Resnet-101) | COCO | segm/AP: 44.23 segm/AP50: 67.53 |
55 |
Example Usage#
Quick Start with Pre-trained Model#
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Custom Model Configuration#
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|