FAI-MF (FocoosAI MaskFormer)#

Overview#

The FAI-MF model is a Mask2Former implementation optimized by FocoosAI for semantic and instance segmentation tasks. Unlike traditional segmentation models such as DeepLab, Mask2Former employs a mask-classification approach where predictions consist of segmentation masks paired with class probabilities.

Neural Network Architecture#

The FAI-MF model is built on the Mask2Former architecture, featuring a transformer-based encoder-decoder design with three main components:

Mask2Former Architecture

Backbone#

Network: Any backbone that can extract multi-scale features from an image
Output: Multi-scale features from stages 2-5 at resolutions 1/4, 1/8, 1/16, and 1/32

Pixel Decoder#

Architecture: Feature Pyramid Network (FPN)
Input: Features from backbone stages 2-5
Modifications: Deformable attention modules removed for improved portability and inference speed
Output: Upscaled multi-scale features for mask generation

Transformer Decoder#

Design: Lightweight version of the original Mask2Former decoder
Layers: N decoder layer (depending on the speed/accuracy trade-off)
Queries: Q learnable object queries (usually 100)
Components:
Self-attention layers
Masked cross-attention layers
Feed-forward networks (FFN)

Configuration Parameters#

Core Model Parameters#

num_classes (int): Number of segmentation classes
num_queries (int, default=100): Number of learnable object queries
resolution (int, default=640): Input image resolution

Backbone Configuration#

backbone_config (BackboneConfig): Backbone network configuration

Architecture Dimensions#

pixel_decoder_out_dim (int, default=256): Pixel decoder output channels
pixel_decoder_feat_dim (int, default=256): Pixel decoder feature channels
transformer_predictor_hidden_dim (int, default=256): Transformer hidden dimension
transformer_predictor_dec_layers (int, default=6): Number of decoder layers
head_out_dim (int, default=256): Prediction head output dimension

Inference Configuration#

postprocessing_type (str): Either "semantic" or "instance" segmentation
mask_threshold (float, default=0.5): Binary mask threshold
threshold (float, default=0.5): Confidence threshold for the classification scores
predict_all_pixels (bool, default=False): Predict class for every pixel, this is usually better for semantic segmentation

Supported Tasks#

Semantic Segmentation#

Output: Dense pixel-wise class predictions
Use case: Scene understanding, medical imaging, autonomous driving
Configuration: Set postprocessing_type="semantic"

Instance Segmentation#

Output: Individual object instances with masks and bounding boxes
Use case: Object detection and counting, robotics, surveillance
Configuration: Set postprocessing_type="instance"

Model Outputs#

Inner Model Output (`MaskFormerModelOutput`)#

masks (torch.Tensor): Shape [B, num_queries, H, W] - Query mask predictions
logits (torch.Tensor): Shape [B, num_queries, num_classes] - Class predictions
loss (Optional[dict]): Training losses including:
- loss_ce: Cross-entropy classification loss
- loss_mask: Binary cross-entropy mask loss
- loss_dice: Dice coefficient loss

Inference Output (`FocoosDetections`)#

For each detected object:

bbox (List[float]): Bounding box coordinates [x1, y1, x2, y2]
conf (float): Confidence score
cls_id (int): Class identifier
mask (str): Base64-encoded binary mask
label (Optional[str]): Human-readable class name

Losses#

The model employs three complementary loss functions as described in the original paper:

Cross-entropy Loss (loss_ce): Classification of object classes
Dice Loss (loss_dice): Shape-aware segmentation loss
Mask Loss (loss_mask): Binary cross-entropy on predicted masks

Available Models#

Currently, you can find 5 fai-mf models on the Focoos Hub, 2 for semantic segmentation and 3 for instance-segmentation.

Semantic Segmentation Models#

Model Name	Architecture	Dataset	Metric	FPS Nvidia-T4
fai-mf-l-ade	Mask2Former (Resnet-101)	ADE20K	mIoU: 48.27 mAcc: 62.15	73
fai-mf-m-ade	Mask2Former (STDC-2)	ADE20K	mIoU: 45.32 mACC: 57.75	127

Instance Segmentation Models#

Model Name	Architecture	Dataset	Metric	FPS Nvidia-T4
fai-mf-s-coco-ins	Mask2Former (Resnet-50)	COCO	segm/AP: 41.45 segm/AP50: 64.12	86
fai-mf-m-coco-ins	Mask2Former (Resnet-101)	COCO	segm/AP: 43.09 segm/AP50: 65.87	70
fai-mf-l-coco-ins	Mask2Former (Resnet-101)	COCO	segm/AP: 44.23 segm/AP50: 67.53	55

Example Usage#

Quick Start with Pre-trained Model#

from focoos.model_manager import ModelManager

# Load a pre-trained BisenetFormer model
model = ModelManager.get("fai-mf-l-ade")

# Run inference on an image
image = Image.open("path/to/image.jpg")
result = model(image)

# Process results
for detection in result.detections:
    print(f"Class: {detection.label}, Confidence: {detection.conf:.3f}")

Custom Model Configuration#

from focoos.models.fai_mf.config import MaskFormerConfig
from focoos.models.fai_mf.modelling import FAIMaskFormer
from focoos.nn.backbone.stdc import STDCConfig

# Configure the backbone
backbone_config = STDCConfig(
    model_type="stdc",
    use_pretrained=True,
)

# Configure the model
config = MaskFormerConfig(
    backbone_config=backbone_config,
    num_classes=80,
    num_queries=300,
    transformer_predictor_dec_layers=3,
    threshold=0.5,
)

# Create the model
model = FAIMaskFormer(config)