Skip to content

remote model

RemoteModel Module

This module provides a class to manage remote models in the Focoos ecosystem. It supports various functionalities including model training, deployment, inference, and monitoring.

Classes:

Name Description
RemoteModel

A class for interacting with remote models, managing their lifecycle, and performing inference.

Modules:

Name Description
HttpClient

Handles HTTP requests.

logger

Logging utility.

BoxAnnotator, LabelAnnotator, MaskAnnotator

Annotation tools for visualizing detections and segmentation tasks.

FocoosDet, FocoosDetections

Classes for representing and managing detections.

FocoosTask

Enum for defining supported tasks (e.g., DETECTION, SEMSEG).

Hyperparameters

Structure for training configuration parameters.

ModelMetadata

Contains metadata for the model.

ModelStatus

Enum for representing the current status of the model.

TrainInstance

Enum for defining available training instances.

image_loader

Utility function for loading images.

focoos_detections_to_supervision

Converter for Focoos detections to supervision format.

RemoteModel #

Represents a remote model in the Focoos platform.

Attributes:

Name Type Description
model_ref str

Reference ID for the model.

http_client HttpClient

Client for making HTTP requests.

max_deploy_wait int

Maximum wait time for model deployment.

metadata ModelMetadata

Metadata of the model.

label_annotator LabelAnnotator

Annotator for adding labels to images.

box_annotator BoxAnnotator

Annotator for drawing bounding boxes.

mask_annotator MaskAnnotator

Annotator for drawing masks on images.

Source code in focoos/remote_model.py
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
class RemoteModel:
    """
    Represents a remote model in the Focoos platform.

    Attributes:
        model_ref (str): Reference ID for the model.
        http_client (HttpClient): Client for making HTTP requests.
        max_deploy_wait (int): Maximum wait time for model deployment.
        metadata (ModelMetadata): Metadata of the model.
        label_annotator (LabelAnnotator): Annotator for adding labels to images.
        box_annotator (sv.BoxAnnotator): Annotator for drawing bounding boxes.
        mask_annotator (sv.MaskAnnotator): Annotator for drawing masks on images.
    """

    def __init__(self, model_ref: str, http_client: HttpClient):
        """
        Initialize the RemoteModel instance.

        Args:
            model_ref (str): Reference ID for the model.
            http_client (HttpClient): HTTP client instance for communication.

        Raises:
            ValueError: If model metadata retrieval fails.
        """
        self.model_ref = model_ref
        self.http_client = http_client
        self.max_deploy_wait = 10
        self.metadata: ModelMetadata = self.get_info()

        self.label_annotator = sv.LabelAnnotator(text_padding=10, border_radius=10)
        self.box_annotator = sv.BoxAnnotator()
        self.mask_annotator = sv.MaskAnnotator()
        logger.info(
            f"[RemoteModel]: ref: {self.model_ref} name: {self.metadata.name} description: {self.metadata.description} status: {self.metadata.status}"
        )

    def get_info(self) -> ModelMetadata:
        """
        Retrieve model metadata.

        Returns:
            ModelMetadata: Metadata of the model.

        Raises:
            ValueError: If the request fails.
        """
        res = self.http_client.get(f"models/{self.model_ref}")
        if res.status_code != 200:
            logger.error(f"Failed to get model info: {res.status_code} {res.text}")
            raise ValueError(f"Failed to get model info: {res.status_code} {res.text}")
        self.metadata = ModelMetadata(**res.json())
        return self.metadata

    def train(
        self,
        dataset_ref: str,
        hyperparameters: Hyperparameters,
        instance_type: TrainInstance = TrainInstance.ML_G4DN_XLARGE,
        volume_size: int = 50,
        max_runtime_in_seconds: int = 36000,
    ) -> dict | None:
        """
        Initiate the training of a remote model on the Focoos platform.

        This method sends a request to the Focoos platform to start the training process for the model
        referenced by `self.model_ref`. It requires a dataset reference and hyperparameters for training,
        as well as optional configuration options for the instance type, volume size, and runtime.

        Args:
            dataset_ref (str): The reference ID of the dataset to be used for training.
            hyperparameters (Hyperparameters): A structure containing the hyperparameters for the training process.
            anyma_version (str, optional): The version of Anyma to use for training. Defaults to "anyma-sagemaker-cu12-torch22-0111".
            instance_type (TrainInstance, optional): The type of training instance to use. Defaults to TrainInstance.ML_G4DN_XLARGE.
            volume_size (int, optional): The size of the disk volume (in GB) for the training instance. Defaults to 50.
            max_runtime_in_seconds (int, optional): The maximum runtime for training in seconds. Defaults to 36000.

        Returns:
            dict: A dictionary containing the response from the training initiation request. The content depends on the Focoos platform's response.

        Raises:
            ValueError: If the request to start training fails (e.g., due to incorrect parameters or server issues).
        """
        res = self.http_client.post(
            f"models/{self.model_ref}/train",
            data={
                "dataset_ref": dataset_ref,
                "instance_type": instance_type,
                "volume_size": volume_size,
                "max_runtime_in_seconds": max_runtime_in_seconds,
                "hyperparameters": hyperparameters.model_dump(),
            },
        )
        if res.status_code != 200:
            logger.warning(f"Failed to train model: {res.status_code} {res.text}")
            raise ValueError(f"Failed to train model: {res.status_code} {res.text}")
        return res.json()

    def train_info(self) -> Optional[TrainingInfo]:
        """
        Retrieve the current status of the model training.

        Sends a request to check the training status of the model referenced by `self.model_ref`.

        Returns:
            dict: A dictionary containing the training status information.

        Raises:
            ValueError: If the request to get training status fails.
        """
        res = self.http_client.get(f"models/{self.model_ref}/train/status")
        if res.status_code != 200:
            logger.error(f"Failed to get train status: {res.status_code} {res.text}")
            raise ValueError(f"Failed to get train status: {res.status_code} {res.text}")
        return TrainingInfo(**res.json())

    def train_logs(self) -> list[str]:
        """
        Retrieve the training logs for the model.

        This method sends a request to fetch the logs of the model's training process. If the request
        is successful (status code 200), it returns the logs as a list of strings. If the request fails,
        it logs a warning and returns an empty list.

        Returns:
            list[str]: A list of training logs as strings.

        Raises:
            None: Returns an empty list if the request fails.
        """
        res = self.http_client.get(f"models/{self.model_ref}/train/logs")
        if res.status_code != 200:
            logger.warning(f"Failed to get train logs: {res.status_code} {res.text}")
            return []
        return res.json()

    def metrics(self) -> Metrics:  # noqa: F821
        """
        Retrieve the metrics of the model.

        This method sends a request to fetch the metrics of the model identified by `model_ref`.
        If the request is successful (status code 200), it returns the metrics as a `Metrics` object.
        If the request fails, it logs a warning and returns an empty `Metrics` object.

        Returns:
            Metrics: An object containing the metrics of the model.

        Raises:
            None: Returns an empty `Metrics` object if the request fails.
        """
        res = self.http_client.get(f"models/{self.model_ref}/metrics")
        if res.status_code != 200:
            logger.warning(f"Failed to get metrics: {res.status_code} {res.text}")
            return Metrics()  # noqa: F821
        return Metrics(**res.json())

    def _annotate(self, im: np.ndarray, detections: sv.Detections) -> np.ndarray:
        """
        Annotate an image with detection results.

        This method adds visual annotations to the provided image based on the model's detection results.
        It handles different tasks (e.g., object detection, semantic segmentation, instance segmentation)
        and uses the corresponding annotator (bounding box, label, or mask) to draw on the image.

        Args:
            im (np.ndarray): The image to be annotated, represented as a NumPy array.
            detections (sv.Detections): The detection results to be annotated, including class IDs and confidence scores.

        Returns:
            np.ndarray: The annotated image as a NumPy array.
        """
        classes = self.metadata.classes
        if classes is not None:
            labels = [
                f"{classes[int(class_id)]}: {confid * 100:.0f}%"
                for class_id, confid in zip(detections.class_id, detections.confidence)
            ]
        else:
            labels = [
                f"{str(class_id)}: {confid * 100:.0f}%"
                for class_id, confid in zip(detections.class_id, detections.confidence)
            ]
        if self.metadata.task == FocoosTask.DETECTION:
            annotated_im = self.box_annotator.annotate(scene=im.copy(), detections=detections)

            annotated_im = self.label_annotator.annotate(scene=annotated_im, detections=detections, labels=labels)
        elif self.metadata.task in [
            FocoosTask.SEMSEG,
            FocoosTask.INSTANCE_SEGMENTATION,
        ]:
            annotated_im = self.mask_annotator.annotate(scene=im.copy(), detections=detections)
        return annotated_im

    def infer(
        self,
        image: Union[str, Path, np.ndarray, bytes],
        threshold: float = 0.5,
        annotate: bool = False,
    ) -> Tuple[FocoosDetections, Optional[np.ndarray]]:
        """
        Perform inference on the provided image using the remote model.

        This method sends an image to the remote model for inference and retrieves the detection results.
        Optionally, it can annotate the image with the detection results.

        Args:
            image (Union[str, Path, bytes]): The image to infer on, which can be a file path, a string representing the path, or raw bytes.
            threshold (float, optional): The confidence threshold for detections. Defaults to 0.5.
            annotate (bool, optional): Whether to annotate the image with the detection results. Defaults to False.

        Returns:
            Tuple[FocoosDetections, Optional[np.ndarray]]:
                - FocoosDetections: The detection results including class IDs, confidence scores, etc.
                - Optional[np.ndarray]: The annotated image if `annotate` is True, else None.

        Raises:
            FileNotFoundError: If the provided image file path is invalid.
            ValueError: If the inference request fails.
        """
        image_bytes = None
        if isinstance(image, str) or isinstance(image, Path):
            if not os.path.exists(image):
                logger.error(f"Image file not found: {image}")
                raise FileNotFoundError(f"Image file not found: {image}")
            image_bytes = open(image, "rb").read()
        elif isinstance(image, np.ndarray):
            _, buffer = cv2.imencode(".jpg", image)
            image_bytes = buffer.tobytes()
        else:
            image_bytes = image
        files = {"file": image_bytes}
        t0 = time.time()
        res = self.http_client.post(
            f"models/{self.model_ref}/inference?confidence_threshold={threshold}",
            files=files,
        )
        t1 = time.time()
        if res.status_code == 200:
            logger.debug(f"Inference time: {t1 - t0:.3f} seconds")
            detections = FocoosDetections(
                detections=[FocoosDet.from_json(d) for d in res.json().get("detections", [])],
                latency=res.json().get("latency", None),
            )
            preview = None
            if annotate:
                im0 = image_loader(image)
                sv_detections = focoos_detections_to_supervision(detections)
                preview = self._annotate(im0, sv_detections)
            return detections, preview
        else:
            logger.error(f"Failed to infer: {res.status_code} {res.text}")
            raise ValueError(f"Failed to infer: {res.status_code} {res.text}")

    def notebook_monitor_train(self, interval: int = 30, plot_metrics: bool = False, max_runtime: int = 36000) -> None:
        """
        Monitor the training process in a Jupyter notebook and display metrics.

        Periodically checks the training status and displays metrics in a notebook cell.
        Clears previous output to maintain a clean view.

        Args:
            interval (int): Time between status checks in seconds. Must be 30-240. Default: 30
            plot_metrics (bool): Whether to plot metrics graphs. Default: False
            max_runtime (int): Maximum monitoring time in seconds. Default: 36000 (10 hours)

        Returns:
            None
        """
        from IPython.display import clear_output

        if not 30 <= interval <= 240:
            raise ValueError("Interval must be between 30 and 240 seconds")

        last_update = self.get_info().updated_at
        start_time = time.time()
        status_history = []

        while True:
            # Get current status
            model_info = self.get_info()
            status = model_info.status

            # Clear and display status
            clear_output(wait=True)
            status_msg = f"[Live Monitor {self.metadata.name}] {status.value}"
            status_history.append(status_msg)
            for msg in status_history:
                logger.info(msg)

            # Show metrics if training completed
            if status == ModelStatus.TRAINING_COMPLETED:
                metrics = self.metrics()
                if metrics.best_valid_metric:
                    logger.info(f"Best Checkpoint (iter: {metrics.best_valid_metric.get('iteration', 'N/A')}):")
                    for k, v in metrics.best_valid_metric.items():
                        logger.info(f"  {k}: {v}")
                    visualizer = MetricsVisualizer(metrics)
                    visualizer.log_metrics()
                    if plot_metrics:
                        visualizer.notebook_plot_training_metrics()

            # Update metrics during training
            if status == ModelStatus.TRAINING_RUNNING and model_info.updated_at > last_update:
                last_update = model_info.updated_at
                metrics = self.metrics()
                visualizer = MetricsVisualizer(metrics)
                visualizer.log_metrics()
                if plot_metrics:
                    visualizer.notebook_plot_training_metrics()

            # Check exit conditions
            if status not in [ModelStatus.CREATED, ModelStatus.TRAINING_RUNNING, ModelStatus.TRAINING_STARTING]:
                return

            if time.time() - start_time > max_runtime:
                logger.warning(f"Monitoring exceeded {max_runtime} seconds limit")
                return

            sleep(interval)

    def stop_training(self) -> None:
        """
        Stop the training process of the model.

        This method sends a request to stop the training of the model identified by `model_ref`.
        If the request fails, an error is logged and a `ValueError` is raised.

        Raises:
            ValueError: If the stop training request fails.

        Logs:
            - Error message if the request to stop training fails, including the status code and response text.

        Returns:
            None: This method does not return any value.
        """
        res = self.http_client.delete(f"models/{self.model_ref}/train")
        if res.status_code != 200:
            logger.error(f"Failed to get stop training: {res.status_code} {res.text}")
            raise ValueError(f"Failed to get stop training: {res.status_code} {res.text}")

    def delete_model(self) -> None:
        """
        Delete the model from the system.

        This method sends a request to delete the model identified by `model_ref`.
        If the request fails or the status code is not 204 (No Content), an error is logged
        and a `ValueError` is raised.

        Raises:
            ValueError: If the delete model request fails or does not return a 204 status code.

        Logs:
            - Error message if the request to delete the model fails, including the status code and response text.

        Returns:
            None: This method does not return any value.
        """
        res = self.http_client.delete(f"models/{self.model_ref}")
        if res.status_code != 204:
            logger.error(f"Failed to delete model: {res.status_code} {res.text}")
            raise ValueError(f"Failed to delete model: {res.status_code} {res.text}")

__init__(model_ref, http_client) #

Initialize the RemoteModel instance.

Parameters:

Name Type Description Default
model_ref str

Reference ID for the model.

required
http_client HttpClient

HTTP client instance for communication.

required

Raises:

Type Description
ValueError

If model metadata retrieval fails.

Source code in focoos/remote_model.py
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
def __init__(self, model_ref: str, http_client: HttpClient):
    """
    Initialize the RemoteModel instance.

    Args:
        model_ref (str): Reference ID for the model.
        http_client (HttpClient): HTTP client instance for communication.

    Raises:
        ValueError: If model metadata retrieval fails.
    """
    self.model_ref = model_ref
    self.http_client = http_client
    self.max_deploy_wait = 10
    self.metadata: ModelMetadata = self.get_info()

    self.label_annotator = sv.LabelAnnotator(text_padding=10, border_radius=10)
    self.box_annotator = sv.BoxAnnotator()
    self.mask_annotator = sv.MaskAnnotator()
    logger.info(
        f"[RemoteModel]: ref: {self.model_ref} name: {self.metadata.name} description: {self.metadata.description} status: {self.metadata.status}"
    )

delete_model() #

Delete the model from the system.

This method sends a request to delete the model identified by model_ref. If the request fails or the status code is not 204 (No Content), an error is logged and a ValueError is raised.

Raises:

Type Description
ValueError

If the delete model request fails or does not return a 204 status code.

Logs
  • Error message if the request to delete the model fails, including the status code and response text.

Returns:

Name Type Description
None None

This method does not return any value.

Source code in focoos/remote_model.py
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
def delete_model(self) -> None:
    """
    Delete the model from the system.

    This method sends a request to delete the model identified by `model_ref`.
    If the request fails or the status code is not 204 (No Content), an error is logged
    and a `ValueError` is raised.

    Raises:
        ValueError: If the delete model request fails or does not return a 204 status code.

    Logs:
        - Error message if the request to delete the model fails, including the status code and response text.

    Returns:
        None: This method does not return any value.
    """
    res = self.http_client.delete(f"models/{self.model_ref}")
    if res.status_code != 204:
        logger.error(f"Failed to delete model: {res.status_code} {res.text}")
        raise ValueError(f"Failed to delete model: {res.status_code} {res.text}")

get_info() #

Retrieve model metadata.

Returns:

Name Type Description
ModelMetadata ModelMetadata

Metadata of the model.

Raises:

Type Description
ValueError

If the request fails.

Source code in focoos/remote_model.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
def get_info(self) -> ModelMetadata:
    """
    Retrieve model metadata.

    Returns:
        ModelMetadata: Metadata of the model.

    Raises:
        ValueError: If the request fails.
    """
    res = self.http_client.get(f"models/{self.model_ref}")
    if res.status_code != 200:
        logger.error(f"Failed to get model info: {res.status_code} {res.text}")
        raise ValueError(f"Failed to get model info: {res.status_code} {res.text}")
    self.metadata = ModelMetadata(**res.json())
    return self.metadata

infer(image, threshold=0.5, annotate=False) #

Perform inference on the provided image using the remote model.

This method sends an image to the remote model for inference and retrieves the detection results. Optionally, it can annotate the image with the detection results.

Parameters:

Name Type Description Default
image Union[str, Path, bytes]

The image to infer on, which can be a file path, a string representing the path, or raw bytes.

required
threshold float

The confidence threshold for detections. Defaults to 0.5.

0.5
annotate bool

Whether to annotate the image with the detection results. Defaults to False.

False

Returns:

Type Description
Tuple[FocoosDetections, Optional[ndarray]]

Tuple[FocoosDetections, Optional[np.ndarray]]: - FocoosDetections: The detection results including class IDs, confidence scores, etc. - Optional[np.ndarray]: The annotated image if annotate is True, else None.

Raises:

Type Description
FileNotFoundError

If the provided image file path is invalid.

ValueError

If the inference request fails.

Source code in focoos/remote_model.py
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
def infer(
    self,
    image: Union[str, Path, np.ndarray, bytes],
    threshold: float = 0.5,
    annotate: bool = False,
) -> Tuple[FocoosDetections, Optional[np.ndarray]]:
    """
    Perform inference on the provided image using the remote model.

    This method sends an image to the remote model for inference and retrieves the detection results.
    Optionally, it can annotate the image with the detection results.

    Args:
        image (Union[str, Path, bytes]): The image to infer on, which can be a file path, a string representing the path, or raw bytes.
        threshold (float, optional): The confidence threshold for detections. Defaults to 0.5.
        annotate (bool, optional): Whether to annotate the image with the detection results. Defaults to False.

    Returns:
        Tuple[FocoosDetections, Optional[np.ndarray]]:
            - FocoosDetections: The detection results including class IDs, confidence scores, etc.
            - Optional[np.ndarray]: The annotated image if `annotate` is True, else None.

    Raises:
        FileNotFoundError: If the provided image file path is invalid.
        ValueError: If the inference request fails.
    """
    image_bytes = None
    if isinstance(image, str) or isinstance(image, Path):
        if not os.path.exists(image):
            logger.error(f"Image file not found: {image}")
            raise FileNotFoundError(f"Image file not found: {image}")
        image_bytes = open(image, "rb").read()
    elif isinstance(image, np.ndarray):
        _, buffer = cv2.imencode(".jpg", image)
        image_bytes = buffer.tobytes()
    else:
        image_bytes = image
    files = {"file": image_bytes}
    t0 = time.time()
    res = self.http_client.post(
        f"models/{self.model_ref}/inference?confidence_threshold={threshold}",
        files=files,
    )
    t1 = time.time()
    if res.status_code == 200:
        logger.debug(f"Inference time: {t1 - t0:.3f} seconds")
        detections = FocoosDetections(
            detections=[FocoosDet.from_json(d) for d in res.json().get("detections", [])],
            latency=res.json().get("latency", None),
        )
        preview = None
        if annotate:
            im0 = image_loader(image)
            sv_detections = focoos_detections_to_supervision(detections)
            preview = self._annotate(im0, sv_detections)
        return detections, preview
    else:
        logger.error(f"Failed to infer: {res.status_code} {res.text}")
        raise ValueError(f"Failed to infer: {res.status_code} {res.text}")

metrics() #

Retrieve the metrics of the model.

This method sends a request to fetch the metrics of the model identified by model_ref. If the request is successful (status code 200), it returns the metrics as a Metrics object. If the request fails, it logs a warning and returns an empty Metrics object.

Returns:

Name Type Description
Metrics Metrics

An object containing the metrics of the model.

Raises:

Type Description
None

Returns an empty Metrics object if the request fails.

Source code in focoos/remote_model.py
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
def metrics(self) -> Metrics:  # noqa: F821
    """
    Retrieve the metrics of the model.

    This method sends a request to fetch the metrics of the model identified by `model_ref`.
    If the request is successful (status code 200), it returns the metrics as a `Metrics` object.
    If the request fails, it logs a warning and returns an empty `Metrics` object.

    Returns:
        Metrics: An object containing the metrics of the model.

    Raises:
        None: Returns an empty `Metrics` object if the request fails.
    """
    res = self.http_client.get(f"models/{self.model_ref}/metrics")
    if res.status_code != 200:
        logger.warning(f"Failed to get metrics: {res.status_code} {res.text}")
        return Metrics()  # noqa: F821
    return Metrics(**res.json())

notebook_monitor_train(interval=30, plot_metrics=False, max_runtime=36000) #

Monitor the training process in a Jupyter notebook and display metrics.

Periodically checks the training status and displays metrics in a notebook cell. Clears previous output to maintain a clean view.

Parameters:

Name Type Description Default
interval int

Time between status checks in seconds. Must be 30-240. Default: 30

30
plot_metrics bool

Whether to plot metrics graphs. Default: False

False
max_runtime int

Maximum monitoring time in seconds. Default: 36000 (10 hours)

36000

Returns:

Type Description
None

None

Source code in focoos/remote_model.py
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
def notebook_monitor_train(self, interval: int = 30, plot_metrics: bool = False, max_runtime: int = 36000) -> None:
    """
    Monitor the training process in a Jupyter notebook and display metrics.

    Periodically checks the training status and displays metrics in a notebook cell.
    Clears previous output to maintain a clean view.

    Args:
        interval (int): Time between status checks in seconds. Must be 30-240. Default: 30
        plot_metrics (bool): Whether to plot metrics graphs. Default: False
        max_runtime (int): Maximum monitoring time in seconds. Default: 36000 (10 hours)

    Returns:
        None
    """
    from IPython.display import clear_output

    if not 30 <= interval <= 240:
        raise ValueError("Interval must be between 30 and 240 seconds")

    last_update = self.get_info().updated_at
    start_time = time.time()
    status_history = []

    while True:
        # Get current status
        model_info = self.get_info()
        status = model_info.status

        # Clear and display status
        clear_output(wait=True)
        status_msg = f"[Live Monitor {self.metadata.name}] {status.value}"
        status_history.append(status_msg)
        for msg in status_history:
            logger.info(msg)

        # Show metrics if training completed
        if status == ModelStatus.TRAINING_COMPLETED:
            metrics = self.metrics()
            if metrics.best_valid_metric:
                logger.info(f"Best Checkpoint (iter: {metrics.best_valid_metric.get('iteration', 'N/A')}):")
                for k, v in metrics.best_valid_metric.items():
                    logger.info(f"  {k}: {v}")
                visualizer = MetricsVisualizer(metrics)
                visualizer.log_metrics()
                if plot_metrics:
                    visualizer.notebook_plot_training_metrics()

        # Update metrics during training
        if status == ModelStatus.TRAINING_RUNNING and model_info.updated_at > last_update:
            last_update = model_info.updated_at
            metrics = self.metrics()
            visualizer = MetricsVisualizer(metrics)
            visualizer.log_metrics()
            if plot_metrics:
                visualizer.notebook_plot_training_metrics()

        # Check exit conditions
        if status not in [ModelStatus.CREATED, ModelStatus.TRAINING_RUNNING, ModelStatus.TRAINING_STARTING]:
            return

        if time.time() - start_time > max_runtime:
            logger.warning(f"Monitoring exceeded {max_runtime} seconds limit")
            return

        sleep(interval)

stop_training() #

Stop the training process of the model.

This method sends a request to stop the training of the model identified by model_ref. If the request fails, an error is logged and a ValueError is raised.

Raises:

Type Description
ValueError

If the stop training request fails.

Logs
  • Error message if the request to stop training fails, including the status code and response text.

Returns:

Name Type Description
None None

This method does not return any value.

Source code in focoos/remote_model.py
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
def stop_training(self) -> None:
    """
    Stop the training process of the model.

    This method sends a request to stop the training of the model identified by `model_ref`.
    If the request fails, an error is logged and a `ValueError` is raised.

    Raises:
        ValueError: If the stop training request fails.

    Logs:
        - Error message if the request to stop training fails, including the status code and response text.

    Returns:
        None: This method does not return any value.
    """
    res = self.http_client.delete(f"models/{self.model_ref}/train")
    if res.status_code != 200:
        logger.error(f"Failed to get stop training: {res.status_code} {res.text}")
        raise ValueError(f"Failed to get stop training: {res.status_code} {res.text}")

train(dataset_ref, hyperparameters, instance_type=TrainInstance.ML_G4DN_XLARGE, volume_size=50, max_runtime_in_seconds=36000) #

Initiate the training of a remote model on the Focoos platform.

This method sends a request to the Focoos platform to start the training process for the model referenced by self.model_ref. It requires a dataset reference and hyperparameters for training, as well as optional configuration options for the instance type, volume size, and runtime.

Parameters:

Name Type Description Default
dataset_ref str

The reference ID of the dataset to be used for training.

required
hyperparameters Hyperparameters

A structure containing the hyperparameters for the training process.

required
anyma_version str

The version of Anyma to use for training. Defaults to "anyma-sagemaker-cu12-torch22-0111".

required
instance_type TrainInstance

The type of training instance to use. Defaults to TrainInstance.ML_G4DN_XLARGE.

ML_G4DN_XLARGE
volume_size int

The size of the disk volume (in GB) for the training instance. Defaults to 50.

50
max_runtime_in_seconds int

The maximum runtime for training in seconds. Defaults to 36000.

36000

Returns:

Name Type Description
dict dict | None

A dictionary containing the response from the training initiation request. The content depends on the Focoos platform's response.

Raises:

Type Description
ValueError

If the request to start training fails (e.g., due to incorrect parameters or server issues).

Source code in focoos/remote_model.py
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
def train(
    self,
    dataset_ref: str,
    hyperparameters: Hyperparameters,
    instance_type: TrainInstance = TrainInstance.ML_G4DN_XLARGE,
    volume_size: int = 50,
    max_runtime_in_seconds: int = 36000,
) -> dict | None:
    """
    Initiate the training of a remote model on the Focoos platform.

    This method sends a request to the Focoos platform to start the training process for the model
    referenced by `self.model_ref`. It requires a dataset reference and hyperparameters for training,
    as well as optional configuration options for the instance type, volume size, and runtime.

    Args:
        dataset_ref (str): The reference ID of the dataset to be used for training.
        hyperparameters (Hyperparameters): A structure containing the hyperparameters for the training process.
        anyma_version (str, optional): The version of Anyma to use for training. Defaults to "anyma-sagemaker-cu12-torch22-0111".
        instance_type (TrainInstance, optional): The type of training instance to use. Defaults to TrainInstance.ML_G4DN_XLARGE.
        volume_size (int, optional): The size of the disk volume (in GB) for the training instance. Defaults to 50.
        max_runtime_in_seconds (int, optional): The maximum runtime for training in seconds. Defaults to 36000.

    Returns:
        dict: A dictionary containing the response from the training initiation request. The content depends on the Focoos platform's response.

    Raises:
        ValueError: If the request to start training fails (e.g., due to incorrect parameters or server issues).
    """
    res = self.http_client.post(
        f"models/{self.model_ref}/train",
        data={
            "dataset_ref": dataset_ref,
            "instance_type": instance_type,
            "volume_size": volume_size,
            "max_runtime_in_seconds": max_runtime_in_seconds,
            "hyperparameters": hyperparameters.model_dump(),
        },
    )
    if res.status_code != 200:
        logger.warning(f"Failed to train model: {res.status_code} {res.text}")
        raise ValueError(f"Failed to train model: {res.status_code} {res.text}")
    return res.json()

train_info() #

Retrieve the current status of the model training.

Sends a request to check the training status of the model referenced by self.model_ref.

Returns:

Name Type Description
dict Optional[TrainingInfo]

A dictionary containing the training status information.

Raises:

Type Description
ValueError

If the request to get training status fails.

Source code in focoos/remote_model.py
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
def train_info(self) -> Optional[TrainingInfo]:
    """
    Retrieve the current status of the model training.

    Sends a request to check the training status of the model referenced by `self.model_ref`.

    Returns:
        dict: A dictionary containing the training status information.

    Raises:
        ValueError: If the request to get training status fails.
    """
    res = self.http_client.get(f"models/{self.model_ref}/train/status")
    if res.status_code != 200:
        logger.error(f"Failed to get train status: {res.status_code} {res.text}")
        raise ValueError(f"Failed to get train status: {res.status_code} {res.text}")
    return TrainingInfo(**res.json())

train_logs() #

Retrieve the training logs for the model.

This method sends a request to fetch the logs of the model's training process. If the request is successful (status code 200), it returns the logs as a list of strings. If the request fails, it logs a warning and returns an empty list.

Returns:

Type Description
list[str]

list[str]: A list of training logs as strings.

Raises:

Type Description
None

Returns an empty list if the request fails.

Source code in focoos/remote_model.py
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
def train_logs(self) -> list[str]:
    """
    Retrieve the training logs for the model.

    This method sends a request to fetch the logs of the model's training process. If the request
    is successful (status code 200), it returns the logs as a list of strings. If the request fails,
    it logs a warning and returns an empty list.

    Returns:
        list[str]: A list of training logs as strings.

    Raises:
        None: Returns an empty list if the request fails.
    """
    res = self.http_client.get(f"models/{self.model_ref}/train/logs")
    if res.status_code != 200:
        logger.warning(f"Failed to get train logs: {res.status_code} {res.text}")
        return []
    return res.json()