Inference CLI
The following are the CLI commands for ScaleGenAI Inference.
| Function | Description |
|---|---|
| create | Launch an inference job. |
| list | List launched inference jobs. |
| start | Restart an inference job once scaled to zero. |
| delete | Delete the inference job. |
create
Run this command to create an inference job.
scalegen infer create [args]
The following is the list of arguments that the command requires:
name[required = true]:: string : The name of the deployment job (e.g.,"test_deploy").model[required = true]:: string : The Hugging Face model to use for inference (e.g.,"meta-llama/Llama-3.1-70B-Instruct").inf_type[required = true]:: string [ "llm" | "embedding" ] : Specifies the type of inference, either"llm"for language model completions or"embedding"for embeddings.cloud_regions[required = true]:: string : Specifies the cloud region and provider in the formatPROVIDER:REGION(e.g.,"SCALEGENAI:EU").autoscaling_strategy[required = false]:: string : Strategy for autoscaling, such as"rps_per_worker".lower_allowed_latency_sec[required = false]:: float : The lower bound for allowed latency in seconds (e.g.,0.2).lower_allowed_threshold[required = false]:: float : The lower threshold for autoscaling decisions (e.g.,0.2).scale_down_time_window_sec[required = false]:: int : Time window in seconds for scaling down (e.g.,300).scale_up_time_window_sec[required = false]:: int : Time window in seconds for scaling up (e.g.,300).scaling_down_timeout_sec[required = false]:: int : Timeout in seconds for scaling down (e.g.,1200).scaling_up_timeout_sec[required = false]:: int : Timeout in seconds for scaling up (e.g.,1200).upper_allowed_latency_sec[required = false]:: float : The upper bound for allowed latency in seconds (e.g.,1.0).upper_allowed_threshold[required = false]:: float : The upper threshold for autoscaling decisions (e.g.,1.0).min_workers[required = false]:: int : Minimum number of workers to start with (e.g.,1).max_price_per_hour[optional]:: float : Maximum price per hour for the inference job.allow_spot_instances[optional]:: boolean : Whether to allow spot instances for inference deployment.hf_token[optional]:: string : Hugging Face token, required if using a private repository model.
Example
scalegen infer create \
--name "test_deploy" \
--model "meta-llama/Llama-3.1-70B-Instruct" \
--inf_type "llm" \
--cloud_regions "SCALEGENAI:EU" \
--autoscaling_strategy "rps_per_worker" \
--lower_allowed_latency_sec 0.2 \
--lower_allowed_threshold 0.2 \
--scale_down_time_window_sec 300 \
--scale_up_time_window_sec 300 \
--scaling_down_timeout_sec 1200 \
--scaling_up_timeout_sec 1200 \
--upper_allowed_latency_sec 1 \
--upper_allowed_threshold 1 \
--min_workers 1 \
list
Run this command to list your running inference deployments.
scalegen infer list
To print deployment details, use the -v or --verbose flag
scalegen infer list -v
start
Run this command to start a inference job once it has been scaled to zero.
scalegen infer start <INF_ID>
Example
scalegen infer start test_job_id
delete
Run this command to delete the inference deployment.
scalegen infer delete <INF_ID>
Example
scalegen infer delete test_job_id