Inference CLI
The following are the CLI commands for ScaleGenAI Inference.
Function | Description |
---|---|
create | Launch an inference job. |
list | List launched inference jobs. |
start | Restart an inference job once scaled to zero. |
delete | Delete the inference job. |
create
Run this command to create an inference job.
scalegen infer create [args]
The following is the list of arguments that the command requires:
name
[required = true]:: string : The name of the deployment job (e.g.,"test_deploy"
).model
[required = true]:: string : The Hugging Face model to use for inference (e.g.,"meta-llama/Llama-3.1-70B-Instruct"
).inf_type
[required = true]:: string [ "llm" | "embedding" ] : Specifies the type of inference, either"llm"
for language model completions or"embedding"
for embeddings.cloud_regions
[required = true]:: string : Specifies the cloud region and provider in the formatPROVIDER:REGION
(e.g.,"SCALEGENAI:EU"
).autoscaling_strategy
[required = false]:: string : Strategy for autoscaling, such as"rps_per_worker"
.lower_allowed_latency_sec
[required = false]:: float : The lower bound for allowed latency in seconds (e.g.,0.2
).lower_allowed_threshold
[required = false]:: float : The lower threshold for autoscaling decisions (e.g.,0.2
).scale_down_time_window_sec
[required = false]:: int : Time window in seconds for scaling down (e.g.,300
).scale_up_time_window_sec
[required = false]:: int : Time window in seconds for scaling up (e.g.,300
).scaling_down_timeout_sec
[required = false]:: int : Timeout in seconds for scaling down (e.g.,1200
).scaling_up_timeout_sec
[required = false]:: int : Timeout in seconds for scaling up (e.g.,1200
).upper_allowed_latency_sec
[required = false]:: float : The upper bound for allowed latency in seconds (e.g.,1.0
).upper_allowed_threshold
[required = false]:: float : The upper threshold for autoscaling decisions (e.g.,1.0
).min_workers
[required = false]:: int : Minimum number of workers to start with (e.g.,1
).max_price_per_hour
[optional]:: float : Maximum price per hour for the inference job.allow_spot_instances
[optional]:: boolean : Whether to allow spot instances for inference deployment.hf_token
[optional]:: string : Hugging Face token, required if using a private repository model.
Example
scalegen infer create \
--name "test_deploy" \
--model "meta-llama/Llama-3.1-70B-Instruct" \
--inf_type "llm" \
--cloud_regions "SCALEGENAI:EU" \
--autoscaling_strategy "rps_per_worker" \
--lower_allowed_latency_sec 0.2 \
--lower_allowed_threshold 0.2 \
--scale_down_time_window_sec 300 \
--scale_up_time_window_sec 300 \
--scaling_down_timeout_sec 1200 \
--scaling_up_timeout_sec 1200 \
--upper_allowed_latency_sec 1 \
--upper_allowed_threshold 1 \
--min_workers 1 \
list
Run this command to list your running inference deployments.
scalegen infer list
To print deployment details, use the -v
or --verbose
flag
scalegen infer list -v
start
Run this command to start a inference job once it has been scaled to zero.
scalegen infer start <INF_ID>
Example
scalegen infer start test_job_id
delete
Run this command to delete the inference deployment.
scalegen infer delete <INF_ID>
Example
scalegen infer delete test_job_id