Skip to main content

Inference CLI


The following are the CLI commands for ScaleGenAI Inference.

FunctionDescription
createLaunch an inference job.
listList launched inference jobs.
startRestart an inference job once scaled to zero.
deleteDelete the inference job.

create

Run this command to create an inference job.

scalegen infer create [args]

The following is the list of arguments that the command requires:

  • name [required = true]:: string : The name of the deployment job (e.g., "test_deploy").
  • model [required = true]:: string : The Hugging Face model to use for inference (e.g., "meta-llama/Llama-3.1-70B-Instruct").
  • inf_type [required = true]:: string [ "llm" | "embedding" ] : Specifies the type of inference, either "llm" for language model completions or "embedding" for embeddings.
  • cloud_regions [required = true]:: string : Specifies the cloud region and provider in the format PROVIDER:REGION (e.g., "SCALEGENAI:EU").
  • autoscaling_strategy [required = false]:: string : Strategy for autoscaling, such as "rps_per_worker".
  • lower_allowed_latency_sec [required = false]:: float : The lower bound for allowed latency in seconds (e.g., 0.2).
  • lower_allowed_threshold [required = false]:: float : The lower threshold for autoscaling decisions (e.g., 0.2).
  • scale_down_time_window_sec [required = false]:: int : Time window in seconds for scaling down (e.g., 300).
  • scale_up_time_window_sec [required = false]:: int : Time window in seconds for scaling up (e.g., 300).
  • scaling_down_timeout_sec [required = false]:: int : Timeout in seconds for scaling down (e.g., 1200).
  • scaling_up_timeout_sec [required = false]:: int : Timeout in seconds for scaling up (e.g., 1200).
  • upper_allowed_latency_sec [required = false]:: float : The upper bound for allowed latency in seconds (e.g., 1.0).
  • upper_allowed_threshold [required = false]:: float : The upper threshold for autoscaling decisions (e.g., 1.0).
  • min_workers [required = false]:: int : Minimum number of workers to start with (e.g., 1).
  • max_price_per_hour [optional]:: float : Maximum price per hour for the inference job.
  • allow_spot_instances [optional]:: boolean : Whether to allow spot instances for inference deployment.
  • hf_token [optional]:: string : Hugging Face token, required if using a private repository model.

Example

scalegen infer create \
--name "test_deploy" \
--model "meta-llama/Llama-3.1-70B-Instruct" \
--inf_type "llm" \
--cloud_regions "SCALEGENAI:EU" \
--autoscaling_strategy "rps_per_worker" \
--lower_allowed_latency_sec 0.2 \
--lower_allowed_threshold 0.2 \
--scale_down_time_window_sec 300 \
--scale_up_time_window_sec 300 \
--scaling_down_timeout_sec 1200 \
--scaling_up_timeout_sec 1200 \
--upper_allowed_latency_sec 1 \
--upper_allowed_threshold 1 \
--min_workers 1 \

list

Run this command to list your running inference deployments.

scalegen infer list

To print deployment details, use the -v or --verbose flag

scalegen infer list -v

start

Run this command to start a inference job once it has been scaled to zero.

scalegen infer start <INF_ID>

Example

scalegen infer start test_job_id

delete

Run this command to delete the inference deployment.

scalegen infer delete <INF_ID>

Example

scalegen infer delete test_job_id