Skip to main content

Inference API

The ScaleGenAI inference API has the following methods.

createLaunch an inference job.
updateUpdate deployment config.
deleteDelete deployment config.
getGet inference deployment info.


Endpoint: /sg_inf/create


This method is used to create a new inference deployment.


  • Method: POST
  • Headers:
    • Content-Type: application/json
  • Body:
    "name": "string",
    "model": "string",
    "base_model": "string",
    "inf_type": "llm",
    "hf_token": "string",
    "allow_spot_instances": false,
    "logs_store": "string",
    "cloud_providers": [],
    "gateway_config": {
    "name": "DATACRUNCH",
    "region": "string"
    "initial_worker_config": {
    "min_workers": 0,
    "initial_workers_gpu": "A100",
    "initial_workers_gpu_num": 0,
    "use_same_gpus_when_scaling": false,
    "instance_types": ["string"],
    "use_on_prem": false,
    "use_cloudburst": false,
    "on_prem_node_ids": ["string"]
    "autoscaling_config": {
    "enable_speedup_shared": false,
    "lower_allowed_latency_sec": 1,
    "scale_to_zero_timeout_sec": 1800,
    "scaling_down_timeout_sec": 1200,
    "scaling_up_timeout_sec": 1200,
    "time_window_sec": 300,
    "upper_allowed_latency_sec": 4
    "max_price_per_hour": 0,
    "max_throughput_rate": 0

Parameter Description

  • name:: string : The name of the inference task.
  • model:: string : The name of the model to be used for inference.
  • base_model [optional]:: string : The base model on which the custom inference is built.
  • data_path:: string : The path to the HuggingFace dataset used for finetuning.
  • inf_type:: string["llm", "embedding"] : Whether the deployment is for a completions LLM or an embeddings model.
  • hf_token [optional]:: string : HuggingFace token.
  • allow_spot_instances:: boolean : A boolean indicating whether to use spot instances for inference deployments.
  • logs_store [optional]:: string : Name of the Artifacts Storage where logs are to be stored.
  • cloud_providers [optional]:: list[string] : An array of cloud providers.
  • gateway_config [optional]:: object : An object containing the API gateway configuration.
    • name:: string : Cloud provider where API gateway is to be configured.
    • region:: string : Region where API gateway is to be configured.
  • initial_worker_config [optional]:: object : An object containing the initial deployment nodes' configuration.
    • min_workers:: int : The minimum number of workers to start with.
    • initial_workers_gpu:: string : The type of GPU to be used by the initial workers.
    • initial_workers_gpu_num:: int : The number of GPUs to be used by the initial workers.
    • use_same_gpus_when_scaling:: boolean : A boolean indicating whether to use the same type of GPUs when scaling.
    • instance_types:: list[string] : Specify AWS instance type instead of initial_workers_gpu and initial_workers_gpu_num.
    • use_on_prem:: boolean : Whether to use on-premise resources.
    • use_cloudburst:: boolean : Whether to use cloudburst.
    • on_prem_node_ids:: list[string] : An array of on-premise node IDs to be used.
  • autoscaling_config [optional]:: object : An object containing the autoscaling logic configuration.
    • enable_speedup_shared:: boolean : A boolean indicating whether to enable fast autoscaling on ScaleGenAI shared infrastructure.
    • lower_allowed_latency_sec:: int : The lower limit of allowed latency in seconds.
    • scale_to_zero_timeout_sec:: int : The timeout in seconds for scaling to zero.
    • scaling_down_timeout_sec:: int : The timeout in seconds for scaling down.
    • scaling_up_timeout_sec:: int : The timeout in seconds for scaling up.
    • time_window_sec:: int : The time window in seconds for autoscaling.
    • upper_allowed_latency_sec:: int : The upper limit of allowed latency in seconds.
  • max_price_per_hour [optional]:: int : The maximum price per hour for the inference task.
  • max_throughput_rate [optional]:: int : The maximum throughput rate for the inference task.


"success": true,
"message": {
"inf_id": "string"


curl -X POST \
-H "Content-Type: application/json" \
-d '{
"name": "test_inference",
"model": "meta-llama/Llama-2-70b",
"inf_type": "llm",
"allow_spot_instances": true,
"autoscaling_config": {
"enable_speedup_shared": false,
"lower_allowed_latency_sec": 2,
"scale_to_zero_timeout_sec": 1800,
"scaling_down_timeout_sec": 1200,
"scaling_up_timeout_sec": 1200,
"time_window_sec": 300,
"upper_allowed_latency_sec": 5
}' \


Endpoint: /sg_inf/{inference_id}


This method is used to update an existing inference deployment.


  • Method: PUT

  • Headers:

    • Content-Type: application/json
  • Body:

    "initial_worker_config": {
    "min_workers": 0,
    "initial_workers_gpu": "A100",
    "initial_workers_gpu_num": 0,
    "use_same_gpus_when_scaling": false,
    "instance_types": ["string"],
    "use_on_prem": false,
    "use_cloudburst": false,
    "on_prem_node_ids": ["string"]
    "autoscaling_config": {
    "time_window_sec": 300,
    "upper_allowed_latency_sec": 4,
    "lower_allowed_latency_sec": 1,
    "scaling_up_timeout_sec": 1200,
    "scaling_down_timeout_sec": 1200,
    "scale_to_zero_timeout_sec": 1800,
    "enable_speedup_shared": false
    "imidiate_scale_down": false

Parameter Description

  • inference_id:: string : Inference deployment job ID.

  • initial_worker_config [optional]:: object : An object containing the initial deployment nodes' configuration to be edited.

    • min_workers:: int : The minimum number of workers.
    • initial_workers_gpu:: string : The type of GPU to be used by the initial workers.
    • initial_workers_gpu_num:: int : The number of GPUs to be used by the initial workers.
    • use_same_gpus_when_scaling:: boolean : A boolean indicating whether to use the same type of GPUs when scaling.
    • instance_types:: list[string] : Specify AWS instance type instead of initial_workers_gpu and initial_workers_gpu_num.
    • use_on_prem:: boolean : Whether to use on-premise resources.
    • use_cloudburst:: boolean : Whether to use cloudburst.
    • on_prem_node_ids:: list[string] : An array of on-premise node IDs to be used.
  • autoscaling_config [optional]:: object : An object containing the autoscaling logic configuration.

    • time_window_sec:: int : The time window in seconds for autoscaling.
    • upper_allowed_latency_sec:: int : The upper limit of allowed latency in seconds.
    • lower_allowed_latency_sec:: int : The lower limit of allowed latency in seconds.
    • scaling_up_timeout_sec:: int : The timeout in seconds for scaling up.
    • scaling_down_timeout_sec:: int : The timeout in seconds for scaling down.
    • scale_to_zero_timeout_sec:: int : The timeout in seconds for scaling to zero.
    • enable_speedup_shared:: boolean : A boolean indicating whether to enable shared speedup.
  • imidiate_scale_down [optional]:: boolean : A boolean indicating whether to immediately scale down.


"success": true,
"message": "string"


curl -X PUT \
-H "Content-Type: application/json" \
-d '{
"initial_worker_config": {
"min_workers": 3,
"initial_workers_gpu": "A100",
"initial_workers_gpu_num": 2,
"use_same_gpus_when_scaling": false,
"autoscaling_config": {
"time_window_sec": 300,
"upper_allowed_latency_sec": 4,
"lower_allowed_latency_sec": 1,
"scaling_up_timeout_sec": 1200,
"scaling_down_timeout_sec": 1200,
"scale_to_zero_timeout_sec": 1800,
"enable_speedup_shared": false
"imidiate_scale_down": false
}' \


Endpoint: /sg_inf/{inference_id}


This method is used to delete an inference deployment.


  • Method: DELETE
  • Headers:
    • Content-Type: application/json

Parameter Description

  • inference_id:: string : Inference deployment job ID.


"success": true,
"message": "string"


curl -X DELETE \
-H "Content-Type: application/json" \


Endpoint: /sg_inf/{inference_id}


This method is used to get an inference deployment information.


  • Method: GET

  • Headers:

    • Content-Type: application/json

Parameter Description

  • inference_id:: string : Inference deployment job ID.


"success": true,
"message": "string"


curl -X GET \
-H "Content-Type: application/json" \