Skip to main content

Inference API


The ScaleGenAI inference API has the following methods.

FunctionDescription
createLaunch an inference job.
updateUpdate deployment config.
deleteDelete deployment config.
getGet inference deployment info.

create

Endpoint: /sg_inf/create

Description

This method is used to create a new inference deployment.

Request

  • Method: POST
  • Headers:
    • Content-Type: application/json
  • Body:
    {
    "name": "string",
    "model": "string",
    "base_model": "string",
    "inf_type": "llm",
    "hf_token": "string",
    "allow_spot_instances": false,
    "logs_store": "string",
    "cloud_providers": [],
    "gateway_config": {
    "name": "DATACRUNCH",
    "region": "string"
    },
    "initial_worker_config": {
    "min_workers": 0,
    "initial_workers_gpu": "A100",
    "initial_workers_gpu_num": 0,
    "use_same_gpus_when_scaling": false,
    "instance_types": ["string"],
    "use_on_prem": false,
    "use_cloudburst": false,
    "on_prem_node_ids": ["string"]
    },
    "autoscaling_config": {
    "enable_speedup_shared": false,
    "lower_allowed_latency_sec": 1,
    "scale_to_zero_timeout_sec": 1800,
    "scaling_down_timeout_sec": 1200,
    "scaling_up_timeout_sec": 1200,
    "time_window_sec": 300,
    "upper_allowed_latency_sec": 4
    },
    "max_price_per_hour": 0,
    "max_throughput_rate": 0
    }

Parameter Description

  • name:: string : The name of the inference task.
  • model:: string : The name of the model to be used for inference.
  • base_model [optional]:: string : The base model on which the custom inference is built.
  • data_path:: string : The path to the HuggingFace dataset used for finetuning.
  • inf_type:: string["llm", "embedding"] : Whether the deployment is for a completions LLM or an embeddings model.
  • hf_token [optional]:: string : HuggingFace token.
  • allow_spot_instances:: boolean : A boolean indicating whether to use spot instances for inference deployments.
  • logs_store [optional]:: string : Name of the Artifacts Storage where logs are to be stored.
  • cloud_providers [optional]:: list[string] : An array of cloud providers.
  • gateway_config [optional]:: object : An object containing the API gateway configuration.
    • name:: string : Cloud provider where API gateway is to be configured.
    • region:: string : Region where API gateway is to be configured.
  • initial_worker_config [optional]:: object : An object containing the initial deployment nodes' configuration.
    • min_workers:: int : The minimum number of workers to start with.
    • initial_workers_gpu:: string : The type of GPU to be used by the initial workers.
    • initial_workers_gpu_num:: int : The number of GPUs to be used by the initial workers.
    • use_same_gpus_when_scaling:: boolean : A boolean indicating whether to use the same type of GPUs when scaling.
    • instance_types:: list[string] : Specify AWS instance type instead of initial_workers_gpu and initial_workers_gpu_num.
    • use_on_prem:: boolean : Whether to use on-premise resources.
    • use_cloudburst:: boolean : Whether to use cloudburst.
    • on_prem_node_ids:: list[string] : An array of on-premise node IDs to be used.
  • autoscaling_config [optional]:: object : An object containing the autoscaling logic configuration.
    • enable_speedup_shared:: boolean : A boolean indicating whether to enable fast autoscaling on ScaleGenAI shared infrastructure.
    • lower_allowed_latency_sec:: int : The lower limit of allowed latency in seconds.
    • scale_to_zero_timeout_sec:: int : The timeout in seconds for scaling to zero.
    • scaling_down_timeout_sec:: int : The timeout in seconds for scaling down.
    • scaling_up_timeout_sec:: int : The timeout in seconds for scaling up.
    • time_window_sec:: int : The time window in seconds for autoscaling.
    • upper_allowed_latency_sec:: int : The upper limit of allowed latency in seconds.
  • max_price_per_hour [optional]:: int : The maximum price per hour for the inference task.
  • max_throughput_rate [optional]:: int : The maximum throughput rate for the inference task.

Response

{
"success": true,
"message": {
"inf_id": "string"
}
}

Example

curl -X POST \
-H "Content-Type: application/json" \
-d '{
"name": "test_inference",
"model": "meta-llama/Llama-2-70b",
"inf_type": "llm",
"allow_spot_instances": true,
"autoscaling_config": {
"enable_speedup_shared": false,
"lower_allowed_latency_sec": 2,
"scale_to_zero_timeout_sec": 1800,
"scaling_down_timeout_sec": 1200,
"scaling_up_timeout_sec": 1200,
"time_window_sec": 300,
"upper_allowed_latency_sec": 5
}
}' \
https://api.example.com/sg_inf/create

update

Endpoint: /sg_inf/{inference_id}

Description

This method is used to update an existing inference deployment.

Request

  • Method: PUT

  • Headers:

    • Content-Type: application/json
  • Body:

    {
    "initial_worker_config": {
    "min_workers": 0,
    "initial_workers_gpu": "A100",
    "initial_workers_gpu_num": 0,
    "use_same_gpus_when_scaling": false,
    "instance_types": ["string"],
    "use_on_prem": false,
    "use_cloudburst": false,
    "on_prem_node_ids": ["string"]
    },
    "autoscaling_config": {
    "time_window_sec": 300,
    "upper_allowed_latency_sec": 4,
    "lower_allowed_latency_sec": 1,
    "scaling_up_timeout_sec": 1200,
    "scaling_down_timeout_sec": 1200,
    "scale_to_zero_timeout_sec": 1800,
    "enable_speedup_shared": false
    },
    "imidiate_scale_down": false
    }

Parameter Description

  • inference_id:: string : Inference deployment job ID.

  • initial_worker_config [optional]:: object : An object containing the initial deployment nodes' configuration to be edited.

    • min_workers:: int : The minimum number of workers.
    • initial_workers_gpu:: string : The type of GPU to be used by the initial workers.
    • initial_workers_gpu_num:: int : The number of GPUs to be used by the initial workers.
    • use_same_gpus_when_scaling:: boolean : A boolean indicating whether to use the same type of GPUs when scaling.
    • instance_types:: list[string] : Specify AWS instance type instead of initial_workers_gpu and initial_workers_gpu_num.
    • use_on_prem:: boolean : Whether to use on-premise resources.
    • use_cloudburst:: boolean : Whether to use cloudburst.
    • on_prem_node_ids:: list[string] : An array of on-premise node IDs to be used.
  • autoscaling_config [optional]:: object : An object containing the autoscaling logic configuration.

    • time_window_sec:: int : The time window in seconds for autoscaling.
    • upper_allowed_latency_sec:: int : The upper limit of allowed latency in seconds.
    • lower_allowed_latency_sec:: int : The lower limit of allowed latency in seconds.
    • scaling_up_timeout_sec:: int : The timeout in seconds for scaling up.
    • scaling_down_timeout_sec:: int : The timeout in seconds for scaling down.
    • scale_to_zero_timeout_sec:: int : The timeout in seconds for scaling to zero.
    • enable_speedup_shared:: boolean : A boolean indicating whether to enable shared speedup.
  • imidiate_scale_down [optional]:: boolean : A boolean indicating whether to immediately scale down.

Response

{
"success": true,
"message": "string"
}

Example

curl -X PUT \
-H "Content-Type: application/json" \
-d '{
"initial_worker_config": {
"min_workers": 3,
"initial_workers_gpu": "A100",
"initial_workers_gpu_num": 2,
"use_same_gpus_when_scaling": false,
},
"autoscaling_config": {
"time_window_sec": 300,
"upper_allowed_latency_sec": 4,
"lower_allowed_latency_sec": 1,
"scaling_up_timeout_sec": 1200,
"scaling_down_timeout_sec": 1200,
"scale_to_zero_timeout_sec": 1800,
"enable_speedup_shared": false
},
"imidiate_scale_down": false
}' \
https://api.example.com/sg_inf/test_job_id

delete

Endpoint: /sg_inf/{inference_id}

Description

This method is used to delete an inference deployment.

Request

  • Method: DELETE
  • Headers:
    • Content-Type: application/json

Parameter Description

  • inference_id:: string : Inference deployment job ID.

Response

{
"success": true,
"message": "string"
}

Example

curl -X DELETE \
-H "Content-Type: application/json" \
https://api.example.com/sg_inf/test_job_id

get

Endpoint: /sg_inf/{inference_id}

Description

This method is used to get an inference deployment information.

Request

  • Method: GET

  • Headers:

    • Content-Type: application/json

Parameter Description

  • inference_id:: string : Inference deployment job ID.

Response

{
"success": true,
"message": "string"
}

Example

curl -X GET \
-H "Content-Type: application/json" \
https://api.example.com/sg_inf/test_job_id