Inference API
The ScaleGenAI inference API has the following methods.
Function | Description |
---|---|
create | Launch an inference job. |
update | Update deployment config. |
delete | Delete deployment config. |
get | Get inference deployment info. |
create
Endpoint: /sg_inf/create
Description
This method is used to create a new inference deployment.
Request
- Method:
POST
- Headers:
- Content-Type:
application/json
- Content-Type:
- Body:
{
"name": "string",
"model": "string",
"base_model": "string",
"inf_type": "llm",
"hf_token": "string",
"allow_spot_instances": false,
"logs_store": "string",
"cloud_providers": [],
"gateway_config": {
"name": "DATACRUNCH",
"region": "string"
},
"initial_worker_config": {
"min_workers": 0,
"initial_workers_gpu": "A100",
"initial_workers_gpu_num": 0,
"use_same_gpus_when_scaling": false,
"instance_types": ["string"],
"use_on_prem": false,
"use_cloudburst": false,
"on_prem_node_ids": ["string"]
},
"autoscaling_config": {
"enable_speedup_shared": false,
"lower_allowed_latency_sec": 1,
"scale_to_zero_timeout_sec": 1800,
"scaling_down_timeout_sec": 1200,
"scaling_up_timeout_sec": 1200,
"time_window_sec": 300,
"upper_allowed_latency_sec": 4
},
"max_price_per_hour": 0,
"max_throughput_rate": 0
}
Parameter Description
name
:: string : The name of the inference task.model
:: string : The name of the model to be used for inference.base_model
[optional]:: string : The base model on which the custom inference is built.data_path
:: string : The path to the HuggingFace dataset used for finetuning.inf_type
:: string["llm", "embedding"] : Whether the deployment is for a completions LLM or an embeddings model.hf_token
[optional]:: string : HuggingFace token.allow_spot_instances
:: boolean : A boolean indicating whether to use spot instances for inference deployments.logs_store
[optional]:: string : Name of the Artifacts Storage where logs are to be stored.cloud_providers
[optional]:: list[string] : An array of cloud providers.gateway_config
[optional]:: object : An object containing the API gateway configuration.name
:: string : Cloud provider where API gateway is to be configured.region
:: string : Region where API gateway is to be configured.
initial_worker_config
[optional]:: object : An object containing the initial deployment nodes' configuration.min_workers
:: int : The minimum number of workers to start with.initial_workers_gpu
:: string : The type of GPU to be used by the initial workers.initial_workers_gpu_num
:: int : The number of GPUs to be used by the initial workers.use_same_gpus_when_scaling
:: boolean : A boolean indicating whether to use the same type of GPUs when scaling.instance_types
:: list[string] : Specify AWS instance type instead of initial_workers_gpu and initial_workers_gpu_num.use_on_prem
:: boolean : Whether to use on-premise resources.use_cloudburst
:: boolean : Whether to use cloudburst.on_prem_node_ids
:: list[string] : An array of on-premise node IDs to be used.
autoscaling_config
[optional]:: object : An object containing the autoscaling logic configuration.enable_speedup_shared
:: boolean : A boolean indicating whether to enable fast autoscaling on ScaleGenAI shared infrastructure.lower_allowed_latency_sec
:: int : The lower limit of allowed latency in seconds.scale_to_zero_timeout_sec
:: int : The timeout in seconds for scaling to zero.scaling_down_timeout_sec
:: int : The timeout in seconds for scaling down.scaling_up_timeout_sec
:: int : The timeout in seconds for scaling up.time_window_sec
:: int : The time window in seconds for autoscaling.upper_allowed_latency_sec
:: int : The upper limit of allowed latency in seconds.
max_price_per_hour
[optional]:: int : The maximum price per hour for the inference task.max_throughput_rate
[optional]:: int : The maximum throughput rate for the inference task.
Response
{
"success": true,
"message": {
"inf_id": "string"
}
}
Example
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"name": "test_inference",
"model": "meta-llama/Llama-2-70b",
"inf_type": "llm",
"allow_spot_instances": true,
"autoscaling_config": {
"enable_speedup_shared": false,
"lower_allowed_latency_sec": 2,
"scale_to_zero_timeout_sec": 1800,
"scaling_down_timeout_sec": 1200,
"scaling_up_timeout_sec": 1200,
"time_window_sec": 300,
"upper_allowed_latency_sec": 5
}
}' \
https://api.example.com/sg_inf/create
update
Endpoint: /sg_inf/{inference_id}
Description
This method is used to update an existing inference deployment.
Request
-
Method:
PUT
-
Headers:
- Content-Type:
application/json
- Content-Type:
-
Body:
{
"initial_worker_config": {
"min_workers": 0,
"initial_workers_gpu": "A100",
"initial_workers_gpu_num": 0,
"use_same_gpus_when_scaling": false,
"instance_types": ["string"],
"use_on_prem": false,
"use_cloudburst": false,
"on_prem_node_ids": ["string"]
},
"autoscaling_config": {
"time_window_sec": 300,
"upper_allowed_latency_sec": 4,
"lower_allowed_latency_sec": 1,
"scaling_up_timeout_sec": 1200,
"scaling_down_timeout_sec": 1200,
"scale_to_zero_timeout_sec": 1800,
"enable_speedup_shared": false
},
"imidiate_scale_down": false
}
Parameter Description
-
inference_id
:: string : Inference deployment job ID. -
initial_worker_config
[optional]:: object : An object containing the initial deployment nodes' configuration to be edited.min_workers
:: int : The minimum number of workers.initial_workers_gpu
:: string : The type of GPU to be used by the initial workers.initial_workers_gpu_num
:: int : The number of GPUs to be used by the initial workers.use_same_gpus_when_scaling
:: boolean : A boolean indicating whether to use the same type of GPUs when scaling.instance_types
:: list[string] : Specify AWS instance type instead of initial_workers_gpu and initial_workers_gpu_num.use_on_prem
:: boolean : Whether to use on-premise resources.use_cloudburst
:: boolean : Whether to use cloudburst.on_prem_node_ids
:: list[string] : An array of on-premise node IDs to be used.
-
autoscaling_config
[optional]:: object : An object containing the autoscaling logic configuration.time_window_sec
:: int : The time window in seconds for autoscaling.upper_allowed_latency_sec
:: int : The upper limit of allowed latency in seconds.lower_allowed_latency_sec
:: int : The lower limit of allowed latency in seconds.scaling_up_timeout_sec
:: int : The timeout in seconds for scaling up.scaling_down_timeout_sec
:: int : The timeout in seconds for scaling down.scale_to_zero_timeout_sec
:: int : The timeout in seconds for scaling to zero.enable_speedup_shared
:: boolean : A boolean indicating whether to enable shared speedup.
-
imidiate_scale_down
[optional]:: boolean : A boolean indicating whether to immediately scale down.
Response
{
"success": true,
"message": "string"
}
Example
curl -X PUT \
-H "Content-Type: application/json" \
-d '{
"initial_worker_config": {
"min_workers": 3,
"initial_workers_gpu": "A100",
"initial_workers_gpu_num": 2,
"use_same_gpus_when_scaling": false,
},
"autoscaling_config": {
"time_window_sec": 300,
"upper_allowed_latency_sec": 4,
"lower_allowed_latency_sec": 1,
"scaling_up_timeout_sec": 1200,
"scaling_down_timeout_sec": 1200,
"scale_to_zero_timeout_sec": 1800,
"enable_speedup_shared": false
},
"imidiate_scale_down": false
}' \
https://api.example.com/sg_inf/test_job_id
delete
Endpoint: /sg_inf/{inference_id}
Description
This method is used to delete an inference deployment.
Request
- Method:
DELETE
- Headers:
- Content-Type:
application/json
- Content-Type:
Parameter Description
inference_id
:: string : Inference deployment job ID.
Response
{
"success": true,
"message": "string"
}
Example
curl -X DELETE \
-H "Content-Type: application/json" \
https://api.example.com/sg_inf/test_job_id
get
Endpoint: /sg_inf/{inference_id}
Description
This method is used to get an inference deployment information.
Request
-
Method:
GET
-
Headers:
- Content-Type:
application/json
- Content-Type:
Parameter Description
inference_id
:: string : Inference deployment job ID.
Response
{
"success": true,
"message": "string"
}
Example
curl -X GET \
-H "Content-Type: application/json" \
https://api.example.com/sg_inf/test_job_id