Inference API
The ScaleGenAI inference API has the following methods.
Function | Description |
---|---|
create | Launch an inference job. |
delete | Delete deployment config. |
get | Get inference deployment info. |
create
Endpoint: /sg_inf/create
Description
This method is used to create a new inference deployment.
Request
- Method:
POST
- Headers:
- Content-Type:
application/json
- Content-Type:
- Body:
{
"name": "string",
"model": "string",
"base_model": "string",
"inf_type": "llm",
"hf_token": "string",
"allow_spot_instances": false,
"logs_store": "string",
"cloud_providers": [],
"gateway_config": {
"name": "DATACRUNCH",
"region": "string"
},
"initial_worker_config": {
"min_workers": 0,
"initial_workers_gpu": "A100",
"initial_workers_gpu_num": 0,
"use_same_gpus_when_scaling": false,
"instance_types": ["string"],
"use_on_prem": false,
"use_cloudburst": false,
"on_prem_node_ids": ["string"]
},
"autoscaling_config": {
"enable_speedup_shared": false,
"lower_allowed_latency_sec": 1,
"scale_to_zero_timeout_sec": 1800,
"scaling_down_timeout_sec": 1200,
"scaling_up_timeout_sec": 1200,
"time_window_sec": 300,
"upper_allowed_latency_sec": 4
},
"max_price_per_hour": 0,
"max_throughput_rate": 0
}
Parameter Description
id
:: string : Unique identifier for the inference deployment instance.name
:: string : The name of the inference task (e.g.,"llama-70b-template"
).
Config Parameters
name
:: string : The name of the inference configuration (e.g.,"llama-70b-dep"
).model
:: string : The model to be used for inference (e.g.,"meta-llama/Llama-3.1-70B-Instruct"
).base_model
:: string : The base model for custom inference (e.g.,"meta-llama/Llama-3.1-70B-Instruct"
).inf_type
:: string [ "llm" | "embedding" ] : Type of inference, either"llm"
for completions or"embedding"
for embeddings.hf_token
[optional]:: string : Hugging Face authentication token.engine
:: string : Inference engine to be used (e.g.,"vllm"
).custom_chat_template
[optional]:: string : Custom chat template to apply, if any.allow_spot_instances
:: boolean : Whether to allow spot instances for inference deployment (e.g.,false
).logs_store
[optional]:: string : Storage location for logs.
Cloud Providers
cloud_providers
:: list[object] : A list of cloud providers for deployment.name
:: string : Name of the cloud provider (e.g.,"SCALEGENAI"
).regions
:: list[string] : Regions for cloud deployment (e.g.,["US", "EU", "CANADA", "ASIA"]
).
Initial Worker Configuration
initial_worker_config
:: object : Configuration for initial worker nodes in the deployment.min_workers
:: int : Minimum number of workers to start with (e.g.,0
).initial_workers_gpu
:: string : Type of GPU for initial workers (e.g.,"A100_80GB"
).initial_workers_gpu_num
:: int : Number of GPUs per initial worker (e.g.,4
).use_other_gpus
:: boolean : Whether to allow other GPU types (e.g.,true
).instance_types
:: list[string] : Specifies custom instance types, if any.use_on_prem
:: boolean : Whether to use on-premise resources for deployment (e.g.,false
).use_cloudburst
:: boolean : Whether to enable cloudburst support for scaling (e.g.,false
).on_prem_node_ids
[optional]:: list[string] : List of on-premise node IDs for deployment.expand_gpu_types
:: boolean : Allows expansion to different GPU types as needed (e.g.,true
).max_workers
:: int : Maximum number of workers allowed (e.g.,4
).
Autoscaling Configuration
autoscaling_config
:: object : Configuration for autoscaling parameters.scale_up_time_window_sec
:: int : Time window in seconds to scale up workers (e.g.,300
).scale_down_time_window_sec
:: int : Time window in seconds to scale down workers (e.g.,300
).scaling_up_timeout_sec
:: int : Timeout in seconds for scaling up (e.g.,1200
).scaling_down_timeout_sec
:: int : Timeout in seconds for scaling down (e.g.,1200
).scale_to_zero_timeout_sec
:: int : Timeout in seconds to scale down to zero workers (e.g.,7200
).enable_speedup_shared
:: boolean : Whether to enable speedup on shared infrastructure (e.g.,false
).enable_fast_autoscaling
:: boolean : Whether to enable fast autoscaling (e.g.,false
).scale_to_zero
:: boolean : Whether to allow scaling down to zero workers (e.g.,true
).autoscaling_strategy
:: string : Strategy for autoscaling, e.g.,"ttft_latency_sec"
.upper_allowed_threshold
:: float : Upper threshold for autoscaling (e.g.,5.0
).lower_allowed_threshold
:: float : Lower threshold for autoscaling (e.g.,0.2
).upper_allowed_latency_sec
:: float : Upper allowed latency in seconds (e.g.,1.0
).lower_allowed_latency_sec
:: float : Lower allowed latency in seconds (e.g.,0.2
).
Pricing and Throughput
max_price_per_hour
[optional]:: float : Maximum allowed price per hour for the inference task.min_throughput_rate
[optional]:: float : Minimum required throughput rate for the task.
Controller Cloud Configuration
controller_cloud_config
:: object : Configuration for the cloud-based controller.public_url
:: boolean : Whether the controller is accessible via a public URL (e.g.,true
).use_ssl
:: boolean : Whether SSL is enabled for secure communication (e.g.,true
).use_api_gateway
:: boolean : Whether an API gateway is used (e.g.,false
).vpc_id
[optional]:: string : VPC ID for cloud networking.cloud_provider
:: string : The cloud provider for the controller (e.g.,"SCALEGENAI"
).region
:: string : The region of the cloud provider (e.g.,"US"
).api_gateway_data
[optional]:: object : Additional API gateway configuration data.
Controller On-Premise Configuration
controller_on_prem_config
[optional]:: object : Configuration details for on-premise controller setup.
Advanced Parameters
llm_loras
[optional]:: list : List of Low-Rank Adaptation (LoRA) configurations for the model.max_model_len
:: int : Maximum allowable model length (e.g.,32000
).throughput_optimized
:: boolean : Whether to optimize for maximum throughput.
Response
{
"success": true,
"message": {
"inf_id": "string"
}
}
Example
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"name": "llama-70b-dep",
"model": "meta-llama/Llama-3.1-70B-Instruct",
"base_model": "meta-llama/Llama-3.1-70B-Instruct",
"inf_type": "llm",
"hf_token": null,
"engine": "vllm",
"custom_chat_template": null,
"allow_spot_instances": false,
"logs_store": null,
"cloud_providers": [
{
"name": "SCALEGENAI",
"regions": [
"US",
"EU",
"CANADA",
"ASIA"
]
}
],
"initial_worker_config": {
"min_workers": 0,
"initial_workers_gpu": "A100_80GB",
"initial_workers_gpu_num": 4,
"use_other_gpus": true,
"instance_types": [],
"use_on_prem": false,
"use_cloudburst": false,
"on_prem_node_ids": null,
"expand_gpu_types": true,
"max_workers": 4
},
"autoscaling_config": {
"scale_up_time_window_sec": 300,
"scale_down_time_window_sec": 300,
"scaling_up_timeout_sec": 1200,
"scaling_down_timeout_sec": 1200,
"scale_to_zero_timeout_sec": 7200,
"enable_speedup_shared": false,
"enable_fast_autoscaling": false,
"scale_to_zero": true,
"autoscaling_strategy": "ttft_latency_sec",
"upper_allowed_threshold": 5.0,
"lower_allowed_threshold": 0.2,
"upper_allowed_latency_sec": 1.0,
"lower_allowed_latency_sec": 0.2
},
"max_price_per_hour": null,
"min_throughput_rate": null,
"controller_cloud_config": {
"public_url": true,
"use_ssl": true,
"use_api_gateway": false,
"vpc_id": null,
"cloud_provider": "SCALEGENAI",
"region": "US",
"api_gateway_data": null
},
"controller_on_prem_config": null,
"llm_loras": [],
"max_model_len": 32000,
"throughput_optimized": false
}
}' \
https://api.example.com/sg_inf/create
delete
Endpoint: /sg_inf/{inference_id}
Description
This method is used to delete an inference deployment.
Request
- Method:
DELETE
- Headers:
- Content-Type:
application/json
- Content-Type:
Parameter Description
inference_id
:: string : Inference deployment job ID.
Response
{
"success": true,
"message": "string"
}
Example
curl -X DELETE \
-H "Content-Type: application/json" \
https://api.example.com/sg_inf/test_job_id
get
Endpoint: /sg_inf/{inference_id}
Description
This method is used to get an inference deployment information.
Request
-
Method:
GET
-
Headers:
- Content-Type:
application/json
- Content-Type:
Parameter Description
inference_id
:: string : Inference deployment job ID.
Response
{
"success": true,
"message": "string"
}
Example
curl -X GET \
-H "Content-Type: application/json" \
https://api.example.com/sg_inf/test_job_id