Skip to main content

Inference API


The ScaleGenAI inference API has the following methods.

FunctionDescription
createLaunch an inference job.
deleteDelete deployment config.
getGet inference deployment info.

create

Endpoint: /sg_inf/create

Description

This method is used to create a new inference deployment.

Request

  • Method: POST
  • Headers:
    • Content-Type: application/json
  • Body:
    {
    "name": "string",
    "model": "string",
    "base_model": "string",
    "inf_type": "llm",
    "hf_token": "string",
    "allow_spot_instances": false,
    "logs_store": "string",
    "cloud_providers": [],
    "gateway_config": {
    "name": "DATACRUNCH",
    "region": "string"
    },
    "initial_worker_config": {
    "min_workers": 0,
    "initial_workers_gpu": "A100",
    "initial_workers_gpu_num": 0,
    "use_same_gpus_when_scaling": false,
    "instance_types": ["string"],
    "use_on_prem": false,
    "use_cloudburst": false,
    "on_prem_node_ids": ["string"]
    },
    "autoscaling_config": {
    "enable_speedup_shared": false,
    "lower_allowed_latency_sec": 1,
    "scale_to_zero_timeout_sec": 1800,
    "scaling_down_timeout_sec": 1200,
    "scaling_up_timeout_sec": 1200,
    "time_window_sec": 300,
    "upper_allowed_latency_sec": 4
    },
    "max_price_per_hour": 0,
    "max_throughput_rate": 0
    }

Parameter Description

  • id:: string : Unique identifier for the inference deployment instance.
  • name:: string : The name of the inference task (e.g., "llama-70b-template").

Config Parameters

  • name:: string : The name of the inference configuration (e.g., "llama-70b-dep").
  • model:: string : The model to be used for inference (e.g., "meta-llama/Llama-3.1-70B-Instruct").
  • base_model:: string : The base model for custom inference (e.g., "meta-llama/Llama-3.1-70B-Instruct").
  • inf_type:: string [ "llm" | "embedding" ] : Type of inference, either "llm" for completions or "embedding" for embeddings.
  • hf_token [optional]:: string : Hugging Face authentication token.
  • engine:: string : Inference engine to be used (e.g., "vllm").
  • custom_chat_template [optional]:: string : Custom chat template to apply, if any.
  • allow_spot_instances:: boolean : Whether to allow spot instances for inference deployment (e.g., false).
  • logs_store [optional]:: string : Storage location for logs.

Cloud Providers

  • cloud_providers:: list[object] : A list of cloud providers for deployment.
    • name:: string : Name of the cloud provider (e.g., "SCALEGENAI").
    • regions:: list[string] : Regions for cloud deployment (e.g., ["US", "EU", "CANADA", "ASIA"]).

Initial Worker Configuration

  • initial_worker_config:: object : Configuration for initial worker nodes in the deployment.
    • min_workers:: int : Minimum number of workers to start with (e.g., 0).
    • initial_workers_gpu:: string : Type of GPU for initial workers (e.g., "A100_80GB").
    • initial_workers_gpu_num:: int : Number of GPUs per initial worker (e.g., 4).
    • use_other_gpus:: boolean : Whether to allow other GPU types (e.g., true).
    • instance_types:: list[string] : Specifies custom instance types, if any.
    • use_on_prem:: boolean : Whether to use on-premise resources for deployment (e.g., false).
    • use_cloudburst:: boolean : Whether to enable cloudburst support for scaling (e.g., false).
    • on_prem_node_ids [optional]:: list[string] : List of on-premise node IDs for deployment.
    • expand_gpu_types:: boolean : Allows expansion to different GPU types as needed (e.g., true).
    • max_workers:: int : Maximum number of workers allowed (e.g., 4).

Autoscaling Configuration

  • autoscaling_config:: object : Configuration for autoscaling parameters.
    • scale_up_time_window_sec:: int : Time window in seconds to scale up workers (e.g., 300).
    • scale_down_time_window_sec:: int : Time window in seconds to scale down workers (e.g., 300).
    • scaling_up_timeout_sec:: int : Timeout in seconds for scaling up (e.g., 1200).
    • scaling_down_timeout_sec:: int : Timeout in seconds for scaling down (e.g., 1200).
    • scale_to_zero_timeout_sec:: int : Timeout in seconds to scale down to zero workers (e.g., 7200).
    • enable_speedup_shared:: boolean : Whether to enable speedup on shared infrastructure (e.g., false).
    • enable_fast_autoscaling:: boolean : Whether to enable fast autoscaling (e.g., false).
    • scale_to_zero:: boolean : Whether to allow scaling down to zero workers (e.g., true).
    • autoscaling_strategy:: string : Strategy for autoscaling, e.g., "ttft_latency_sec".
    • upper_allowed_threshold:: float : Upper threshold for autoscaling (e.g., 5.0).
    • lower_allowed_threshold:: float : Lower threshold for autoscaling (e.g., 0.2).
    • upper_allowed_latency_sec:: float : Upper allowed latency in seconds (e.g., 1.0).
    • lower_allowed_latency_sec:: float : Lower allowed latency in seconds (e.g., 0.2).

Pricing and Throughput

  • max_price_per_hour [optional]:: float : Maximum allowed price per hour for the inference task.
  • min_throughput_rate [optional]:: float : Minimum required throughput rate for the task.

Controller Cloud Configuration

  • controller_cloud_config:: object : Configuration for the cloud-based controller.
    • public_url:: boolean : Whether the controller is accessible via a public URL (e.g., true).
    • use_ssl:: boolean : Whether SSL is enabled for secure communication (e.g., true).
    • use_api_gateway:: boolean : Whether an API gateway is used (e.g., false).
    • vpc_id [optional]:: string : VPC ID for cloud networking.
    • cloud_provider:: string : The cloud provider for the controller (e.g., "SCALEGENAI").
    • region:: string : The region of the cloud provider (e.g., "US").
    • api_gateway_data [optional]:: object : Additional API gateway configuration data.

Controller On-Premise Configuration

  • controller_on_prem_config [optional]:: object : Configuration details for on-premise controller setup.

Advanced Parameters

  • llm_loras [optional]:: list : List of Low-Rank Adaptation (LoRA) configurations for the model.
  • max_model_len:: int : Maximum allowable model length (e.g., 32000).
  • throughput_optimized:: boolean : Whether to optimize for maximum throughput.

Response

{
"success": true,
"message": {
"inf_id": "string"
}
}

Example

curl -X POST \
-H "Content-Type: application/json" \
-d '{
"name": "llama-70b-dep",
"model": "meta-llama/Llama-3.1-70B-Instruct",
"base_model": "meta-llama/Llama-3.1-70B-Instruct",
"inf_type": "llm",
"hf_token": null,
"engine": "vllm",
"custom_chat_template": null,
"allow_spot_instances": false,
"logs_store": null,
"cloud_providers": [
{
"name": "SCALEGENAI",
"regions": [
"US",
"EU",
"CANADA",
"ASIA"
]
}
],
"initial_worker_config": {
"min_workers": 0,
"initial_workers_gpu": "A100_80GB",
"initial_workers_gpu_num": 4,
"use_other_gpus": true,
"instance_types": [],
"use_on_prem": false,
"use_cloudburst": false,
"on_prem_node_ids": null,
"expand_gpu_types": true,
"max_workers": 4
},
"autoscaling_config": {
"scale_up_time_window_sec": 300,
"scale_down_time_window_sec": 300,
"scaling_up_timeout_sec": 1200,
"scaling_down_timeout_sec": 1200,
"scale_to_zero_timeout_sec": 7200,
"enable_speedup_shared": false,
"enable_fast_autoscaling": false,
"scale_to_zero": true,
"autoscaling_strategy": "ttft_latency_sec",
"upper_allowed_threshold": 5.0,
"lower_allowed_threshold": 0.2,
"upper_allowed_latency_sec": 1.0,
"lower_allowed_latency_sec": 0.2
},
"max_price_per_hour": null,
"min_throughput_rate": null,
"controller_cloud_config": {
"public_url": true,
"use_ssl": true,
"use_api_gateway": false,
"vpc_id": null,
"cloud_provider": "SCALEGENAI",
"region": "US",
"api_gateway_data": null
},
"controller_on_prem_config": null,
"llm_loras": [],
"max_model_len": 32000,
"throughput_optimized": false
}
}' \
https://api.example.com/sg_inf/create

delete

Endpoint: /sg_inf/{inference_id}

Description

This method is used to delete an inference deployment.

Request

  • Method: DELETE
  • Headers:
    • Content-Type: application/json

Parameter Description

  • inference_id:: string : Inference deployment job ID.

Response

{
"success": true,
"message": "string"
}

Example

curl -X DELETE \
-H "Content-Type: application/json" \
https://api.example.com/sg_inf/test_job_id

get

Endpoint: /sg_inf/{inference_id}

Description

This method is used to get an inference deployment information.

Request

  • Method: GET

  • Headers:

    • Content-Type: application/json

Parameter Description

  • inference_id:: string : Inference deployment job ID.

Response

{
"success": true,
"message": "string"
}

Example

curl -X GET \
-H "Content-Type: application/json" \
https://api.example.com/sg_inf/test_job_id