Inference API

The ScaleGenAI inference API has the following methods.

Function	Description
create	Launch an inference job.
delete	Delete deployment config.
get	Get inference deployment info.

create

Endpoint: /sg_inf/create

Description

This method is used to create a new inference deployment.

Request

Method: POST
Headers:
- Content-Type: application/json

Body:

{
  "name": "string",
  "model": "string",
  "base_model": "string",
  "inf_type": "llm",
  "hf_token": "string",
  "allow_spot_instances": false,
  "logs_store": "string",
  "cloud_providers": [],
  "gateway_config": {
    "name": "DATACRUNCH",
    "region": "string"
  },
  "initial_worker_config": {
    "min_workers": 0,
    "initial_workers_gpu": "A100",
    "initial_workers_gpu_num": 0,
    "use_same_gpus_when_scaling": false,
    "instance_types": ["string"],
    "use_on_prem": false,
    "use_cloudburst": false,
    "on_prem_node_ids": ["string"]
  },
  "autoscaling_config": {
    "enable_speedup_shared": false,
    "lower_allowed_latency_sec": 1,
    "scale_to_zero_timeout_sec": 1800,
    "scaling_down_timeout_sec": 1200,
    "scaling_up_timeout_sec": 1200,
    "time_window_sec": 300,
    "upper_allowed_latency_sec": 4
  },
  "max_price_per_hour": 0,
  "max_throughput_rate": 0
}

Parameter Description

id:: string : Unique identifier for the inference deployment instance.
name:: string : The name of the inference task (e.g., "llama-70b-template").

Config Parameters

name:: string : The name of the inference configuration (e.g., "llama-70b-dep").
model:: string : The model to be used for inference (e.g., "meta-llama/Llama-3.1-70B-Instruct").
base_model:: string : The base model for custom inference (e.g., "meta-llama/Llama-3.1-70B-Instruct").
inf_type:: string [ "llm" | "embedding" ] : Type of inference, either "llm" for completions or "embedding" for embeddings.
hf_token [optional]:: string : Hugging Face authentication token.
engine:: string : Inference engine to be used (e.g., "vllm").
custom_chat_template [optional]:: string : Custom chat template to apply, if any.
allow_spot_instances:: boolean : Whether to allow spot instances for inference deployment (e.g., false).
logs_store [optional]:: string : Storage location for logs.

Cloud Providers

cloud_providers:: list[object] : A list of cloud providers for deployment.
- name:: string : Name of the cloud provider (e.g., "SCALEGENAI").
- regions:: list[string] : Regions for cloud deployment (e.g., ["US", "EU", "CANADA", "ASIA"]).

Initial Worker Configuration

initial_worker_config:: object : Configuration for initial worker nodes in the deployment.
- min_workers:: int : Minimum number of workers to start with (e.g., 0).
- initial_workers_gpu:: string : Type of GPU for initial workers (e.g., "A100_80GB").
- initial_workers_gpu_num:: int : Number of GPUs per initial worker (e.g., 4).
- use_other_gpus:: boolean : Whether to allow other GPU types (e.g., true).
- instance_types:: list[string] : Specifies custom instance types, if any.
- use_on_prem:: boolean : Whether to use on-premise resources for deployment (e.g., false).
- use_cloudburst:: boolean : Whether to enable cloudburst support for scaling (e.g., false).
- on_prem_node_ids [optional]:: list[string] : List of on-premise node IDs for deployment.
- expand_gpu_types:: boolean : Allows expansion to different GPU types as needed (e.g., true).
- max_workers:: int : Maximum number of workers allowed (e.g., 4).

Autoscaling Configuration

autoscaling_config:: object : Configuration for autoscaling parameters.
- scale_up_time_window_sec:: int : Time window in seconds to scale up workers (e.g., 300).
- scale_down_time_window_sec:: int : Time window in seconds to scale down workers (e.g., 300).
- scaling_up_timeout_sec:: int : Timeout in seconds for scaling up (e.g., 1200).
- scaling_down_timeout_sec:: int : Timeout in seconds for scaling down (e.g., 1200).
- scale_to_zero_timeout_sec:: int : Timeout in seconds to scale down to zero workers (e.g., 7200).
- enable_speedup_shared:: boolean : Whether to enable speedup on shared infrastructure (e.g., false).
- enable_fast_autoscaling:: boolean : Whether to enable fast autoscaling (e.g., false).
- scale_to_zero:: boolean : Whether to allow scaling down to zero workers (e.g., true).
- autoscaling_strategy:: string : Strategy for autoscaling, e.g., "ttft_latency_sec".
- upper_allowed_threshold:: float : Upper threshold for autoscaling (e.g., 5.0).
- lower_allowed_threshold:: float : Lower threshold for autoscaling (e.g., 0.2).
- upper_allowed_latency_sec:: float : Upper allowed latency in seconds (e.g., 1.0).
- lower_allowed_latency_sec:: float : Lower allowed latency in seconds (e.g., 0.2).

Pricing and Throughput

max_price_per_hour [optional]:: float : Maximum allowed price per hour for the inference task.
min_throughput_rate [optional]:: float : Minimum required throughput rate for the task.

Controller Cloud Configuration

controller_cloud_config:: object : Configuration for the cloud-based controller.
- public_url:: boolean : Whether the controller is accessible via a public URL (e.g., true).
- use_ssl:: boolean : Whether SSL is enabled for secure communication (e.g., true).
- use_api_gateway:: boolean : Whether an API gateway is used (e.g., false).
- vpc_id [optional]:: string : VPC ID for cloud networking.
- cloud_provider:: string : The cloud provider for the controller (e.g., "SCALEGENAI").
- region:: string : The region of the cloud provider (e.g., "US").
- api_gateway_data [optional]:: object : Additional API gateway configuration data.

Controller On-Premise Configuration

controller_on_prem_config [optional]:: object : Configuration details for on-premise controller setup.

Advanced Parameters

llm_loras [optional]:: list : List of Low-Rank Adaptation (LoRA) configurations for the model.
max_model_len:: int : Maximum allowable model length (e.g., 32000).
throughput_optimized:: boolean : Whether to optimize for maximum throughput.

Response

{
  "success": true,
  "message": {
    "inf_id": "string"
  }
}

Example

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{
        "name": "llama-70b-dep",
        "model": "meta-llama/Llama-3.1-70B-Instruct",
        "base_model": "meta-llama/Llama-3.1-70B-Instruct",
        "inf_type": "llm",
        "hf_token": null,
        "engine": "vllm",
        "custom_chat_template": null,
        "allow_spot_instances": false,
        "logs_store": null,
        "cloud_providers": [
            {
                "name": "SCALEGENAI",
                "regions": [
                    "US",
                    "EU",
                    "CANADA",
                    "ASIA"
                ]
            }
        ],
        "initial_worker_config": {
            "min_workers": 0,
            "initial_workers_gpu": "A100_80GB",
            "initial_workers_gpu_num": 4,
            "use_other_gpus": true,
            "instance_types": [],
            "use_on_prem": false,
            "use_cloudburst": false,
            "on_prem_node_ids": null,
            "expand_gpu_types": true,
            "max_workers": 4
        },
        "autoscaling_config": {
            "scale_up_time_window_sec": 300,
            "scale_down_time_window_sec": 300,
            "scaling_up_timeout_sec": 1200,
            "scaling_down_timeout_sec": 1200,
            "scale_to_zero_timeout_sec": 7200,
            "enable_speedup_shared": false,
            "enable_fast_autoscaling": false,
            "scale_to_zero": true,
            "autoscaling_strategy": "ttft_latency_sec",
            "upper_allowed_threshold": 5.0,
            "lower_allowed_threshold": 0.2,
            "upper_allowed_latency_sec": 1.0,
            "lower_allowed_latency_sec": 0.2
        },
        "max_price_per_hour": null,
        "min_throughput_rate": null,
        "controller_cloud_config": {
            "public_url": true,
            "use_ssl": true,
            "use_api_gateway": false,
            "vpc_id": null,
            "cloud_provider": "SCALEGENAI",
            "region": "US",
            "api_gateway_data": null
        },
        "controller_on_prem_config": null,
        "llm_loras": [],
        "max_model_len": 32000,
        "throughput_optimized": false
        }
    }' \
  https://api.example.com/sg_inf/create

delete

Endpoint: /sg_inf/{inference_id}

Description

This method is used to delete an inference deployment.

Request

Method: DELETE
Headers:
- Content-Type: application/json

Parameter Description

inference_id:: string : Inference deployment job ID.

Response

{
  "success": true,
  "message": "string"
}

Example

curl -X DELETE \
  -H "Content-Type: application/json" \
  https://api.example.com/sg_inf/test_job_id

get

Endpoint: /sg_inf/{inference_id}

Description

This method is used to get an inference deployment information.

Request

Method: GET
Headers:
- Content-Type: application/json

Parameter Description

inference_id:: string : Inference deployment job ID.

Response

{
  "success": true,
  "message": "string"
}

Example

curl -X GET \
  -H "Content-Type: application/json" \
  https://api.example.com/sg_inf/test_job_id

Inference API

create​

Description​

Request​

Parameter Description​

Config Parameters​

Cloud Providers​

Initial Worker Configuration​

Autoscaling Configuration​

Pricing and Throughput​

Controller Cloud Configuration​

Controller On-Premise Configuration​

Advanced Parameters​

Response​

Example​

delete​

Description​

Request​

Parameter Description​

Response​

Example​

get​

Description​

Request​

Parameter Description​

Response​

Example​

create

Description

Request

Parameter Description

Config Parameters

Cloud Providers

Initial Worker Configuration

Autoscaling Configuration

Pricing and Throughput

Controller Cloud Configuration

Controller On-Premise Configuration

Advanced Parameters

Response

Example

delete

Description

Request

Parameter Description

Response

Example

get

Description

Request

Parameter Description

Response

Example