Inference Guide
With ScaleGenAI, you can deploy base variants of Llama, Mistral, Qwen, etc. at scale. Additionally, you can deploy your custom fine-tuned models as well.
The CLI and API guides are available as linked below. Refer to this section of the doc for information on various fine-tuning parameters.
Model Configuration
Choose a model: Specify the HuggingFace model repository for the model that you want to deploy.
HuggingFace Access Token: Your HF access token. Refer to this guide to get your HF token.
Results Store: Choose an artifacts store where you want the inference results and logs to be written to. More information on how to configure an artifacts store/checkpoint store here.
GPU Configuration
Choose GPU Type: Choose the preferred GPU type for the fine-tuning job.
No. of GPUs: Select the number of GPUs.
Allow other GPU types: When enabled, if preferred GPU configuration is not available in the region of choice, an equivalent configuration will be selected.
You'll get approximate price estimates for your chosen config. Click Update
button to update to a new configuration.
Deployment Region
Choose the Deployment Region: Choose the preferred region for model deployment. Depends on proximity to users and data jurisdiciton requirements.
Auto-Scaling
Replicas: Set the min and max number of replicas, autoscaling will be triggered based on the traffic.
For auto-scaling, you can choose between a throughput-based policy (requests per second) or a latency-based policy (TTFT Latency).
Request Per Second:
- Concurrent Requests Per Worker: Specify the minimum concurrent requests per second for each worker
TTFT Latency:
- Time To First Token Latency (TTFT): Enter the time it takes to receive the first token during inference.
Scale-Up Window: Set the time window to scale-up based on inference demand.
Scale-Down Window: Set the time window to scale down based on reduced demand.
Scale-to-Zero Window: Set the time window to scale down to zero when there's no demand.