Skip to main content

Inference using OpenAI SDK

ScaleGenAI deployed models are compatibile with the OpenAI API standard, allowing easier integrations into existing applications and toolkits.


Python SDK

Simply swap out the OpenAI base_url and the api_key the with the ScaleGenAI-deployed model credentials for a seamless switch from the OpenAI GPT backend to an open-source model of your choice.

import os
import openai

system_content = "You are a Science encyclopedia chatbot. Be helpful and informative."
user_content = "What is known as the 'powerhouse of the cell?'"

client = openai.OpenAI(
api_key=os.environ.get("SCALEGENAI_MODEL_API_KEY"),
base_url=os.environ.get("SCALEGENAI_MODEL_BASE_URL"),
)

chat_completion = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B",
messages=[
{"role": "system", "content": system_content},
{"role": "user", "content": user_content},
],
)

response = chat_completion.choices[0].message.content
print("Response:\n", response)

Streaming Response

To stream responses from the completions model using the Python SDK, you'll want to set up the stream parameter to True. This enables the SDK to yield responses as chunks as they become available, rather than waiting for the full completion.

import os
import openai

system_content = "You are a Science encyclopedia chatbot. Be helpful and informative."
user_content = "What is known as the 'powerhouse of the cell?'"

client = openai.OpenAI(
api_key=os.environ.get("SCALEGENAI_MODEL_API_KEY"),
base_url=os.environ.get("SCALEGENAI_MODEL_BASE_URL"),
)

stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B",
messages=[
{"role": "system", "content": system_content},
{"role": "user", "content": user_content},
],
stream=True,
max_tokens=1024,
)

for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)