Inference using OpenAI SDK
ScaleGenAI deployed models are compatibile with the OpenAI API standard, allowing easier integrations into existing applications and toolkits.
Python SDK
Simply swap out the OpenAI base_url
and the api_key
the with the ScaleGenAI-deployed model credentials for a seamless switch from the OpenAI GPT backend to an open-source model of your choice.
import os
import openai
system_content = "You are a Science encyclopedia chatbot. Be helpful and informative."
user_content = "What is known as the 'powerhouse of the cell?'"
client = openai.OpenAI(
api_key=os.environ.get("SCALEGENAI_MODEL_API_KEY"),
base_url=os.environ.get("SCALEGENAI_MODEL_BASE_URL"),
)
chat_completion = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B",
messages=[
{"role": "system", "content": system_content},
{"role": "user", "content": user_content},
],
)
response = chat_completion.choices[0].message.content
print("Response:\n", response)
Streaming Response
To stream responses from the completions model using the Python SDK, you'll want to set up the stream
parameter to True
. This enables the SDK to yield responses as chunks as they become available, rather than waiting for the full completion.
import os
import openai
system_content = "You are a Science encyclopedia chatbot. Be helpful and informative."
user_content = "What is known as the 'powerhouse of the cell?'"
client = openai.OpenAI(
api_key=os.environ.get("SCALEGENAI_MODEL_API_KEY"),
base_url=os.environ.get("SCALEGENAI_MODEL_BASE_URL"),
)
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B",
messages=[
{"role": "system", "content": system_content},
{"role": "user", "content": user_content},
],
stream=True,
max_tokens=1024,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)