Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified docs/openshift-ai/data-science-project/images/min-replicas-0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -79,26 +79,36 @@ In the pop-up window that appears, you can specify the following details:
KServe integrates with Red Hat OpenShift Serverless and Red Hat OpenShift
Service Mesh to deploy models on the single-model serving platform.

- **Number of model server replicas to deploy**: This defines the number
of instances of the **model server engine** you want to deploy.

Using the "**Advanced**" deployment mode, you can scale it up as needed
by specifying the **Minimum replicas** and **Maximum replicas**, depending
on the expected number of incoming requests.

!!! tips "Intelligent Auto-Scaling & Scale-to-Zero for Significant Cost Savings"

Once you deploy your model and obtain the inference endpoints, you can
edit the deployment and set the **Minimum replicas** to 0. This enables
intelligent auto-scaling of your model's compute resources (CPU, GPU,
RAM, etc.), allowing replicas to scale up during high traffic and scale
down when idle. With `scale-to-zero` enabled, the system reduces pods
to zero during inactivity, eliminating idle compute costs—especially
beneficial for GPU workloads. The model then scales back up instantly
as soon as a new request arrives.

![Minimal Replicas Zero](images/min-replicas-0.png)

ii. **Standard**: Alternatively, you can use standard deployment mode, which
uses KServe Raw Deployment mode.

- **Number of model server replicas to deploy**: This defines the number of
instances of the **model server engine** you want to deploy. You can scale it
up as needed by specifying the **Minimum replicas** and **Maximum replicas**,
depending on the expected number of incoming requests.

!!! tips "Intelligent Auto-Scaling and Scale-to-Zero for Significant Cost Savings"
- **Number of model server replicas to deploy**: This defines the number
of instances of the **model server engine** you want to deploy.

Once you deploy your model and obtain the inference endpoints, you can
edit the deployment and set the **Minimum replicas** to 0. This enables
intelligent auto-scaling of your model's compute resources (CPU, GPU,
RAM, etc.), allowing replicas to scale up during high traffic and scale
down when idle. With `scale-to-zero` enabled, the system reduces pods to
zero during inactivity, eliminating idle compute costs—especially
beneficial for GPU workloads. The model then scales back up instantly as
soon as a new request arrives.
In "**Standard**" deployment mode, you cannot scale the number of replicas
up or down, unlike in "**Advanced**" deployment mode.

![Minimal Replicas Zero](images/min-replicas-0.png)
![Number of model server replicas to deploy](images/no-of-model-server-replicas.png)

- **Model server size**: This is the amount of resources, CPU, and RAM that will
be allocated to your server. Select the appropriate configuration for size and
Expand Down
2 changes: 2 additions & 0 deletions docs/openshift-ai/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@ the list below.

- [Serving vLLM and Granite Models](other-projects/serving-vLLM-and-Granite-Models.md)

- [Deploying a Llama model with KServe](other-projects/deploying-a-llama-model-with-kserve.md)

- [Serving Text Generation Inference Service (TGIS) and FLAN-T5 Small Model](other-projects/serving-tgis-and-FLAN-T5-Model.md)

## LLM Clients
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,337 @@
# Deploying a Llama model with KServe using Red Hat OpenShift AI

In this walkthrough, we will explore and demonstrate how to deploy a **Llama**
language model using the intuitive interface of Red Hat OpenShift AI (RHOAI) and
NERC's powerful infrastructure features, including GPU acceleration, automatic
resource scaling, and support for distributed computing.

**Prerequisites**:

- Before proceeding, confirm that you have an active GPU quota that has been approved
for your current NERC OpenShift Allocation through NERC ColdFront. Read
more about [How to Access GPU Resources](../../openshift/gpus/intro-to-gpus-on-nerc-ocp.md#accessing-gpu-resources)
on NERC OpenShift Allocation.

- `Llama-3.2-3B-Instruct-FP8` model: [Llama-3.2-3B-Instruct-FP8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-FP8)
is obtained by quantizing the weights of the [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)
model to the FP8 data type. This optimization reduces the number of bits used to
represent weights and activations from 16 to 8, lowering GPU memory requirements
by approximately 50% and increasing matrix-multiply throughput by about 2×. Weight
quantization also reduces disk storage requirements by roughly 50%.

For our Llama model demonstration, we are using a publicly available container
image from the [Quay.io](https://quay.io/repository/jharmison/models?tab=tags&tag=redhatai--llama-3_2-3b-instruct-fp8-modelcar)
registry. Specifically, we will deploy the Llama 3.2 model with 3 billion parameters,
fine-tuned for instruction-following and optimized with 8-bit floating-point
precision to minimize memory usage.

- Setup the OpenShift CLI (`oc`) Tools locally and configure the OpenShift CLI
to enable `oc` commands. Refer to [this user guide](../../openshift/logging-in/setup-the-openshift-cli.md).

- [Helm](https://helm.sh/) installed locally.

## Establishing model connections

Create a **Connection** to a ModelCar container image, which is an **OCI-compliant**
container that packages a machine learning model along with its runtime environment
and dependencies for consistent deployment.

In your OpenShift AI project, go to the **Connections** tab and click the
"Create Connection" and then choose the **URI** connection type as shown below:

![Select URI for Connection Type](images/select-uri-for-connection.png)

!!! note "Connection Types"

OpenShift AI supports three connection types for accessing model images:

- **OCI-compliant registry**: For proprietary images requiring authentication.

- **S3 compatible object storage**: For cloud storage solutions.

- **URI**: For publicly available resources, **which is the one used in this
demonstration**.

To create this connection in your project, enter the following URI and use
`Llama 3.2 3B Modelcar` as the connection name, as shown below:

![Create Connection](images/create-connection-using-uri.png)

```sh
oci://quay.io/jharmison/models:redhatai--llama-3_2-3b-instruct-fp8-modelcar
```

## Setting up Single-model Server and Deploy the model

1. In the left menu, click **Data science projects**.

The **Data science projects** page opens.

2. Click the name of the project that you want to deploy a model in.

A project details page opens.

3. Click the **Models** tab.

4. Perform one of the following actions:

- If you see a **​​Single-model serving platform** tile, click **Select single-model**
on the tile and then click the **Deploy model** button.

![Add A Single-model Server](images/add-a-single-model-server.png)

- If you do not see any tiles i.e. "Single-model serving platform" is already
selected, click the **Deploy model** button.

![Single-model serving platform](images/single-model-serving.png)

5. The **Deploy model** dialog opens.

Enter the following information for your new model:

- **Model deployment name**: Enter a unique name for the model that you are
deploying (e.g., "mini-llama-demo").

- **Serving runtime**: Select **vLLM NVIDIA GPU ServingRuntime for KServe** runtime.

- **Model framework (name - version)**: This is pre-selected as `vLLM`.

- **Deployment mode**: From the Deployment mode list, select **Advanced**
option - uses *Knative Serverless*.

- **Number of model server replicas to deploy** has **Minimum replicas**: `1`
and **Maximum replicas**:`1`.

- **Model server size**: This is the amount of resources, CPU, and RAM that will
be allocated to your server. Here, you can select `Medium` size.
- **Accelerator**: Select `NVIDIA A100 GPU`.

- **Number of accelerators**: `1`.

- **Model route**: Select the checkbox for "Make deployed models available through
an external route" this will enable us to send requests to the model endpoint
from outside the cluster.

- **Token authentication**: Select the checkbox for "Require token authentication"
if you want to secure or restrict access to the model by forcing requests to
provide an authorization token, which is important for security. While selecting
it, you can keep the populated Service account name i.e. `default-name`.

- **Source model location**:

i. Select the **Connection** option from the dropdown list that you created
[as described here](#establishing-model-connections) to store
the model by using the **Existing connection** option Connection dropdown
list i.e. `Llama 3.2 3B Modelcar`.

- **Configuration parameters**: You can customize the runtime parameters in the
Configuration parameters section. You don't need to add any arguments here.

For our example, set the **Model deployment name** to `granite`, and select
**Serving runtime** as `vLLM NVIDIA GPU ServingRuntime for KServe`. Also, ensure
that the **Deployment mode** is set to `Advanced` - uses *Knative Serverless*.

![Deploy Model 1](images/deploy-model-1.png)

Please leave the other fields at their default settings. For example, the
**Number of model server replicas to deploy** has **Minimum replicas** set to `1`
and **Maximum replicas** set to `1`, and the **Model server size** is set to `Medium`.

Choose `NVIDIA A100 GPU` as the **Accelerator**, with the **Number of accelerators**
set to `1`.

At this point, ensure that both
**Make deployed models available through an external route** and
**Require token authentication** are *checked*. Please leave the populated
**Service account name** i.e. `default-name` as it is. Select `Llama 3.2 3B Modelcar`
as the **Connection** from the *Existing connection*, as shown below:

![Deploy Model 2](images/deploy-model-2.png)

When you are ready to deploy your model, select the **Deploy** button.

Confirm that the deployed model appears on the **Models** tab for your project.
After some time, once the model has finished deploying, the model deployments
page of the dashboard will display a green checkmark in the **Status** column,
indicating that the deployment is complete.

To view details for the deployed model, click the dropdown *arrow icon* to the left
of your deployed model name (e.g., `mini-llama-demo`), as shown below:

![Llama Model Deployed Successfully](images/llama-model-deployed-successful.png)

You can also modify the configure properties for your deployed model configuration
by clicking on the three dots on the right side, and selecting **Edit**. This will
bring back the same configuration pop-up window we used earlier. This menu also
has the option for you to **Delete** the deployed model.

!!! tips "Intelligent Auto-Scaling and Scale-to-Zero for Significant Cost Savings"

Once you deploy your model and obtain the inference endpoints, you can edit
the deployment and set the **Minimum replicas** to 0. This enables intelligent
auto-scaling of your model's compute resources (CPU, GPU, RAM, etc.), allowing
replicas to scale up during high traffic and scale down when idle. With
`scale-to-zero` enabled, the system reduces pods to zero during inactivity,
eliminating idle compute costs—especially beneficial for GPU workloads. The
model then scales back up instantly as soon as a new request arrives.

## Testing your deployment

### Internal testing

Once deployed, navigate to **Workloads > Pods** in the left-hand menu, then locate
and click on the pod that corresponds to the model deployment name, as shown below:

![Locate your Llama model server in the Pods tab](images/llama-model-deployment-pod.png)

Access the pod’s terminal by clicking the **Terminal** tab, then run a *curl* command
to test internal communication.

The **vLLM runtime** uses **OpenAI's API** format, making integration straightforward.
You can learn more in the [OpenAI documentation](https://platform.openai.com/docs/api-reference/introduction).

The following is an example command you can use to test the connection:

```sh
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I help you?"},
{"role": "user", "content": "What is 2 plus 2?"}
]
}'
```

If your command output is successful, it should output something like this:

![Successful curl request output in the pod terminal](images/curl-response-successful.png)

### Testing external access

For external testing, use the token and external endpoint in your *curl* command.

The deployed model is now accessible through the API endpoint of the model server.
The information about the endpoint is different, depending on how you configured
the model server.

As in this example, you have exposed the model externally through a route, click
on the "Internal and external endpoint details" link in the Inference endpoint
section. A popup will display the address for the *url* and the *External (can be
accessed from inside or outside the cluster)* for the inference endpoints as shown
below:

![Successfully Deployed Llama Model Inference endpoints Info](images/deployed-llama-model-inference-endpoints.png)

**Notes**:

- The internal URL displayed is only the base address of the endpoint of the
following format: `https://name-of-your-model.name-of-your-project-namespace.svc.cluster.local`
that is accessible only within your cluster locally.

- The External Inference endpoint displays the full URL of the following format:
`https://name-of-your-model-name-of-your-project.apps.shift.nerc.mghpcc.org`
that you can be easily accessed from outside the cluster.

- Get the **Authorization Token** for your deployed model by clicking on dropdown
*arrow icon* to the left of your deployed model name i.e. "granite". Your
Authorization Token is located at the "Token authentication" section under
"Token secret", you can just copy the token i.e. `YOUR_BEARER_TOKEN` directly
from the UI.

The following are some example commands you can use to test the connection:

```sh
curl -X POST https://<url>/v1/chat/completions \
-H "Content-Type: application/json" -H "Authorization: Bearer YOUR_BEARER_TOKEN" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hello! How can I help you?"},
{"role": "user", "content": "What is 2 plus 2?"}
]
}'
```

**Output:**

![External Curl Testing Successful](images/external-curl-testing.png)

```sh
curl -k -X POST https://<url>/v1/completions \
-H "Content-Type: application/json" -H "Authorization: Bearer YOUR_BEARER_TOKEN" \
-d '{
"model": "name-of-your-model",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0.7
}'
```

## Web interface integration using [Open WebUI](https://docs.openwebui.com/)

For a more user-friendly experience, integrate with Open WebUI as follows:

1. **Clone** or navigate to [this repository](https://github.com/nerc-project/llm-on-nerc.git).

To get started, clone the repository using:

```sh
git clone https://github.com/nerc-project/llm-on-nerc.git
cd llm-on-nerc/llm-clients/openwebui/charts/openwebui
```

2. Prepare `values.yaml` to connect the Open WebUI to the Deployed vLLM Model.

Edit the `values.yaml` file to specify your running vLLM model and external
endpoint and token:

```yaml
vllmEndpoint: http://vllm.example.svc:8000/v1
vllmModel: granite-3.3-2b-instruct
vllmToken: ""
```

3. Install **Helm chart**.

Deploy Open WebUI using Helm with your configuration:

```sh
helm install openwebui ./ -f values.yaml
```

**Output:**

```sh
NAME: openwebui
LAST DEPLOYED: Tue Dec 2 22:52:06 2025
NAMESPACE: <your-namespace>
STATUS: deployed
REVISION: 1
DESCRIPTION: Install complete
TEST SUITE: None
NOTES:
1. Get the Open WebUI URL by running these commands:
route_hostname=$(kubectl get --namespace <your-namespace> route openwebui -o jsonpath='{.status.ingress[0].host}')
echo https://${route_hostname}
```

4. Access Open WebUI and Test vLLM integration.

Ensure the clean web UI is connected to your vLLM endpoint by sending a simple
prompt and verifying the response as shown below:

![Open WebUI Integration](images/open-webui-integration.png)

!!! danger "To Remove the Helm Chart"

Run the following command to cleanly uninstall and delete a Helm release:

```sh
helm uninstall openwebui
```

---
Loading