Scaling and Securing ML Model with KServe

Raju Dawadi
5 min readFeb 24, 2025

--

Let’s say, we’ve got an awesome machine learning model, maybe even a massive Large Language Model (LLM), and we want to share it with the world or a small set of user privately. Many things come into play for deployment, handling the traffic, making the connection secure, keeping the release process streamlined. If we want to leverage the power of Kubernetes here rather than handling VM configuration with open steps manually(Cattle vs Pets), here comes the power of KServe(kserve.github.io) which has good integration with Istio(a popular service mesh) within Kubernetes cluster.

img src: kserve.github.io

How KServe Makes it Easier?

KServe makes deployment and scaling of ML models securely on Kubernetes way easier, just like a simplified control panel.

Let’s dive into the simplest installation approach by spinning up our own Minikube cluster, installing Istio in ambient mode, setting up kserve, initiating a service, handling traffic and monitoring. This can be done on any public cloud k8s cluster as well.

Ready the Cluster with Installations

  1. Spin up Kubernetes cluster with minikube
minikube start

2. Install istio on ambient mode and Gateway API CRD

curl -L https://istio.io/downloadIstio | sh -
cd istio-*
export PATH=$PWD/bin:$PATH

istioctl install --set profile=ambient --skip-confirmation

kubectl get crd gateways.gateway.networking.k8s.io &> /dev/null || \
{ kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml; }

3. Install KServe and cert-manager

kubectl create ns kserve

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.0/cert-manager.yaml
helm install -n kserve kserve-crd oci://ghcr.io/kserve/charts/kserve-crd --version v0.14.1

helm install -n kserve kserve oci://ghcr.io/kserve/charts/kserve --version v0.14.1 \
--set kserve.controller.deploymentMode=RawDeployment

kubectl create ns kserve-test

That’s it for the installation step. Now we dig into running a simple ML model on our cluster with much ease.

Kickoff InferenceService and Gateway

Let’s run a simple google-t5/t5-small: Text-To-Text Transfer Transformer (T5) model from Hugging Face by creating InferenceService

kubectl apply -n kserve-test -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-t5-small
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=t5
- --model_id=google-t5/t5-small
- --backend=huggingface
EOF

And confirm the pod is ready and Inference Service is live

kubectl get po -n kserve-test

k get InferenceService -n kserve-test

That will give URL(Host Header) through which we will access the model. It should be the structure like this http://huggingface-t5-small-kserve-test.example.com

Create Gateway and Route

Now, we need to create a route by using Gateway API we installed

kubectl apply -n kserve-test -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: kserve-gateway
spec:
gatewayClassName: istio
listeners:
- name: http
port: 80
protocol: HTTP
allowedRoutes:
namespaces:
from: All
EOF
--
kubectl apply -n kserve-test -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: kserve-t5
spec:
parentRefs:
- name: kserve-gateway
hostnames:
- "huggingface-t5-kserve-test.example.com"
rules:
- backendRefs:
- name: huggingface-t5-small-predictor
port: 80
EOF

Access the Model via Minikube Tunnel

If we check existing service for the gateway, it will be in Pending state

kubectl get svc -n kserve-test

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
huggingface-t5-small-predictor ClusterIP 10.105.231.203 <none> 80/TCP 10m
kserve-gateway-istio LoadBalancer 10.109.151.181 <pending> 15021:32167/TCP,80:32492/TCP 44s

Minikube has this nice feature of tunneling the service. If we just do minikube tunnel , it will get the LoadBalancer type service with an EXTERNAL-IP(127.0.0.1).

With that, we can call our Inference Service with ease.

curl --location 'http://localhost/openai/v1/completions' \
--header 'content-type: application/json' \
--header 'Host: huggingface-t5-kserve-test.example.com' \
--data '{
"model": "t5",
"prompt": "translate this to German: I am living in beautiful world.",
"stream": false,
"max_tokens": 30
}'

# POSSIBLE RESPONSE {"id":"7cb9a637-b14c-4eac-8116-d0b8b463c1f1","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Das lebt in einer schönen Welt."}],"created":1740333939,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":11,"prompt_tokens":13,"total_tokens":24}}

As we are using Gateway API, the request/response can be routed or modified by leveraging lots of benefits of it. Some of it could be:

  • Header modification
  • Traffic splitting
  • Request mirroring
  • Redirects and retries

Monitoring and Securing Traffic with mTLS

Let’s leverage the Istio features and a combination of Prometheus-Kiali to view the traffic flow.

kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/prometheus.yaml
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/kiali.yaml

This will install Prometheus and Kiali on istio-system namespace. And with this simple command, we will land into Kiali dashboard

istioctl dashboard kiali
Kiali dashboard before namespace label

Here, the traffic is not traced by the Istio mesh. Let’s label the namespace which will enable all pods in the namespace to be part of an ambient mesh:

kubectl label namespace kserve-test istio.io/dataplane-mode=ambient

If we repeat the same request to the endpoint few more times, we will see the traffic being traced and mTLS secured.

Kiali dashboard namespace labeled

Autoscale InferenceService with Workload

Let’s make the autoscaling target to 1 by adding annotation; which means one pod per one request.

kubectl apply -n kserve-test -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-t5-small
annotations:
autoscaling.knative.dev/target: "1"
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=t5
- --model_id=google-t5/t5-small
- --backend=huggingface
EOF

If we start making few more request to the endpoint, we will see pods scaling up

kubectl get po -n kserve-test
NAME READY STATUS RESTARTS AGE
huggingface-t5-small-predictor-67bd7ddd7c-h49j2 0/1 Running 0 29s
huggingface-t5-small-predictor-db8fbdc77-l6lvq 1/1 Running 0 5h38m

Easing Rollout with in-built Canary Strategy

KServe out of the box supports a configurable canary rollout strategy with multiple steps, allowing a new version of an InferenceService to receive a percentage of traffic. If a rollout step fails, the strategy can also be implemented to rollback to the previous revision which helps on ensuring stability during updates.

img src: kserve.github.io
kubectl apply -n kserve-test -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-t5-small
spec:
predictor:
canaryTrafficPercent: 10
model:
modelFormat:
name: huggingface
args:
- --model_name=t5
- --model_id=google-t5/t5-small-v1
- --backend=huggingface
EOF

The above update will create one new pod which can take 10% of the traffic but as the model -v1 doesn’t exist, it won’t affect existing traffic. The canary rollout process is much easier with KServe through which we can increase the rollout percentage and gradually test the performance of new version of model.

That’s it for this post. If you are interested to stay in sync with me, feel free to connect on Linkedin.

--

--

Raju Dawadi
Raju Dawadi

No responses yet