Scaling and Securing ML Model with KServe
Let’s say, we’ve got an awesome machine learning model, maybe even a massive Large Language Model (LLM), and we want to share it with the world or a small set of user privately. Many things come into play for deployment, handling the traffic, making the connection secure, keeping the release process streamlined. If we want to leverage the power of Kubernetes here rather than handling VM configuration with open steps manually(Cattle vs Pets), here comes the power of KServe(kserve.github.io) which has good integration with Istio(a popular service mesh) within Kubernetes cluster.
How KServe Makes it Easier?
KServe makes deployment and scaling of ML models securely on Kubernetes way easier, just like a simplified control panel.
Let’s dive into the simplest installation approach by spinning up our own Minikube cluster, installing Istio in ambient mode, setting up kserve, initiating a service, handling traffic and monitoring. This can be done on any public cloud k8s cluster as well.
Ready the Cluster with Installations
- Spin up Kubernetes cluster with minikube
minikube start
2. Install istio on ambient mode and Gateway API CRD
curl -L https://istio.io/downloadIstio | sh -
cd istio-*
export PATH=$PWD/bin:$PATH
istioctl install --set profile=ambient --skip-confirmation
kubectl get crd gateways.gateway.networking.k8s.io &> /dev/null || \
{ kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.2.0/standard-install.yaml; }
3. Install KServe and cert-manager
kubectl create ns kserve
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.0/cert-manager.yaml
helm install -n kserve kserve-crd oci://ghcr.io/kserve/charts/kserve-crd --version v0.14.1
helm install -n kserve kserve oci://ghcr.io/kserve/charts/kserve --version v0.14.1 \
--set kserve.controller.deploymentMode=RawDeployment
kubectl create ns kserve-test
That’s it for the installation step. Now we dig into running a simple ML model on our cluster with much ease.
Kickoff InferenceService and Gateway
Let’s run a simple google-t5/t5-small: Text-To-Text Transfer Transformer (T5) model from Hugging Face by creating InferenceService
kubectl apply -n kserve-test -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-t5-small
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=t5
- --model_id=google-t5/t5-small
- --backend=huggingface
EOF
And confirm the pod is ready and Inference Service is live
kubectl get po -n kserve-test
k get InferenceService -n kserve-test
That will give URL(Host Header) through which we will access the model. It should be the structure like this http://huggingface-t5-small-kserve-test.example.com
Create Gateway and Route
Now, we need to create a route by using Gateway API we installed
kubectl apply -n kserve-test -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: kserve-gateway
spec:
gatewayClassName: istio
listeners:
- name: http
port: 80
protocol: HTTP
allowedRoutes:
namespaces:
from: All
EOF
--
kubectl apply -n kserve-test -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: kserve-t5
spec:
parentRefs:
- name: kserve-gateway
hostnames:
- "huggingface-t5-kserve-test.example.com"
rules:
- backendRefs:
- name: huggingface-t5-small-predictor
port: 80
EOF
Access the Model via Minikube Tunnel
If we check existing service for the gateway, it will be in Pending state
kubectl get svc -n kserve-test
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
huggingface-t5-small-predictor ClusterIP 10.105.231.203 <none> 80/TCP 10m
kserve-gateway-istio LoadBalancer 10.109.151.181 <pending> 15021:32167/TCP,80:32492/TCP 44s
Minikube has this nice feature of tunneling the service. If we just do minikube tunnel
, it will get the LoadBalancer type service with an EXTERNAL-IP(127.0.0.1).
With that, we can call our Inference Service with ease.
curl --location 'http://localhost/openai/v1/completions' \
--header 'content-type: application/json' \
--header 'Host: huggingface-t5-kserve-test.example.com' \
--data '{
"model": "t5",
"prompt": "translate this to German: I am living in beautiful world.",
"stream": false,
"max_tokens": 30
}'
# POSSIBLE RESPONSE {"id":"7cb9a637-b14c-4eac-8116-d0b8b463c1f1","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Das lebt in einer schönen Welt."}],"created":1740333939,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":11,"prompt_tokens":13,"total_tokens":24}}
As we are using Gateway API, the request/response can be routed or modified by leveraging lots of benefits of it. Some of it could be:
- Header modification
- Traffic splitting
- Request mirroring
- Redirects and retries
Monitoring and Securing Traffic with mTLS
Let’s leverage the Istio features and a combination of Prometheus-Kiali to view the traffic flow.
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/prometheus.yaml
kubectl apply -f https://raw.githubusercontent.com/istio/istio/release-1.24/samples/addons/kiali.yaml
This will install Prometheus and Kiali on istio-system namespace. And with this simple command, we will land into Kiali dashboard
istioctl dashboard kiali
Here, the traffic is not traced by the Istio mesh. Let’s label the namespace which will enable all pods in the namespace to be part of an ambient mesh:
kubectl label namespace kserve-test istio.io/dataplane-mode=ambient
If we repeat the same request to the endpoint few more times, we will see the traffic being traced and mTLS secured.
Autoscale InferenceService with Workload
Let’s make the autoscaling target to 1 by adding annotation; which means one pod per one request.
kubectl apply -n kserve-test -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-t5-small
annotations:
autoscaling.knative.dev/target: "1"
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=t5
- --model_id=google-t5/t5-small
- --backend=huggingface
EOF
If we start making few more request to the endpoint, we will see pods scaling up
kubectl get po -n kserve-test
NAME READY STATUS RESTARTS AGE
huggingface-t5-small-predictor-67bd7ddd7c-h49j2 0/1 Running 0 29s
huggingface-t5-small-predictor-db8fbdc77-l6lvq 1/1 Running 0 5h38m
Easing Rollout with in-built Canary Strategy
KServe out of the box supports a configurable canary rollout strategy with multiple steps, allowing a new version of an InferenceService to receive a percentage of traffic. If a rollout step fails, the strategy can also be implemented to rollback to the previous revision which helps on ensuring stability during updates.
kubectl apply -n kserve-test -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-t5-small
spec:
predictor:
canaryTrafficPercent: 10
model:
modelFormat:
name: huggingface
args:
- --model_name=t5
- --model_id=google-t5/t5-small-v1
- --backend=huggingface
EOF
The above update will create one new pod which can take 10% of the traffic but as the model -v1
doesn’t exist, it won’t affect existing traffic. The canary rollout process is much easier with KServe through which we can increase the rollout percentage and gradually test the performance of new version of model.
That’s it for this post. If you are interested to stay in sync with me, feel free to connect on Linkedin.