Running and Securing AI Applications on Cloud Run with Model Armor
AI adoption is accelerating across industries, but running these applications in production introduces new security challenges. From sensitive model data to untrusted extensions, the attack surface grows quickly.
Fortunately, Cloud Run offers an effective way to deploy with built-in scalability and managed infrastructure. By adding Model Armor as a Cloud Run Service Extension, workloads can gain an additional layer of security without heavy operational overhead.
Why Security Matters for AI Applications
Such application services interact with models, extensions, and client applications. A compromised environment can lead to unauthorized access to model files, data leakage, malicious prompt injections, or even compliance violations. Securing the environment ensures that both developers and end-users can trust the responses coming from the server.
Deploying Ollama on Cloud Run
In this example, Ollama is chosen because it provides a lightweight, developer friendly way to run open models with minimal setup. Its flexibility makes it ideal for containerized environments like Cloud Run, where quick deployment and scalability are essential. Cloud Run makes this straightforward:
gcloud run deploy gemma-ollama-baked-service \
--image=ollama/ollama:latest \
--region=us-central1 \
--platform=managed \
--cpu=4 \
--memory=16Gi \
--gpu=1 \
--gpu-type=nvidia-l4 \
--no-gpu-zonal-redundancy \
--port=11434 \
--timeout=600 \
--concurrency=4 \
--set-env-vars=OLLAMA_NUM_PARALLEL=4 \
--no-cpu-throttling \
--allow-unauthenticated \
--max-instances=1 \
--min-instances=0 \
--add-volume-mount=volume=ollama-model,mount-path=/root/.ollama/modelsThis launches the ollama instance in a fully managed, autoscaling environment with model file staying in a Cloud Storage bucket, and Cloud Run supports mounting them directly into the container. By mounting model files from Cloud Storage instead of baking them into the container image, we can keep deployments lightweight and easy to update.
Call Ollama API for Pulling Model and Interact
Once the deployment is ready, call the api to pull gemma:2b model which will be stored in the volume mounted in google cloud storage bucket.
curl $CLOUD_RUN_URL/api/pull -d '{
"name": "gemma:2b"
}'Now, we are good to interact with the model
curl -X POST "$CLOUD_RUN_URL/api/generate" \
-H "Content-Type: application/json" \
-d '{
"model": "gemma:2b",
"prompt": "What are the most popular festivals in Nepal?",
"stream": false
}'Adding Model Armor with Service Extensions
Here’s where Model Armor comes in. By enabling Cloud Run Service Extensions, the MCP server gains inline request filtering and traffic inspection.
1. Hook It Up with a Serverless NEG (Network Endpoint Group)
Cloud Run doesn’t natively sit behind a Google-managed load balancer, so we create a NEG:
gcloud compute network-endpoint-groups create my-ai-neg \
--region us-central1 \
--network-endpoint-type serverless \
--cloud-run-service my-ai-service2. Build the Backend Service for the Load Balancer
gcloud compute backend-services create ai-backend \
--load-balancing-scheme EXTERNAL_MANAGED \
--protocol HTTPS \
--region us-central1gcloud compute backend-services add-backend ai-backend \
--network-endpoint-group my-ai-neg \
--network-endpoint-group-region us-central13. Generate TLS Certificates (Self-Signed for Dev)
For internal testing, I used a self-signed certificate:
openssl genrsa -out ai.key 2048
openssl req -new -x509 -key ai.key -out ai.crt -days 365 \
-subj "/C=US/ST=CA/L=Mountain View/O=tech/OU=AI/CN=internal.ai"Upload it to Google Cloud:
gcloud compute ssl-certificates create ai-ssl-cert \
--certificate=ai.crt \
--private-key=ai.key \
--region us-central1For production, you’d use a Google-managed cert tied to your domain.
4. Set Up URL Mapping and HTTPS Proxy
gcloud compute url-maps create ai-url-map \
--default-service ai-backend \
--region us-central1gcloud compute url-maps add-path-matcher ai-url-map \
--default-service ai-backend \
--path-matcher-name default-matcher \
--path-rules '*=ai-backend' \
--region us-central1gcloud compute target-https-proxies create ai-https-proxy \
--url-map ai-url-map \
--ssl-certificates ai-ssl-cert \
--region us-central1
5. Carve Out a Subnet and Reserve a Static IP
gcloud compute networks subnets create proxy-subnet \
--purpose REGIONAL_MANAGED_PROXY \
--role ACTIVE \
--region us-central1 \
--network default \
--range 192.168.0.0/26gcloud compute addresses create ai-static-ip --region us-central1gcloud compute forwarding-rules create ai-https-fr \
--address ai-static-ip \
--target-https-proxy ai-https-proxy \
--ports 443 \
--load-balancing-scheme EXTERNAL_MANAGED \
--region us-central1
Now, your AI is accessible at a consistent static IP — or ideally, your custom domain pointing to it.
6. Layer In Model Armor for Safety and Compliance
You don’t want your AI responding to anything harmful — or worse, leaking sensitive data. Model Armor acts as a protective filter, inspecting both incoming prompts and outgoing responses before they reach your model or your users. Enter Model Armor:
gcloud config set api_endpoint_overrides/modelarmor https://modelarmor.us-central1.rep.googleapis.com/gcloud model-armor templates create --location us-central1 my-model-armor \
--rai-settings-filters='[{ "filterType": "HATE_SPEECH", "confidenceLevel": "MEDIUM_AND_ABOVE" },{ "filterType": "HARASSMENT", "confidenceLevel": "MEDIUM_AND_ABOVE" },{ "filterType": "SEXUALLY_EXPLICIT", "confidenceLevel": "MEDIUM_AND_ABOVE" }]' \
--basic-config-filter-enforcement=enabled \
--pi-and-jailbreak-filter-settings-enforcement=enabled \
--pi-and-jailbreak-filter-settings-confidence-level=LOW_AND_ABOVE \
--malicious-uri-filter-settings-enforcement=enabled \
--template-metadata-custom-llm-response-safety-error-code=798 \
--template-metadata-custom-llm-response-safety-error-message="Guardian, a critical flaw has been detected in the very incantation you are attempting to cast." \
--template-metadata-custom-prompt-safety-error-code=799 \
--template-metadata-custom-prompt-safety-error-message="Guardian, a critical flaw has been detected in the very incantation you are attempting to cast." \
--template-metadata-ignore-partial-invocation-failures \
--template-metadata-log-operations \
--template-metadata-log-sanitize-operationsThen import it as a traffic extension:
gcloud service-extensions lb-traffic-extensions import model-armor-chain \
--source=service_extension.yaml \
--location=us-central1(service_extension.yaml defines how Model Armor filters traffic.)
7. Lock Down Ingress
Ensure your AI only takes traffic via the LB:
gcloud beta run services update my-ai-service \
--ingress internal-and-cloud-load-balancingFinal Architecture
User → LB (HTTPS, static IP, TLS) → Model Armor → Cloud Run service
- Global TLS with no extra cost or Ops
- Static IP (nice for firewall rules)
- Content safety on both incoming prompts and outgoing responses
- Scalable run of AI application endpoint
This setup combines the convenience of serverless with strong security controls. Cloud Run ensures scalability, while Model Armor enforces responsible AI practices at every request-response boundary.
