Skip to content

Operations Guide — Agent Framework (Kind Cluster)

Quick reference for deploying updates, checking status, reading logs, and debugging the local Kind cluster.


Prerequisites

Tool Purpose
docker Build images
kind Local k8s cluster
kubectl Cluster management
uv Python package manager
pnpm Frontend package manager

Full Redeploy (from scratch)

Use this when you want to rebuild both images and re-apply everything from scratch.

# Works on Windows (PowerShell), Linux, macOS, and Git Bash
uv run python deploy.py

# Optional flags:
uv run python deploy.py --cluster-name dev --backend-tag agent-microservices-kind:local --frontend-tag chatbot-frontend-kind:local

The script automatically: 1. Reads secrets from .env 2. Builds backend Docker image (agent-microservices-kind:local) 3. Builds frontend Docker image (chatbot-frontend-kind:local) from ../ai-chatbot-ui/ 4. Loads both images into the Kind cluster dev 5. Deploys namespaces and infra (Postgres, Redis) 6. Creates secrets in all namespaces 7. Applies all k8s manifests via kubectl apply -k deployment/k8s/overlays/kind


Partial Redeploy — Backend Only

When you only change Python code in agent-framework/:

# 1. Rebuild backend image
docker build -f deployment/docker/backend.Dockerfile -t agent-microservices-kind:local .

# 2. Load into Kind cluster
kind load docker-image agent-microservices-kind:local --name dev

# 3. Restart all backend deployments
kubectl rollout restart deployment -n af-edge
kubectl rollout restart deployment -n af-platform
kubectl rollout restart deployment -n af-runtime

# 4. Watch rollout complete
kubectl rollout status deployment/gateway-bff -n af-edge --timeout=120s

Partial Redeploy — Frontend Only

When you only change code in ai-chatbot-ui/:

# From ai-chatbot-ui/ directory
cd ..\ai-chatbot-ui

# 1. Rebuild frontend image (NEXT_PUBLIC_API_URL="" → uses relative paths via ingress)
docker build --build-arg NEXT_PUBLIC_API_URL="" -t localhost/ai-chatbot-ui:latest .

# 2. Load into Kind cluster
kind load docker-image localhost/ai-chatbot-ui:latest --name dev

# 3. Update the frontend deployment to use the new image
kubectl set image deployment/frontend frontend=localhost/ai-chatbot-ui:latest -n af-edge

# 4. Restart to pick it up
kubectl rollout restart deployment/frontend -n af-edge

# 5. Watch it come up
kubectl rollout status deployment/frontend -n af-edge --timeout=120s

Apply k8s Manifest Changes Only

When you edit YAML files in deployment/k8s/ but don't need to rebuild images:

kubectl apply -k deployment/k8s/overlays/kind

Status & Health

Quick overview — all pods

kubectl get pods -A

Per-namespace pods

kubectl get pods -n af-edge        # frontend, gateway-bff
kubectl get pods -n af-platform    # identity-auth, policy-authorization
kubectl get pods -n af-runtime     # agent-runtime, conversation, job-controller, etc.
kubectl get pods -n af-data        # postgres, redis
kubectl get pods -n af-observability  # grafana, loki, tempo, prometheus

Only show problem pods

kubectl get pods -A --field-selector=status.phase!=Running | Where-Object { $_ -notmatch "Completed|code-interpreter" }

Deployment health

kubectl get deployments -A

HPA status (autoscaler)

kubectl get hpa -A

Ingress rules

kubectl describe ingress af-ingress -n af-edge

Endpoint connectivity

# Health check
curl http://localhost/health

# Should return threads list (empty array is fine)
curl http://localhost/threads

# Full smoke test
./deployment/k8s/overlays/kind/smoke-test.ps1

Logs

Frontend (Next.js)

kubectl logs -n af-edge deployment/frontend --tail=100 -f

Gateway BFF

kubectl logs -n af-edge deployment/gateway-bff --tail=100 -f

Agent Runtime (where the ReAct loop runs)

kubectl logs -n af-runtime deployment/agent-runtime --tail=100 -f

Job Controller

kubectl logs -n af-runtime deployment/job-controller --tail=100 -f

Identity / Auth

kubectl logs -n af-platform deployment/identity-auth --tail=100 -f

All logs from a namespace (last 50 lines per pod)

kubectl logs -n af-runtime --selector="" --tail=50 --all-containers

Follow logs from multiple pods matching a label

kubectl logs -n af-runtime -l app=agent-runtime -f

Previous crashed container logs

kubectl logs -n af-edge deployment/frontend --previous

Debugging

Describe a failing pod

kubectl describe pod <pod-name> -n <namespace>
# e.g.
kubectl describe pod -n af-edge -l app=frontend

Exec into a running container

kubectl exec -it -n af-edge deployment/gateway-bff -- /bin/sh

Check events (shows scheduling failures, OOM kills, etc.)

kubectl get events -n af-edge --sort-by='.lastTimestamp' | Select-Object -Last 20
kubectl get events -A --sort-by='.lastTimestamp' | Select-Object -Last 30

Memory pressure — kill stuck pods

# Delete all Pending pods (they'll reschedule if something frees up)
kubectl get pods -A --field-selector=status.phase=Pending -o json |
  kubectl delete -f -

Force delete a stuck pod

kubectl delete pod <pod-name> -n <namespace> --grace-period=0 --force

Secrets

Recreate secrets (after .env change)

# Re-run deploy script — it uses --dry-run=client | apply so it's idempotent
uv run python deploy.py

View current secret keys (not values)

kubectl get secret shared-secrets -n af-edge -o jsonpath='{.data}' | ConvertFrom-Json | Get-Member -MemberType NoteProperty | Select-Object Name

Scaling

Scale a deployment to 1 replica (memory-constrained single-node Kind)

kubectl scale deployment <name> -n <namespace> --replicas=1
# e.g.
kubectl scale deployment gateway-bff -n af-edge --replicas=1

Scale an HPA minimum

$patch = '{"spec":{"minReplicas":1}}'
Set-Content "$env:TEMP\hpa.json" $patch
kubectl patch hpa <name> -n <namespace> --type=merge --patch-file "$env:TEMP\hpa.json"

Observability

Open Grafana dashboards

http://localhost/grafana/
Login: admin / admin (anonymous read also enabled)

Pre-built dashboards: - Service RED Metrics — requests, errors, duration per service - Infrastructure — CPU, memory, pod restarts - Log Analytics — error rates, log search - Distributed Tracing — trace explorer (Tempo) - Alerts Overview — firing alerts

Query logs directly (Loki)

In Grafana → Explore → Loki:

{namespace=~"af-.*"}                           # all agent-framework logs
{namespace="af-edge", app="gateway-bff"}       # gateway logs only
{namespace=~"af-.*"} |~ "(?i)error|exception" # errors across all services

Query traces (Tempo)

In Grafana → Explore → Tempo → Search


Local Dev (no cluster)

Start backend (monolith mode)

cd agent-framework
docker compose -f deployment/docker/docker-compose.yml up -d postgres redis
uv run uvicorn ravi.server.app:app --port 8000 --reload

Start frontend

cd ai-chatbot-ui
pnpm dev     # runs on http://localhost:3000

Run tests

cd agent-framework
uv run pytest

Lint & format

uv run ruff check .
uv run ruff format .

Port Reference

Service Local Dev Port k8s (Kind)
Frontend (Next.js) 3000 http://localhost/
Backend API 8000 http://localhost/chat, /threads, etc.
PostgreSQL 5432 internal cluster only
Redis 6379 internal cluster only
Grafana http://localhost/grafana/
MCP demo server 9000 docker compose -f deployment/docker/docker-compose.yml --profile mcp