"Never trust, always verify" sounds great in a conference talk. Implementing it in a production environment with legacy services, tight deadlines, and engineers who just want to ship features is a different story. This guide covers how to roll out zero trust incrementally without breaking everything.
Why Perimeter Security Fails
The traditional network model — hard outer shell, soft interior — assumes that anything inside the network is trusted. This fails because:
- Lateral movement — an attacker who compromises one service can reach everything on the internal network
- Remote work — the "inside" and "outside" distinction no longer maps to physical locations
- Cloud services — your perimeter now extends to AWS, GCP, SaaS tools, and third-party APIs
- Supply chain attacks — a compromised dependency runs with full network access inside your perimeter
Zero Trust Principles
Every request — whether from a user, service, or device — must be:
- Authenticated — prove who you are
- Authorized — prove you're allowed to do this specific thing
- Encrypted — all traffic encrypted, even internal
- Continuously verified — authentication isn't a one-time event
Starting Point: Service-to-Service Authentication
The highest-impact first step is ensuring services authenticate to each other. No more "if it's on the internal network, it's trusted."
Mutual TLS (mTLS)
Every service gets a certificate. Every connection requires both sides to present valid certificates:
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: production
spec:
mtls:
mode: STRICTWith Istio's strict mTLS, any service that tries to communicate without a valid certificate is rejected. No exceptions.
Service Identity with SPIFFE
SPIFFE provides a standard for service identity that works across platforms:
spiffe://myorg.com/ns/production/sa/payment-serviceEvery service gets a SPIFFE ID. Authorization policies reference these IDs instead of IP addresses or hostnames, which change constantly in dynamic environments.
Network Policies: Default Deny
The foundation of zero trust networking in Kubernetes — deny all traffic by default, then explicitly allow only what's needed:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
- EgressThen allow specific communication paths:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-to-database
namespace: production
spec:
podSelector:
matchLabels:
app: postgres
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: api-service
ports:
- port: 5432
protocol: TCPUser Access: Beyond VPN
VPNs give users full network access — the opposite of zero trust. Replace VPN-based access with identity-aware proxies:
server {
listen 443 ssl;
server_name internal-tool.example.com;
# Verify OAuth2 token on every request
auth_request /oauth2/auth;
error_page 401 = /oauth2/sign_in;
auth_request_set $user $upstream_http_x_auth_request_user;
auth_request_set $email $upstream_http_x_auth_request_email;
auth_request_set $groups $upstream_http_x_auth_request_groups;
location / {
proxy_pass http://internal-tool:8080;
proxy_set_header X-Authenticated-User $user;
proxy_set_header X-Authenticated-Email $email;
proxy_set_header X-Authenticated-Groups $groups;
}
}Each request is authenticated and authorized individually. No VPN. No "you're on the network, so you're trusted."
Short-Lived Credentials
Long-lived API keys and service account tokens are the antithesis of zero trust. Every credential should expire:
# Instead of static AWS access keys, use STS for temporary credentials
aws sts assume-role \
--role-arn arn:aws:iam::123456789:role/deploy-role \
--role-session-name ci-deploy \
--duration-seconds 900 # 15 minutes — enough for one deployment
# In Kubernetes, use projected service account tokens
# that expire and auto-rotateapiVersion: v1
kind: Pod
spec:
containers:
- name: app
volumeMounts:
- name: token
mountPath: /var/run/secrets/tokens
volumes:
- name: token
projected:
sources:
- serviceAccountToken:
path: token
expirationSeconds: 3600 # 1 hour
audience: api.example.comMonitoring Zero Trust
Zero trust generates a lot of authentication and authorization events. Monitor them:
groups:
- name: zero-trust-alerts
rules:
- alert: UnauthorizedServiceCommunication
expr: |
sum(rate(istio_requests_total{
response_code="403",
reporter="destination"
}[5m])) by (source_workload, destination_workload) > 0
for: 1m
annotations:
summary: "{{ $labels.source_workload }} denied access to {{ $labels.destination_workload }}"
- alert: MtlsHandshakeFailures
expr: |
sum(rate(envoy_ssl_connection_error[5m])) by (pod) > 0.1
for: 2m
annotations:
summary: "mTLS handshake failures on {{ $labels.pod }}"The Incremental Rollout
Don't try to implement everything at once. A proven rollout order:
- Week 1-2: Enable mTLS in permissive mode (log but don't block)
- Week 3-4: Deploy default-deny network policies in staging
- Week 5-6: Switch mTLS to strict mode in production
- Week 7-8: Deploy network policies to production
- Month 3: Replace VPN access with identity-aware proxy
- Month 4: Migrate to short-lived credentials
Each step has a rollback plan. Each step is validated before moving to the next.
Key Takeaways
- Start with service-to-service mTLS — it's the highest-impact, lowest-risk first step
- Default-deny network policies are non-negotiable — without them, compromised services have unlimited lateral movement
- Replace VPNs with identity-aware proxies — VPNs are the opposite of zero trust
- Short-lived credentials reduce blast radius — a leaked token that expires in 15 minutes is dramatically less dangerous
- Roll out incrementally — zero trust is a journey, not a migration weekend