Monitoring Guide¶
This guide explains how to implement comprehensive monitoring for ZeroTrustKerberosLink to ensure optimal performance, security, and reliability.
Overview¶
Effective monitoring of ZeroTrustKerberosLink helps you:
- Detect and respond to security incidents
- Identify performance bottlenecks
- Ensure high availability
- Track usage patterns
- Verify compliance requirements
Key Metrics¶
Authentication Metrics¶
Metric | Description | Normal Range |
---|---|---|
auth_requests_total | Total authentication requests | Varies by usage |
auth_success_total | Successful authentications | >95% of requests |
auth_failure_total | Failed authentications | <5% of requests |
auth_latency_seconds | Authentication request latency | <500ms |
AWS Integration Metrics¶
Metric | Description | Normal Range |
---|---|---|
aws_role_assumptions_total | Total AWS role assumptions | Varies by usage |
aws_role_assumption_failures_total | Failed role assumptions | <1% of attempts |
aws_role_assumption_latency_seconds | Role assumption latency | <1000ms |
aws_api_errors_total | AWS API errors | Near 0 |
System Metrics¶
Metric | Description | Normal Range |
---|---|---|
cpu_usage_percent | CPU utilization | <70% |
memory_usage_bytes | Memory usage | <80% of available |
disk_usage_percent | Disk usage | <80% |
open_file_descriptors | Open file descriptors | <80% of limit |
Redis Metrics¶
Metric | Description | Normal Range |
---|---|---|
redis_connections_total | Total Redis connections | <80% of connection pool |
redis_operation_latency_seconds | Redis operation latency | <50ms |
redis_errors_total | Redis errors | Near 0 |
redis_cache_hit_ratio | Cache hit ratio | >80% |
Prometheus Integration¶
Basic Configuration¶
Enable Prometheus metrics in config.yaml
:
monitoring:
prometheus:
enabled: true
endpoint: "/metrics"
port: 8080 # Same as main service port
Prometheus Server Configuration¶
Add ZeroTrustKerberosLink as a scrape target:
# prometheus.yml
scrape_configs:
- job_name: 'zerotrustkerberos'
scrape_interval: 15s
static_configs:
- targets: ['zerotrustkerberos:8080']
For multiple instances with service discovery:
# prometheus.yml
scrape_configs:
- job_name: 'zerotrustkerberos'
scrape_interval: 15s
consul_sd_configs:
- server: 'consul:8500'
services: ['zerotrustkerberos']
Grafana Dashboards¶
Dashboard Setup¶
- Import the ZeroTrustKerberosLink dashboard:
- Dashboard ID:
12345
(example) -
Or import from JSON file
-
Connect to your Prometheus data source
Key Dashboard Panels¶
- Authentication Overview
- Authentication success/failure rate
- Authentication latency
-
Active sessions
-
AWS Integration
- Role assumption success/failure rate
- Role assumption latency
-
Most used AWS roles
-
System Health
- CPU, memory, disk usage
- Request rate
-
Error rate
-
Security Monitoring
- Failed authentication attempts
- Policy violations
- Unusual access patterns
Alert Configuration¶
Critical Alerts¶
Alert | Condition | Severity | Response |
---|---|---|---|
High Authentication Failure Rate | rate(auth_failure_total[5m]) / rate(auth_requests_total[5m]) > 0.1 | Critical | Investigate potential attack |
Service Unavailable | up{job="zerotrustkerberos"} == 0 | Critical | Restart service, check logs |
High Latency | histogram_quantile(0.95, auth_latency_seconds) > 1 | Warning | Check system resources, scaling |
Redis Connection Failure | redis_up == 0 | Critical | Check Redis connection, restart if needed |
Prometheus Alert Rules¶
# alerts.yml
groups:
- name: zerotrustkerberos
rules:
- alert: HighAuthFailureRate
expr: rate(auth_failure_total[5m]) / rate(auth_requests_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High authentication failure rate"
description: "Authentication failure rate is above 10% for 5 minutes"
- alert: ServiceUnavailable
expr: up{job="zerotrustkerberos"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "ZeroTrustKerberosLink service down"
description: "Service has been down for more than 1 minute"
Health Checks¶
Health Check Endpoint¶
Configure the health check endpoint:
server:
health_check:
enabled: true
endpoint: "/health"
include_details: true
Health Check Components¶
The health check endpoint reports on:
- Service Status: Overall service health
- Kerberos: Kerberos authentication status
- AWS: AWS API connectivity
- Redis: Redis connection status
- System Resources: CPU, memory, disk usage
Example health check response:
{
"status": "healthy",
"version": "1.2.3",
"timestamp": "2025-05-01T12:34:56Z",
"uptime": "3d 12h 34m",
"components": {
"kerberos": {
"status": "healthy",
"details": "Kerberos authentication working"
},
"aws": {
"status": "healthy",
"details": "AWS API connection OK"
},
"redis": {
"status": "healthy",
"details": "Redis connection OK"
},
"system": {
"status": "healthy",
"cpu": "45%",
"memory": "60%",
"disk": "30%"
}
}
}
Log Monitoring¶
Log-Based Alerts¶
Configure log-based alerts for security events:
logging:
alerts:
- name: "multiple_auth_failures"
pattern: "authentication.*failed"
threshold: 5
window: "5m"
notification: "email"
- name: "admin_actions"
pattern: "event_type.*admin"
threshold: 1
window: "1h"
notification: "slack"
Log Visualization¶
Use Kibana or Grafana Loki for log visualization:
- Authentication Dashboard
- Authentication events over time
- Failed authentication sources
-
Authentication methods used
-
Security Dashboard
- Policy violations
- Administrative actions
- Access patterns
Notification Channels¶
Email Notifications¶
Configure email alerts:
notifications:
email:
enabled: true
smtp_server: "smtp.example.com"
smtp_port: 587
use_tls: true
from_address: "alerts@example.com"
to_addresses:
- "security@example.com"
- "ops@example.com"
Slack Notifications¶
Configure Slack alerts:
notifications:
slack:
enabled: true
webhook_url_file: "/etc/zerotrustkerberos/secrets/slack_webhook"
channel: "#zerotrustkerberos-alerts"
username: "ZeroTrustMonitor"
PagerDuty Integration¶
Configure PagerDuty for critical alerts:
notifications:
pagerduty:
enabled: true
integration_key_file: "/etc/zerotrustkerberos/secrets/pagerduty_key"
service_name: "ZeroTrustKerberosLink"
Monitoring in Kubernetes¶
Prometheus Operator¶
Use Prometheus Operator for Kubernetes monitoring:
- Create a ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: zerotrustkerberos
namespace: monitoring
spec:
selector:
matchLabels:
app: zerotrustkerberos
endpoints:
- port: http
path: /metrics
interval: 15s
- Create a PrometheusRule:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: zerotrustkerberos-alerts
namespace: monitoring
spec:
groups:
- name: zerotrustkerberos
rules:
- alert: HighAuthFailureRate
expr: rate(auth_failure_total[5m]) / rate(auth_requests_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High authentication failure rate"
Kubernetes Liveness Probe¶
Configure Kubernetes liveness probe:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
Capacity Planning¶
Metrics for Scaling¶
Monitor these metrics for capacity planning:
- Request Rate: Requests per second
- CPU Usage: When consistently >70%, consider scaling
- Memory Usage: When consistently >80%, consider scaling
- Authentication Latency: When P95 >500ms, consider scaling
Auto-scaling Configuration¶
Configure auto-scaling based on metrics:
# Kubernetes HPA example
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: zerotrustkerberos
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: zerotrustkerberos
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Security Monitoring¶
Security-Focused Metrics¶
Monitor these security-focused metrics:
- Authentication Failures: By source IP, user, time
- Policy Violations: Attempts to access unauthorized resources
- Role Assumption Patterns: Unusual role assumption behavior
- Geographic Access: Access from unusual locations
- Time-based Access: Access outside normal hours
Security Dashboard¶
Create a security-focused dashboard with:
- Authentication Heatmap: By time and source
- Top Failed Users: Users with most failures
- Geographic Access Map: Where access is coming from
- Role Usage: Which AWS roles are being used
- Policy Violations: Attempts to violate policies
Best Practices¶
Monitoring Best Practices¶
- Comprehensive Coverage: Monitor all critical components
- Appropriate Thresholds: Set realistic alert thresholds
- Alert Fatigue: Avoid too many alerts by tuning thresholds
- Correlation: Correlate metrics with logs for context
- Historical Data: Retain historical data for trend analysis
- Regular Review: Review dashboards and alerts regularly
- Documentation: Document monitoring setup and procedures
Security Monitoring Best Practices¶
- Baseline Behavior: Establish normal behavior patterns
- Anomaly Detection: Alert on deviations from normal
- Context Awareness: Include context in security alerts
- Rapid Response: Define procedures for security alerts
- Regular Testing: Test monitoring and alerting regularly