Free Interactive Tool · MLOps

AI Model Performance & Health Dashboard

The production monitoring framework for deployed AI systems. Visualise drift detection, latency and uptime, and incident rollback tracking — with a plain-language explanation of why each signal matters and how to instrument it in your environment.

Time Range

Model Type

Drift Threshold (PSI)

0.25

Drift Score (PSI) —

P95 Latency —

System Uptime —

Open Incidents —

Avg MTTR —

Rollbacks —

Drift Score Over Time (PSI)

PSI Score Threshold

Latency Distribution

Incident & Rollback Timeline

Metric Education

What each signal means, why it matters in production, and how to deploy it.

Drift Detection

Purpose

Drift detection tracks statistical deviation between the data distribution at deployment time and what the model is receiving in production. There are two types: data drift (the inputs have changed) and concept drift (the relationship between inputs and outputs has changed). Both cause silent accuracy degradation — the model answers confidently with increasingly wrong results.

Why It Matters

A model trained on last year's data will degrade when market conditions, language patterns, or customer behaviour shift — drift detection catches this weeks before accuracy visibly collapses
Data drift is the leading indicator; concept drift is the dangerous one. Without monitoring, you discover failures through customer complaints, not instrumentation
The Population Stability Index (PSI): below 0.10 is stable, 0.10–0.25 requires investigation, above 0.25 indicates major shift requiring retraining or rollback
LLMs drift for different reasons than classification models — prompt distribution shifts, topic drift, and language evolution are the primary vectors

How to Deploy

Capture baseline statistics (mean, variance, feature distributions) on your training or validation set before deployment — this is your reference snapshot
Log every inference input to a monitoring store; compute PSI against baseline on a rolling 24-hour window for each feature
Set alert at PSI > 0.25; trigger human review at PSI > 0.10 with automated tagging of affected predictions
Separate data drift from concept drift by running a labelled validation set weekly — if accuracy drops while PSI is low, you have concept drift

Latency & Uptime

Purpose

Latency measures how long the model takes to respond — tracked as P50 (median), P95 (95th percentile), and P99 (99th percentile). Uptime measures the percentage of time the system is operational and serving valid responses. For LLMs, two latency metrics matter separately: Time to First Token (TTFT) for perceived responsiveness, and total generation time for workflow completion.

Why It Matters

P95 latency, not average, is what users experience — an average of 200ms means nothing if 5% of requests take 4 seconds and time out upstream workflows
High latency under load is usually the first signal of capacity issues or model degradation — it precedes accuracy drops and surfacing incidents
Uptime is a contract commitment for enterprise AI — a 99.9% SLA allows only 8.7 hours downtime per year. Sub-99% in production makes AI systems unreliable as business infrastructure
Latency spikes at specific input lengths reveal tokenisation bottlenecks — a critical optimisation vector for LLM deployments

How to Deploy

Instrument every inference call with three timestamps: request received → first token generated → response complete
Emit P50, P95, P99 to your observability platform (Datadog, Prometheus, Cloudwatch) in real time — not batched
Set SLOs: P95 < 2s for synchronous user-facing workflows; P99 < 5s. Alert at 80% of SLO budget consumed in any hour
Monitor latency by input token bucket (0–500, 500–2000, 2000+) — latency scaling by input length is non-linear and exposes capacity constraints
Run synthetic health checks every 60 seconds with a fixed test prompt; use these for uptime calculation, not real user traffic

Incident & Rollback Tracking

Purpose

Incident tracking logs every production failure or degradation event, classified by severity. Rollback tracking records when model versions were reverted, why, and how long the decision took. Together, they form your AI governance audit trail — the evidence that your oversight structures actually function rather than existing on paper.

Why It Matters

MTTR (Mean Time to Resolve) is your governance maturity metric — resolving AI incidents in 4 hours versus 4 days reflects the difference between a monitored and an unmonitored deployment
Rollback history reveals which model versions were stable versus unstable — essential intelligence for future deployment decisions and version management
Incident logs are a regulatory requirement for AI systems in finance, healthcare, and infrastructure — auditors will ask for them; not having them is a compliance finding, not just a technical gap
Without automated incident creation, incidents are discovered and reported inconsistently — you cannot measure what you do not systematically detect

How to Deploy

Define three severity tiers: P0 = complete outage or safety failure; P1 = degraded performance >20% or significant accuracy drop; P2 = anomaly requiring monitoring but not immediate action
Automate incident creation when drift or latency thresholds breach — do not rely on manual reporting; human-reported incidents are always late
Track MTTD (detection time from incident start) separately from MTTR — detection speed reflects monitoring instrumentation maturity
Log every rollback with: version rolled from, version rolled to, trigger reason code, time from decision to deployment completion
Run a monthly incident retrospective — classify root causes into data quality, model version, infrastructure, or external dependency; trend these over time

Free Download

Get the AI Production Monitoring Checklist

The 47-point pre-deployment and ongoing monitoring checklist Terence uses across enterprise AI programmes — covering drift instrumentation, SLO configuration, incident response playbooks, and rollback decision criteria.

Pre-deployment baseline capture checklist (14 items)
Drift threshold configuration guide with PSI reference ranges
Latency SLO template with P50/P95/P99 targets by model type
Incident severity classification matrix and rollback decision tree

No spam. Unsubscribe any time.