Metrics to prove progress

Trust Architecture Playbook: Automation pillar

In certificate lifecycle automation programs, the metrics most organizations track first, certificate counts, automation coverage percentages, and renewal volumes are the easiest to collect, but are the least useful for understanding whether the program is working. A program that has automated 10,000 certificates without visibility into whether those certificates are being deployed successfully, or whether its highest-criticality services are covered, or even being able to reliably detect faults, is not a mature program.

Measuring program health

Measure the right things

The first step is measuring the right things. Reliability metrics like renewal success rate, deployment success rate, time to detect, and time to recover tell you whether automation is working. Both dimensions are required, because a program can score well on coverage while failing consistently on reliability. Neither of those programs is in control of its certificate estate. Measuring only one dimension creates the illusion of progress while the other dimension compounds risk.

Segment every metric

The second step is segmentation. An aggregate renewal success rate of 97% looks acceptable until you segment it by criticality tier and discover that the failures are concentrated in Tier 0 and Tier 1 services. An automation coverage percentage of 80% looks strong until you segment by platform and discover that a specific appliance class, one that hosts a disproportionate share of a customer-facing ore revenue-generating service hasn’t been automated because the pattern for that platform was never built. Aggregate metrics can hide the information that matters most, so every metric in the program should be segmented by factors like criticality tier, platform, environment, and owning team as a baseline.

Use metrics to gauge program maturity

The third step is using these metrics as a gauge of the program’s maturity. The metrics your program produces over time document where the program is in its development and where the remaining risk is. For example:

Renewal success rate trending upward across all tiers indicates that readiness gates are working and automation patterns are reliable.
Break-glass frequency that remains stable or increases after Phase 2 indicates that the standard automation model has gaps that teams are routing around rather than resolving.
Exception volume that grows without declining indicates that the exception process has become a parallel operating model rather than a governed safety valve.
Time to detect that remains high for Tier 0 and Tier 1 services after monitoring is nominally in place indicates that alert routing is broken or that the monitoring is measuring the wrong things.

Audit evidence collection

Alongside operational metrics, audit evidence collection is the mechanism used to prove, to an auditor, a regulator, an executive, etc., that the automation program is operating as designed and that every certificate lifecycle action was authorized, executed correctly, and validated. The evidence requirements must be defined before automation begins. For each automated lifecycle event, the program should produce and retain a complete, evidence record allowing for the complete chain of issuance to be tied together with the governance decisions that drive them.

What a mature program can answer

A mature certificate automation program is one that can answer, for any certificate in the estate, at any point in time: who authorized it, what profile governed it, where it was deployed, how deployment was validated, when it was last renewed, whether that renewal resulted in verified deployment, and what the current operational status of the service presenting it is. Build the measurement layer with the same intentionality you bring to the automation patterns themselves, and it will tell you, continuously and accurately, whether the program is working.

Core metrics

Metric	Definition	Healthy threshold	Segment by
Renewal success rate	% of scheduled renewals that complete issuance successfully.	≥99% (Tier 0/1); ≥97% (Tier 2)	Tier, profile, business unit, environment
Deployment success rate	% of renewals that result in validated live deployment — not just issuance.	≥98% (Tier 0/1); ≥95% (Tier 2)	Tier, pattern, service group
Time to detect (TTD)	Elapsed time from failure event to first alert or operator awareness.	<15 min (Tier 0); <1 hr. (Tier 1); <4 hr. (Tier 2)	Tier, platform
Time to recover (TTR)	Elapsed time from failure detection to validated service restoration.	<1 hr. (Tier 0); <4 hr. (Tier 1); <24 hr. (Tier 2)	Tier, service archetype
Automation coverage	% of inventory in each tier governed by a standard profile and active automation.	Target: ≥90% of Tier 0/1 within 6 months of pilot completion.	Tier, environment
Exception volume and age	Count of open exceptions and their age.	Declining trend. No exception >90 days without governance review.	Business unit, pattern
Break-glass frequency	Number of break-glass events per period. High frequency indicates gaps in standard automation.	Declining trend post-Phase 2. Sustained frequency is a program health signal.	Tier, profile
Readiness gate coverage	% of inventory with all readiness gates met.	Target: 100% of automated services; leading indicator for expansion.	Tier, owning org

Audit evidence checklist

Retain the following evidence for each automated certificate lifecycle event:

Profile approval record: Who approved the profile, when, and under what authority.
Delegated scope approval: Business unit, owner, and authorization boundary.
Issuance event: Timestamp, requester identity, profile used, CA source, and certificate serial.
Deployment target: Service, host, environment, and install location.
Validation results: Handshake check output, chain validation, SAN verification, and health check result.
Exception or break-glass approval: Approver identity, justification, and expiry date.
Incident record: Timeline, actions taken, root cause.
Corrective actions: Remediation steps, gate updates, and monitoring changes.

Crypto agility and post-quantum readiness exercises

The crypto agility built through an intentional, mature enterprise certificate lifecycle automation program (reliable renewal, deterministic deployments, a complete inventory, tested mass-rotation runbooks, etc.) has the same foundation as what is required for post-quantum readiness (the ability to rotate keys, update algorithms, and respond to cryptographic requirements at scale without manual intervention).

NIST finalized post-quantum cryptographic standards in 2024 (ML-KEM, ML-DSA, SLH-DSA). Browser vendor and CA/Browser Forum timelines for algorithm deprecation are accelerating. Organizations should begin inventory analysis now to identify certificate populations that will require their algorithm be migrated and to confirm that their automation patterns support the target algorithms.

Audit current algorithm inventory: identify any certificates with weak or non-standard keys and algorithms.
Confirm that approved profiles support strong classical algorithms (for example, ECDSA P-384/P-512, RSA4096) and plan post-quantum algorithm support timelines aligned to CA and platform availability.
Test mass rotation capability in a non-production environment.
Include algorithm transition planning in Phase 4 of the rollout model.

In this section: