§ Trackr.Live

Key Management

Cryptographic primitives are mathematically rigorous, well-analyzed, and broadly secure when implemented correctly. The deployment of cryptography in real systems is dominated by key management — the operational discipline of where keys come from, where they live, who has access, how they rotate, and how they are eventually destroyed. Almost every cryptographic breach in production traces back to a key management failure, not a primitive failure. AES is fine. The key that decrypted the production database was in a developer’s Slack DM.

This page is the deep-dive companion to the Cryptography umbrella overview and the operational counterpart to the Symmetric Cryptography, Public-Key Cryptography, and Public Key Infrastructure pages. The scope here is what to do with keys once you have them, where the deployed systems live, and the failure patterns that keep recurring.

The key lifecycle

A cryptographic key goes through seven distinguishable phases over its operational life. Each phase has its own security requirements, and most real-world failures map cleanly onto one of them.

Generation. A new key is produced, ideally from a cryptographically secure source of randomness. Failures here are catastrophic and often invisible — a key generated with insufficient entropy looks identical to a properly-generated key until an attacker recovers it.

Distribution. The key (or a derived secret) gets to the systems that need it. For symmetric keys this is the bootstrap problem that public-key cryptography exists to solve; for public-key pairs the private key never moves but the public key does.

Storage. The key sits at rest somewhere — in memory, on disk, in a hardware module, in a cloud service. Storage protection is what distinguishes a key from publicly-known data.

Usage. The key gets invoked to perform cryptographic operations. The threat model here is leakage during use — through side channels, memory dumps, or careless handling of intermediate values.

Rotation. The key is replaced by a new key, either on a schedule, in response to suspected compromise, or because of compliance requirements. The orchestration of rotation across distributed systems is one of the harder operational problems in the discipline.

Revocation. The key is declared invalid before its scheduled expiration, typically because of suspected or confirmed compromise. Revocation has the same scaling problems as certificate revocation (covered in the PKI page) but is broader in scope.

Destruction. The key is removed from all storage locations such that it cannot be recovered. This is structurally hard — backups, snapshots, memory dumps, and offline archives all complicate the question of whether a key has actually been destroyed.

The rest of this page works through each phase in turn, identifies where the recurring failures live, and discusses the systems that have been built to make the discipline tractable.

Generation — entropy and why it matters

A cryptographic key is a random number, and the entire security argument depends on the key being unpredictable to an adversary. The source of unpredictability is randomness, which in computing systems is harder to produce reliably than the math implies.

Modern operating systems provide a cryptographically secure pseudorandom number generator (CSPRNG) that gathers entropy from hardware sources — interrupt timings, RDRAND or RDSEED on x86, ARM’s RNDR instruction, dedicated true RNG chips, network traffic timing, disk operations — and uses that entropy to seed a deterministic pseudorandom generator that produces high-quality output. The kernel CSPRNG is what applications should use directly, through:

  • Linux: getrandom(2) system call (since 3.17), preferred over /dev/urandom.
  • macOS / iOS / BSDs: arc4random(3), /dev/urandom.
  • Windows: BCryptGenRandom, CNG.
  • Modern cryptographic libraries: randombytes_buf (libsodium), RAND_bytes (OpenSSL), the Go crypto/rand package, the Rust OsRng.

A long-running debate about /dev/random versus /dev/urandom on Linux has produced more confusion than necessary. The short version: on modern Linux kernels with adequate boot-time entropy, both are equivalent for cryptographic purposes after the system has seeded. The historical concern about /dev/urandom returning predictable output during early boot has been addressed by getrandom(2), which blocks until the kernel pool is properly seeded and then never blocks again. New code should use getrandom(2); legacy code should use /dev/urandom and accept that the early-boot window is the only risk.

The recurring catastrophic failures in key generation tell the story:

The Debian OpenSSL incident (2008) is the textbook example. A Debian packager removed two lines from OpenSSL’s RNG initialization code in an attempt to silence Valgrind warnings. The change reduced the effective entropy pool to a 15-bit seed, meaning that every key generated on a Debian system between September 2006 and May 2008 came from a pool of 32,767 possible values. The keys looked normal; they were trivially enumerable. The cleanup took years across the open-source ecosystem and required forced key rotation on millions of systems.

The Sony PlayStation 3 ECDSA key recovery (2010) exploited a different flavor of the same problem. Sony’s firmware signing implementation used a fixed value (not even random) for the per-signature nonce in ECDSA. Reusing a nonce across signatures lets an attacker recover the signing key through trivial algebra. The PS3 firmware signing key was extracted from publicly-released firmware, allowing third parties to sign their own firmware. The mitigation industry-wide is deterministic ECDSA (RFC 6979), which derives the nonce from a hash of the message and private key rather than from a random number generator. Or, better, use Ed25519, which eliminates the failure mode by design.

The ROCA vulnerability (2017) found that Infineon’s RSA key generation library, embedded in millions of TPMs, smart cards, and Estonian national ID cards, used a structured method for selecting primes that produced keys with a specific algebraic weakness. The private key could be recovered from the public key for affected systems in days of computation. The Infineon library had passed certification audits for years before the weakness was discovered by external researchers. The lesson is that “certified” is not the same as “correct.”

The Juniper backdoor (2015) found that Juniper’s ScreenOS firewalls had been using a constant in their Dual_EC_DRBG implementation that allowed traffic decryption by anyone who knew the corresponding scalar. Dual_EC_DRBG was a NIST-standardized random number generator that had been suspected of being backdoored since at least 2007; the Juniper incident confirmed both that the suspicion was correct and that real production hardware had been shipping with the backdoor active.

The pattern across these failures: the math of the cryptographic primitive was fine. The random number generator was the failure point, and the failure was invisible until external researchers found it.

Storage — where keys actually live

Keys at rest live in one of several places, each with its own threat model and operational profile.

Filesystem storage

The simplest and most common storage location is a file on disk. SSH keys, GPG keys, TLS server private keys, application API keys — all typically start as files. The protections are filesystem permissions (the file is mode 0600, owned by the right user) and full-disk encryption (the file is unreadable when the system is powered off).

The failure modes are well-known and persistent:

  • Keys in source control. A private key committed to git history persists even after deletion. GitHub, GitLab, and Bitbucket all scan for accidentally-committed credentials and notify the affected repositories; the secret-scanning tooling has become standard but does not catch every variant. The mitigation is git-secrets, truffleHog, gitleaks, or detect-secrets in pre-commit hooks, plus regular history audits.
  • Keys in environment variables. Often necessary for application configuration, environment variables show up in process listings (ps auxe), in core dumps, in container introspection, and in error logs. The exposure is broader than it appears.
  • Keys in Docker images or CI artifacts. A key baked into a published image is permanent. A key written to a CI cache or artifact is broadly readable across the CI pipeline.
  • Keys in backups and snapshots. A backup of a system containing keys requires the same protection as the system itself. Most backup processes do not preserve this property.

OS keychains

Operating systems provide first-party key storage facilities with hardware backing where available:

  • Windows DPAPI (Data Protection API) and the Credential Manager encrypt secrets with keys derived from the user’s password and (where available) a TPM. The protection is reasonable against offline attacks; less reasonable against an attacker who has compromised the user account.
  • macOS Keychain Services and the iOS Keychain integrate with the Secure Enclave on modern Apple hardware, providing hardware-backed key storage. The Secure Enclave can hold keys that the main CPU never sees, performing cryptographic operations entirely within the enclave.
  • Linux has historically been less uniform. The kernel keyring (keyrings(7)) provides in-kernel key storage. GNOME Keyring and KWallet provide desktop-level credential managers, with varying quality. Server-side, the Linux Kernel Keyring is the right primitive for systems that need transient key storage with proper ownership controls.

OS keychains are the right answer for user-level credentials and application secrets on a single device. They are less appropriate for shared infrastructure secrets or for keys that need to be replicated across systems.

Hardware Security Modules

A Hardware Security Module (HSM) is a dedicated cryptographic appliance designed to generate, store, and use cryptographic keys without ever exposing the keys to the host system. The host sends data into the HSM, the HSM performs the cryptographic operation, the host gets the result back. The key never leaves the HSM.

HSMs are validated against FIPS 140-3 (formerly FIPS 140-2), which defines four security levels:

  • Level 1 — basic cryptographic correctness, no physical security requirements. Software libraries can achieve Level 1.
  • Level 2 — tamper-evident physical packaging, role-based authentication. Most consumer-grade hardware modules.
  • Level 3 — tamper-resistant packaging that zeroes keys on detection, identity-based authentication. The typical enterprise HSM.
  • Level 4 — robust tamper-detection that protects against environmental attack. Aerospace, military, and the most sensitive government applications.

Commercial HSM offerings include Thales (formerly Gemalto, formerly SafeNet) Luna, AWS CloudHSM (which uses Marvell/Cavium hardware), Entrust nShield, Utimaco, and several others. The cost is non-trivial — a Level 3 HSM is typically in the tens of thousands of dollars per unit — and operating HSMs in a cluster with proper key escrow and disaster recovery is its own operational discipline.

The standard programming interface for HSMs is PKCS#11, also known as Cryptoki. PKCS#11 is the API that most HSM clients use to perform operations against the device. The API is comprehensive but byzantine, and getting PKCS#11 integration correct is a recurring source of subtle bugs.

TPMs and secure elements

A Trusted Platform Module (TPM) is a hardware module that combines limited HSM-like capabilities with platform-attestation functions. TPMs are now standard on consumer hardware — Windows 11 requires TPM 2.0, and most servers ship with one. The TPM can:

  • Generate keys that never leave the device.
  • Seal keys to specific platform configurations (the key can only be unsealed when the platform is in a known-good state, providing measured-boot protections).
  • Provide attestation about the platform’s boot state to remote parties.

TPMs are not as capable as full HSMs — they are slower, have limited key storage, and are designed for single-system use rather than cluster operation — but they are universally available and provide hardware-backed key protection at no marginal cost.

Secure elements are similar in concept but more specialized. The Apple Secure Enclave, the Google Titan M chip, and various smart-card-class chips embedded in phones and laptops are all secure elements. They typically have a narrower API than TPMs but tighter integration with the platform.

Cloud KMS

The dominant key management pattern for cloud-deployed applications in 2026 is cloud KMS — a managed service operated by the cloud provider that holds keys in their HSM infrastructure and exposes a simple API for cryptographic operations.

The three major offerings:

  • AWS KMS — symmetric keys, asymmetric keys (RSA and ECC), HMAC keys. Backed by AWS CloudHSM internally. Integrated with most AWS services for envelope encryption.
  • Google Cloud KMS — similar feature set, with Cloud HSM for keys that need to stay in FIPS 140-2 Level 3 hardware. Integrated with GCP services through the same envelope encryption pattern.
  • Azure Key Vault — Microsoft’s offering. Two tiers: standard (software-backed) and premium (HSM-backed). The premium tier is the right choice for production cryptographic material.

Cloud KMS adds capabilities beyond raw HSM access:

  • Key rotation automation. Most cloud KMS services can rotate symmetric keys automatically on a configured schedule, with old key versions retained for decrypting historical data.
  • Granular IAM. Access to specific keys is controlled by the cloud provider’s identity system, with audit logging of every key usage.
  • Envelope encryption. Rather than encrypting data directly with the KMS key (which would route every byte of data through the KMS API), applications generate data encryption keys (DEKs) locally, encrypt the data with the DEK, and encrypt the DEK with the KMS-held key encryption key (KEK). The encrypted DEK travels with the data; only the small KEK operation requires the KMS round-trip.

Cloud KMS also introduces a shared-responsibility tradeoff that deserves explicit attention. The cloud provider has access to the underlying HSM hardware. For many threat models this is acceptable — the alternative is operating your own HSM fleet, which has its own risks. For threat models that require strict separation, Bring Your Own Key (BYOK) lets the customer import key material that originated outside the cloud provider, and Hold Your Own Key (HYOK) / External Key Manager lets the cloud KMS reach out to a customer-controlled HSM for the actual cryptographic operations. The latter is operationally more complex but provides stronger isolation guarantees.

Secure enclaves

Trusted Execution Environments (TEEs) — Intel SGX, AMD SEV-SNP, Arm TrustZone, AWS Nitro Enclaves, Azure Confidential Computing, GCP Confidential VMs — provide hardware-isolated execution environments that can hold and use cryptographic keys without exposing them to the host operating system. The threat model is protection against an attacker who has compromised the OS or hypervisor.

The deployment pattern has been complicated. Intel SGX in particular has been subject to a long series of side-channel attacks (Foreshadow, SgxSpectre, Plundervolt, SmashEx, ÆPIC Leak) that have variably broken the security guarantees. Intel deprecated SGX on client CPUs in 2022 but continues to support it on server SKUs with revised threat-model claims. AMD SEV and the newer ARM Confidential Compute architectures have similar tradeoffs.

For practical purposes in 2026, secure enclaves are a useful additional layer of defense for keys that need to live close to the application but should not be readable from a compromised host. They are not a replacement for HSMs or for cloud KMS — they are a complement, useful in specific architectural patterns (confidential computing, attestation-based access control).

Key wrapping and key hierarchies

Real systems rarely use a single cryptographic key for all operations. The standard pattern is a hierarchy of keys with specific responsibilities.

A typical hierarchy:

  • Root key — held in an HSM or cloud KMS, never exported, used only to wrap (encrypt) the next layer of keys. Rotated infrequently if at all.
  • Key encryption keys (KEKs) — derived from or wrapped by the root key, used to encrypt data encryption keys. May be specialized by purpose (one KEK per application, one per tenant, one per region).
  • Data encryption keys (DEKs) — used to encrypt actual data. Often ephemeral or per-record. The encrypted form (wrapped by the KEK) travels with the encrypted data.

Envelope encryption is the term for the DEK-wrapped-by-KEK pattern. It is universal in modern cloud systems because it separates the high-volume operation (encrypting bulk data with DEKs) from the high-trust operation (managing the small number of KEKs that wrap the DEKs).

The hierarchy provides three operational benefits:

  • Performance. Bulk encryption uses local DEKs without round-tripping to the KMS for every byte.
  • Limited blast radius. A compromised DEK affects only the data it encrypted, not the entire keyspace.
  • Rotation simplicity. Rotating a KEK requires re-wrapping the DEKs (a small operation per DEK) rather than re-encrypting the underlying data.

The flip side is complexity. Every additional layer in the hierarchy is another opportunity for the orchestration to go wrong. The right number of layers is the smallest number that meets the threat model and operational requirements; for most applications, three layers (root → KEK → DEK) is the standard.

Key distribution

The distribution problem — getting a key from where it is to where it needs to be — has different answers depending on whether the key is symmetric or asymmetric.

For symmetric keys, the canonical answer is to use public-key cryptography to bootstrap a symmetric session key. Diffie-Hellman key agreement (or its elliptic-curve variants) lets two parties derive a shared secret over an insecure channel. TLS, Signal, SSH, and most other modern protocols use this pattern.

For longer-term symmetric keys that need to be replicated across systems, the typical pattern is key wrapping: encrypt the symmetric key with a public key for transport, decrypt it at the destination, and store it locally (or hold it in an HSM). Cloud KMS provides this primitive directly through its wrap/unwrap API.

For public-key pairs, only the public key needs to be distributed; the private key stays at its origin. PKI is the standard distribution mechanism (certificates with embedded public keys, signed by a trusted authority). For internal systems, simpler mechanisms — direct certificate exchange, configuration management distribution, key servers — are also common.

Secret sharing is the cryptographic answer to the question of distributing a key across multiple parties such that no single party can recover it. Shamir’s Secret Sharing, published by Adi Shamir in 1979, splits a secret into n shares such that any k of them can reconstruct it, but fewer than k reveal nothing. The mechanism is widely used for high-value root keys (the root signing key of a CA, the root encryption key of an HSM) where no single operator should be able to recover the key alone. The HashiCorp Vault unseal process uses Shamir’s; AWS KMS custom key stores use it for HSM cluster initialization.

Key rotation

Keys should be rotated. The reasons:

  • Forward secrecy. A future compromise of the current key should not retroactively expose data encrypted with previous keys.
  • Cryptographic wear-out. Some primitives have data-volume limits. AES-GCM, for example, should not encrypt more than 2^32 messages with a single key, and finite-field DH parameters can be precomputed if reused widely enough.
  • Compliance. PCI DSS, FedRAMP, and most regulated frameworks require regular key rotation regardless of cryptographic necessity.
  • Personnel changes. A key that a departed employee had access to should be replaced.
  • Suspected compromise. Any signal of possible compromise should trigger rotation.

The cadence varies by context. TLS session keys rotate every connection. Cloud KMS data encryption keys typically rotate annually with automated key version management. Long-term signing keys for code signing or CA roots may rotate every five to ten years, with substantial preparation work.

The operational hard part of rotation is orchestration across distributed systems. The new key has to be distributed to every system that needs it before the old key is deactivated; the old key has to be retained as long as data encrypted under it still exists. The standard pattern:

  1. Generate the new key.
  2. Distribute the new key to all systems that need to use it for encryption.
  3. Begin encrypting new data with the new key.
  4. Allow systems to decrypt with either the old or new key during the transition.
  5. Re-encrypt historical data with the new key (or accept that the old key must be retained for decryption).
  6. Decommission the old key once no encrypted data depends on it.

The mistake pattern: rotating the key without re-encrypting old data, then losing the old key. The result is permanent inability to decrypt historical records. This happens.

Key destruction

Destroying a key sounds simple — delete the file, zero the memory — and is operationally one of the harder problems in the discipline. The reasons:

  • Backups and snapshots. A key has been backed up. Has the backup been destroyed too? Has the offline copy been destroyed? What about the disaster recovery site?
  • Memory residue. Keys held in process memory may persist in swap, hibernation files, core dumps, or VM snapshots even after the process exits.
  • Hardware caches. SSDs, drive controllers, and hardware caches retain data after logical deletion. Magnetic media retains data after deletion until overwritten.
  • Cloud provider replication. Data deleted from a cloud KMS may persist in the provider’s replication infrastructure for some retention window.

Cryptographic erasure is the standard mitigation. Rather than trying to physically destroy data, the data is encrypted with a key, and the key is destroyed. Destruction of the key (which is small) renders the data (which may be vast) cryptographically inaccessible. Modern self-encrypting drives use this pattern for fast secure erase: the drive holds a media encryption key, the drive contents are encrypted with it, and “secure erase” simply rotates the key, making the previous contents unreadable in milliseconds.

The same pattern applies at the cloud level. AWS S3, GCS, and Azure Blob Storage all support customer-managed encryption keys; destroying the key destroys access to the encrypted data without requiring the cloud provider to overwrite every block.

The remaining hard problem is ensuring that the key itself is actually destroyed everywhere. Cloud KMS services provide “scheduled key deletion” with a mandatory waiting period (7 days minimum on AWS, longer typical) before destruction is irreversible — a recognition that key destruction is easy to do by accident and hard to undo.

Threshold and multi-party schemes

The basic pattern of one key in one place has structural limits. Threshold cryptography and multi-party computation (MPC) address them.

Threshold signatures split a signing key into shares such that any k of n shareholders can collectively produce a valid signature, but fewer than k cannot. The signature appears identical to a single-key signature to a verifier; the verifier needs to do nothing different. Schemes exist for ECDSA threshold signing (used in some cryptocurrency custody systems), Ed25519 threshold signing, and several others. Threshold signing is particularly attractive for high-value signing keys (CA root keys, code signing keys) where no single operator should have unilateral signing capability.

Multi-Party Computation (MPC) more generally is a class of protocols that lets multiple parties compute a function over their private inputs without revealing the inputs to each other. Applied to key management, MPC lets a key be operationally held by a set of parties such that the key itself is never reconstructed in any single location — every cryptographic operation is performed cooperatively. Custody firms operating cryptocurrency at scale (Fireblocks, Coinbase Custody, Anchorage) use MPC-based key management for hot signing infrastructure where traditional HSMs would not provide enough operational flexibility.

The performance cost of MPC is real — operations that take microseconds on a single key can take seconds across an MPC cluster — but for high-value signing keys where the operational rate is low, the cost is acceptable.

Audit and observability

A key management program needs to answer four questions on demand:

  1. What keys do we have? (Key inventory)
  2. Where do they live? (Storage location for each)
  3. Who can use them? (Access control)
  4. Who has used them? (Audit log)

Answering these in a complex environment is operationally harder than it sounds. Keys accumulate over years in HSMs, cloud KMSes, application-specific stores, configuration management systems, password managers, and (regrettably) the heads of long-tenured engineers. The first time an organization tries to produce a complete key inventory, the typical finding is that nobody has the full picture.

Cloud KMS services help substantially because they centralize key usage through a single API with comprehensive audit logging. AWS CloudTrail, GCP Cloud Audit Logs, and Azure Activity Log all record every key operation by default. The challenge is correlating those logs with the application-level identities that actually performed the operations, and with the data those operations protected.

For on-premises HSMs, the audit capability depends on the HSM model and configuration. Most enterprise HSMs log operations to an internal audit log that can be exported, but the integration with broader SIEM tooling is often custom.

The PCI DSS key management requirements (specifically Requirement 3 of PCI DSS v4.0) provide a useful framework for what a key management program needs to document, regardless of whether PCI compliance is a direct requirement. The framework covers key generation, distribution, storage, rotation, retirement, escrow, and the management roles involved.

Recurring incident patterns

A short catalog of how key management actually fails in production:

Hardcoded keys in source code. Persistent across decades of secure-coding training. Every major company has at least one historical incident of API keys, signing keys, or database credentials committed to a public repository. The detection tooling (gitleaks, truffleHog, GitHub secret scanning) catches most accidental commits at push time, but legacy history and private repositories remain a concern.

Keys leaked through CI/CD systems. A build process that has access to production keys, and that runs in an environment with broad logging, leaks those keys through build logs. The Codecov incident (2021) was a notable example — a compromised build script exposed environment variables across thousands of customer build pipelines.

Keys in screenshots and chat logs. An engineer screenshots their terminal during a debugging session and the screenshot includes an API key. The key gets shared on Slack, in a Jira comment, in an email. The exposure is bounded but real, and it appears in a non-trivial fraction of post-incident reviews.

Long-lived keys that should have been rotated. A key generated for a one-off integration five years ago that was never rotated, never inventoried, and is still active because no one knew to revoke it. The incident discovery is usually when the long-departed engineer’s GitHub account gets compromised and the attacker finds the key in their personal repository.

Keys in unencrypted backups. A database backup that contains an application’s master key (because the key is stored in the database for envelope encryption) is itself unencrypted in backup storage. The backup gets exposed; the master key gets exposed; everything the master key wrapped is now decryptable.

Disaster recovery keys nobody knows where to find. The encrypted backup is fine. The key to decrypt it is in a safe in an office that’s been closed for three years. Or it’s in an HSM that’s been decommissioned. Or it’s in a password manager whose master password was held by an employee who left.

Cryptocurrency wallet incidents are a continuous source of cautionary tales about key management at the consumer level. Seed phrase loss, exchange compromises, social engineering, hardware wallet phishing — the entire surface area of key management gets exercised against an asset class where keys directly represent millions of dollars.

The common thread across these patterns is that the cryptographic primitive was not involved in any of the failures. Every incident is about where the key was, who had access to it, and what happened to it operationally.

Compliance frameworks

Several compliance regimes specify key management requirements that often drive operational design:

  • FIPS 140-3 specifies the security requirements for cryptographic modules, including key management. The four security levels were covered in the HSM section above. FIPS 140 is mandatory for US federal systems and is broadly adopted as a benchmark elsewhere.
  • NIST SP 800-57 is the comprehensive NIST guidance on key management, in three parts (general, organizational best practices, application-specific). The document is dense but authoritative.
  • PCI DSS Requirement 3 specifies the key management requirements for systems handling payment card data, including key generation, distribution, storage, rotation, retirement, and split knowledge / dual control.
  • HIPAA specifies broad requirements for protecting health information, including key management requirements that are less prescriptive than PCI but cover similar ground.
  • FedRAMP layers additional key management requirements on top of FIPS 140 for cloud services serving the US federal government.
  • SOC 2 Type 2 audits assess key management practices as part of the security trust services criterion.

For most organizations, the compliance requirements are the floor for the key management program, not the ceiling. A program designed to meet only the compliance requirements typically has visible gaps when measured against the operational threat model.

Post-quantum key management

The post-quantum transition affects key management in three concrete ways:

Key sizes increase. ML-KEM-768 has a 2 KB public key (against 32 bytes for X25519) and a 1.1 KB secret key. ML-DSA-65 has a 2 KB public key and a 3.3 KB private key. SLH-DSA private keys are smaller but signatures are 7-49 KB. Existing HSMs, smart cards, and TPMs may have capacity limits that fit classical keys comfortably but require firmware updates or hardware replacement for post-quantum keys.

Hybrid storage patterns. During the migration window, systems need to hold both classical and post-quantum keys for the same identity. The storage primitives, the IAM policies, the rotation orchestration, and the audit logging all need to accommodate the dual-key reality.

Migration ordering. Key encryption keys (KEKs) need to be migrated to post-quantum before data encryption keys, because a quantum attacker who breaks a long-lived KEK retrospectively decrypts every DEK it has wrapped. Signing keys for long-lived signatures (firmware signing, root CA signing) need migration earlier than session-signing keys.

The cryptographic libraries (BoringSSL, OpenSSL 3.5+, the Go and Rust crypto ecosystems) are adding post-quantum support through 2025-2027. The cloud KMSes are adding ML-KEM and ML-DSA support on similar timelines. The Quantum Computing page on this site goes deeper into the algorithm choices and the engineering tradeoffs.

Standards and references

  • NIST SP 800-57 Part 1, Part 2, Part 3 — comprehensive key management guidance.
  • NIST SP 800-130 — Framework for Designing Cryptographic Key Management Systems.
  • NIST SP 800-131A — Transitions: Recommendation for Transitioning the Use of Cryptographic Algorithms and Key Lengths.
  • FIPS 140-3 — Security Requirements for Cryptographic Modules.
  • PCI DSS v4.0 Requirement 3 — Protection of stored cardholder data.
  • RFC 4949 — Internet Security Glossary.
  • PKCS#11 v3.1 — the HSM API standard.
  • CMC (Cryptographic Module Validation Program) — the NIST program that validates FIPS 140-3 compliance.

What to actually use in 2026

For new systems, the practical recommendations:

  • For cloud-deployed applications: use the cloud provider’s KMS (AWS KMS, Google Cloud KMS, Azure Key Vault with the HSM tier). Use envelope encryption for bulk data. Use customer-managed keys with automated rotation. Use IAM to scope access narrowly. Enable comprehensive audit logging from day one.
  • For on-premises or hybrid systems: use a network-attached HSM (Thales Luna, Entrust nShield, or equivalent) for high-value keys, with proper cluster operation and key escrow. Use the OS keychain for user-level credentials and application secrets. Avoid filesystem-stored keys for anything that should outlive a single workload.
  • For service-to-service authentication: mTLS with per-service certificates managed by a service mesh or by cert-manager. Rotate certificates frequently (24 hours to 7 days is typical). Use SPIFFE/SPIRE for identity if the architecture supports it.
  • For developer workstations: hardware security keys (YubiKey, Google Titan, Nitrokey, Solokey) for SSH and signing operations. Keys generated on the hardware should never leave it. Backup keys for disaster recovery should be stored offline.
  • For high-value root signing keys: threshold signing schemes if the operational rate is low enough to tolerate the performance cost; HSM-held keys with quorum-controlled access otherwise. Document the recovery process and rehearse it.
  • For password-based key management: Argon2id with appropriate parameters (covered in the Hash Functions and MACs page). Never use raw hash functions for password-derived keys.

Avoid: keys in source code, keys in environment variables for production systems that have better options available, keys held only in a single developer’s password manager, manually-rotated keys without orchestration tooling, keys without an inventory entry, keys without an audit log, and any key management system that depends on a specific person to function.

The cryptographic primitives have been good enough for decades. The deployments fail at the boundaries — at the moment a key crosses from a system that should protect it to a system that doesn’t. The work of key management is the work of keeping those boundaries explicit, auditable, and operationally maintainable over years. The discipline is unglamorous, the tooling is incomplete, the incidents keep happening, and there is no shortcut.