§ AU

ESXi Can Write Its Logs to a RAM Disk. Ransomware Counts on It

The first time you handle a hypervisor ransomware case, you learn something that isn’t in the runbook: you image the ESXi host, go looking in /var/run/log (the real log directory — /var/log is mostly symlinks pointing at it), and there’s almost nothing there. A day of shell history, if that. No auth.log going back to initial access. No hostd.log covering the reconnaissance week. The encryption event itself, if you’re lucky. You go looking for the timeline that reconstructs how root got compromised, and the timeline was in a ramdisk that got flushed when the host was rebooted — or powered off — to make the running VMs release their disks before encryption.

That’s not tampering, exactly. That’s the default configuration doing what it was designed to do. ESXi booting from a USB stick, an SD card, or a stateless Auto Deploy image has no persistent home for its scratch space, so it maps scratch to an in-memory ramdisk and /var/run/log rides along with it — and unless someone explicitly pointed Syslog.global.logDir at a persistent datastore or set Syslog.global.logHost to forward off-box, the logs live and die in RAM. (Boot-from-SAN and local-disk installs normally get a persistent scratch partition; what really decides it is where ScratchConfig.ConfiguredScratchLocation points, not the label on the boot device.) Broadcom’s own KB (article 317690, the “system logs are stored on non-persistent storage” warning everyone clicks past in the vSphere Client) says this in as many words. The crews hitting ESXi know it better than most of the people running the hosts.

This is the detection-and-tradecraft half of the story. Its companion piece — ESXi Forensics and the RAMdisk Logging Gap Nobody Configured Around — takes the forensic-acquisition side: log rotation as a second way the evidence disappears, the full /scratch/log symlink chain, and how to pull what’s left off a host that already rebooted.

Why the host gives you nothing

Walk the boot media question first, because it decides everything downstream. A host installed to a local SSD or M.2 with a real scratch partition — or booting from a properly provisioned SAN LUN — will usually persist /var/run/log across reboots. A host that boots from a USB stick or SD card — still extremely common in blade chassis and older HPE/Dell gear — falls back to a ramdisk scratch and keeps its logs in volatile memory. Same ESXi build, completely different forensic outcome, and it’s the scratch configuration rather than the boot media’s label that ultimately decides it — you often don’t know which one you’re dealing with until you’re already on the host with a dead-box acquisition in progress.

The logs that matter for a ransomware timeline are a short list. /var/run/log/auth.log for authentication, /var/run/log/shell.log for esxcli and shell command execution, /var/run/log/hostd.log for the host agent (VM power operations, config changes, API-driven activity), /var/run/log/vpxa.log for the vCenter management agent when the host is attached to vCenter (the host-side record of vpxd-driven operations), and /var/run/log/vobd.log for the observer daemon that records things like SSH state changes and firewall ruleset edits. Note that vpxa.log lives on the host and dies with the rest of them. On vCenter-managed estates you also get vpxd.log, and this is the one piece of good news: unlike vpxa, vpxd.log lives on the vCenter appliance, not the ESXi host, so it survives a host reboot — assuming the vCenter appliance itself wasn’t sitting in the blast radius. If the attacker went through vCenter — and a lot of them do, because that’s where the flat admin credentials are — you have a parallel record they didn’t necessarily reach.

One more default worth internalizing: ESXi audit records are off — where they exist at all. The esxcli system auditrecords subsystem, the thing that would give you a proper immutable-ish audit trail of privileged operations, has been around since vSphere 7.0 Update 1 (earlier hosts don’t have it), and everywhere it exists it’s deactivated out of the box and has to be turned on for both local and remote modes. Most shops never do. So the audit facility that would catch the anti-forensics is itself dark on the majority of hosts, which is its own small tragedy.

The anti-forensics is boring, which is the point

There’s a temptation to imagine hypervisor ransomware as some exotic kernel-level thing. It isn’t. Once an operator has root on an ESXi host — usually via a compromised vCenter, a stolen SSH key, or a domain account that maps to an admin role through AD integration — the tradecraft is a handful of native esxcli and shell operations, and several of them exist purely to blind the responder who shows up later.

The pattern that reporting on Akira and its Linux/ESXi variants keeps describing (Sygnia and Security Risk Advisors both have solid writeups): enable SSH if it’s off, disable the relevant firewall rulesets, redirect or kill syslog, turn off coredumps, lower the VIB acceptance level so unsigned tooling loads, and in some cases step the system clock. Redirecting Syslog.global.logDir to /tmp is the elegant one — it doesn’t trip a “logs were deleted” alarm because nothing gets deleted. The logs just start writing to volatile storage, and the next reboot takes them — and a power-off or reboot is usually part of the sequence anyway, because the guests have to release their VMDKs before encryption. (Plenty of crews force the VMs off with esxcli vm process kill rather than bounce the host — but the ones who reboot get the log flush for free.) Disabling esxcli system auditrecords does the same to the audit stream if it was ever on.

I want to be precise about the SSH-tunneling persistence people cite, because it gets over-dramatized. A remote port-forward back to a C2 over native ssh is real and it’s used, but it’s not a magic backdoor — it’s an outbound SSH session, and if your ESXi management network has any egress control at all, an ESXi host reaching out to an internet IP on 22 is a screaming anomaly. The problem is that most management networks have no egress monitoring, because “it’s the management VLAN, it’s trusted.” That assumption is doing a lot of unearned work.

What the detection actually looks like

Here’s the part that matters: none of this is detectable on the host after the fact, so the detection has to be live and it has to be forwarded. If you run Splunk, the ESXi security content ships with an app that expects index=vmware-esxilog and sourcetype=vmw-syslog, fed by the esxi_syslog macro (the general Splunk Add-on for VMware ESXi parses the same syslog under a different sourcetype, vmware:esxilog — same underlying data, different TA, so confirm which one you’ve got before lifting field names). If you’re on Elastic, you’re building the equivalent ingest pipeline yourself and it’s messier, because ESXi’s syslog format isn’t clean key-value and the field extractions fight you.

The high-signal strings are few and specific. "SSH access has been enabled" shows up in hostd and vobd when SSH flips on. The syslog-tamper detection keys on the config-set event, roughly Set called with key '*logHost*' or *logDir* — that’s your attacker repointing the log destination, and it should be a near-zero-volume event in a steady-state estate. esxcli system auditrecords with local or remote disable is another. Firewall ruleset changes surface as Operation 'disable' for rule set succeeding. And there’s a genuinely clever one from the Splunk content: an NTPClock system clock stepped event with a delta over 172800 seconds — a two-day jump — catches the timeline-corruption trick, where an operator moves the clock to scramble correlation across your other sources.

Now the reality check, because a clean list of search strings is a lie of omission. "SSH access has been enabled" is noisy in exactly the shops that most need it. Every backup product that touches ESXi, every automation runbook, half the VMware admins doing legitimate break-fix, they all toggle SSH. In a mid-size estate expect that string to fire a handful of times a day, mostly benign. The first round of tuning is not on the string — it’s on the source. Whitelist the jump hosts and the backup appliance IPs that legitimately enable SSH, alert only when the enable comes from an interactive session or an unexpected source address, and pair it with a “SSH access has been disabled” correlation so a toggle that turns on and never turns back off within your normal maintenance window is what actually pages someone. An enable with no matching disable is the shape you care about.

The logDir/logHost change and the auditrecords disable are different animals — those should be so rare that you can afford to alert on every single one and eat the occasional false positive from a legitimate reconfiguration. If your SOC gets paged because a VMware engineer moved the log directory to a new datastore, that’s a thirty-second Slack confirmation, not alert fatigue. Tune those toward paranoia.

The unglamorous prerequisite under all of it: Syslog.global.logHost has to be set, on every host, pointing at a collector that isn’t itself a VM on the cluster you’re trying to protect. That last clause is the one people miss. If your syslog target is a Linux VM running on the same vSAN cluster the ransomware just encrypted, you have forwarded your evidence into the blast radius. Put the collector somewhere the hypervisor compromise can’t reach — a separate management cluster, a physical box, or your SIEM’s cloud ingest tier. And watch the forwarding path itself: vmsyslogd-dropped.log (and .vmsyslogd.err) record messages the host had to drop because a buffer filled or the remote collector went unreachable — exactly what you’d see if an attacker cut forwarding or the collector sat in the blast radius. A gap in received host syslog that lines up with drop entries is its own signal, not just an outage.

What changes the answer

Standalone ESXi versus vCenter-managed is the biggest fork. Standalone hosts have no vpxd.log safety net; if syslog forwarding wasn’t configured, a rebooted standalone host is forensically close to a brick. vCenter estates at least give you the appliance-side record of who did what through the API.

Boot media, as covered, decides whether local logs survive at all. FedRAMP and DoD estates running the DISA ESXi STIG are in materially better shape here, because the STIG mandates persistent logging and remote syslog — Syslog.global.logHost set and logDir on persistent storage are explicit findings. If you’re in a hardened enclave, check your own compliance scan; the control you need may already be enforced and you just have to point it at the right collector. Commercial estates with no such mandate are where the ramdisk default quietly wins.

Time skew deserves a specific callout. ESXi hosts drift, and a cluster where NTP was never configured properly will show hosts minutes or more apart. When you’re stitching vpxd.log from vCenter against forwarded host syslog against your EDR on the guest VMs, that skew turns into hours of wasted correlation effort and, worse, it gives a clock-stepping attacker natural cover. Fix NTP before you need the timeline, not during the incident.

Mapping and remediation

The controls here are almost entirely about pre-positioning, because everything reactive fails on a hypervisor with no agent.

Control What it buys you on ESXi
AU-4 / AU-11 Persistent logDir on a VMFS datastore and defined retention, so a reboot doesn’t erase the record — but not vSAN: Broadcom doesn’t support scratch or syslog on a vSAN datastore (it can hang the host), so use a VMFS LUN
AU-9 Off-host forwarding via logHost to a collector outside the cluster — log protection that survives host compromise
AC-6 / lockdown mode Normal (or strict) lockdown so direct host access requires going through vCenter, shrinking the ungoverned root paths
CM-7 SSH and the ESXi Shell disabled by default; the shell timeout set low so an enabled session doesn’t linger
CM-5 / SI-7 / SR Secure Boot to enforce that loaded VIBs and boot components are signed, plus execInstalledOnly (built on top of Secure Boot, default-on in ESXi 8.0) to block execution of binaries not delivered through an installed VIB — complementary controls against the acceptance-level-lowering move, not the same one
IA-2 MFA on vCenter and, where AD integration exists, killing the flat “ESX Admins”-style group mappings that hand out host root
CP-9 / CP-10 Backups isolated from the cluster they protect, immutable where you can get it, because the encryption event is the one you can’t detect your way out of

Enable audit records (esxcli system auditrecords local enable and remote enable) while you’re in there. It’s off by default, it’s cheap, and it’s the one facility purpose-built to catch the tampering.

The uncomfortable summary: on a default ESXi host, your incident response capability was decided months before the incident, by whether someone set two advanced parameters. Get logHost and logDir right across the fleet and verify the collector lives outside the blast radius. Everything else in the runbook assumes those two are already done, and on most of the hosts I’d expect to walk into cold, they aren’t.

Sources

Discussion

1 comment

Comments are closed on this post.