ESXi Forensics and the RAMdisk Logging Gap Nobody Configured Around
A single ESXi host can carry forty production VMs, and that is exactly why the crews care about it. Get root on the hypervisor and you do not have to touch a single guest from the inside. Stop the VMs, then encrypt the flat VMDKs on the datastore directly, and the whole estate goes dark in the time it takes to script a loop. That mechanism is well documented now. The part that keeps coming up short in the after-action is quieter: by the time IR gets a session on the box, the logs that would explain how root happened are often already gone. Not wiped by some clever anti-forensics trick. Gone because ESXi wrote them to a ramdisk and nobody ever turned on persistence.
This is a logging-architecture problem wearing a ransomware costume, and it is worth understanding before the day you need the evidence.
Where ESXi actually keeps its logs
ESXi logs to memory first. The default log path is /scratch/log, which on a normal install is a symlink that chains through /var/lib/vmware/osdata/locker/log and on to /vmfs/volumes/{uuid}/log (Synacktiv’s Velociraptor writeup walks the full chain, and it is messier than the docs make it sound). If /scratch resolves to persistent storage, those logs survive a reboot. If it does not — on hosts that boot from SD or USB media, or that left scratch pointed at the ramdisk — /scratch/log resolves to /var/run/log, which lives in volatile memory, and a reboot takes the log set with it. Broadcom’s own KB on this is blunt about it: ESXi logs internally to a ramdisk, and persistent logging across reboots requires either a scratch location on disk or a remote syslog target.
Then there is the file layout, which trips up anyone expecting a single Linux-style syslog. ESXi splits its telemetry across files by function. /var/log/syslog.log carries plenty of system-level and daemon messages, but it is not the aggregate that holds the events you care about for an intrusion. The ones that matter, per Sygnia’s breakdown, are:
/var/log/hostd.log— host agent, where management actions and SSH service state changes land/var/log/auth.log— authentication, including interactive SSH sessions/var/log/shell.log— ESXi shell command history/var/log/vobd.log— the observer daemon, which emits the audit VOB events you actually want to alert on
Miss the fact that they are separate and you will build a detection against syslog.log, see nothing, and conclude the host is clean. It isn’t. You’re reading the wrong file.
Two ways the evidence is gone before you get there
The reboot is the obvious one. Ramdisk-resident, no persistent scratch, host bounces during or after the incident, and the local record is zeroed. Plenty of encryptors leave the host in a state where someone power-cycles it during recovery, and that recovery step is what destroys the timeline.
The subtler one is rotation. ESXi rotates those files by size, with a small number of rotations kept by default — the Syslog.global.defaultSize and Syslog.global.defaultRotate knobs, roughly 1 MB per file and around eight rotations out of the box. A host being hammered (VMs forced off, datastore I/O spiking, the management agents screaming) generates enough log to churn through that rotation window fast. You can have persistent scratch configured correctly and still find that the relevant hostd.log lines rolled off before you collected, buried under recovery noise. Persistence buys you survival across reboot. It does not buy you retention against a noisy host. Only off-box forwarding does that — and on a busy estate, raising defaultSize and defaultRotate on hosts with persistent scratch is a cheap hedge while you get forwarding stood up.
And yes, an operator with root can stop vmsyslogd or clear /scratch/log if it’s persistent. But honestly, in most of these cases you don’t need to invoke deliberate anti-forensics to explain the missing logs. The ramdisk did it for free.
What to detect, and which file it lives in
Get the logs off the box. This is the whole ballgame for retention. If you have Splunk, the Splunk Add-on for VMware ESXi gives you a parsed vmware:esxilog sourcetype; if you don’t, raw syslog to 514 with a sane props/transforms config works, it’s just messier and you’ll be fighting field extraction later — and prefer a TCP/TLS target over plain UDP 514 if integrity and confidentiality of the audit stream matter to you. Either way, set Syslog.global.logHost on every host. Without it your DFIR collection is reduced to whatever happened to still be sitting on scratch, which is a bad place to be making promises to an ISSO from.
The signal worth alerting on first is SSH and shell enablement. In vobd.log you’re looking for the audit VOBs esx.audit.ssh.enabled and esx.audit.shell.enabled; hostd.log carries the corresponding service-state change; auth.log shows the actual sshd session opening, usually as root. On a healthy estate these events are near zero. SSH on ESXi is supposed to be off, enabled briefly under change control, and turned back off. So the detection is almost embarrassingly simple: count esx.audit.ssh.enabled per host, alert on anything above the baseline, which should be roughly nothing. Watch especially for the enable-log-in-then-disable pattern — a brief SSH enable, a root session in auth.log within a few minutes, then a quiet disable is a classic operator tradecraft sequence, and the disable is what makes it look tidy after the fact.
The catch is the shops that leave SSH on permanently “for convenience.” That’s the load-bearing bad habit. If that’s your environment, the enable event never fires because it was enabled in 2021, and your whole detection collapses to noise. You then have to pivot down to auth.log and alert on root SSH sessions from any source that isn’t your jump host, which means you need to actually know your jump host IP and keep that allowlist current. Nobody does. So fix the habit first: SSH off by default is a prerequisite for the cheap detection, not a nice-to-have.
Beyond that, a short list of things worth a rule: local account creation through esxcli system account add on a host that should only ever authenticate against vCenter or AD; lockdown mode being disabled; and a burst of VM power-off operations clustered tighter than any human admin would issue them. That last one is your bridge between “someone is poking around” and “the encrypt is starting.”
On false positives. The good news is that the legitimate heavy hitters mostly stay out of your way. Veeam, Nakivo, and the rest of the backup stack typically talk to the host through the vSphere API by way of vpxa/hostd, not by opening an interactive SSH session, so a root SSH login from a backup proxy’s IP is usually odd and worth waking someone for. The caveat: NBD and HotAdd transport move data over NFC (TCP 902) and the VDDK, not SSH — but some legacy or scripted workflows (ghettoVCB-style scripts, certain guest-processing or quiescence helpers) do open SSH to the host, so confirm how your backup stack actually moves data before you treat backup-proxy SSH as automatically malicious — build the allowlist from real logs, not assumptions. The bad news is everything an admin does by hand during a real maintenance window looks identical to early-stage intrusion. Expect the first week of any new ESXi rule set to light up around your patching and your storage migrations. The tuning is boring and necessary: enumerate the maintenance source IPs, the break-glass account, the monitoring poller, and carve them out by source rather than by event type. Carving by event type is how you end up suppressing the exact thing you built the rule for.
On time. Check NTP before you trust any of it. ESXi hosts drift, NTP gets misconfigured or points at a server that was decommissioned two reorgs ago, and a host clock that’s several minutes off will quietly destroy your correlation against vCenter’s event database and the AD authentication trail. Pull the configured source and the actual offset as part of collection — esxcli system ntp get for the configuration and esxcli system time get for the current host time — not as an afterthought, because a timeline you can’t defend is worse than no timeline. And don’t stop at the system clock: hardware-clock drift on the physical server can leave esxcli system time get looking correct while the on-disk logs are skewed relative to vCenter events captured via API, so pull esxcli hardware clock get as well.
vCenter changes the answer
If the host is vCenter-managed, you have a second copy of a lot of this. vpxd.log and the vCenter events database record management-plane actions, and vCenter normally runs on real disk with real retention, so it’s often the more reliable witness than the host itself. Host-side vpxa.log and fdm.log (vSphere HA) can corroborate management actions too, especially when the host stayed connected. That said, the second copy is only as good as the management plane’s integrity: an actor with ESXi root acting on a host that’s disconnected from vCenter, or one who has compromised vCenter directly, can leave little or nothing in vpxd — so treat it as corroboration, not gospel. Standalone hosts are where you’re fully exposed: the small ROBO site, the two-host cluster at a branch, the lab box that quietly went into production. Those depend entirely on their own ramdisk, and they are overwhelmingly the ones that never had logHost set.
Regulated estates tend to mandate syslog forwarding, so a FedRAMP Moderate or High boundary will usually have it; commercial SMB virtualization frequently runs hosts where nobody has ever opened the Advanced Settings. Know which one you’re defending, because it determines whether your IR plan starts with “pull from the SIEM” or “pray scratch was persistent.”
NIST 800-53 mapping
| Control | Why it bites here |
|---|---|
| AU-4 | The audit storage capacity is, literally, a ramdisk measured in megabytes. This is the control the architecture violates by default. |
| AU-9 | Off-box forwarding is the only thing that survives an attacker — or a reboot — killing local logging. Protect the audit record by getting it off the asset, ideally over TLS. |
| AU-11 | Retention against a noisy host is a forwarding/SIEM problem, not a host problem. Persistent scratch alone won’t satisfy it. |
| AU-2 / AU-6 | You can’t review what you never separated correctly. The multi-file layout is an event-selection trap. |
| SI-4 | The SSH-enable and mass-power-off detections live here. |
| AC-17 / MA-4 | SSH to ESXi is nonlocal maintenance / remote access. Off by default, logged when on. |
| CM-6 / CM-7 | logHost, scratch location, and SSH default state are configuration baseline items. Drift on these is the whole problem. |
| CP-9 / CP-10 | The VMs are the asset. Backups that a hypervisor-root actor can also reach are not backups. |
The AU-4 line is the one I’d put in front of an AO. Most audit-storage conversations are about index sizing and cold-tier budgets. On ESXi the storage capacity that matters is a few megabytes of volatile memory, and the entire forensic value of the host hangs on a config setting that ships effectively off.
When you arrive after the host already rebooted
If there was no forwarding, the order of operations matters. Don’t reboot it again. Acquire /scratch/log live — chase the actual target with readlink -f /scratch/log first, since it may be a symlink into the ramdisk — then run esxcli system syslog config get to read Syslog.global.logDir and Syslog.global.logHost and learn whether anything was ever persisted or shipped, then pull what’s resident in the ramdisk before anyone “helpfully” power-cycles for recovery. Velociraptor’s offline collector handles ESXi well enough for this; hash everything on the way out and record the host’s clock offset alongside the collection so the timeline holds up later.
Set logHost on every ESXi host you own before you need it. It is a five-minute change that decides whether the next hypervisor incident has a forensic record or a shrug.
Sources
- ESXi Ransomware Attacks: Stealthy Persistence through SSH Tunneling (Sygnia)
- ESXi Ransomware Attacks: Evolution, Impact, and Defense Strategy (Sygnia)
- Detecting Suspicious ESXi Activity Before Ransomware Happens (Splunk)
- VMware ESXi forensic with Velociraptor (Synacktiv)
- Determining whether an ESXi host has persistent logging (Broadcom)
- Configuring syslog on ESXi (Broadcom)