Setting Up ESXi Core Dump on a VMFS Datastore

In the world of enterprise virtualization, stability is king. However, even the most robust systems encounter critical failures. When a VMware ESXi host faces an unrecoverable error, it triggers a Purple Screen of Death (PSOD). To the untrained eye, it’s a wall of cryptic text; to a System Administrator, it is the beginning of a forensic investigation.

The heart of this investigation is the Coredump.

1. What is an ESXi Coredump?

An ESXi coredump is a diagnostic file generated by the VMkernel when a host halts due to a critical exception. When the kernel realizes it can no longer operate safely without risking data corruption, it freezes all virtual machine operations and "dumps" the entire contents of the physical RAM into a persistent storage location.




The Anatomy of a Crash

A coredump isn't just a log file; it is a point-in-time bitstream of the host's memory. It contains:

  • The Stack Trace: The sequence of function calls that led to the crash.

  • CPU Register States: What the processors were doing at the millisecond of failure.

  • Loaded Modules: A list of every driver and vib active at the time.


2. Why is the Coredump Important?

Without a coredump, troubleshooting a host crash is guesswork. Its importance can be categorized into three main pillars:

I. Root Cause Analysis (RCA)

Modern data centers demand "Five Nines" (99.999%) availability. If a host crashes, "turning it off and on again" is not an acceptable solution. You must prove why it happened. Was it a buggy driver provided by a hardware vendor? Was it a known VMware software bug? The coredump provides the evidence needed to open a Support Request (SR) with VMware/Broadcom and get a definitive answer.

II. Hardware vs. Software Validation

Often, PSODs are caused by physical hardware failure—specifically memory parity errors or CPU cache issues. The coredump will explicitly state if the machine check exception was hardware-initiated. This saves teams hours of troubleshooting by allowing them to immediately initiate an RMA for a motherboard or DIMM rather than digging through software logs.

III. Proactive Patching

By analyzing dumps, you might discover that a specific version of an NVMe driver is causing conflicts. This allows you to proactively update the rest of your cluster before other hosts encounter the same failure, preventing a cascading cluster failure.


3. How to Configure Coredumps on a Local VMFS Datastore

While ESXi traditionally used a dedicated small partition for dumps, modern servers with massive RAM require larger files. Configuring a coredump to a file on a VMFS datastore is the most flexible method.

Step 1: Enable SSH and Connect

You must use the CLI for this configuration. Enable SSH on your host through the vSphere Client and connect via a terminal (like PuTTY or CMD).

Step 2: Create the Core Dump File

Note: Official guidance states VMFS on software iSCSI does not support coredump files. While this lab uses software iSCSI for demonstration, production environments should use supported Hardware iSCSI or Fibre Channel datastores.

You need to allocate space on your datastore, if not provided, the system creates a file of the size appropriate for the memory installed in the host. . It is recommended to name the file based on the host.

Bash
# Replace "DatastoreName" with your actual local VMFS datastore name
esxcli system coredump file add --datastore="DatastoreName" --file="esxi-host01-dump"



Step 3: Identify the File Path

Once created, you need the absolute VMFS path (which includes the Datastore UUID) to activate it.

Bash
esxcli system coredump file list

Note: Copy the path that looks like /vmfs/volumes/5ef1.../vmkdump/esxi-host01-dump.




Step 4: Set the File as Active and Configured

This tells the VMkernel, "If you crash, write the data exactly here."

Bash
# Use the path copied from the previous step
esxcli system coredump file set --path="/vmfs/volumes/UUID/vmkdump/filename" 

esxcli system coredump file set --enable==true



Step 5: Final Verification

Confirm the status to ensure both Active and Configured columns show as true.

Bash
esxcli system coredump file list

4. Best Practices and Recommendations

To ensure your diagnostic strategy is enterprise-grade, follow these recommendations:

One File Per Host

If you are using shared storage (SAN/NAS) to store dumps for a cluster, do not point multiple hosts to the same file. Each host requires its own unique file. If Host A and Host B share a file, a simultaneous crash would result in corrupted diagnostic data.

Storage Compatibility

  • Local Storage is Best: Local DAS (Direct Attached Storage) is the most reliable.

  • No Software iSCSI: You cannot save coredumps to a Software iSCSI or FCoE datastore. During a crash, the network stack is often the first thing to fail; the host cannot write a dump to a location that requires a working network driver.

Sizing the Dump File

Modern ESXi versions (7.x, 8.x, and 9.x) use a "partial dump" strategy by default to save space, but for a full analysis, the file should be large enough to accommodate the host's memory metadata. A safe rule of thumb is to allocate at least 2.5 GB to 4 GB for the dump file, depending on the host's physical RAM.

Use the Network Dump Collector

For large-scale environments (10+ hosts), managing individual files on datastores becomes tedious. Consider setting up the vSphere ESXi Dump Collector. This service (included with vCenter) allows hosts to send their coredumps over the network to a central repository, even if they have no local disks.

For more detail on this topic please check the official documentation here - https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/vsphere/8-0/vsphere-storage/working-with-datastores-in-vsphere-storage-environment/managing-vsphere-vmfs-datastores/setting-up-esxi-core-dump-on-a-vmfs-datastore.html



Comments