File System Alignment in Virtual Environments


In speaking to my fellow Implementation Engineers and team leads, I’ve come to learn file system misalignment is a known issue in virtual environments and can cause performance issues for virtual machines.  A little research has provided an overview of the storage layers in a virtualized environment, details on the proper alignment of guest file systems, and a description of the performance impact misalignment can have on the virtual infrastructure. NetApp has produced a white paper that speaks to file system alignment in virtual environments: TR 3747, which I’ve reproduced below.

In any server virtualization environment using shared storage, there are different layers of storage involved for the VMs to access storage.  There are different ways shared storage can be presented for the hypervisor and also the different layers of storage involved.

VMware vSphere 4 has four ways of using shared storage for deploying virtual machines:

• VMFS (Virtual Machine File System) on a Fibre Channel (FC) or iSCSI logical unit number (LUN) attached to the ESX or  ESXi host
• NFS (Network File System) export mounted on an ESX or ESXi host
• RDM (Raw Device Mapping) is the primary method of presenting a VM direct access and ownership of a LUN; the guest formats the RDM LUN as it would for any disk
• LUNs directly mapped by the guest OS (operating system) by using an iSCSI software initiator where the guest OS supports it

For both the VMFS and NFS options, the files that make up a VM are stored in a directory on the LUN or NFS export and each VM has a separate directory. Each virtual disk of the VM is made up of two files:
• <vmname>-flat.vmdk. The file containing the actual disk image of the guest VM
• <vmname>.vmdk. A text descriptor file that contains information about the size of the virtual disk as well as cylinder, head, and sector information for the virtual BIOS to report to the guest OS

Both VMFS and NFS mounted by ESX or ESXi are referred to as datastores. For the NFS option, there is no VMFS layer and the VM directories are directly stored on the NFS mount presented as a datastore.

The table below shows which layers must have proper file system alignment to maximize disk performance.

As we can see, there are multiple layers of storage involved. Each layer is organized into blocks, or chunks, to make accessing the storage more efficient. The size and the starting offset of each block can be different at each layer. Although a different block size across the storage layers doesn’t require any special attention, the starting offset does. For optimal performance, the starting offset of a file system should align with the start of a block in the next lower layer of storage. For example, an NTFS file system that resides on a LUN should have an offset that is divisible by the block size of the storage array presenting the LUN. Misalignment of block boundaries at any one of these storage layers can result in performance degradation. This issue is not unique to any one storage array and can occur for storage arrays from any vendor. VMware has also identified that this can be an issue in virtual environments. For more details, you can refer to the following VMware documentation:

http://www.vmware.com/pdf/esx3_partition_align.pdf
http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.0.pdf

Disks use geometry to identify themselves and their characteristics to the upper-layer operating system. The upper-layer operating system uses the disk geometry information to calculate the size of the disk and partitions the disk into predetermined addressable blocks. Just as with physical disks, LUNs report disk geometry to the host (physical host, virtualization host, or the VM, depending on the mode of usage) so that it can calculate space and partition the LUN into addressable blocks.

The Cylinder-head-sector article on Wikipedia (http://en.wikipedia.org/wiki/Cylinder-head-sector) provides background information on cylinder-head concepts.

Historically, hard drives (LUNs) presented the OS with a physical geometry that would be used to partition and format the disk efficiently. Disk geometry today is virtual and fabricated by the disk firmware. Operating system partitioning programs such as fdisk use the emulated disk geometry to determine where to begin a partition. Unfortunately, some partitioning programs invoked during OS setup create disk partitions that do not align with underlying block boundaries of the disk. Also, the more notable tools, such as GNU fdisk, found on many Linux® distributions; and Microsoft DiskPart, found on Windows 2000 and Windows 2003, by default also create disk partitions that do not align with underlying block boundaries of the disk.

By default, many guest OSs, including most versions of Windows, attempt to align the first sector on a full track boundary. The installer/setup routine requests the Cylinder/Head/Sector (CHS) information that describes the disk from the BIOS (PC firmware that manages disk I/O at a low level), or, in the case of many VMs, an emulated BIOS. The issue is that the CHS data hasn’t actually corresponded to anything physical, even in physical machines, since the late 1980s. At larger LUN sizes, usually 8GB or more, the sectors per track (S number) is always reported as 63, so the partitioning routine sets a starting offset of 63 sectors in an attempt to start the partition on a track boundary. While this may be correct for a single physical disk drive, it does not line up with any storage vendor’s logical block size. A block is the smallest unit of data that can be used to store an object on a storage device. In order to make sure of optimal storage capacity and performance, data should reside in the blocks. Physical disk blocks always have 512 (usable/visible) bytes, but for efficiency and scalability reasons, storage devices use a logical block size that is some number of physical blocks, usually a power of 2. For example, NetApp Unified Storage Architecture arrays have a logical block size of 4K, that is, 8 disk blocks. EMC Symmetrix storage arrays have a logical block size of 64KB.

Write operations can consume no less than a single 4KB block and can consume many 4KB blocks depending on the size of the write operation. Ideally, the guest/child OS should align its file system(s) such that writes are aligned to the storage device’s logical blocks. The problem of unaligned LUN I/O occurs when the partitioning scheme used by the host OS doesn’t match the block boundaries inside the LUN, as shown in the picture below. If the guest file system is not aligned, it might become necessary to read or write twice as many blocks of storage than the guest actually requested because any guest file system block actually occupies at least two partial storage blocks. As a simple example, assuming only one layer of file system and that the guest allocation unit is equal to the storage logical block size (4K or 4,096 bytes), each guest block (technically an allocation unit) would occupy 512 bytes of one block and 3,584 bytes (4,096 – 512) of the next. This results in inefficient I/O because the storage controller must perform additional work such as reading extra data to satisfy a read or write I/O from the host.

By default, many guest operating systems, including Windows 2000, Windows 2003, and various Linux distributions, start the first primary partition at sector (logical block) 63. The reasons for this are historically tied to disk geometry. This behavior leads to misaligned file systems because the partition does not begin at a sector that is a multiple of 8. Note that Windows Server 2008 and Windows Vista® default at 1,048,576, which is divisible by 4,096, and do not require any adjustments. Also, RHEL 6 does not require any adjustments because it aligns its partitions properly by default.

The misalignment issue is more complex when the file system on the virtualization host contains the files (for example, vmdk or vhd) that represent the VM virtual disks. In this case, the partition scheme used by the guest OS for the virtual disks must match the partition scheme used by the LUNs on the hypervisor host and the storage array blocks.

VMs hosted on VMFS involve two layers of alignment, both of which should align with the storage blocks:
• VMFS
• The file system on the guest vmdk files inside the VMFS

The default starting offset of VMFS2 is 63 blocks, which results in misaligned I/O. The default offset of VMFS3, when created with the VMware vSphere Client, is 128 blocks by default, which does not result in misaligned I/O. Datastores migrated from VMFS2 to VMFS3 as part of an ESX/VI3 upgrade are not realigned; VM files need to be copied from the old datastore to a newly created datastore. In addition to properly aligning VMFS, each VM guest file system needs to be properly aligned as well.

RDM and LUNs directly mapped by the guest VM do not require special attention if the LUN type on the LUN matches the guest operating system type.

While NFS datastores do not require alignment themselves, each VM guest file system needs to be properly aligned. For all of these storage options, misalignment at any layer can cause performance issues as the system scales.

Misalignment can cause an increase in per-operation latency. It requires the storage array to read from or write to more blocks than necessary to perform logical I/O. Below is an example that shows a LUN with and without file system alignment. In the first instance, the LUN with aligned file systems uses four 4KB blocks on the LUN to store four 4KB blocks of data generated by a host. In the second scenario, where there is misalignment, the storage controller must use five blocks to store the same 16KB of data. This is an inefficient use of space, and performance suffers when the storage array must process five blocks to read or write what should be only four blocks of data. This results in inefficient I/O, because the storage array is doing more work than is actually requested.

VMFS-based datastores (FC or ISCSI) should be set to the LUN type VMware and created using the Virtual Infrastructure Client or VMware vSphere Client. This results in the VMFS file system being aligned with the LUN on the storage controller. If you use vmkfstools, make sure that you first partition the LUN using fdisk. This allows you to set the correct offset.

There is no VMFS layer involved with NFS, so only the alignment of the guest VM file system within the VMDK to the storage array is required.

There are several options to prevent misalignment when provisioning VMs.

USING DISKPART TO FORMAT WITH THE CORRECT STARTING PARTITION OFFSET (WINDOWS GUEST VMS)

This works for Windows VMs hosted on VMware ESX or ESXi (vmdk files hosted on VMFS or NFS datastores.

Aligning Boot Disk

Virtual disks to be used as the boot disk can be formatted with the correct offset at the time of creation by connecting the new virtual disk to a running VM before installing an operating system and manually setting the partition offset. For Windows guest operating systems, you might consider using an existing Windows Preinstall Environment boot CD or alternative tools like Bart’s PE CD. To set up the starting offset, follow these steps.

1. Boot the VM with the WinPE CD.
2. Select Start > Run and enter DiskPart.
3. Enter Select Disk0.
4. Enter create partition primary align=32.

5. Reboot the VM with WinPE CD.
6. Install the operating system as normal.

Aligning Data Disk

 To format virtual disks to be used as the data disk with the correct offset at the time of creation, use DiskPart in the VM. Attach the data disk to the VM. Check that there is no data on the disk.

1. Select Start > Run.
2. Enter diskpart.
3. Enter list disk to determine the disk # for the new data disk.
4. Enter select disk <disk_number> (for example, select disk 1).
5. Enter create partition primary align=32.

6. Enter exit to exit the DiskPart utility.
7. Format the data disk as you normally do.

Google

Advertisements

One Comment on “File System Alignment in Virtual Environments”

  1. vMario says:

    Great article Mike!

    I am aware of this topic but never find the time really grapple with this topic. I am bookmark this article and hopefully find some time asap to dive more into this topic.

    Cheers,
    Mario


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s