VMbus

VMbus is a software construct provided by Hyper-V to guest VMs. It consists of a control path and common facilities used by synthetic devices that Hyper-V presents to guest VMs. The control path is used to offer synthetic devices to the guest VM and, in some cases, to rescind those devices. The common facilities include software channels for communicating between the device driver in the guest VM and the synthetic device implementation that is part of Hyper-V, and signaling primitives to allow Hyper-V and the guest to interrupt each other.

VMbus is modeled in Linux as a bus, with the expected /sys/bus/vmbus entry in a running Linux guest. The VMbus driver (drivers/hv/vmbus_drv.c) establishes the VMbus control path with the Hyper-V host, then registers itself as a Linux bus driver. It implements the standard bus functions for adding and removing devices to/from the bus.

Most synthetic devices offered by Hyper-V have a corresponding Linux device driver. These devices include:

  • SCSI controller

  • NIC

  • Graphics frame buffer

  • Keyboard

  • Mouse

  • PCI device pass-thru

  • Heartbeat

  • Time Sync

  • Shutdown

  • Memory balloon

  • Key/Value Pair (KVP) exchange with Hyper-V

  • Hyper-V online backup (a.k.a. VSS)

Guest VMs may have multiple instances of the synthetic SCSI controller, synthetic NIC, and PCI pass-thru devices. Other synthetic devices are limited to a single instance per VM. Not listed above are a small number of synthetic devices offered by Hyper-V that are used only by Windows guests and for which Linux does not have a driver.

Hyper-V uses the terms "VSP" and "VSC" in describing synthetic devices. "VSP" refers to the Hyper-V code that implements a particular synthetic device, while "VSC" refers to the driver for the device in the guest VM. For example, the Linux driver for the synthetic NIC is referred to as "netvsc" and the Linux driver for the synthetic SCSI controller is "storvsc". These drivers contain functions with names like "storvsc_connect_to_vsp".

VMbus channels

An instance of a synthetic device uses VMbus channels to communicate between the VSP and the VSC. Channels are bi-directional and used for passing messages. Most synthetic devices use a single channel, but the synthetic SCSI controller and synthetic NIC may use multiple channels to achieve higher performance and greater parallelism.

Each channel consists of two ring buffers. These are classic ring buffers from a university data structures textbook. If the read and writes pointers are equal, the ring buffer is considered to be empty, so a full ring buffer always has at least one byte unused. The "in" ring buffer is for messages from the Hyper-V host to the guest, and the "out" ring buffer is for messages from the guest to the Hyper-V host. In Linux, the "in" and "out" designations are as viewed by the guest side. The ring buffers are memory that is shared between the guest and the host, and they follow the standard paradigm where the memory is allocated by the guest, with the list of GPAs that make up the ring buffer communicated to the host. Each ring buffer consists of a header page (4 Kbytes) with the read and write indices and some control flags, followed by the memory for the actual ring. The size of the ring is determined by the VSC in the guest and is specific to each synthetic device. The list of GPAs making up the ring is communicated to the Hyper-V host over the VMbus control path as a GPA Descriptor List (GPADL). See function vmbus_establish_gpadl().

Each ring buffer is mapped into contiguous Linux kernel virtual space in three parts: 1) the 4 Kbyte header page, 2) the memory that makes up the ring itself, and 3) a second mapping of the memory that makes up the ring itself. Because (2) and (3) are contiguous in kernel virtual space, the code that copies data to and from the ring buffer need not be concerned with ring buffer wrap-around. Once a copy operation has completed, the read or write index may need to be reset to point back into the first mapping, but the actual data copy does not need to be broken into two parts. This approach also allows complex data structures to be easily accessed directly in the ring without handling wrap-around.

On arm64 with page sizes > 4 Kbytes, the header page must still be passed to Hyper-V as a 4 Kbyte area. But the memory for the actual ring must be aligned to PAGE_SIZE and have a size that is a multiple of PAGE_SIZE so that the duplicate mapping trick can be done. Hence a portion of the header page is unused and not communicated to Hyper-V. This case is handled by vmbus_establish_gpadl().

Hyper-V enforces a limit on the aggregate amount of guest memory that can be shared with the host via GPADLs. This limit ensures that a rogue guest can't force the consumption of excessive host resources. For Windows Server 2019 and later, this limit is approximately 1280 Mbytes. For versions prior to Windows Server 2019, the limit is approximately 384 Mbytes.

VMbus messages

All VMbus messages have a standard header that includes the message length, the offset of the message payload, some flags, and a transactionID. The portion of the message after the header is unique to each VSP/VSC pair.

Messages follow one of two patterns:

  • Unidirectional: Either side sends a message and does not expect a response message

  • Request/response: One side (usually the guest) sends a message and expects a response

The transactionID (a.k.a. "requestID") is for matching requests & responses. Some synthetic devices allow multiple requests to be in- flight simultaneously, so the guest specifies a transactionID when sending a request. Hyper-V sends back the same transactionID in the matching response.

Messages passed between the VSP and VSC are control messages. For example, a message sent from the storvsc driver might be "execute this SCSI command". If a message also implies some data transfer between the guest and the Hyper-V host, the actual data to be transferred may be embedded with the control message, or it may be specified as a separate data buffer that the Hyper-V host will access as a DMA operation. The former case is used when the size of the data is small and the cost of copying the data to and from the ring buffer is minimal. For example, time sync messages from the Hyper-V host to the guest contain the actual time value. When the data is larger, a separate data buffer is used. In this case, the control message contains a list of GPAs that describe the data buffer. For example, the storvsc driver uses this approach to specify the data buffers to/from which disk I/O is done.

Three functions exist to send VMbus messages:

  1. vmbus_sendpacket(): Control-only messages and messages with embedded data -- no GPAs

  2. vmbus_sendpacket_pagebuffer(): Message with list of GPAs identifying data to transfer. An offset and length is associated with each GPA so that multiple discontinuous areas of guest memory can be targeted.

  3. vmbus_sendpacket_mpb_desc(): Message with list of GPAs identifying data to transfer. A single offset and length is associated with a list of GPAs. The GPAs must describe a single logical area of guest memory to be targeted.

Historically, Linux guests have trusted Hyper-V to send well-formed and valid messages, and Linux drivers for synthetic devices did not fully validate messages. With the introduction of processor technologies that fully encrypt guest memory and that allow the guest to not trust the hypervisor (AMD SNP-SEV, Intel TDX), trusting the Hyper-V host is no longer a valid assumption. The drivers for VMbus synthetic devices are being updated to fully validate any values read from memory that is shared with Hyper-V, which includes messages from VMbus devices. To facilitate such validation, messages read by the guest from the "in" ring buffer are copied to a temporary buffer that is not shared with Hyper-V. Validation is performed in this temporary buffer without the risk of Hyper-V maliciously modifying the message after it is validated but before it is used.

VMbus interrupts

VMbus provides a mechanism for the guest to interrupt the host when the guest has queued new messages in a ring buffer. The host expects that the guest will send an interrupt only when an "out" ring buffer transitions from empty to non-empty. If the guest sends interrupts at other times, the host deems such interrupts to be unnecessary. If a guest sends an excessive number of unnecessary interrupts, the host may throttle that guest by suspending its execution for a few seconds to prevent a denial-of-service attack.

Similarly, the host will interrupt the guest when it sends a new message on the VMbus control path, or when a VMbus channel "in" ring buffer transitions from empty to non-empty. Each CPU in the guest may receive VMbus interrupts, so they are best modeled as per-CPU interrupts in Linux. This model works well on arm64 where a single per-CPU IRQ is allocated for VMbus. Since x86/x64 lacks support for per-CPU IRQs, an x86 interrupt vector is statically allocated (see HYPERVISOR_CALLBACK_VECTOR) across all CPUs and explicitly coded to call the VMbus interrupt service routine. These interrupts are visible in /proc/interrupts on the "HYP" line.

The guest CPU that a VMbus channel will interrupt is selected by the guest when the channel is created, and the host is informed of that selection. VMbus devices are broadly grouped into two categories:

  1. "Slow" devices that need only one VMbus channel. The devices (such as keyboard, mouse, heartbeat, and timesync) generate relatively few interrupts. Their VMbus channels are all assigned to interrupt the VMBUS_CONNECT_CPU, which is always CPU 0.

  2. "High speed" devices that may use multiple VMbus channels for higher parallelism and performance. These devices include the synthetic SCSI controller and synthetic NIC. Their VMbus channels interrupts are assigned to CPUs that are spread out among the available CPUs in the VM so that interrupts on multiple channels can be processed in parallel.

The assignment of VMbus channel interrupts to CPUs is done in the function init_vp_index(). This assignment is done outside of the normal Linux interrupt affinity mechanism, so the interrupts are neither "unmanaged" nor "managed" interrupts.

The CPU that a VMbus channel will interrupt can be seen in /sys/bus/vmbus/devices/<deviceGUID>/ channels/<channelRelID>/cpu. When running on later versions of Hyper-V, the CPU can be changed by writing a new value to this sysfs entry. Because the interrupt assignment is done outside of the normal Linux affinity mechanism, there are no entries in /proc/irq corresponding to individual VMbus channel interrupts.

An online CPU in a Linux guest may not be taken offline if it has VMbus channel interrupts assigned to it. Any such channel interrupts must first be manually reassigned to another CPU as described above. When no channel interrupts are assigned to the CPU, it can be taken offline.

When a guest CPU receives a VMbus interrupt from the host, the function vmbus_isr() handles the interrupt. It first checks for channel interrupts by calling vmbus_chan_sched(), which looks at a bitmap setup by the host to determine which channels have pending interrupts on this CPU. If multiple channels have pending interrupts for this CPU, they are processed sequentially. When all channel interrupts have been processed, vmbus_isr() checks for and processes any message received on the VMbus control path.

The VMbus channel interrupt handling code is designed to work correctly even if an interrupt is received on a CPU other than the CPU assigned to the channel. Specifically, the code does not use CPU-based exclusion for correctness. In normal operation, Hyper-V will interrupt the assigned CPU. But when the CPU assigned to a channel is being changed via sysfs, the guest doesn't know exactly when Hyper-V will make the transition. The code must work correctly even if there is a time lag before Hyper-V starts interrupting the new CPU. See comments in target_cpu_store().

VMbus device creation/deletion

Hyper-V and the Linux guest have a separate message-passing path that is used for synthetic device creation and deletion. This path does not use a VMbus channel. See vmbus_post_msg() and vmbus_on_msg_dpc().

The first step is for the guest to connect to the generic Hyper-V VMbus mechanism. As part of establishing this connection, the guest and Hyper-V agree on a VMbus protocol version they will use. This negotiation allows newer Linux kernels to run on older Hyper-V versions, and vice versa.

The guest then tells Hyper-V to "send offers". Hyper-V sends an offer message to the guest for each synthetic device that the VM is configured to have. Each VMbus device type has a fixed GUID known as the "class ID", and each VMbus device instance is also identified by a GUID. The offer message from Hyper-V contains both GUIDs to uniquely (within the VM) identify the device. There is one offer message for each device instance, so a VM with two synthetic NICs will get two offers messages with the NIC class ID. The ordering of offer messages can vary from boot-to-boot and must not be assumed to be consistent in Linux code. Offer messages may also arrive long after Linux has initially booted because Hyper-V supports adding devices, such as synthetic NICs, to running VMs. A new offer message is processed by vmbus_process_offer(), which indirectly invokes vmbus_add_channel_work().

Upon receipt of an offer message, the guest identifies the device type based on the class ID, and invokes the correct driver to set up the device. Driver/device matching is performed using the standard Linux mechanism.

The device driver probe function opens the primary VMbus channel to the corresponding VSP. It allocates guest memory for the channel ring buffers and shares the ring buffer with the Hyper-V host by giving the host a list of GPAs for the ring buffer memory. See vmbus_establish_gpadl().

Once the ring buffer is set up, the device driver and VSP exchange setup messages via the primary channel. These messages may include negotiating the device protocol version to be used between the Linux VSC and the VSP on the Hyper-V host. The setup messages may also include creating additional VMbus channels, which are somewhat mis-named as "sub-channels" since they are functionally equivalent to the primary channel once they are created.

Finally, the device driver may create entries in /dev as with any device driver.

The Hyper-V host can send a "rescind" message to the guest to remove a device that was previously offered. Linux drivers must handle such a rescind message at any time. Rescinding a device invokes the device driver "remove" function to cleanly shut down the device and remove it. Once a synthetic device is rescinded, neither Hyper-V nor Linux retains any state about its previous existence. Such a device might be re-added later, in which case it is treated as an entirely new device. See vmbus_onoffer_rescind().