Add modifications to kernel module.

Additional modifications had to be included into the build of nvidia.ko in order for things to work properly.
fangli · Feb 22, 2021 · 6881c41 · 6881c41
1 parent 5cfacdf
commit 6881c41
Show file tree

Hide file tree

Showing 3 changed files with 1,273 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -5,42 +5,179 @@ Unlock vGPU functionality for consumer grade GPUs.
 
 ## Important!
 
-This tool is a work in progress. In the current state it does not work.
+This tool is very untested, use at your own risk.
 
 
 ## Description
 
 This tool enables the use of Geforce and Quadro GPUs with the NVIDIA vGPU
 software. NVIDIA vGPU normally only supports a few Tesla GPUs but since some
 Geforce and Quadro GPUs share the same physical chip as the Tesla this is only
-a software limitation for those GPUs. This tool works by intercepting the ioctl
-syscalls between the userspace nvidia-vgpud and nvidia-vgpu-mgr services and
-the kernel driver. Doing this allows the script to alter the identification and
-capabilities that the user space services relies on to determine if the GPU is
-vGPU capable.
+a software limitation for those GPUs. This tool aims to remove this limitation.
 
 
 ## Dependencies:
 
 * This tool requires Python3, the latest version is recommended.
 * The python package "frida" is required. `pip3 install frida`.
-* The tool requires the NVIDIA GRID vGPU driver to be properly installed for it
-  to do its job. This special driver is only accessible to NVIDIA enterprise
-  customers. The script has only been tested with 11.3 for "KVM on Linux" and
-  may or may not work on other versions.
+* The tool requires the NVIDIA GRID vGPU driver.
+* "dkms" is highly recommended as it simplifies the process of rebuilding the
+  driver alot.
 
 
 ## Installation:
 
-The NVIDIA vGPU drivers will create an nvidia-vgpud and nvidia-vgpu-mgr 
-systemd service. All we have to do is replace the path
-/usr/bin/<executable> in /lib/systemd/system/nvidia-vgpud.service and
-/lib/systemd/system/nvidia-vgpu-mgr.service with the path to the vgpu\_unlock
-script and pass the original executable path as the first argument.
+In the following instructions `<path_to_vgpu_unlock>` need to be replaced with
+the path to this repository on the target system and `<version>` need to be
+replaced with the version of the NVIDIA GRID vGPU driver.
+
+Install the NVIDIA GRID vGPU driver, make sure to install it as a dkms module.
+```
+./nvidia-installer
+```
+
+Modify the line begining with `ExecStart=` in `/lib/systemd/system/nvidia-vgpu.service`
+and `/lib/systemd/system/nvidia-vgpu-mgr.service` to use `vgpu_unlock` as
+executable and pass the original executable as the first argument. Ex:
+```
+ExecStart=<path_to_vgpu_unlock>/vgpu_unlock /usr/bin/nvidia-vgpud
+```
+
+Reload the systemd daemons:
+```
+systemctl daemon-reload
+```
+
+Modify the file `/usr/src/nvidia-<version>/nvidia/os-interface.c` and add the
+following line after the lines begining with `#include` at the start of the
+file.
+```
+#include "<path_to_vgpu_unlock>/vgpu_unlock_hooks.c"
+```
+
+Modify the file `/usr/src/nvidia-<version>/nvidia/nvidia.Kbuild` and add the
+following line.
+```
+ldflags-y += -T <path_to_vgpu_unlock>/kern.ld
+```
+
+Remove the nvidia kernel module using dkms:
+```
+dkms remove -m nvidia -v <version> --all
+```
+
+Rebuild and reinstall the nvidia kernel module using dkms:
+```
+dkms install -m nvidia -v <version>
+```
+
+Reboot.
 
 ---
 **NOTE**
 
 This script will only work if there exists a vGPU compatible Tesla GPU that
 uses the same physical chip as the actual GPU being used.
+
+
 ---
+
+## How it works
+
+### vGPU supported?
+
+In order to determine if a certain GPU supports the vGPU functionality the
+driver looks at the PCI device ID. This identifier together with the PCI vendor
+ID is unique for each type of PCI device. In order to enable vGPU support we
+need to tell the driver that the PCI device ID of the installed GPU is one of
+the device IDs used by a vGPU capable GPU.
+
+### Userspace script: vgpu\_unlock
+
+The userspace services nvidia-vgpud and nvidia-vgpu-mgr uses the ioctl syscall
+to communicate with the kernel module. Specifically they read the PCI device ID
+and determines if the installed GPU is vGPU capable.
+
+The python script vgpu\_unlock intercepts all ioctl syscalls between the
+executable specified as the first argument and the kernel. The script then
+modifies the kernel responses to indicate a PCI device ID with vGPU support
+and a vGPU capable GPU.
+
+### Kernel module hooks: vgpu\_unlock\_hooks.c
+
+In order to exchange data with the GPU the kernel module maps the physical
+address space of the PCI bus into its own virtual address space. This is done
+using the ioremap\* kernel functions. The kernel module then reads and writes
+data into that mapped address space. This is done using the memcpy kernel
+function.
+
+By including the vgpu\_unlock\_hooks.c file into the os-interface.c file we can
+use C preprocessor macros to replace and intercept calls to the iormeap and
+memcpy functions. Doing this allows us to maintain a view of what is mapped
+where and what data that is being accessed.
+
+### Kernel module linker script: kern.ld
+
+This is a modified version of the default linker script provided by gcc. The
+script is modified to place the .rodata section of nv-kernel.o into .data
+section instead of .rodata, making it writable. The script also provide the
+symbols `vgpu_unlock_nv_kern_rodata_beg` and `vgpu_unlock_nv_kern_rodata_end`
+to let us know where that section begins and ends.
+
+### How it all comes together
+
+After boot the nvidia-vgpud service queries the kernel for all installed GPUs
+and checks for vGPU capability. This call is intercepted by the vgpu\_unlock
+python script and the GPU is made vGPU capable. If a vGPU capable GPU is found
+then nvidia-vgpu creates an MDEV device and the /sys/class/mdev\_bus directory
+is created by the system.
+
+vGPU devices can now be created by echoing UUIDs into the `create` files in the
+mdev bus representation. This will create additional structures representing
+the new vGPU device on the MDEV bus. These devices can then be assigned to VMs,
+and when the VM starts it will open the MDEV device. This causes nvidia-vgpu-mgr
+to start communicating with the kernel using ioctl. Again these calls are
+intercepted by the vgpu\_unlock python script and when nvidia-vgpu-mgr asks if
+the GPU is vGPU capable the answer is changed to yes. After that check it
+attempts to initialize the vGPU device instance.
+
+Initialization of the vGPU device is handled by the kernel module and it
+performs its own check for vGPU capability, this one is a bit more complicated.
+
+The kernel module maps the physical PCI address range 0xf0000000-0xf1000000 into
+its virtual address space, it then performs some magical operations which we
+don't really know what they do. What we do know is that after these operations
+it accesses a 128 bit value at physical address 0xf0029624, which we call the
+magic value. The kernel module also accessses a 128 bit value at physical 
+address 0xf0029634, which we call the key value.
+
+The kernel module then has a couple of lookup tables for the magic value, one
+for vGPU capable GPUs and one for the others. So the kernel module looks for the
+magic value in both of these lookup tables, and if it is found that table entry
+also contains a set of AES-128 encrypted data blocks and a HMAC-SHA256
+signature.
+
+The signature is then validated by using the key value mentioned earlier to
+calculate the HMAC-SHA256 signature over the encrypted data blocks. If the
+signature is correct, then the blocks are decrypted using AES-128 and the same
+key.
+
+Inside of the decrypted data is once again the PCI device ID.
+
+So in order for the kernel module to accept the GPU as vGPU capable the magic
+value will have to be in the table of vGPU capable magic values, the key has
+to generate a valid HMAC-SHA256 signature and the AES-128 decrypted data blocks
+has to contain a vGPU capable PCI device ID. If any of these checks fail, then
+the error code 0x56 "Call not supported" is returned.
+
+In order to make these checks pass the hooks in vgpu\_unlock\_hooks.c will look
+for a ioremap call that maps the physical address range that contain the magic
+and key values, recalculate the addresses of those values into the virtual
+address space of the kernel module, monitor memcpy operations reading at those
+addresses, and if such an operation occurs, keep a copy of the value until both
+are known, locate the lookup tables in the .rodata section of nv-kernel.o, find
+the signature and data bocks, validate the signature, decrypt the blocks, edit
+the PCI device ID in the decrypted data, reencrypt the blocks, regenerate the
+signature and insert the magic, blocks and signature into the table of vGPU
+capable magic values. And that's what they do.
+
diff --git a/kern.ld b/kern.ld
@@ -0,0 +1,162 @@
+/* Script for ld -r: link without relocation */
+/* Copyright (C) 2014-2018 Free Software Foundation, Inc.
+   Copying and distribution of this script, with or without modification,
+   are permitted in any medium without royalty provided the copyright
+   notice and this notice are preserved.  */
+OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64",
+	      "elf64-x86-64")
+OUTPUT_ARCH(i386:x86-64)
+ /* For some reason, the Solaris linker makes bad executables
+  if gld -r is used and the intermediate file has sections starting
+  at non-zero addresses.  Could be a Solaris ld bug, could be a GNU ld
+  bug.  But for now assigning the zero vmas works.  */
+SECTIONS
+{
+  /* Read-only sections, merged into text segment: */
+  .interp       0 : { *(.interp) }
+  .note.gnu.build-id : { *(.note.gnu.build-id) }
+  .hash         0 : { *(.hash) }
+  .gnu.hash     0 : { *(.gnu.hash) }
+  .dynsym       0 : { *(.dynsym) }
+  .dynstr       0 : { *(.dynstr) }
+  .gnu.version  0 : { *(.gnu.version) }
+  .gnu.version_d 0: { *(.gnu.version_d) }
+  .gnu.version_r 0: { *(.gnu.version_r) }
+  .rela.init    0 : { *(.rela.init) }
+  .rela.text    0 : { *(.rela.text) }
+  .rela.fini    0 : { *(.rela.fini) }
+  .rela.rodata  0 : { *(.rela.rodata) }
+  .rela.data.rel.ro 0 : { *(.rela.data.rel.ro) }
+  .rela.data    0 : { *(.rela.data) }
+  .rela.tdata	0 : { *(.rela.tdata) }
+  .rela.tbss	0 : { *(.rela.tbss) }
+  .rela.ctors   0 : { *(.rela.ctors) }
+  .rela.dtors   0 : { *(.rela.dtors) }
+  .rela.got     0 : { *(.rela.got) }
+  .rela.bss     0 : { *(.rela.bss) }
+  .rela.ldata   0 : { *(.rela.ldata) }
+  .rela.lbss    0 : { *(.rela.lbss) }
+  .rela.lrodata 0 : { *(.rela.lrodata) }
+  .rela.ifunc   0 : { *(.rela.ifunc) }
+  .rela.plt     0 :
+    {
+      *(.rela.plt)
+    }
+  .init         0 :
+  {
+    KEEP (*(SORT_NONE(.init)))
+  }
+  .plt          0 : { *(.plt) *(.iplt) }
+.plt.got      0 : { *(.plt.got) }
+.plt.sec      0 : { *(.plt.sec) }
+  .text         0 :
+  {
+    *(.text .stub)
+    /* .gnu.warning sections are handled specially by elf32.em.  */
+    *(.gnu.warning)
+  }
+  .fini         0 :
+  {
+    KEEP (*(SORT_NONE(.fini)))
+  }
+  .rodata       0 : { *(EXCLUDE_FILE (*nv-kernel.o) .rodata) }
+  .rodata1      0 : { *(.rodata1) }
+  .eh_frame_hdr : { *(.eh_frame_hdr)  }
+  .eh_frame     0 : ONLY_IF_RO { KEEP (*(.eh_frame))  }
+  .gcc_except_table 0 : ONLY_IF_RO { *(.gcc_except_table
+  .gcc_except_table.*) }
+  .gnu_extab 0 : ONLY_IF_RO { *(.gnu_extab*) }
+  /* These sections are generated by the Sun/Oracle C++ compiler.  */
+  .exception_ranges 0 : ONLY_IF_RO { *(.exception_ranges
+  .exception_ranges*) }
+  /* Adjust the address for the data segment.  We want to adjust up to
+     the same address within the page on the next page up.  */
+  /* Exception handling  */
+  .eh_frame     0 : ONLY_IF_RW { KEEP (*(.eh_frame))  }
+  .gnu_extab    0 : ONLY_IF_RW { *(.gnu_extab) }
+  .gcc_except_table 0 : ONLY_IF_RW { *(.gcc_except_table .gcc_except_table.*) }
+  .exception_ranges 0 : ONLY_IF_RW { *(.exception_ranges .exception_ranges*) }
+  /* Thread Local Storage sections  */
+  .tdata	0 :
+   {
+     *(.tdata)
+   }
+  .tbss		0 : { *(.tbss) }
+  .jcr          0 : { KEEP (*(.jcr)) }
+  .dynamic      0 : { *(.dynamic) }
+  .got          0 : { *(.got) *(.igot) }
+  .got.plt      0 : { *(.got.plt)  *(.igot.plt) }
+  .data         0 :
+  {
+    *(.data)
+    vgpu_unlock_nv_kern_rodata_beg = .;
+    *nv-kernel.o(.rodata)
+    vgpu_unlock_nv_kern_rodata_end = .;
+  }
+  .data1        0 : { *(.data1) }
+  .bss          0 :
+  {
+   *(.bss)
+   *(COMMON)
+   /* Align here to ensure that the .bss section occupies space up to
+      _end.  Align after .bss to ensure correct alignment even if the
+      .bss section disappears because there are no input sections.
+      FIXME: Why do we need it? When there is no .bss section, we don't
+      pad the .data section.  */
+  }
+  .lbss 0 :
+  {
+    *(.dynlbss)
+    *(.lbss)
+    *(LARGE_COMMON)
+  }
+  .lrodata 0  :
+  {
+    *(.lrodata)
+  }
+  .ldata 0  :
+  {
+    *(.ldata)
+  }
+  /* Stabs debugging sections.  */
+  .stab          0 : { *(.stab) }
+  .stabstr       0 : { *(.stabstr) }
+  .stab.excl     0 : { *(.stab.excl) }
+  .stab.exclstr  0 : { *(.stab.exclstr) }
+  .stab.index    0 : { *(.stab.index) }
+  .stab.indexstr 0 : { *(.stab.indexstr) }
+  .comment       0 : { *(.comment) }
+  /* DWARF debug sections.
+     Symbols in the DWARF debugging sections are relative to the beginning
+     of the section so we begin them at 0.  */
+  /* DWARF 1 */
+  .debug          0 : { *(.debug) }
+  .line           0 : { *(.line) }
+  /* GNU DWARF 1 extensions */
+  .debug_srcinfo  0 : { *(.debug_srcinfo) }
+  .debug_sfnames  0 : { *(.debug_sfnames) }
+  /* DWARF 1.1 and DWARF 2 */
+  .debug_aranges  0 : { *(.debug_aranges) }
+  .debug_pubnames 0 : { *(.debug_pubnames) }
+  /* DWARF 2 */
+  .debug_info     0 : { *(.debug_info) }
+  .debug_abbrev   0 : { *(.debug_abbrev) }
+  .debug_line     0 : { *(.debug_line .debug_line.* .debug_line_end ) }
+  .debug_frame    0 : { *(.debug_frame) }
+  .debug_str      0 : { *(.debug_str) }
+  .debug_loc      0 : { *(.debug_loc) }
+  .debug_macinfo  0 : { *(.debug_macinfo) }
+  /* SGI/MIPS DWARF 2 extensions */
+  .debug_weaknames 0 : { *(.debug_weaknames) }
+  .debug_funcnames 0 : { *(.debug_funcnames) }
+  .debug_typenames 0 : { *(.debug_typenames) }
+  .debug_varnames  0 : { *(.debug_varnames) }
+  /* DWARF 3 */
+  .debug_pubtypes 0 : { *(.debug_pubtypes) }
+  .debug_ranges   0 : { *(.debug_ranges) }
+  /* DWARF Extension.  */
+  .debug_macro    0 : { *(.debug_macro) }
+  .debug_addr     0 : { *(.debug_addr) }
+  .gnu.attributes 0 : { KEEP (*(.gnu.attributes)) }
+}
+