New CVE! CVE-2026-31431, codenamed Copy Fail. The short version: an unprivileged local user opens a socket, asks the kernel to do some routine crypto, and walks away with root.

What makes Copy Fail special is that it’s a logic bug, not a memory corruption. Memory bugs are the usual story (find a buffer overflow or a use-after-free, then chain careful tricks until you have code execution); logic bugs skip all of that, because the kernel’s own code, executing exactly as written, just happens to do something it shouldn’t. That’s why this 732-byte Python script that dropped on April 29 works on every major Linux distribution shipped.

The writeup is solid, the exploit itself is 732 bytes of intentionally unreadable Python (very on-brand), Low Level did a very cool video about it make sure to check it out. But first let’s walk through it

Ok so what’s actually a socket

A socket in Linux is the standard kernel object that programs use to send streams of bytes somewhere. Network sockets (TCP/UDP) talk over the network. Unix domain sockets talk between processes on the same machine. AF_ALG sockets talk to the kernel’s own crypto routines. They’re all the same kind of object, just plugged into different backends.

Quick detour first, because everything that follows depends on it: the kernel doesn’t track memory byte by byte. It tracks it in fixed-size chunks called pages (physical ones, not virtual), usually 4 KB each. Physical RAM is divided into pages, the kernel allocates and frees memory in page-sized units, and any logical buffer bigger than 4 KB ends up spread across several pages.

That’s where scatter lists come in. A scatter list is a linked list of (page, offset, length, next) entries that tells the kernel “this buffer starts at this offset inside this page, continues for this many bytes, and the next chunk lives at this entry”. Inside the kernel, every socket is a struct with two pointers: one to an input scatter list (the data going in) and one to an output scatter list (the data coming back out). Send 40,000 bytes into a socket and the kernel scatters them across about ten pages and remembers the path.

A socket in the kernel: input scatter list pointing at caller pages, output scatter list pointing at kernel pages

There’s a contract baked into this layout: the input list holds the caller’s pages, and the kernel only reads from them; the output list holds the kernel’s own pages, and the kernel writes them for you to read back later. The whole bug is going to be the kernel breaking that contract by writing into an input page.

AF_ALG: kernel crypto, in a socket

AF_ALG is a special socket family that doesn’t talk to the network or to other processes; it talks to the kernel’s own crypto code. The pitch is the standard one: don’t write your own crypto, use the kernel’s audited and hardware-accelerated implementations instead. The API is bind a socket to an algorithm name, send your data through, read the result back out.

One of the algorithm families AF_ALG offers is AEAD, Authenticated Encryption with Associated Data, and it’s where Copy Fail lives. AEAD does two things in the same operation. It encrypts a payload, so an eavesdropper can’t read it. And it cryptographically signs a piece of metadata that travels alongside it, so the receiver can verify nothing was tampered with along the way.

The contract

When you sendmsg into an AEAD socket, the kernel expects your buffer laid out in this exact order:

input  →  AAD || ciphertext || tags
output →  AAD || plaintext

Three pieces in, two pieces out. Each of those pieces is a separate entry in the socket’s scatter list, pointing at its own page in memory; when I say “the tags page” later in the post, that’s literally what I mean: the page in the input scatter list that holds the tag bytes.

AAD (Associated Authenticated Data) is metadata that gets signed but not encrypted, things like packet headers and sequence numbers, where the receiver needs to read them directly while still being able to verify they weren’t tampered with. Ciphertext is the encrypted payload, which the kernel decrypts for you. Tags is the cryptographic signature itself, the receipt the kernel checks against the AAD and ciphertext to confirm nothing was altered along the way.

All three input pages live in the input scatter list, supplied by the caller, and the kernel only reads from them. The two output pages (AAD and plaintext) live in the output scatter list, allocated and written by the kernel itself, and that’s what you read back with read(). The contract is one-way: input pages go in, output pages come out, never the other way around.

The optimization that breaks everything

Here’s the part that took me a couple of reads to get.

The same AEAD code path also handles IPSec, the encryption layer used by most VPN tunnels. IPSec packets carry a 64-bit Extended Sequence Number (ESN) that gets folded into the HMAC so attackers can’t replay old packets after the regular 32-bit counter wraps around. Computing that HMAC with an ESN needs a few bytes of scratch memory.

Instead of allocating a fresh page for that scratch memory, someone decided to reuse the tags page (the one that lives in the input scatter list, the one the kernel is supposed to only read from) as a kernel-side scratchpad. The reasoning was presumably “we’re done reading the tags by this point, the page is right there, why allocate another?” For IPSec specifically there’s no harm: same caller, same page, the operation completes correctly.

The problem is what this implicitly does. The kernel just promoted a page from “caller-provided, read-only” to “kernel writes to it”, breaking the one-way contract on the input scatter list, all to save one page allocation in a corner of the IPSec code path. The rest of the exploit is finding a way to make that newly-writable page be a page the attacker actually cares about.

AEAD scatter lists: the kernel’s ESN scratchpad write lands on the tags page in the input scatter list

splice() and the page-cache trick

splice() is one of those Linux syscalls that’s been around forever and most application engineers never touch. It moves data between a file descriptor and a pipe (or between two pipes) without ever copying it through your program; the kernel just rewires its own page references internally. It’s how high-throughput servers move bytes from disk to socket without bouncing them through user buffers.

We also need to talk about the page cache. Every file you open() and read() gets cached in memory by the kernel, addressed by (file, offset), so re-reading the same chunk of the same file is a memory copy instead of a disk hit (which is why your second cat of a big file is instant).

Now the trick: if you splice a file descriptor into a pipe and then splice that pipe into a socket, the page that ends up in the socket’s input scatter list points at the exact same physical memory as the page-cache page of the file. The socket and the page cache literally share that page. And splice() takes an offset, so you also get to pick which page lands there, and where inside that page.

splice() turning a page-cache page of /usr/bin/su into a page in the AF_ALG socket’s input scatter list

So now you can take a page from any file you can open and slip it into the input scatter list of an AF_ALG socket, with byte-level control over where the kernel’s eventual scratchpad write will land.

The chain

Here’s the rough shape of it, in pseudocode (not the real exploit, just the structure):

# 1. Bind an AEAD socket via AF_ALG
sock = socket(AF_ALG, SOCK_SEQPACKET, 0)
bind(sock, type="aead", name="rfc4106(gcm(aes))")
op = accept(sock)

# 2. Send the AAD and ciphertext part of the input
#    (the kernel expects the buffer as: AAD || ciphertext || tags)
sendmsg(op, [aad, ciphertext])

# 3. Open /usr/bin/su and force its first page into the page cache
fd = open("/usr/bin/su", O_RDONLY)
read(fd, 1)

# 4. Splice that exact page-cache page into op's input scatter list,
#    so the kernel will treat it as the "tags" page
pr, pw = pipe()
splice(fd, TARGET_OFFSET, pw, None, PAGE_SIZE)
splice(pr, None, op, None, PAGE_SIZE)

# 5. Drive the AEAD code path with crafted ESN input so the
#    kernel's scratchpad write lands at the offset we picked
#    inside /usr/bin/su's cached page
trigger_aead(op)

# 6. exec /usr/bin/su; the kernel reuses the (now corrupted) cached page
execve("/usr/bin/su", [...])

Four bytes might sound tiny, but it’s plenty when the target is /usr/bin/su, a binary that already runs as root. Flip the right four bytes inside its cached copy (an instruction, a jump-table entry, a function pointer) and the in-memory version of the binary now does whatever you tell it to, with full root privileges.

The full exploit chain: AF_ALG socket, sendmsg, open /usr/bin/su, splice the cached page in as tags, trigger AEAD, exec, root

Why /usr/bin/su

setuid is a Unix file flag (set with chmod u+s) that says “when this binary runs, the resulting process gets the file owner’s UID, not the UID of the user who launched it”. /usr/bin/su is owned by root and has setuid set, so when a regular user execs it, the kernel runs the resulting process as UID 0 (root). It has to: su switches you into another account, and only root has the authority to do that switch.

setuid is normally fine because the on-disk binary is signed off by your distro’s packagers, the filesystem permissions stop you modifying it, and the kernel re-checks the setuid bit at every exec. None of that protects you when the page-cache copy of the binary has been corrupted, because the kernel runs from the page cache, not from the disk. The on-disk file stays pristine; the in-memory cached copy is the one that’s been flipped, and the next exec runs from that cached copy with full root privileges.

Setuid binary on disk is pristine, but the page cache copy has 4 bytes flipped, and exec runs from cache

Putting it all together

If I had to pick one image to summarise the whole exploit, it would be this. The whole trick hinges on one physical page existing in two roles at the same time: as the tags slot in the AF_ALG socket’s input scatter list (a place the kernel feels free to scribble on, thanks to the IPSec optimization) and as the cached page of /usr/bin/su (the binary the kernel is about to execute as root). Once the same physical page is both, a normal AEAD operation corrupts a setuid binary, and the next exec hands you a root shell.

Copy Fail overview: attacker, AF_ALG socket, page cache, AEAD code path and the kernel exec path, with the same physical page sitting in the input scatter list and in the page cache

What you actually do about it

Patches landed across distros on April 30 and May 1. AlmaLinux, RHEL, Amazon Linux, Ubuntu, SUSE, the cloud kernels (CloudLinux, AKS, GKE, EKS) all have updates out. If you administer Linux for a living, you should probably run the update.

If you can’t patch immediately, block AF_ALG socket creation. You can do that with seccomp (a per-process syscall filter), with an LSM rule (Linux Security Module, the kernel’s pluggable security framework), or just by blacklisting the algif kernel modules (blacklist algif_aead in your modprobe config). The trade-off is that anything actually using AF_ALG breaks, including some VPN tooling and a few LUKS disk-encryption configs, so it’s a bandage and not a fix.

The wider point: this is a logic bug, not a memory corruption, and almost none of the usual kernel defences apply. A quick tour of why each one is irrelevant here:

  • Stack canaries are random values planted near function return addresses to catch buffer overflows. Nothing here is overflowing a buffer.
  • KASLR scrambles where kernel code and data live in memory so attackers can’t hardcode addresses. We never need to know an address.
  • SMAP, SMEP and CFI are the page-access and control-flow guards that stop the kernel from accidentally reading or executing userspace memory, or jumping to unexpected places. The kernel itself is doing the write here, on its own behalf, into a page that just happens to also be ours.

Those mitigations all exist to defeat memory-corruption bugs. They don’t help when the attacker isn’t fighting the memory model at all, just walking through a door the kernel itself opened to save one allocation.

It’s also a useful reminder that “performance optimization” inside the kernel is a different beast from the same words in regular code. Reusing a page across a trust boundary looks fine in code review when the reviewer is thinking about throughput, and only becomes obvious in hindsight, when someone finally asks who owns which page.

Sources

Anyway, looks like 732 bytes was enough this time :)