Process Creation in Io_uring

57 points · semiquaver · 1 days ago

21 comments

cperciva · 18 hours ago

The "clone the entire address space and then call exec" idiom is indeed wildly inefficient -- that's why the horror which is vfork was invented -- but I'm not convinced that putting everything which sits between fork and execve into io_uring (or, as a comment snarkily suggests, ebpf) is the solution. There's just too many things userland might want to do.

I wonder if the best solution lies somewhere in the vicinity of "fork but only copy a small part of the address space" -- rather than copying the entire address space as in fork (only to use a tiny portion and throw away the rest) or copying none of the address space as in vfork (the paging tables are shared between parent and child until exec) if we can identify what memory the child will need to access before calling _exit or exec (say, "the current function and its local variables") then we could create an address space with just a few paging tables entries.

Kind of like the "zygote" forking model (early in the main process lifetime, a zygote process gets forked off, and when the main process wants another worker it asks the zygote to fork one off) except that the "zygote" is more like an induced pluripotent stem cell, having been reverted from an adult state.

Show replies

10000truths · 17 hours ago

I agree with Pavel that extending the clone syscall is a better idea than this patch set. The flexibility that Josh and Gabriel talk about seems wholly unnecessary. In every use of fork-(do stuff)-exec I've ever seen, the below two observations remained true:

1. Everything needed in the "do stuff" part was known prior to the call to fork

2. Any failures in the "do stuff" part would scrap the child process and report an error to the parent process

Show replies

PaulDavisThe1st · 12 hours ago

For now, I'd settle for an RT-safe way to create a new process that then calls execve. AFAIK, this doesn't for Linux and may not exist for any *nix kernel at this time (not sure about this second part).

Show replies