Teleforking a Process onto a Different Computer

572 点作者 trishume大约 5 年前

39 条评论

hawski大约 5 年前

Seriously cool. That also reminds me of DragonFlyBSD's process checkpointing feature that offers suspend to disk. In Linux world there were many attempts, but AFAIK nothing simple and complete enough. To be fair I don't know if DF's implementation is that either.<a href="https://www.dragonflybsd.org/cgi/web-man?command=sys_checkpoint&section=2" rel="nofollow">https://www.dragonflybsd.org/cgi/web-man?command=sys_checkpo...</a><a href="https://www.dragonflybsd.org/cgi/web-man?command=checkpoint&section=ANY" rel="nofollow">https://www.dragonflybsd.org/cgi/web-man?command=checkpoint&...</a>

评论 #22988689 未加载

synack大约 5 年前

This reminds me of OpenMOSIX, which implemented a good chunk of POSIX in a distributed fashion.MPI also comes to mind, but it's more focused on the IPC mechanisms.I always liked Plan 9's approach, where every CPU is just a file and you execute code by writing to that file, even if it's on a remote filesystem.

评论 #22988488 未加载

评论 #22989618 未加载

评论 #22988329 未加载

评论 #22990444 未加载

评论 #22992643 未加载

评论 #22990739 未加载

评论 #22988470 未加载

评论 #22995181 未加载

评论 #22997000 未加载

ISL大约 5 年前

What's old is new again -- I'm pretty sure QNX could do this in the 1990s.QNX had a really cool way of doing inter-process communication over the LAN that worked as if it were local. Used it in my first lab job in 2001. You might not find it on the web, though. The API references were all (thick!) dead trees.Edit: Looks like QNX4 couldn't fork over the LAN. It had a separate "spawn()" call that could operate across nodes.<a href="https://www.qnx.com/developers/docs/qnx_4.25_docs/qnx4/sysarch/proc.html" rel="nofollow">https://www.qnx.com/developers/docs/qnx_4.25_docs/qnx4/sysar...</a>

评论 #22988164 未加载

评论 #22990278 未加载

评论 #22988360 未加载

0xbadcafebee大约 5 年前

It's nice to see people re-discover old school tech. In cluster computing this was generally called "application checkpointing"[1] and it's still in use in many different systems today. If you want to build this into your app for parallel computing you'd typically use PVM[2]/MPI[3]. SSI[4] clusters tried to simplify all this by making any process "telefork" and run on any node (based on a load balancing algorithm), but the most persistent and difficult challenge was getting shared memory and threading to work reliably.It looks like CRIU support is bundled in kernels since 3.11[5], and works for me in Ubuntu 18.04, so you can basically do this now without custom apps.[1] <a href="https://en.wikipedia.org/wiki/Application_checkpointing" rel="nofollow">https://en.wikipedia.org/wiki/Application_checkpointing</a> [2] <a href="https://en.wikipedia.org/wiki/Parallel_Virtual_Machine" rel="nofollow">https://en.wikipedia.org/wiki/Parallel_Virtual_Machine</a> [3] <a href="https://en.wikipedia.org/wiki/Message_Passing_Interface" rel="nofollow">https://en.wikipedia.org/wiki/Message_Passing_Interface</a> [4] <a href="https://en.wikipedia.org/wiki/Single_system_image" rel="nofollow">https://en.wikipedia.org/wiki/Single_system_image</a> [5] <a href="https://en.wikipedia.org/wiki/CRIU#Use" rel="nofollow">https://en.wikipedia.org/wiki/CRIU#Use</a>

fitzn大约 5 年前

Really cool idea! Thanks for providing so much detail in the post. I enjoyed it.A somewhat related project is the PIOS operating system written 10 years ago but still used today to teach the operating systems class there. The OS has different goals than your project but it does support forking processes to different machines and then deterministically merging their results back into the parent process. Your post remind me of it. There's a handful of papers that talks about the different things they did with the OS, as well as their best paper award at OSDI 2010.<a href="https://dedis.cs.yale.edu/2010/det/" rel="nofollow">https://dedis.cs.yale.edu/2010/det/</a>

dekhn大约 5 年前

Condor, a distributed computing environment, has done IO remoting (where all calls to IO on the target machine get sent back to the source) for several decades. The origin of Linux Containers was process migration.I believe people have found other ways to do this, personally I think the ECS model (like k8s, but the cloud provider hosts the k8s environment) where the user packages up all the dependencies and clearly specifies the IO mechanisms through late biniding, makes a lot more sense for distributed computing.

评论 #22990082 未加载

评论 #22988277 未加载

Animats大约 5 年前

That goes back to the 1980s, with UCLA Locus. This was a distributed UNIX-like system. You could launch a process on another machine and keep I/O and pipes connected. Even on a machine with a different CPU architecture. They even shared file position between tasks across the network. Locus was eventually part of an IBM product.A big part of the problem is "fork", which is a primitive designed to work on a PDP-11 with very limited memory. The way "fork" originally worked was to swap out the process, and instead of discarding the in-memory copy, duplicate the process table entry for it, making the swapped-out version and the in-memory version separate processes. This copied code, data, and the process header with the file info. This is a strange way to launch a new process, but it was really easy to implement in early Unix.Most other systems had some variant on "run" - launch and run the indicated image. That distributes much better.

评论 #22991764 未加载

评论 #22990830 未加载

评论 #22989212 未加载

userbinator大约 5 年前

This can let you stream in new pages of memory only as they are accessed by the program, allowing you to teleport processes with lower latency since they can start running basically right away.That's what "live migration" does; it can be done with an entire VM: <a href="https://en.wikipedia.org/wiki/Live_migration" rel="nofollow">https://en.wikipedia.org/wiki/Live_migration</a>

评论 #22991187 未加载

评论 #22990043 未加载

评论 #22989953 未加载

评论 #22989676 未加载

dreamcompiler大约 5 年前

Telescript [0] is based on this idea, although at a higher level. I wish we could just build Actor-based operating systems and then we wouldn't need to keep reinventing flexible distributed computation, but alas...[1][0] <a href="https://en.wikipedia.org/wiki/Telescript_(programming_language)" rel="nofollow">https://en.wikipedia.org/wiki/Telescript_(programming_langua...</a>[1] Yes I know Erlang exists. I wish more people would use it.

评论 #22994963 未加载

systemBuilder大约 5 年前

I think the problem with a lot of these ideas is that the value of fork() is only marginally higher than the value of starting a fresh process with arguments on a remote machine. The complexity of moving a full process to another machine is 10 times higher than just starting a new process on the remote machine with all the binaries present already.Quite frankly, vfork only exists and gets used because it's so damned cheap to copy the pagetable entries and use copy-on-write, to save RAM. Take away the cheapness by copying the whole address space over a network, adding slowness, and nobody will be interested any more.And both techniques are inferior to having a standing service on the remote machine that can accept an RPC and begin doing useful work in under 10 microseconds.RPC is how we launch mapshards at Google with a worker process that is a long-running server and it just receives a job spec over the network and can execute right away against the job spec.

abotsis大约 5 年前

Also of interest might be Sprite- a Berkeley research os developed “back in the day” by Ken Shirriff And others. It boasted a lot of innovations like a logging filesystem (not just metadata) and a distributed process model and filesystem allowing for live migrations between nodes. <a href="https://www2.eecs.berkeley.edu/Research/Projects/CS/sprite/sprite.html" rel="nofollow">https://www2.eecs.berkeley.edu/Research/Projects/CS/sprite/s...</a>

评论 #22997237 未加载

TazeTSchnitzel大约 5 年前

In essence this is manually implementing forking — spawning a new process and copying the bytes over without getting the kernel to help you, except over a network too.It reminds me a bit of when I wanted to parallelise the PHP test suite but didn't want to (couldn't?) use fork(), yet I didn't want to substantially rewrite the code to be amenable to cleanly re-initialising the state in the right way. But conveniently, this program used mostly global variables, and you can access global variables in PHP as one magic big associative array called $GLOBALS. So I moved most of the program's code into two functions (mostly just adding the enclosing function declaration syntax and indentation, plus `global` imports), made the program re-invoke itself NPROCS times mid-way, sending its children `serialize($GLOBALS)` over a loopback TCP connection, then had the spawned children detect an environment variable to receieve the serialized array over TCP, unserialize() it and copy it into `$GLOBALS`, and call the second function… lo and behold, it worked perfectly. :D (Of course I needed to make some other changes to make it useful, but they were also similar small incisions that tried to avoid refactoring the code as much as possible.)PHP's test suite uses this horrible hack to this day. It's… easier than rewriting the legacy code…

评论 #22991632 未加载

评论 #22991644 未加载

vladbb大约 5 年前

I implemented something similar ten years ago for a class project: <a href="https://youtu.be/0am-5noTrWk" rel="nofollow">https://youtu.be/0am-5noTrWk</a>

new_realist大约 5 年前

See <a href="https://criu.org/Live_migration" rel="nofollow">https://criu.org/Live_migration</a>

jka大约 5 年前

This reminds me a little bit of the idea of 'Single System Image'[1] computing.The idea, in abstract, is that you login to an environment where you can list running processes, perform filesystem I/O, list and create network connections, etc -- and any and all of these are in fact running across a cluster of distributed machines.(in a trivial case that cluster might be a single machine, in which case it's essentially no different to logging in to a standalone server)The wikipedia page referenced has a good description and a list of implementations; sadly the set of {has-recent-release && is-open-source && supports-process-migration} seems empty.[1] - <a href="https://en.wikipedia.org/wiki/Single_system_image" rel="nofollow">https://en.wikipedia.org/wiki/Single_system_image</a>

评论 #22990990 未加载

londons_explore大约 5 年前

Bonus points if you can effectively implement the "copy on write" ability of the linux kernel to only send over pages to the remote machine that are changed either in the local or remote fork, or read in the remote fork.A rsync-like diff algorithm might also substantially reduce copied pages if the same or a similar process is teleforked multiple times.Many processes have a lot of memory which is never read or written, and there's no reason that should be moved, or at least no reason it should be moved quickly.Using that, you ought to be able to resume the remote fork in milliseconds rather than seconds.userfaultfd() or mapping everything to files on a FUSE filesystem both look like promising implementation options.

评论 #22988038 未加载

评论 #22988032 未加载

评论 #22995946 未加载

评论 #22988070 未加载

评论 #22988181 未加载

carapace大约 5 年前

"Somebody else has had this problem."Don't get me wrong, this is great hacking and great fun. And this is a good point:> I think this stuff is really cool because it’s an instance of one of my favourite techniques, which is diving in to find a lesser-known layer of abstraction that makes something that seems nigh-impossible actually not that much work. Teleporting a computation may seem impossible, or like it would require techniques like serializing all your state, copying a binary executable to the remote machine, and running it there with special command line flags to reload the state.

rapjr9大约 5 年前

There was a lot of work on mobile agents 20 years ago, Java programs that could jump from machine to machine over the network and continue executing wherever they landed. The field stagnated because there were some really difficult security problems (how can you trust the code to execute on your machine? How can the code trust whatever machine it lands on and use it's services?). I think later work resolved the security issues but the field has not re-surged. Might be a good place to start to see what the issues and risks of mobile task execution are.

saagarjha大约 5 年前

It’s touched on at the very end, but this kind of work is somewhat similar to what the kernel needs to do on a fork or context switch, so you can really figure out what state you need to keep track of from there. Once you have that, scheduling one of these network processes isn’t really all that different than scheduling a normal process, except the of course syscalls on the remote machine will possibly go to a kernel that doesn’t know what to do with them.

lachlan-sneff大约 5 年前

Wow, this is really interesting. I bet that there's a way of doing this robustly by streaming wasm modules instead of full executables to every server in the cluster.

p4bl0大约 5 年前

Very cool :). Apart from Plan9 that many people already talked about here, it also made me think of Emacs `unexec` [0].[0] <a href="http://git.savannah.gnu.org/cgit/emacs.git/tree/src/unexelf.c" rel="nofollow">http://git.savannah.gnu.org/cgit/emacs.git/tree/src/unexelf....</a>

peterkelly大约 5 年前

There's been a bunch of interesting work done on this over the years. Here's a literature survey on the topic: <a href="https://dl.acm.org/doi/abs/10.1145/367701.367728" rel="nofollow">https://dl.acm.org/doi/abs/10.1145/367701.367728</a>

tpetry大约 5 年前

It‘s a really nice idea. But when reading it i came to the conclusion that web workers are a really genious idea which could be used for c-like software equally good: Everything which should run on a different server is an extra executable, so this executable would be shipped to the destination, started and then the two processes can talk by message passing. This concept is so generic there could be dozens of „schedulers“ to start a process on a remote location, like ssh connect, start a cloud vm, ...

zozbot234大约 5 年前

The page states that CRIU requires kernel patches, but other sources say that the kernel code for CRIU is already in the mainline kernel. What's up with that?

评论 #22992430 未加载

评论 #22988230 未加载

评论 #22988187 未加载

bamboozled大约 5 年前

I could imagine having a build system which produces a process as an artifact and then just forks it in the cloud without distributing those pesky archives!

评论 #22994451 未加载

anthk大约 5 年前

What I'd love it's bingding easily remote directories as local. Not NFS, but a braindead 9p. If I don't have a tool, I'd love to have a bind-mount of a directory from a stranger, and run a binarye from within (or piping it) without he being able to trace the I/O.If the remote FS is a diff arch, I'd should be able to run the same binary remotely as a fallback option, seamless.

touisteur大约 5 年前

I wonder whether the effort by the syzkaller people (@dyukov) could help with the actual description of all the syscalls (that the author says people gave up on for now, because too complex), since they need them to be able to fuzz efficiently...

YesThatTom2大约 5 年前

Condor did this in the early 90s.

cecilpl2大约 5 年前

This is similar to what Incredibuild does. It distributes compile and compute jobs across a network, effectively sandboxing the remote process and forwarding all filesystem calls back to the initiating agent.

crashdelta大约 5 年前

This is one of the best side projects I've ever seen, hands down.

评论 #22990205 未加载

sharno大约 5 年前

I think Unison [0] is going to make this trivial[0] <a href="https://www.unisonweb.org/" rel="nofollow">https://www.unisonweb.org/</a>

justicezyx大约 5 年前

There is some company using CRIU to implement general purpose process live migration (not specifically VMs).

concernedctzn大约 5 年前

side note: take a look at this guy's other blog posts, they're all very good

pcr910303大约 5 年前

This makes me think about Urbit[0] OS — Urbit OS represents the entire OS state in a simple tree, this would be very simple to implement.[0]: <a href="https://urbit.org/" rel="nofollow">https://urbit.org/</a>

rhabarba大约 5 年前

I love how people reimplement Plan 9 (poorly).

cjbprime大约 5 年前

Amazing work.

评论 #22989554 未加载

crashdelta大约 5 年前

THIS IS REVOLUTIONARY!

评论 #22992223 未加载

totorovirus大约 5 年前

teleforkbomb..

anticensor大约 5 年前

hoard() would be a better name.

评论 #22988259 未加载

39 条评论

hawski大约 5 年前

评论 #22988689 未加载

synack大约 5 年前

评论 #22988488 未加载

评论 #22989618 未加载

评论 #22988329 未加载

评论 #22990444 未加载

评论 #22992643 未加载

评论 #22990739 未加载

评论 #22988470 未加载

评论 #22995181 未加载

评论 #22997000 未加载

ISL大约 5 年前

评论 #22988164 未加载

评论 #22990278 未加载

评论 #22988360 未加载

0xbadcafebee大约 5 年前

fitzn大约 5 年前

dekhn大约 5 年前

评论 #22990082 未加载

评论 #22988277 未加载

Animats大约 5 年前

评论 #22991764 未加载

评论 #22990830 未加载

评论 #22989212 未加载

userbinator大约 5 年前

评论 #22991187 未加载

评论 #22990043 未加载

评论 #22989953 未加载

评论 #22989676 未加载

dreamcompiler大约 5 年前

评论 #22994963 未加载

systemBuilder大约 5 年前

abotsis大约 5 年前

评论 #22997237 未加载

TazeTSchnitzel大约 5 年前

评论 #22991632 未加载

评论 #22991644 未加载

vladbb大约 5 年前

I implemented something similar ten years ago for a class project: <a href="https://youtu.be/0am-5noTrWk" rel="nofollow">https://youtu.be/0am-5noTrWk</a>

new_realist大约 5 年前

See <a href="https://criu.org/Live_migration" rel="nofollow">https://criu.org/Live_migration</a>

jka大约 5 年前

评论 #22990990 未加载

londons_explore大约 5 年前

评论 #22988038 未加载

评论 #22988032 未加载

评论 #22995946 未加载

评论 #22988070 未加载

评论 #22988181 未加载

carapace大约 5 年前

rapjr9大约 5 年前

saagarjha大约 5 年前

lachlan-sneff大约 5 年前

Wow, this is really interesting. I bet that there's a way of doing this robustly by streaming wasm modules instead of full executables to every server in the cluster.

p4bl0大约 5 年前

peterkelly大约 5 年前

tpetry大约 5 年前

zozbot234大约 5 年前

The page states that CRIU requires kernel patches, but other sources say that the kernel code for CRIU is already in the mainline kernel. What's up with that?

评论 #22992430 未加载

评论 #22988230 未加载

评论 #22988187 未加载

bamboozled大约 5 年前

I could imagine having a build system which produces a process as an artifact and then just forks it in the cloud without distributing those pesky archives!

评论 #22994451 未加载

anthk大约 5 年前

touisteur大约 5 年前

YesThatTom2大约 5 年前

Condor did this in the early 90s.

cecilpl2大约 5 年前

crashdelta大约 5 年前

This is one of the best side projects I've ever seen, hands down.

评论 #22990205 未加载

sharno大约 5 年前

I think Unison [0] is going to make this trivial[0] <a href="https://www.unisonweb.org/" rel="nofollow">https://www.unisonweb.org/</a>

justicezyx大约 5 年前

There is some company using CRIU to implement general purpose process live migration (not specifically VMs).

concernedctzn大约 5 年前

side note: take a look at this guy's other blog posts, they're all very good

pcr910303大约 5 年前

rhabarba大约 5 年前

I love how people reimplement Plan 9 (poorly).

cjbprime大约 5 年前

Amazing work.

评论 #22989554 未加载

crashdelta大约 5 年前

THIS IS REVOLUTIONARY!

评论 #22992223 未加载

totorovirus大约 5 年前

teleforkbomb..

anticensor大约 5 年前

hoard() would be a better name.

评论 #22988259 未加载