So you think you want to write a deterministic hypervisor?

200 点作者 wwilson大约 1 年前

14 条评论

tlb大约 1 年前

Interesting project. I almost wish I had a concurrency bug to test it on.> Guest software running in the Antithesis platform still experiences concurrency similar to a multi-core / multi-machine system, thanks to the process scheduling imposed by the guest OSThis might not exercise the full set of race conditions. When two threads are running simultaneously on separate cores (or hyper-threaded on the same core) they can interleave instructions at a much finer granularity than any OS time slicing would cause, even within instructions.For example, could it find a race condition where two threads are executing INC [addr] on the same memory address, where context switching between instructions doesn't trigger it?

评论 #39771394 未加载

评论 #39771240 未加载

评论 #39771238 未加载

mareko大约 1 年前

I love this. Many a moon ago, I worked on a system called Aikido at MIT, which combined a special built hypervisor with a binary rewriting system (DynamoRio) to enable efficient time travel debugging and race detection of parallel applications.If anyone's interested, here's a publication that talks about it in more detail:<a href="https://dspace.mit.edu/handle/1721.1/72082" rel="nofollow">https://dspace.mit.edu/handle/1721.1/72082</a>The use of performance counters here also reminds me of another project I worked on called Kendo, which was a posix thread like replacement that used performance counters to enforce a deterministic interleaving of synchronization operations (mutexes, etc). The system could guarantee determinism for programs that didn't have race conditions. Back then, I found that counting instructions wasn't deterministic on the processors of the time, but counting store operations was. If anyone's interested in that work, here's the publication:<a href="http://www.cag.csail.mit.edu/~mareko/asplos073-olszewski.pdf" rel="nofollow">http://www.cag.csail.mit.edu/~mareko/asplos073-olszewski.pdf</a>

comex大约 1 年前

What a tease! They describe in detail two problems they had to “invent workarounds” for, but say nothing about what the workarounds are. I’m very curious, since both of the problems sound quite hard to work around. I wonder if they’re being purposefully vague to make it harder for competitors to replicate their work…

评论 #39773632 未加载

评论 #39773505 未加载

fuzzybear3965大约 1 年前

I don't understand how this works in the case of testing many applications running on many machines, where many services on many machines need to communicate with each other. We deploy a mix of systemd services and OCI containers (running on podman and Docker) to different machines, the exact mix on each machine depends on the machine's intended purpose.We currently run CI tests using QEMU VMs. These VMs comprise a few systems representative of those that we deploy to production.Does adopting Antithesis mean that all non-containerized applications would need to be OCI-ified and every interaction would need to be mocked? There's a sort of combinatorial explosion that I'm concerned about when I'm thinking about testing/adding a new service to a system: All services on which it depends need to be mocked and all services which depend on it require creating a mocked version of it.Seems like a lot of work. Can someone please help clarify things for me?Also, how could we test the behavior of non-application code like drivers or the kernel itself?

评论 #39820740 未加载

评论 #39787530 未加载

评论 #39775370 未加载

wzdd大约 1 年前

I'm not familiar with the area, so am likely missing something, but how do they do deterministic thread-level context switching? Something like:<pre><code> var_1 = 0 var_2 = 0 thread_a: while true: something_complex() var_1 ++ thread_b: while true: something_complex() var_2 ++ </code></pre> Under the quoted definition of determinism, for every point in time, var_1 and var_2 should have the same values across all executions. But this would seem to amount to ensuring that exactly the same number of instructions are executed each time a thread is scheduled.

评论 #39770780 未加载

评论 #39771256 未加载

wyldfire大约 1 年前

Hermit [1] is another cool effort at determinism/reproducibility.[1] <a href="https://github.com/facebookexperimental/hermit">https://github.com/facebookexperimental/hermit</a>

评论 #39768802 未加载

aftbit大约 1 年前

How does this deal with non-determinism from the outside world? For example, let's say one of my tests is flaky because it asks an external service to give it some data, and that external service is flaky in what it returns?Or what if my bug is caused by bitflips in failing memory, that lead to impossible control flow paths being hit? Think something like:<pre><code> if x != 0: return 1/x </code></pre> Failing with an error because x is 0.Not hypothetical scenarios, both real bugs I've had to troubleshoot in my career.

评论 #39770099 未加载

评论 #39770557 未加载

fitzn大约 1 年前

Fun read. It has some similar ideas to <a href="https://dedis.cs.yale.edu/2010/det/" rel="nofollow">https://dedis.cs.yale.edu/2010/det/</a> but that was actually focused on multicore processing and the communication across cores.Congrats on the launch.

delta64大约 1 年前

I've long thought about these kind of OS designs, and what great features they can enable (such as time travel debugging). But the non-determinism introduced by inter-CPU interactions is a fundamental limitation, hence the need to run everything on a single isolated core.One day(^TM) I'm really keen to design a multi-core CPU architecture that allows for deterministic message passing between cores in such a way that you could get this kind of software working with true parallelism.

metalcrow大约 1 年前

This is a very promising project that I've seen a lot of attempts to do in the past, but never got to the level of progress that you have! Very impressive work!I am sad that you decided to give up on solving the multi-core parallelism issue, since each guest running on a single core is a dead giveaway to malware that they're not on a real machine, but it's understandable. I do wonder if that means that some class of bugs will be undetectable to this hypervisor, though.

评论 #39771403 未加载

debbiedowner大约 1 年前

We do something similar in house where I work. Is it hard to onboard new customers? Since they make a special container for you they basically adopt your build system, which may be hard for them. Does this go beyond mutation testing?

评论 #39768852 未加载

costco大约 1 年前

Did you decide on a hypervisor instead of an emulator for overhead reasons? I don't work there but I heard Microsoft has something called tkofuzz which similarly emphasizes determinism but uses Bochs internally.

immibis大约 1 年前

This is for a distributed database product that claims to bypass the CAP theorem or have I misunderstood?> Back then Spanner wasn’t public yet and a lot of people misinterpreted the CAP theorem to say that a strongly consistent database couldn’t also be highly available in the face of network faults.- <a href="https://antithesis.com/blog/is_something_bugging_you/" rel="nofollow">https://antithesis.com/blog/is_something_bugging_you/</a>

评论 #39769066 未加载

评论 #39769548 未加载

deater大约 1 年前

strange they seem unaware of the RR deterministic debugger workespecially as that builds off of the extensive work done on x86 counter determinism here: <a href="https://web.eece.maine.edu/~vweaver/projects/deterministic/" rel="nofollow">https://web.eece.maine.edu/~vweaver/projects/deterministic/</a>it turns out x86/amd chips many of the perf counter events are offset by the (unpredictable) interrupt count because the interrupt return instruction uop gets counted as both a user and kernel instruction. On many processors the retired store instruction avoids this issue.

评论 #39768568 未加载

评论 #39768523 未加载