I built something like this long ago. The easy part - I thought - would be to separate data and code by evaluating all the branches twice (once for the branches taken and once for those not taken) and build up a backlog of the branches not yet taken. Every step along the way you'd mark the instruction pointer as the opcode, the other bytes for that instruction as the operands and the remainder after a complete run would be the static data present in the binary.<p>If only.<p>The easiest way to throw such a scheme off is to have an instruction jump into the middle of an operand and have the code be dual meaning, one reading gives you one disassembly, another a completely different one.<p>If you then try to convert the result to a pseudo-C construct (a function with a stackframe and the corresponding unwinding of that stackframe at the end of the fuction) then it no longer works...<p>The short version of the above is there is a 1:1 correspondence between a C language source code and the un-optimized output of a compiler. That coupling is a lot less hard when you start optimizing and when you use hand-crafted assembly as the input to a reverse-compiler the output may simply no longer be functionally equivalent because the mapping might be non-existent and this can be a hard situation to detect.
I'd definitely just prefer using the dissassembled output as it's infinitely more readable once you've grokked it. Nice project though, in a field as monopolised as reversing, you can never have too many tools!