This is really neat. In the past I've used similar techniques to decode binary data from a third party lidar system in parallel. In a way that the manufacturers probably didn't intend or expect.<p>The system generated large data files which we wanted to process in parallel without any pre-indexing. It turned out that these streams contained sync markers which were "unlikely" to occur in the real data, but there wasn't any precise framing like COBS. Regardless, the markers and certain patterns in the binary headers were enough to synchronize with the stream with a very high degree of reliability.<p>So for parallel processing we'd seek into the middle of the file to process a chunk of data, synchronize with the stream, and process all subsequent lidar scanlines which started in that chunk. Exactly the algorithm they describe here.<p>Amusingly this approach gave reasonable results even in the presence of significant corruption where the manufacturer's software would give up.
Having skimmed the article (because $dayjob and all), I wonder how/if their scheme can cope with write(2) producing a short write, with not all the data in the buffer being atomically committed to their POSIX-compliant backing store?<p>I don't see any mechanism described that makes sure that never happens (by forcing records not to exceed a given length that can always be written atomically - which I am not sure even exists...), so I am wondering how often that kind of thing even happens on contemporary systems, and - if it does - how often they wreck a good number of stored records that way.
site appears to be down... archive link <a href="https://archive.ph/https://pvk.ca/Blog/2021/01/11/stuff-your-logs/" rel="nofollow">https://archive.ph/https://pvk.ca/Blog/2021/01/11/stuff-your...</a>