Keeping duplication metadata around doesn't scale well, though it may be sufficient for their typical case. A strategy I've seen many times that scales better is to (1) only check for duplicates when some part of the system has a reason to believe duplicates could occur, with the assumption that duplicates are relatively infrequent, and (2) design the storage engine such that you can directly search for the record in previously ingested data inexpensively without any extra deduplication metadata. Note that this is not as easy to achieve if your data infrastructure is a loosely coupled collection of arbitrary storage engines, databases, and processing pipelines -- which may be a practical limitation for the case in the article.<p>If the storage engine is well-designed for the data model, a duplicate check against existing data should only touch a handful of pages in the worst case, it is an inexpensive query that rarely or never touches the network (detail depending). For ingestion of records where there is no risk of duplication, presumably the bulk of the time, this is a zero overhead model as there is no duplication state to be maintained or checked. For most scenarios that create potential duplication, this model is also quite cache friendly as a practical matter.<p>The pathological case for this design is when you need to check every single record for duplication (e.g. cleaning up a giant offline mess of agglomerated data that may contain an arbitrary number of duplicates), but those scenarios usually don't involve real-time stream ingestion.