TechEcho

In this article by the creator of Airflow (https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a) it is mentioned that data should be partitioned by event processing time to land immutable blocks of data. How is this implemented?<p>For an example, if I have a system with events entering some stream (e.g. Kafka or Kinesis) and then periodically the data is written to storage (e.g. S3 or other) which are then batch processed on some schedule (e.g. airflow), then there are multiple 'time' values to consider.<p>t1 -> time of the event occurring t2 -> time of the event entering the stream t3 -> time of persisting batch of events to storage t4 -> time of batch run (airflow) for further processing<p>What is considered the "event processing" time in this case? How is a partition generated so that at immutable block of data can be landed predictably? Presumably there must be some deterministic pattern for generating batch runs so that time partitions are immutable and so that backfill tasks can be generated.

Ask HN: How to partition by event processing time in Airflow?

no comments

Ask HN: How to partition by event processing time in Airflow?

no comments