To point out the obvious: generally API providers don’t particularly want you to pararelize your request (they even implement rate limiting to make it harder on purpose). If they wanted to make it easy to get all the results, they would allow you to access the data without pagination - just download all the data in one go.
A few thoughts:<p>1) AWS dynamodb has a parallel scanning functionality for this exact use case. <a href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.ParallelScan" rel="nofollow">https://docs.aws.amazon.com/amazondynamodb/latest/developerg...</a><p>2) A typical database already internally maintains an approximately balanced b-tree for every index. Therefore, it should in principal be cheap for the database to return a list of keys that approximately divide the keyrange into N similarly large ranges, even if the key distribution is very uneven. Is somebody aware of a way where this information could be obtained in a query in e.g. postgres?<p>3) The term 'cursor pagination' is sometimes used for different things, either referring to an in-database concept of cursor, or sometimes as an opaque pagination token. Therefore, for the concept described in the article, I have come to prefer the term keyset pagination, as described in <a href="https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/" rel="nofollow">https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin...</a>. The term keyset pagination makes it clear that we are paginating using conditions on a set of columns that form a unique key for the table.
I believe data export and/or backup should be a separate API, which is low priority and ensures consistency.<p>Here we just see regular APIs are being abused for data export. I'm rather surprised the author did not face rate limiting.
I think keeping temporal history and restricting paginated results to the data at the point in time where the first page was retrieved would be a pretty decent way to solve offset based <i>interfaces</i> (regardless of the complexity of making the query implementation efficient). Data with a lot of churn could churn on, but clients would see a consistent view until they return to the point of entry.<p>Obviously this has some potential caveats if that churn is also likely to quickly invalidate data, or revoke sensitive information. Time limits for historical data retrieval can be imposed to help mitigate this. And individual records can be revised (eg with bitemporal modeling) without altering the set of referenced records.
Pagination of an immutable collection is one thing and can be parallelized. Pagination of a mutable collection (e.g. a database table), on the other hand, is risky since two requests might return intersecting data if new data was added between the requests being executed.<p>True result sets require relative page tokens and a synchronization mechanism if the software demands it.
It's important here that "created" is an <i>immutable</i> attribute. Otherwise you could get issues where the same item appears on multiple lists (or doesn't appear at all) because its attributes changed during the scanning process.
I think you could accomplish something similar with token pagination by requesting a number of items that will result in multiple "pages" for your user interface. Then as the user iterates through you can request additional items. This isn't parallelizing, but provides the same low-latency user experience.
> it uses offsets for pagination... understood to be bad practice by today’s standards. Although convenient to use, offsets are difficult to keep performant in the backend<p>This is funny. Using offsets is known to be bad practice because.... it’s hard to do.<p>Look I’m just a UI guy so what do I know. But this argument gets old because I’m sorry, but people want a paginated list and to know how many pages are in the list. Clicking “next page” 10 times instead of clicking to page 10 is bullshit, and users know it.