I have to download a dataset through one API (WFS provided by geoserver) that tells me the total amount of items and delivers at maximum 1000 items per request and I can sort by one field and offset the requests start index. The layer has ~1Million items. I can use at maximum 5 parallel request before API gets overloaded.<p>Problem is that items are being added and removed in real time, so at the end of the copy process I already have stale data copied and there are new items to be copied over.
So what would you do, or have done in this situation?
Start a never ending loop to crawl data all day long would be something evil or is it something to be fixed on provider side?<p>The api url is https://geoserver.car.gov.br/geoserver/sicar/wfs<p>Source data website: https://consultapublica.car.gov.br/publico/imoveis/index
I currently only have my phone, so i can't judge the API. From my point a full scrape at regular intervals is not that bad. Its only 1000 requests. Depending on the data and querymethods you xan make fresh data appear sooner than removing old data.
The major question is: how fresh do you need your data?<p>Not every application needs realtime data, querying it only on occasion or every few hours can be good enough.