I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.<p>A fun way to track how people are using this is with the git-scraping topic on GitHub:<p><a href="https://github.com/topics/git-scraping?o=desc&s=updated">https://github.com/topics/git-scraping?o=desc&s=updated</a><p>That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.<p>As I write this, just in the last minute repos that updated include:<p>queensland-traffic-conditions: <a href="https://github.com/drzax/queensland-traffic-conditions">https://github.com/drzax/queensland-traffic-conditions</a><p>bbcrss: <a href="https://github.com/jasoncartwright/bbcrss">https://github.com/jasoncartwright/bbcrss</a><p>metrobus-timetrack-history: <a href="https://github.com/jackharrhy/metrobus-timetrack-history">https://github.com/jackharrhy/metrobus-timetrack-history</a><p>bchydro-outages: <a href="https://github.com/outages/bchydro-outages">https://github.com/outages/bchydro-outages</a>
I did this when I was a kid, decompiling a flash game client for an MMO (Tibia).<p>By itself a single decompile was hard to parse, but if you do it for each release, commit the decompiled sources, and diff them you can easily see code changes.<p>So you just run a script to poll for a new client version to drop and automatically download, decompile, commit, and tag.<p>I'd have a diff of the client changes immediately, allowing insight into the protocol changes to update the private game server code to support it.
This is cool but the name is confusing. First of all, git is not being scraped nor is git being used to do any scraping, git is only used as the storage format for the snapshots. Second, there is no scraping happening at all. Scraping is when you parse a file intended for human display in order to extract the embedded unstructured data. The examples given are about periodically downloading an already structured json file and uploading it to github. No parsing is happening unless you count when he manually searches for the json file in the browser dev tools.
This looks very cool!<p>Please consider adding a user agent string with a link to the repo or some Google-able name to your curl call, it can help site operators get in touch with you if it starts to misbehave somehow.
I use this approach for monitoring open ports in our infrastructure -- running masscan, commiting results to git repo. If there are changes, open the merge request for review. During the review, one would investigate the actual server, why there was change in open ports.<p><a href="https://github.com/bobek/masscan_as_a_service">https://github.com/bobek/masscan_as_a_service</a>
> The implementation of the scraper is entirely contained in a single GitHub Actions workflow.<p>It's interesting that you can run a scraper at fixed intervals on a free, hosted CI like that. If the scraped content is larger, more than a single JSON file, will GitHub have a problem with it?
It’s probably not a coincidence the other place I’ve seen this technique was also for archiving a feed of fires.<p>It that case the data was about 250gb when fully uncompressed, and IIRC under a gig when stored as a git repo.<p>It’s a really neat idea, though it can make analysis on the data harder to do, in particular quality control (the aforementioned dataset had a lot of duplicates and inconsistency).<p>Like everything it’s a process of trading off between compute or storage, in this case optimising storage.
In the past I had to hunt down when a particular product's public documentation web pages were updated by the product team to add disclaimers and limitations.<p>This would have helped so much. Bookmarking this tool. Maybe I will get around to setting this up for this docs site.<p>Maybe all larger documentation sites should have a public history like this -- if not volunteered by the maintainers themselves, then through git-scraping by community.
Funnily enough I do something very close to this with the RFC database at rfc-editor.org, here's the script that I have put in my `cron`:<p><pre><code> pushd ~/data/rfc # this is a GIT repo
rsync -avzuh --delete --progress --exclude=.git ftp.rfc-editor.org::rfcs-text-only ~/data/rfc/text-only/
rsync -avzuh --delete --progress --exclude=.git ftp.rfc-editor.org::refs ~/data/rfc/refs
rsync -avzuh --delete --progress --exclude=.git ftp.rfc-editor.org::rfcs-pdf-only ~/data/rfc/pdf-only/
git add .
git commit -m "update $(date '+%Y-%m-%d')"
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git push origin master
popd
</code></pre>
Though I admit using GitHub's servers for this is more clever than me using one of my home servers. Still, I lean more to self-hosting.<p>@simonw Will take a look at `git-history`, looks intriguing!
‘’’
It runs on a schedule at 6, 26 and 46 minutes past the hour—I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite.
‘’’<p>Not sure how much of a difference it makes to the underlying service but I will also do this with my scraping.<p>Thank you for point that out
I use this to aggregate some RSS feeds. Also to generate a feed out of the HTML from sites that don't have a feed. Then i just publish the result as GitHub Pages and add this link to my reader.
Thanks for this instruction it got me going on that idea.
One of my friends is doing this tracking Hungarian law modifications:
<a href="https://github.com/badicsalex/torvenyek">https://github.com/badicsalex/torvenyek</a><p>He has tools for parsing them written in Rust: <a href="https://github.com/badicsalex/hun_law_rs">https://github.com/badicsalex/hun_law_rs</a><p>and Python: <a href="https://github.com/badicsalex/hun_law_py">https://github.com/badicsalex/hun_law_py</a><p>I'm doing it myself tracking my GitHub star changes: <a href="https://github.com/kissgyorgy/my-stars">https://github.com/kissgyorgy/my-stars</a>
I have a couple of similar scrapers as well. One is a private repo that I collect visa information off Wikipedia (for Visalogy.com), and GeoIP information from MaxMind database (used with their permission).<p><a href="https://github.com/Ayesh/Geo-IP-Database/">https://github.com/Ayesh/Geo-IP-Database/</a><p>It downloads the repo, and dumps the data split by the first 8 bytes of the IP address, and saves to individual JSON files. For every new scraper run, it creates a new tag and pushes it as a package, so the dependents can simply update them with their dependency manager.
So it's basically using Git as an "append-only" (no update-in-place) database to then do time queries? It's not the first time I see people using Git that way.<p>EDIT: hmmm I realize in addition to that it's also a way to not have to do specific queries over time: the diff takes care of finding everything that changed (<i>i.e.</i> you don't have to say <i>"I want to see how this and that values changed over time"</i>: the diff does it all). Nice.
Love that the author provides a 5 minute video explaining the purpose and how he did it: <a href="https://www.youtube.com/watch?v=2CjA-03yK8I">https://www.youtube.com/watch?v=2CjA-03yK8I</a>
The core idea I believe is tracking incremental changes and keeping past history of items. Git is good for text, though for large amounts of binary data I would recommend filesystem snapshots like with btrfs.
I use the same technique to maintain a json file mapping Slack channel names to channel IDs, as Slack for some reason doesn't have an API endpoint for getting a channel ID from its name.
some covid datasets were published as git repositories. the cool part was that it added a publish date dimension for historical data so that one could understand how long it took for historical counts to reach a steady state.