Git scraping: track changes over time by scraping to a Git repository (2020)

166 pointsby ekiauhcealmost 2 years ago

21 comments

simonwalmost 2 years ago

I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.A fun way to track how people are using this is with the git-scraping topic on GitHub:<a href="https://github.com/topics/git-scraping?o=desc&s=updated">https://github.com/topics/git-scraping?o=desc&s=updated</a>That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.As I write this, just in the last minute repos that updated include:queensland-traffic-conditions: <a href="https://github.com/drzax/queensland-traffic-conditions">https://github.com/drzax/queensland-traffic-conditions</a>bbcrss: <a href="https://github.com/jasoncartwright/bbcrss">https://github.com/jasoncartwright/bbcrss</a>metrobus-timetrack-history: <a href="https://github.com/jackharrhy/metrobus-timetrack-history">https://github.com/jackharrhy/metrobus-timetrack-history</a>bchydro-outages: <a href="https://github.com/outages/bchydro-outages">https://github.com/outages/bchydro-outages</a>

评论 #37084411 未加载

评论 #37086506 未加载

评论 #37083531 未加载

评论 #37083877 未加载

评论 #37083371 未加载

评论 #37085109 未加载

评论 #37083966 未加载

theultdevalmost 2 years ago

I did this when I was a kid, decompiling a flash game client for an MMO (Tibia).By itself a single decompile was hard to parse, but if you do it for each release, commit the decompiled sources, and diff them you can easily see code changes.So you just run a script to poll for a new client version to drop and automatically download, decompile, commit, and tag.I'd have a diff of the client changes immediately, allowing insight into the protocol changes to update the private game server code to support it.

评论 #37083885 未加载

downWidOutaFitealmost 2 years ago

This is cool but the name is confusing. First of all, git is not being scraped nor is git being used to do any scraping, git is only used as the storage format for the snapshots. Second, there is no scraping happening at all. Scraping is when you parse a file intended for human display in order to extract the embedded unstructured data. The examples given are about periodically downloading an already structured json file and uploading it to github. No parsing is happening unless you count when he manually searches for the json file in the browser dev tools.

评论 #37083378 未加载

评论 #37084016 未加载

mr_ndrsnalmost 2 years ago

This looks very cool!Please consider adding a user agent string with a link to the repo or some Google-able name to your curl call, it can help site operators get in touch with you if it starts to misbehave somehow.

评论 #37083492 未加载

评论 #37083002 未加载

bobekalmost 2 years ago

I use this approach for monitoring open ports in our infrastructure -- running masscan, commiting results to git repo. If there are changes, open the merge request for review. During the review, one would investigate the actual server, why there was change in open ports.<a href="https://github.com/bobek/masscan_as_a_service">https://github.com/bobek/masscan_as_a_service</a>

powersnailalmost 2 years ago

> The implementation of the scraper is entirely contained in a single GitHub Actions workflow.It's interesting that you can run a scraper at fixed intervals on a free, hosted CI like that. If the scraped content is larger, more than a single JSON file, will GitHub have a problem with it?

评论 #37082992 未加载

评论 #37093566 未加载

评论 #37082994 未加载

ojkellyalmost 2 years ago

It’s probably not a coincidence the other place I’ve seen this technique was also for archiving a feed of fires.It that case the data was about 250gb when fully uncompressed, and IIRC under a gig when stored as a git repo.It’s a really neat idea, though it can make analysis on the data harder to do, in particular quality control (the aforementioned dataset had a lot of duplicates and inconsistency).Like everything it’s a process of trading off between compute or storage, in this case optimising storage.

albert_ealmost 2 years ago

In the past I had to hunt down when a particular product's public documentation web pages were updated by the product team to add disclaimers and limitations.This would have helped so much. Bookmarking this tool. Maybe I will get around to setting this up for this docs site.Maybe all larger documentation sites should have a public history like this -- if not volunteered by the maintainers themselves, then through git-scraping by community.

pdimitaralmost 2 years ago

Funnily enough I do something very close to this with the RFC database at rfc-editor.org, here's the script that I have put in my `cron`:<pre><code> pushd ~/data/rfc # this is a GIT repo rsync -avzuh --delete --progress --exclude=.git ftp.rfc-editor.org::rfcs-text-only ~/data/rfc/text-only/ rsync -avzuh --delete --progress --exclude=.git ftp.rfc-editor.org::refs ~/data/rfc/refs rsync -avzuh --delete --progress --exclude=.git ftp.rfc-editor.org::rfcs-pdf-only ~/data/rfc/pdf-only/ git add . git commit -m "update $(date '+%Y-%m-%d')" git reflog expire --expire=now --all git gc --prune=now --aggressive git push origin master popd </code></pre> Though I admit using GitHub's servers for this is more clever than me using one of my home servers. Still, I lean more to self-hosting.@simonw Will take a look at `git-history`, looks intriguing!

评论 #37085572 未加载

great_psyalmost 2 years ago

‘’’ It runs on a schedule at 6, 26 and 46 minutes past the hour—I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite. ‘’’Not sure how much of a difference it makes to the underlying service but I will also do this with my scraping.Thank you for point that out

评论 #37085381 未加载

Zufriedenheitalmost 2 years ago

I use this to aggregate some RSS feeds. Also to generate a feed out of the HTML from sites that don't have a feed. Then i just publish the result as GitHub Pages and add this link to my reader. Thanks for this instruction it got me going on that idea.

kissgyorgyalmost 2 years ago

One of my friends is doing this tracking Hungarian law modifications: <a href="https://github.com/badicsalex/torvenyek">https://github.com/badicsalex/torvenyek</a>He has tools for parsing them written in Rust: <a href="https://github.com/badicsalex/hun_law_rs">https://github.com/badicsalex/hun_law_rs</a>and Python: <a href="https://github.com/badicsalex/hun_law_py">https://github.com/badicsalex/hun_law_py</a>I'm doing it myself tracking my GitHub star changes: <a href="https://github.com/kissgyorgy/my-stars">https://github.com/kissgyorgy/my-stars</a>

评论 #37094624 未加载

Ayeshalmost 2 years ago

I have a couple of similar scrapers as well. One is a private repo that I collect visa information off Wikipedia (for Visalogy.com), and GeoIP information from MaxMind database (used with their permission).<a href="https://github.com/Ayesh/Geo-IP-Database/">https://github.com/Ayesh/Geo-IP-Database/</a>It downloads the repo, and dumps the data split by the first 8 bytes of the IP address, and saves to individual JSON files. For every new scraper run, it creates a new tag and pushes it as a package, so the dependents can simply update them with their dependency manager.

TacticalCoderalmost 2 years ago

So it's basically using Git as an "append-only" (no update-in-place) database to then do time queries? It's not the first time I see people using Git that way.EDIT: hmmm I realize in addition to that it's also a way to not have to do specific queries over time: the diff takes care of finding everything that changed (i.e. you don't have to say "I want to see how this and that values changed over time": the diff does it all). Nice.

nomilkalmost 2 years ago

Love that the author provides a 5 minute video explaining the purpose and how he did it: <a href="https://www.youtube.com/watch?v=2CjA-03yK8I">https://www.youtube.com/watch?v=2CjA-03yK8I</a>

Bu9818almost 2 years ago

The core idea I believe is tracking incremental changes and keeping past history of items. Git is good for text, though for large amounts of binary data I would recommend filesystem snapshots like with btrfs.

modestygrimealmost 2 years ago

I use the same technique to maintain a json file mapping Slack channel names to channel IDs, as Slack for some reason doesn't have an API endpoint for getting a channel ID from its name.

muunboalmost 2 years ago

My mind is so friggin blown, github will run arbitrary cron jobs for you?! Can't believe other services make you pay for that.

a-dubalmost 2 years ago

some covid datasets were published as git repositories. the cool part was that it added a publish date dimension for historical data so that one could understand how long it took for historical counts to reach a steady state.

评论 #37085813 未加载

alphanumeric0almost 2 years ago

What's the benefit of this versus a time series database?

评论 #37084144 未加载

评论 #37087694 未加载

fragmedealmost 2 years ago

(2020)