TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Git scraping: track changes over time by scraping to a Git repository (2020)

166 pointsby ekiauhcealmost 2 years ago

21 comments

simonwalmost 2 years ago
I&#x27;ve been promoting this idea for a few years now, and I&#x27;ve seen an increasing number of people put it into action.<p>A fun way to track how people are using this is with the git-scraping topic on GitHub:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;topics&#x2F;git-scraping?o=desc&amp;s=updated">https:&#x2F;&#x2F;github.com&#x2F;topics&#x2F;git-scraping?o=desc&amp;s=updated</a><p>That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.<p>As I write this, just in the last minute repos that updated include:<p>queensland-traffic-conditions: <a href="https:&#x2F;&#x2F;github.com&#x2F;drzax&#x2F;queensland-traffic-conditions">https:&#x2F;&#x2F;github.com&#x2F;drzax&#x2F;queensland-traffic-conditions</a><p>bbcrss: <a href="https:&#x2F;&#x2F;github.com&#x2F;jasoncartwright&#x2F;bbcrss">https:&#x2F;&#x2F;github.com&#x2F;jasoncartwright&#x2F;bbcrss</a><p>metrobus-timetrack-history: <a href="https:&#x2F;&#x2F;github.com&#x2F;jackharrhy&#x2F;metrobus-timetrack-history">https:&#x2F;&#x2F;github.com&#x2F;jackharrhy&#x2F;metrobus-timetrack-history</a><p>bchydro-outages: <a href="https:&#x2F;&#x2F;github.com&#x2F;outages&#x2F;bchydro-outages">https:&#x2F;&#x2F;github.com&#x2F;outages&#x2F;bchydro-outages</a>
评论 #37084411 未加载
评论 #37086506 未加载
评论 #37083531 未加载
评论 #37083877 未加载
评论 #37083371 未加载
评论 #37085109 未加载
评论 #37083966 未加载
theultdevalmost 2 years ago
I did this when I was a kid, decompiling a flash game client for an MMO (Tibia).<p>By itself a single decompile was hard to parse, but if you do it for each release, commit the decompiled sources, and diff them you can easily see code changes.<p>So you just run a script to poll for a new client version to drop and automatically download, decompile, commit, and tag.<p>I&#x27;d have a diff of the client changes immediately, allowing insight into the protocol changes to update the private game server code to support it.
评论 #37083885 未加载
downWidOutaFitealmost 2 years ago
This is cool but the name is confusing. First of all, git is not being scraped nor is git being used to do any scraping, git is only used as the storage format for the snapshots. Second, there is no scraping happening at all. Scraping is when you parse a file intended for human display in order to extract the embedded unstructured data. The examples given are about periodically downloading an already structured json file and uploading it to github. No parsing is happening unless you count when he manually searches for the json file in the browser dev tools.
评论 #37083378 未加载
评论 #37084016 未加载
mr_ndrsnalmost 2 years ago
This looks very cool!<p>Please consider adding a user agent string with a link to the repo or some Google-able name to your curl call, it can help site operators get in touch with you if it starts to misbehave somehow.
评论 #37083492 未加载
评论 #37083002 未加载
bobekalmost 2 years ago
I use this approach for monitoring open ports in our infrastructure -- running masscan, commiting results to git repo. If there are changes, open the merge request for review. During the review, one would investigate the actual server, why there was change in open ports.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;bobek&#x2F;masscan_as_a_service">https:&#x2F;&#x2F;github.com&#x2F;bobek&#x2F;masscan_as_a_service</a>
powersnailalmost 2 years ago
&gt; The implementation of the scraper is entirely contained in a single GitHub Actions workflow.<p>It&#x27;s interesting that you can run a scraper at fixed intervals on a free, hosted CI like that. If the scraped content is larger, more than a single JSON file, will GitHub have a problem with it?
评论 #37082992 未加载
评论 #37093566 未加载
评论 #37082994 未加载
ojkellyalmost 2 years ago
It’s probably not a coincidence the other place I’ve seen this technique was also for archiving a feed of fires.<p>It that case the data was about 250gb when fully uncompressed, and IIRC under a gig when stored as a git repo.<p>It’s a really neat idea, though it can make analysis on the data harder to do, in particular quality control (the aforementioned dataset had a lot of duplicates and inconsistency).<p>Like everything it’s a process of trading off between compute or storage, in this case optimising storage.
albert_ealmost 2 years ago
In the past I had to hunt down when a particular product&#x27;s public documentation web pages were updated by the product team to add disclaimers and limitations.<p>This would have helped so much. Bookmarking this tool. Maybe I will get around to setting this up for this docs site.<p>Maybe all larger documentation sites should have a public history like this -- if not volunteered by the maintainers themselves, then through git-scraping by community.
pdimitaralmost 2 years ago
Funnily enough I do something very close to this with the RFC database at rfc-editor.org, here&#x27;s the script that I have put in my `cron`:<p><pre><code> pushd ~&#x2F;data&#x2F;rfc # this is a GIT repo rsync -avzuh --delete --progress --exclude=.git ftp.rfc-editor.org::rfcs-text-only ~&#x2F;data&#x2F;rfc&#x2F;text-only&#x2F; rsync -avzuh --delete --progress --exclude=.git ftp.rfc-editor.org::refs ~&#x2F;data&#x2F;rfc&#x2F;refs rsync -avzuh --delete --progress --exclude=.git ftp.rfc-editor.org::rfcs-pdf-only ~&#x2F;data&#x2F;rfc&#x2F;pdf-only&#x2F; git add . git commit -m &quot;update $(date &#x27;+%Y-%m-%d&#x27;)&quot; git reflog expire --expire=now --all git gc --prune=now --aggressive git push origin master popd </code></pre> Though I admit using GitHub&#x27;s servers for this is more clever than me using one of my home servers. Still, I lean more to self-hosting.<p>@simonw Will take a look at `git-history`, looks intriguing!
评论 #37085572 未加载
great_psyalmost 2 years ago
‘’’ It runs on a schedule at 6, 26 and 46 minutes past the hour—I like to offset my cron times like this since I assume that the majority of crons run exactly on the hour, so running not-on-the-hour feels polite. ‘’’<p>Not sure how much of a difference it makes to the underlying service but I will also do this with my scraping.<p>Thank you for point that out
评论 #37085381 未加载
Zufriedenheitalmost 2 years ago
I use this to aggregate some RSS feeds. Also to generate a feed out of the HTML from sites that don&#x27;t have a feed. Then i just publish the result as GitHub Pages and add this link to my reader. Thanks for this instruction it got me going on that idea.
kissgyorgyalmost 2 years ago
One of my friends is doing this tracking Hungarian law modifications: <a href="https:&#x2F;&#x2F;github.com&#x2F;badicsalex&#x2F;torvenyek">https:&#x2F;&#x2F;github.com&#x2F;badicsalex&#x2F;torvenyek</a><p>He has tools for parsing them written in Rust: <a href="https:&#x2F;&#x2F;github.com&#x2F;badicsalex&#x2F;hun_law_rs">https:&#x2F;&#x2F;github.com&#x2F;badicsalex&#x2F;hun_law_rs</a><p>and Python: <a href="https:&#x2F;&#x2F;github.com&#x2F;badicsalex&#x2F;hun_law_py">https:&#x2F;&#x2F;github.com&#x2F;badicsalex&#x2F;hun_law_py</a><p>I&#x27;m doing it myself tracking my GitHub star changes: <a href="https:&#x2F;&#x2F;github.com&#x2F;kissgyorgy&#x2F;my-stars">https:&#x2F;&#x2F;github.com&#x2F;kissgyorgy&#x2F;my-stars</a>
评论 #37094624 未加载
Ayeshalmost 2 years ago
I have a couple of similar scrapers as well. One is a private repo that I collect visa information off Wikipedia (for Visalogy.com), and GeoIP information from MaxMind database (used with their permission).<p><a href="https:&#x2F;&#x2F;github.com&#x2F;Ayesh&#x2F;Geo-IP-Database&#x2F;">https:&#x2F;&#x2F;github.com&#x2F;Ayesh&#x2F;Geo-IP-Database&#x2F;</a><p>It downloads the repo, and dumps the data split by the first 8 bytes of the IP address, and saves to individual JSON files. For every new scraper run, it creates a new tag and pushes it as a package, so the dependents can simply update them with their dependency manager.
TacticalCoderalmost 2 years ago
So it&#x27;s basically using Git as an &quot;append-only&quot; (no update-in-place) database to then do time queries? It&#x27;s not the first time I see people using Git that way.<p>EDIT: hmmm I realize in addition to that it&#x27;s also a way to not have to do specific queries over time: the diff takes care of finding everything that changed (<i>i.e.</i> you don&#x27;t have to say <i>&quot;I want to see how this and that values changed over time&quot;</i>: the diff does it all). Nice.
nomilkalmost 2 years ago
Love that the author provides a 5 minute video explaining the purpose and how he did it: <a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=2CjA-03yK8I">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=2CjA-03yK8I</a>
Bu9818almost 2 years ago
The core idea I believe is tracking incremental changes and keeping past history of items. Git is good for text, though for large amounts of binary data I would recommend filesystem snapshots like with btrfs.
modestygrimealmost 2 years ago
I use the same technique to maintain a json file mapping Slack channel names to channel IDs, as Slack for some reason doesn&#x27;t have an API endpoint for getting a channel ID from its name.
muunboalmost 2 years ago
My mind is so friggin blown, github will run arbitrary cron jobs for you?! Can&#x27;t believe other services make you pay for that.
a-dubalmost 2 years ago
some covid datasets were published as git repositories. the cool part was that it added a publish date dimension for historical data so that one could understand how long it took for historical counts to reach a steady state.
评论 #37085813 未加载
alphanumeric0almost 2 years ago
What&#x27;s the benefit of this versus a time series database?
评论 #37084144 未加载
评论 #37087694 未加载
fragmedealmost 2 years ago
(2020)