I always like to see how some API is used in real projects. Sadly GitHub search is mostly useless for this, because of the number of duplicates. Google code search was great. It even supported regexps. Then the was koders.com, now there's also something from ohloh and it's better than GitHub AFAIR.<p>EDIT: ohloh became openhub and now the code search is discontinued. So there is the nonfunctional GitHub search and an open niche for other projects...
Wow, GitHub could save a lot of storage space if they dedup'd across projects/files explicitly, rather than storing Git repos, which is what I'm assuming they do.<p>Even with a good deduping/compressing filesystem, the way git history is stored means that they're probably missing out on a ton of savings here. Eh, it's probably not worth the complexity/deviation from standard Git tooling.
This is very interesting. I would have liked to see the results for JavaScript when you ignore the node_modules folder. If that's going to count for code duplication then pip dependencies should be included as well.<p>This should definitely be taken as a lesson though: JS needs a better deployment solution. That, or better education on the current solution(s).
Would love to see a follow up where we would see how much duplication existed if we controlled for common dependencies and autogenerated code in conjunction with data on how many repositories are fully cloned (i.e. all code is near identical to another repository).
Very interesting from a security perspective. So much potentially dangerous code copy-pasted and most of it is probably never updated too. I've personally found some C vulnerabilities in code that I easily found used in many projects by Googling the vulnerable line... Usually not so much to do about it too.
Now predicting automatic software that looks at duplicated code, flags it for violating license agreements, and sues for money.<p>Welcome to the future of copyright trolls.