TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

DéjàVu: a map of code duplicates on GitHub

134 pointsby devilciusover 7 years ago

9 comments

hawskiover 7 years ago
I always like to see how some API is used in real projects. Sadly GitHub search is mostly useless for this, because of the number of duplicates. Google code search was great. It even supported regexps. Then the was koders.com, now there&#x27;s also something from ohloh and it&#x27;s better than GitHub AFAIR.<p>EDIT: ohloh became openhub and now the code search is discontinued. So there is the nonfunctional GitHub search and an open niche for other projects...
评论 #15744661 未加载
评论 #15742980 未加载
评论 #15745317 未加载
coding123over 7 years ago
What really sucks is people committing node_modules, that&#x27;s just plain wrong.
评论 #15742235 未加载
评论 #15742682 未加载
评论 #15741470 未加载
评论 #15743329 未加载
zbentleyover 7 years ago
Wow, GitHub could save a lot of storage space if they dedup&#x27;d across projects&#x2F;files explicitly, rather than storing Git repos, which is what I&#x27;m assuming they do.<p>Even with a good deduping&#x2F;compressing filesystem, the way git history is stored means that they&#x27;re probably missing out on a ton of savings here. Eh, it&#x27;s probably not worth the complexity&#x2F;deviation from standard Git tooling.
评论 #15741111 未加载
neurotraceover 7 years ago
This is very interesting. I would have liked to see the results for JavaScript when you ignore the node_modules folder. If that&#x27;s going to count for code duplication then pip dependencies should be included as well.<p>This should definitely be taken as a lesson though: JS needs a better deployment solution. That, or better education on the current solution(s).
评论 #15744237 未加载
评论 #15743581 未加载
hultnerover 7 years ago
Would love to see a follow up where we would see how much duplication existed if we controlled for common dependencies and autogenerated code in conjunction with data on how many repositories are fully cloned (i.e. all code is near identical to another repository).
az0over 7 years ago
Very interesting from a security perspective. So much potentially dangerous code copy-pasted and most of it is probably never updated too. I&#x27;ve personally found some C vulnerabilities in code that I easily found used in many projects by Googling the vulnerable line... Usually not so much to do about it too.
评论 #15743165 未加载
Tommakxover 7 years ago
Would be more interesting to see an analysis of almost equal files - to detect reimplementations of the same thing
inetknghtover 7 years ago
Now predicting automatic software that looks at duplicated code, flags it for violating license agreements, and sues for money.<p>Welcome to the future of copyright trolls.
评论 #15741441 未加载
评论 #15741594 未加载
nihoniumover 7 years ago
In order to prevent code duplication on a global scale, we need more frameworks, like leftpad. :sarc:
评论 #15741491 未加载