TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Amazon has a way to scrape GitHub and feed its AI model

65 pointsby doubtfuluser11 months ago

17 comments

Kye11 months ago
Is it git pull?<p>&gt;&gt; <i>&quot;In response, Amazon proposed a workaround: encouraging its employees to create multiple GitHub accounts and share their access credentials.&quot;</i><p>Ah, no, it&#x27;s git pool.
xmodem11 months ago
Ethically Microsoft has about as much claim to be able to use the data for co-pilot as anyone else.<p>On the other hand, maybe a MSFT v Amazon lawsuit over this could be the wake up call the world needs that maybe we should stop centralising critical infrastructure in the hands of a single company. Which is why I think they wouldn&#x27;t do it - at most I could see Microsoft tightening request limits on accounts associated with Amazon.
评论 #40689549 未加载
评论 #40689252 未加载
jsnell11 months ago
I&#x27;m surprised Amazon&#x27;s legal team signed off on this. It&#x27;s clearly against the GitHub terms of service[0], and Amazon employees acting on the instructions from Amazon had to approve those terms. It seems pretty much identical to the LinkedIn vs. hiQ scraping case, where as I understand the fake account creation was the key point.<p>[0] E.g. no API key sharing for the purposes of evading rate limits, only a single free account per person or organization.
评论 #40689321 未加载
评论 #40689290 未加载
koolba11 months ago
Is the cover image itself generated via some ML model? The old guy in the middle is missing substantial parts of his arm. The box right by him also has some artifacting in the corner.
评论 #40689300 未加载
评论 #40689352 未加载
评论 #40689350 未加载
评论 #40689365 未加载
评论 #40689295 未加载
评论 #40689412 未加载
lokimedes11 months ago
This just rekindled my desire to self-host my git repos. The whole idea that a platform provider can use the IP I host there is obscene. That thieves steal by bounty from each other is not the story.
neilv11 months ago
Separate from the courts, Microsoft could send a message to the AI gold rush field, about &quot;abuse of Microsoft&#x27;s resources&quot;, via ToS:<p>* All <i>Amazon domain names</i> could be banned from accounts on GitHub, or face annoying restrictions, implemented with trivial technical changes. And lawyers could send a letter to Amazon legal, about how Amazon may and may not use GitHub, including Amazon personnel having to disclose their affiliation (not hide it with GMail), and craft some language about how those employee accounts may and may not be used.<p>* More harshly, but fear-instilling to individuals throughout industry, the <i>individuals</i> who let their accounts be used for the scraping could be banned from GitHub, for ToS violation. Not only those particular accounts, but any accounts the individuals might use. (This would hurt, not only for genuine open source participation, but also given how open source is sometimes used for job-hunting appearances, and all the current employers that ask for candidate&#x27;s &quot;GitHub&quot; specifically rather than open source in general.) If banning would have undesired effects of projects GitHub wants to host being pulled, or public reaction as too harsh and questioning why GitHub has so much power, there could instead be annoying restrictions.
评论 #40689584 未加载
foreigner11 months ago
Microsoft could sabotage Amazon&#x27;s AI model by returning poisoned code to accounts registered with @amazon.com email addresses.
评论 #40689560 未加载
raarts11 months ago
Language in this article smells like it&#x27;s written or rewritten by AI.
评论 #40689304 未加载
paradite11 months ago
Microsoft is probably one of the few companies that can sue Amazon without worrying about retaliation from Amazon.<p>For example, GitLab would need to think twice before suing because they offer deployment on AWS.
threecheese11 months ago
Can anyone share a Fermi estimation of the size of poison-pill training data required to impact code interpreter models? (of the size that AMZN might be building with this data)<p>I expect it would vary by language&#x2F;platform popularity (size of available training code). Is it infeasible to create or generate enough code, pushed to enough repositories, to impact the correctness of a model that includes the code in its training data set?
lofaszvanitt11 months ago
MS only provides the infra, everything else is other&#x27;s hard work under the trojan horse open source whatever. If they introduce limits, time to leave github. This will evolve into an elsevier vs researchers kinda situation.
chumanak11 months ago
This article doesn’t make any sense. Why would Amazon make their employees do all this when they can easily pay for a service like crawlbase or similar and easily scrape github without having to create employee accounts?
rty3211 months ago
If github cares enough about this, they would have already sued Amazon. I don&#x27;t think the author needs to worry about any of this
hi-v-rocknroll11 months ago
MSFT&#x27;s LinkedIn scraping was also a thing about 10 years ago until the magic method was taken away. :&#x27;(
评论 #40689424 未加载
amadeuspagel11 months ago
They should send make this data available for everyone on AWS.
glimshe11 months ago
I couldn&#x27;t care less about these huge tech companies stealing from one another. Let them sue themselves to extinction.
评论 #40689124 未加载
评论 #40689180 未加载
htrp11 months ago
disappointing that large mega Corp does the exact same thing broke developers do to get around rate limits
评论 #40689261 未加载
评论 #40689294 未加载