This, along with recent Reddit goings-on has made me realize a major risk with the current structure of online communication. Take either Reddit or Stack Exchange as examples. They build a platform, and users contribute their time, thought, energy, and knowledge to build a community on that platform. Those companies can then gatekeep and restrict access to all that the community built, when all they did is provide the platform, and store the data. We need to rethink this model.<p>The thought and knowledge of communities and users need to belong to those communities and users. To people they intentionally and thoughtfully delegate to and trust. We need to decentralize our communications, like how the internet used to be before the arrival of social media and mega forums. We need to revert to small, focused forums, with less anonymous, more persistent communication, run by people we trust. Otherwise, we will continue to see mega companies harvest our data and use it (or not provide it) against our wishes. If we don’t work to mitigate that dynamic, we have nobody to blame for the poor outcomes but ourselves.
What irks me about this is that 100% of their data is provided for free, by the community that they have fostered, the people like myself who have answered > 2500 questions[0], and now SO feels hard-done-by by LLMs using all <i>their</i> hard work to create tools like CodeGPT, GitHub copilot, etc.<p>Were it really a site for helping developers to improve their skills and increase their productivity through the give-and-take model that SO was, at least once upon a time, SO should perhaps take a deep breath and realise that this might not change a thing apart from causing their contributors to feel like they were never part of it in the first place.<p>I'm not sure if I've correctly articulated that, but I do find SO's stance to be quite revealing. It feels to me like they're crying foul that ChatGPT and the how many other systems out there are stealing their revenue. None of the contributors (apart from the employee ones, I suppose) ever got paid any currency other than high-fives in the form of rep, medals, the gamified stuff, moderation rights, and at certain rep levels some swag in the form of t-shirts and the usual.<p>I never wanted any money from SO, but the revelation of this attitude has left me feeling, well, a little sad to say the least.<p>[0]<a href="https://stackoverflow.com/users/70393/karim79" rel="nofollow">https://stackoverflow.com/users/70393/karim79</a>
It's unfortunate we are seeing all of these data platforms get locked off, because this is not going to affect AI development from big companies, it's only going to affect the ability for individuals to run AI development of any form in their home.<p>I hope the data that has been found so far is going to big enough going forward, but it's incredibly unfortunate that this is happening.<p>I hope all the people making these decisions wake up with a bad headache and severe heartburn tomorrow.
As a reminder, all the SE sites have content under a Creative Commons, By Attribution, Share Alike license, allowing for, among other things, commercial re-use [0] [1].<p>Yes, it sucks that the SE sites are getting more draconian about allowing access to their content but the SE sites are well insulated against it completely disappearing precisely because they're under a libre/free license. Note that Reddit [2], nor HN I might add [3], have any such licensing terms that allow for commercial reuse.<p>Decentralization might be a viable option in the future, but for right now, centralized sites are the norm and the way to protect against the content from disappearing is to put it under libre/free licensing. Note that Wikipedia is centralized and it would certainly be a tragedy if they became more draconian about sharing their data but the content itself is and will be available to the general public, effectively the "commons", because of the licensing terms.<p>To me, this is yet another reminder of why we need to future proof with libre/free/open licensing terms. Or reform copyright, but I don't see that happening within my lifetime.<p>[0] <a href="https://stackoverflow.com/legal/terms-of-service/public#licensing" rel="nofollow">https://stackoverflow.com/legal/terms-of-service/public#lice...</a><p>[1] <a href="https://creativecommons.org/licenses/by-sa/4.0/" rel="nofollow">https://creativecommons.org/licenses/by-sa/4.0/</a><p>[2] <a href="https://www.redditinc.com/policies/developer-terms#text-content4" rel="nofollow">https://www.redditinc.com/policies/developer-terms#text-cont...</a><p>[3] <a href="https://www.ycombinator.com/legal/#tou" rel="nofollow">https://www.ycombinator.com/legal/#tou</a>
Everyone wants to be "smart" by web scraping, harvesting data, building models. No one bothers to build and sustain platforms where quality content can be crowd sourced. Parasitic arrangement is slowly starting a new era of the internet. Question how long until existing data dumps will become outdated and fall into irrelevance.
Really strange comment.<p>> I was recently impacted by the Company's layoff.<p>> I'm offering what I can to uphold the Company's values of Transparency & being Community-centric.<p>I wouldn't offer transparency about a former employers internal operations. Let them respond or at least ping a current employee to respond.
For any curious, the original announcement of the data dump - <a href="https://stackoverflow.blog/2009/06/04/stack-overflow-creative-commons-data-dump/" rel="nofollow">https://stackoverflow.blog/2009/06/04/stack-overflow-creativ...</a>
"Just sorta stating the obvious here, but the timing of this is unbelievably terrible; I actually can't fathom a worse time for this call to be made than in light of this week. –zcoop98"<p>Or, it's exactly the best time to do it. Doing it now allows your news to get blended in with the Reddit news. Doing it later after Reddit chatter settles down means all of the chatter is directed squarely at you.
I only hope this and the Reddit slowmo-trainwreck-in-progress sensitivise more people about the value of the data they contribute and how it is appropriated by the platforms.
> I mention the timing, as this change long pre-dated the current moderator strike and related policy changes.<p>A mod strike? I hadn't heard about this.<p><a href="https://meta.stackexchange.com/questions/389811/moderation-strike-stack-overflow-inc-cannot-consistently-ignore-mistreat-an" rel="nofollow">https://meta.stackexchange.com/questions/389811/moderation-s...</a>
Sad, I had a lot of fun with it making StackRoboflow[1] (This Question Does Not Exist) a few years ago.<p>The models (AWD-LSTM and GPT-2) weren't good enough back then to usefully answer programming questions -- but it's super cool to see that vision realized with GPT-4 and other modern LLMs.<p>[1] <a href="https://stackroboflow.com" rel="nofollow">https://stackroboflow.com</a>
Yesterday's data dumps/APIs fostered community, new market/channel discoveries & low risk acquisitions.<p>Today's data dumps/APIs foster easier access to train ML/AI models to put them on the path to irrelevance. They're pulling out all stops like there's no tmw, and there might not be, if they're willing to shake things up like this.
Stack Overflow and Reddit want money for AIs to train on their data which is why they made these changes, so which companies are next? Could HN get crappier in order to milk AI money for its valuable comments? I guess Wikipedia at least can't do jack to get AI cash for its valuable data.
This data dump was part of the compact between users (whoc reated the content) and the platform (who host it). The data dump was insurance against the company going the CDDB/Gracenote, Experts Exchange or Quora route and either paywalling or even just gating that content. We don't need a repeat of that.<p>If the data dump is gone, that compact is broken and honestly it's time to stop contributing to SO.
twitter, reddit, stack overflow... the digital version of burning the library of alexandria<p>it was always a broken system built on dodgy contracts, but it is still sad to see how unceremoniously everything implodes<p>will any lessons be learned? unlikely.
This is an internet ecosystem issue that is simplified to thoughtless bashing of supposedly evil companies. Yes, these actions are clumsy and user-hostile but consider the big picture.<p>We have companies like Reddit and Stackoverflow not being profitable, despite being wildly successful in usage and internet mind-share. Neither of these companies are particularly over-staffed.<p>We post our "valuable" contributions there. So valuable that nobody wants to pay for it (structurally). We block ads. AI does the daylight robbery. We expect free APIs and data dumps.<p>Perhaps this is our wake-up call. The limitations of the "free" model and companies running at a loss for 15 years straight. It was always an anomaly.
Not sure if this is relevant, but the Hacker News BigQuery dataset also stopped updating since Nov 2022: <a href="https://issuetracker.google.com/issues/261579123" rel="nofollow">https://issuetracker.google.com/issues/261579123</a>
I wonder if the execs at SO figure that OpenAI fed the CC data dump directly to ChatGPT and decided maybe they didn't want to make it quite so easy for them to do it again? Maybe they want to make OpenAI pay for it, or at least attach the license-required attribution.