Ask HN: Isn't ChatGPT unfair to the sources it scraped data from?

65 点作者 wxce超过 2 年前

ChatGPT scraped data from various sources on the internet.> The model was trained using text databases from the internet. This included a whopping 570GB of data obtained from books, webtexts, Wikipedia, articles and other pieces of writing on the internet. To be even more exact, 300 billion words were fed into the system.I believe it's unfair to these sources that ChatGPT drives away their clicks, and in turn the ad income that would come with them.Scraping data seems fine in contexts where clicks aren't driven away from the very site the data was scraped from. But in ChatGPT's case, it seems really unfair to these sources and the work that the authors put, as people would no longer even to attempt to go to these sources.Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?

49 条评论

ergonaught超过 2 年前

Is it unfair for you to create content/products/etc after you have read and learned from various sources on the internet, potentially depriving them of clicks/income?People get internet hostile at me for this question, but it really is that simple. They've automated you, and it's definitely going to be a problem, but if it's acceptable for your brain to do the same thing, you're going to have to find a different angle to attack it than "fairness".

评论 #34665213 未加载

评论 #34665307 未加载

评论 #34665110 未加载

评论 #34665199 未加载

评论 #34669389 未加载

评论 #34665170 未加载

评论 #34665221 未加载

评论 #34665244 未加载

评论 #34665329 未加载

评论 #34665301 未加载

评论 #34665496 未加载

lukev超过 2 年前

I think this is a real concern, but imagine a couple other scenarios:1. You have a widely read spouse named Joe who reads constantly. He's got a good memory, and typically if you have a question you just ask him instead of searching for it yourself. Are you depriving Joe's sources of your eyeballs?2. Many books summarize and restate other books. If I read Cliff's Notes on a book, for example, I can learn a lot about the original book without buying it. Is this depriving the author?3. I have a website that proxies requests to other websites and summarizes them while stripping out ads.So which of these examples are a better metaphor for what a LLM does?I don't know. The fact is, LLMs are a new thing in our tech and culture and they don't quite fit into any of our existing cultural intuitions or norms. Of course it's ambiguous! But it's also exciting.

anileated超过 2 年前

It is not breaking the ad-based model—it’s breaking open information sharing culture as we know it.Yesterday: 1) You do research, you publish a book, you write some posts. 2) People discover your work and you personally, they visit your posts and subscribe to you. 3) You have an opportunity to upsell your book and make money on ads to sustain your future work; more importantly, you get to see traffic stats and see what is in demand, you get thank-you emails and feel valued.Tomorrow: 1) you do research, write posts, publish a book, 2) it is all consumed by a for-profit operated LLM. 3) People ask LLM to get answers, and have no reason or even opportunity to buy your book or know you exist.What exactly are the incentives to publish information openly in that world?(Will they even believe you if you say you’re the one who did the niche research powering some specific ChatGPT answer, in a world everyone knows that you can just ask an LLM?)

评论 #34671304 未加载

评论 #34667983 未加载

评论 #34667416 未加载

daevout超过 2 年前

Yes it absolutely is, but imo less so than what GitHub Copilot and various image generation companies are doing. My theory is that if AI turns out to be as disruptive as the current hype suggests, the conflict between those who feed the AI vs. those who profit from it might be the next big social rift.Artists are already in full rebellion against this, as they should be, being nearly eclipsed by AI, except when it comes to inventing new styles and hand-crafting samples for the models to train on. These, I assume, are either scraped off the web, or signed away in unfair ToS of various online publishing platforms.Since the damage individually is small (they took some code from me without attribution, ok) but collectively enormous, in my opinion it the role of government to step in and soften the blow if necessary.

评论 #34665579 未加载

Sakos超过 2 年前

Absolutely, yes. It's incredibly unfair. But techbros here and elsewhere don't care about you or me or people in general and they'll think up an infinite amount of ridiculous false equivalencies before admitting the risks and real harms.

评论 #34667046 未加载

评论 #34666990 未加载

blue_cookeh超过 2 年前

100% agree - the likes of ChatGPT are straight up generating revenue based on adding value to stolen work.

评论 #34665255 未加载

评论 #34665521 未加载

评论 #34665166 未加载

dmak超过 2 年前

All I have to say is, as technologists, anyone who is criticizing ChatGPT and has not been criticizing Google is a hypocrite. It's well known Google tries to keep you on Google by parsing more and more information from websites and summarizing it. Ex, Wikipedia summaries, IMDB Scores, Review Stars, etc...If you have a problem with ChatGPT's "scraped data", then you have more fundamental issues with how the internet is as it is today.

评论 #34665600 未加载

williamcotton超过 2 年前

Ah, our daily dose of a bunch of people with basically no understanding of copyright law or even the basic concepts of tort or common law jurisprudence make all sorts of silly anthropomorphic arguments about “how computers think”.Please, people, learn how to focus your thoughts. Go read up on copyright law in the United States. If you go into learning about copyright law trying to justify your own preconceived notions you will gain nothing.

评论 #34665907 未加载

oceanplexian超过 2 年前

The dirty secret of how so many social media giants got their initial traction in the early growth stage they scraped content. LinkedIn is one I have personal knowledge of. Facebook another. How do you think they got a critical mass of users? Scraping and fake engagement. Back in the 00's when they were startups operating in little offices in the SF Bay, they had teams of people running Beautiful Soup and were building bots to build profiles and stuff.I'm actually not really sure I have an opinion on the ethics of it. Same argument as Adblock. You don't get to control how people consume your content if you put it out in the world for free. That goes for profiles, or articles, reddit posts, StackOverflow, etc. The only thing that's ironic is that large tech companies throw a fit whenever you want to turn the tables and scrape them.

评论 #34665218 未加载

anthropodie超过 2 年前

Well you could say same thing about the answers that Google displays on it's pages instead of search results! If you don't want these crawlers to index your content I am pretty sure you can disable via robots.txt just like Google.

评论 #34665103 未加载

评论 #34665077 未加载

评论 #34665062 未加载

评论 #34665069 未加载

JoeAnon超过 2 年前

All I know is that while this isn't a new issue, the likes of ChatGPT has brought it to a head and made it more urgent. I am seriously reconsidering whether or not I want my writings to be available on the internet at all. I object to many of the uses, including this, they can be put to, and not publishing them online appears to be the only control available.For now, I have removed my existing works, both technical and creative, from the internet and won't be adding more while I try to work out what to do.

评论 #34665616 未加载

spaintech超过 2 年前

The importance of source citation in ChatGPT's responses is a topic of debate, particularly as the platform shifts towards a paid model. While ChatGPT is designed to deliver information in a conversational and user-friendly way, it is important to consider the potential legal implications of using unverified or uncited information. In sensitive or controversial cases, it is advisable to properly cite sources to ensure accuracy and avoid any potential issues of intellectual property infringement.On the other hand, the focus on the potential of ChatGPT's natural language processing capabilities highlights the significance of learning and using LLM (Language Models) in data handling. The utilization of LLM can potentially lead to a future where traditional databases become obsolete and are replaced by advanced language models. As such, the development and integration of LLM in our daily lives and processes can bring about many benefits and possibilities.

tqwhite超过 2 年前

No. It's not. Also, it's not unfair if I study someone's work and then learn from it. Also, it's not unfair if you see my internet present and are inspired to do similar things.At some point participating in the internet means your stuff is going to be seen. I wear glasses to read web content. I don't think the glasses company should pay royalties for what I read. chatGPT is a tool that allows me to understand and use the information people put onto the internet better.Far from a matter of fairness, this is simply another way that selfish people are trying to monetize the future, to make it more and more difficult and expensive for others to participate."I've always wished I could charge everyone one earth. chatGTP looks like the future. If I can tap the money flow there I will get mo' money."I'm against it.

评论 #34665655 未加载

评论 #34665578 未加载

pharke超过 2 年前

You can also invert this and say that without a system like ChatGPT it is physically impossible for most people to find or use those 570GB of data. A search engine can only get you so far and over time they are becoming less useful as the net floods with junk content. If you don't even know what terms to search for then ChatGPT wins out since you can start with a very simple question and then interrogate it further on details it produces. The best way to think about it is as a better search engine, a fully interactive one that also has some degree of its own agency when it comes to synthesizing data. It could be better, it would be nice to have the option to show sources for the output so that you can verify the facts or do your own research.

peyton超过 2 年前

Google piggybacks on the same sorts of data to rank results and display ads without compensating site owners. They track billions of people without paying them. I don’t think it’s any more unfair if OpenAI built a better product.

gchamonlive超过 2 年前

I think chatgpt just exacerbates a problem that was already pervading the free internet business model, which is that Ad revenue model is outdated and exhausted without a clear alternative.It maybe was unfair to telephone operators when connection automation was implemented, as it made operators obsolete, but the older model couldn't scale, the same way reading text from source doesn't scale for human productivity.

评论 #34665643 未加载

corobo超过 2 年前

This argument came up a bunch a while back. I settled on the opinion that while it's possible to buy summaries of books, I don't give a fart in a breeze where ChatGPT got it's data.E.g. Summary of How to Win Friends and Influence People: Effective Steps to Better Interpersonal Relationships by Book LyteChatGPT does more of a mashup with the learned data than humans need to, that'll do me.

评论 #34665960 未加载

jodrellblank超过 2 年前

> “Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?”We can only hope. It’s unfair to someone that my browser can ask your server for a page, I see an ad for random bullshit nobody would ever care about, and money changes hands behind the scenes and that counts as an economic transaction which boosts GDP. It’s unfair (in my favour) that I can piggy back off this to get things for free.And when I say “someone“ I suspect “everyone”. Sadly spending money advertising “Yorkshire woman finds guaranteed way to win on the horses” doesn’t seem to have caused anyone to run out of money and have the whole thing collapse yet. And it’s unfair on real small businesses with products paying for adverts which people don’t see or are clicked by bots or are misreported and all they can do is throw money at Google and Facebook and hope.

bediger4000超过 2 年前

I kind of agree with you, but I think that's only because we've all been saturated with the idea of everlasting ownership of ideas.Clearly, ownership of ideas runs out, because we all use linked lists or binary trees, or paper, or turbines or the list goes on. We don't pay money to the inventors of linked lists, or the heirs or successors-in-interest to the inventor of paper. Why not? When does ownership of an idea expire? Why do we unconsciously accept copyright or patent limits of today?There's also an issue with simultaneous invention, but that's out of scope here. Clearly ChatGPT is just regurgitating or otherwise emitting previously-ingested material.

beardyw超过 2 年前

When this becomes properly entrenched I fear that it may create a disincentive to create original content. If that happens we will all be poorer for it in return for amazing access to what we already have. I don't think it is a good deal.

评论 #34665682 未加载

pencilcode超过 2 年前

At the minimum, systems like chatgpt should be forced to link to their sources, so it gives something back and so that its assertions can be verified - right now, it’s just good at bullshitting through questions.

snshn超过 2 年前

It's akin to saying that everyone who writes a book today must give credit to everyone who contributed to the creation of modern written language and printing tools.

评论 #34665313 未加载

Brian_K_White超过 2 年前

I think the copyright ethicality of the current class of AIs is about like religion or guns.Discussion is pointless because everyone already has an opinion and it's very firm.

mkl95超过 2 年前

Big Tech companies have been scraping massive amounts of data for about two decades. Many smaller companies have tried to imitate them (remember when Big Data was the hottest thing out there? How do you think most of those startups obtained their data?) but pretty much all of them failed, mainly by running out of cash. OpenAI just happened to win the scraping lottery.

distantsounds超过 2 年前

"your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should."

bdhcuidbebe超过 2 年前

> I believe it's unfair to these sources that ChatGPT drives away their clicks, and in turn the ad income that would come with themGoogle is doing this in search results for years, so does bing. apple also does this in their built in dictionary.why rant about chatgpt that currently at least is a small company in comparison.

fatneckbeard超过 2 年前

chatgpt actually has some ideas about thisquestion: How could the people who generate used in an ai language model be paid for their work?answer: There are several ways in which the people who generate content for an AI language model could be paid for their work:<pre><code> Royalty-based payment: Content creators could receive a percentage of the revenue generated from the use of their content in the AI language model. Token-based payment: If the AI language model is built on a blockchain, content creators could be paid in tokens that could be traded for cryptocurrency or fiat currency. Partnership with content publishers: The developers of the AI language model could partner with content publishers to compensate the creators of the scraped content.</code></pre>

评论 #34665439 未加载

评论 #34665482 未加载

jijji超过 2 年前

The data it was scraped from was then put into vector maps and usedd to create a model which is used from zero to create unique sentences that summarize what the model relates to. The text results coming out are neither copyright infringement nor plagiarism.

评论 #34665151 未加载

acadiel超过 2 年前

There needs to be a “raw source” option that puts links to everything the model spits out that the user can enable or disable. This can give credit to whatever the model cites, and also help us understand a little of how it’s linking things together.

lr4444lr超过 2 年前

Not any more than I don't need to keep paying my teacher once I learn what she knows. ChatGPT's value isn't in what it knows: it's in what it understands from your prompt in terms of that sea of information.

unnouinceput超过 2 年前

1- It's on internet. If it's on Google site then it's free. If you want then use robots.txt (not that ever stopped google's spider to index your pages)2 - Code was trained from GitHub. GitHub is Microsoft. OpenAI is Microsoft money. So Microsoft trained its AI on Microsoft code. You disagree? Then GTFO from GitHub and don't feed Microsoft your code anymore.3 (the most important point) - Q: "Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?"Fuck YEAH!! please do so. I hope the shit show that ad model is crashes and burn to the ground. You can't use internet without having a solid armor on you with uBlock Origin and/or NoScript (or PiHole if you want the same readable experience on rest of your house devices).

naillo超过 2 年前

Never realized how little data it was fed with. 570GB can fit on my laptop.

评论 #34665684 未加载

评论 #34665247 未加载

voisin超过 2 年前

> Can this start breaking the ad-based model of the internet, where a lot of sites rely upon the ad income to run servers?Hopefully. This would be the best outcome I can think of for the Internet.

christkv超过 2 年前

Anybody have an idea of what kind of hardware (and cost) you would need to train the model and to execute it ?Obviously storage is not a major factor here.

评论 #34665821 未加载

throwaway8829超过 2 年前

On that argument, I could see publishers trying to sue. If you ask GPT:> What's the New York Times scrambled egg recipe?GPT returns the exact recipe. If I were NYT I'd be frustrated. Their content is now showing without the ad views or paywall.

Pigalowda超过 2 年前

This reminds me a bit of the criticism of “black box logic” for ML models.Is there something analogous to saliency maps for LLM?

raydiatian超过 2 年前

My feeling is that one of the four happen:1) AI is open sourced and we adapt stably. Either everybody has the opportunity to be their own business, or there is UBI.2) AI is open sourced but it is unfairly distributed. Only some people are suited to BTOB, and/or UBI is shit.3) AI is not open sourced, the wealthy edge out mankind and a planet scale genocide occurs.4) none of it matters because the looming war between the US & China explodes or climate change wipes us out in any meaningful capacity that could pursue AI.Given the track record of our species, #1 feels like wishful thinking

bee_rider超过 2 年前

The ad-based model of the internet is bad anyway. I don’t think chatGPT will break it, but we can hope!

omernomer超过 2 年前

I think it should reference the sources of the information, similar to any research paper or essay.

评论 #34665714 未加载

bilsbie超过 2 年前

Why should it be less fair than what a search engine does?It’s really just building a better model.

评论 #34665501 未加载

hamburga超过 2 年前

Maybe unfair is the wrong word. I think most agree that scraping, even at a massive scale -- is in itself fair. But is it sustainable?Will LLMs drive interest/activity away from wikipedia.org? Will it put its own sources of high-quality ad-supported content -- wikihow.com, for example (though I can't be totally sure it scraped from there) -- out of business? Or is there an earth-shattering copyright suit against OpenAI in the works as we speak?> Can this start breaking the ad-based model of the internetIs the alternative that everything is behind some kind of paywall by default, to block scraping? Is that where we're heading?

dd36超过 2 年前

Do you want companies to do this in private for private gain and not share it with you? Because making it illegal will just make it happen in greater secrecy.

评论 #34665862 未加载

sublinear超过 2 年前

I agree. ChatGPT should cite its sources.

评论 #34665242 未加载

评论 #34665324 未加载

paulcole超过 2 年前

Yes, it’s 100% unfair but the net gain to society will be worth it in the long run. Got to break some eggs to make an omelet.

afrcnc超过 2 年前

yes... and sometimes it's straight out copyright theft

评论 #34665728 未加载

sourcecodeplz超过 2 年前

It's your fault for making your IP free and public. Instead of posting for free on your web property, do it in a book that you charge for.

cebert超过 2 年前

Is it unfair that to present coworkers thoughts you summarized or derived after reading ad-supported content?

评论 #34665122 未加载

评论 #34665066 未加载

jhoelzel超过 2 年前

Only if you learning the same things is cheating too."Copyright" "ingenuity of thought" etc are concepts that need to be overhauled since a lot more people now have access to higher education.

mikewarot超过 2 年前

How could training an AI on the works of Shakespeare possibly be unfair to him? Or to any other long dead person? - I don't see any issuesHow could training an AI on the works of someone who has already been paid for them be unfair? - Possibly because it effects their future marketability and income?Current authors, artists, internet commenters, clearly have an interest in the results of their creative endeavors being used for gain that they won't benefit from. This is very similar to the extractive monopolies of YouTube and the rest of social media. Their profit at our expense.