The shady world of Brave selling copyrighted data for AI training

261 pointsby rand0mx1almost 2 years ago

13 comments

hartatoralmost 2 years ago

> Simply observe the event in which a user does a query q in Brave and then, within one hour, does the same query on a different search engine. What we do is to move the script that detects bad-queries to the browser, run it against the queries that the user does in real-time and then, when all conditions are met, send the following data back to our servers.Wait. Brave browser sends back to Brave Search engine about your browsing? Other search engines usage, but also crawl pages on your computer to help build their search index?Ref: <a href="https://github.com/brave/web-discovery-project/blob/main/modules/web-discovery-project/sources/README.md">https://github.com/brave/web-discovery-project/blob/main/mod...</a>

评论 #36738165 未加载

评论 #36737508 未加载

评论 #36738735 未加载

评论 #36744916 未加载

6gvONxR4sf7oalmost 2 years ago

> Fair use is a doctrine in the law of the United States that allows limited use of copyrighted material without requiring permission from the rights holders. It provides for the legal, non-licensed citation or incorporation of copyrighted material in another author's work under a four-factor balancing test:> 1) The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes> 2) The nature of the copyrighted work> 3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole> 4) The effect of the use upon the potential market for or value of the copyrighted work[emphasis from TFA]HN always talks about derivative work and transformativeness, but never about these. The fourth one especially seems clear in its implications for models.Regardless, it makes it seem much less clear cut than people here often say.

评论 #36737774 未加载

评论 #36737718 未加载

评论 #36737735 未加载

评论 #36737636 未加载

评论 #36737585 未加载

xp84almost 2 years ago

From article:> without any worry for copyright infringement because Brave acts as a middleman.This isn’t how law works. Unless Brave is explicitly indemnifying all their customers (which their lawyers would have to be insane to let them do), any trouble you could get in, is going to be 100% your problem. Pointing the finger at Brave could theoretically get them in trouble too, but would in no way let you off the hook.

isodevalmost 2 years ago

I firmly believe that corps like these don't deserve the benefit of the doubt. Google, Brave and really anyone big enough to allow themselves to do bad things and get away with it must adhere to a standard where they proactively show their stuff doesn't have malicious intents.

评论 #36736721 未加载

k__almost 2 years ago

The websites a Brave user browses are anonymously relayed to their servers for indexing/training. So, they crawl the web without a crawler and the website operators can't do anything about it.That's genius!

评论 #36739830 未加载

throwaway72762almost 2 years ago

I think this title is overstated. It seems like Brave is trying to do the right thing here vs other companies that don't even make the attempt. (Also, crawling as a service has been a thing for a while.)

评论 #36736865 未加载

lern_too_spelalmost 2 years ago

Brave continues to be shady. They claim to respect robots.txt but don't identify their crawler if you want to block it.> They don't mention their crawler anywhere in their docs, either. So, if you wanted to block Brave from crawling and indexing and ultimately selling your content to third parties, your only option for the time being would be to block all crawlers, which is how Brave would be able to "respect robots.txt".

kodahalmost 2 years ago

Unpopular opinion: the next iteration of privacy laws needs to factor in AI. If AI is allowed to slurp up PII or derogative works and the people defending it defend it with the zeal of cryptobros then we're in for a decade of real pain in terms of both copyright law, PII, and IP exposure.

评论 #36738028 未加载

评论 #36738078 未加载

lopatinalmost 2 years ago

Why use brave if my info is already being leaked by third parties? E.g. experian. Is it worth the inconvenience and their repeated tricky attempts at monetizing their security conscious niche? Not being facetious, just a real question from a non security conscious person.

评论 #36738038 未加载

评论 #36738927 未加载

ricardo81almost 2 years ago

My entirely biased opinion is <a href="https://www.mojeek.com/" rel="nofollow noreferrer">https://www.mojeek.com/</a> - a traditional search engine crawler (as in, follow links on the web) that identifies its user agent. Dead Simple. The open web, you can search on it.

verisimialmost 2 years ago

How long until IP works its way onto ai training data or ais themselves? Ie that for some specific instance, the training is intentionally wrong, so as to check and prove that there has been a breach of IP.

评论 #36736681 未加载

评论 #36736705 未加载

评论 #36736995 未加载

niemandhieralmost 2 years ago

This discussion on fair use are always quite anglocentric.Atricle 3 and 4 of the EU 'Copyright in the Digital Single Market' give data miners quite extensive rights.Move operation to the EU, train a foundational model, than train a constitutional model based on that.As much as I hate the upcoming AI regulation, the CDSM is solid.<a href="https://academic.oup.com/grurint/article/71/8/685/6650009" rel="nofollow noreferrer">https://academic.oup.com/grurint/article/71/8/685/6650009</a> <a href="https://eur-lex.europa.eu/eli/dir/2019/790/oj" rel="nofollow noreferrer">https://eur-lex.europa.eu/eli/dir/2019/790/oj</a>Update: Fixed wrong link

评论 #36738555 未加载

411111111111111almost 2 years ago

It's always surprising to me when I hear people using the brave browser... It's by a company that initially tried to replace their blocked ads with their own "safe and non-intrusive" ads as far as I remember, until they backpaddled because of the outrage.It's also a for-profit company and you're not the customer, as you're not paying them money.I'd be way more worried how they're using the data they're collecting on you vs Google or MS

评论 #36737352 未加载

评论 #36737576 未加载

评论 #36742209 未加载

评论 #36737675 未加载

评论 #36737606 未加载