Plagiarized news sites are using Cyrillic characters to avoid detection

92 pointsby mschenkabout 7 years ago

12 comments

iluxonchikabout 7 years ago

I doubt websites hosted in Eastern Europe care about copyright legal threats. Even if they contact the hosting provider directly I doubt any action will be taken. Eastern Europe has plenty of cheap, shady hosting providers where you can host pretty much anything that you want. Unless the website is making a lot of money, nobody is going to spend significant resources to take those websites down.Let me speculate of why they might be doing it. Google will de-rank pages that have content that's identical to others (e.g. identical paragraphs of text). Maybe Facebook is doing something similar?Let's say one of your friends shares an article from BuzzFeed.com, then another friend shares an exact clone of this article from FakeBuzzFeed.com. Now, Facebook might not want to show two articles with the same title from two different websites on your timeline. And considering that BuzzFeed.com is a website with a higher ranking that FakeBuzzFeed.com, it will probably choose to display only the fist one. If you do the Cyrillic trick to the article in FakeBuzzFeed.com, Facebook will think it's something completely different and present it to you, thus getting you a higher reach.The same applies to the advertising part: if you're constantly submitting page ads with exactly the same titles as the one's that real users are sharing, it might get you banned.

评论 #16552363 未加载

Cynddlabout 7 years ago

A bit of extrapolation here. In short, a few websites dedicated to make easy money on Facebook by copying articles have started to use unicode to obfuscate the title. They automatically replace latin characters with similar letters.This makes their title harder to be detected by either Facebook, fact checking websites, or DMCA/copyright bots. Nothing related to Russia here.

评论 #16551903 未加载

AndrewNCarrabout 7 years ago

Here is a project that maintains a list of homoglyphs and has some Java and Javascript code for detecting them.<a href="https://github.com/codebox/homoglyph" rel="nofollow">https://github.com/codebox/homoglyph</a>The list itself in sorted text format, each line a list of similar glyphs:<a href="https://github.com/codebox/homoglyph/blob/master/raw_data/chars.txt" rel="nofollow">https://github.com/codebox/homoglyph/blob/master/raw_data/ch...</a>

评论 #16552336 未加载

frits1993about 7 years ago

This reminds me of a project back in 2014, where a school-mate and I created an "uncopyable" font using the same idea.I put the site back online at <a href="http://nopy.progresso-ict.nl/" rel="nofollow">http://nopy.progresso-ict.nl/</a> ($10 PayPal money has already been given away years ago)

haneefmubarakabout 7 years ago

I think at some point, a sort of visual-normalization that converts similar looking unicode to a single unique string sequence (ex: convert certain letters from Cyrillic and other language sets that are also present in Latin to just Latin) is just going to be necessary as a security precaution.Given the whole "fake news" thing over the past couple of years, I expect that the first step will be taken by one of Google/Twitter/Facebook/etc, but I hope that they (or someone else) releases a library (or worst case, an online API) that allows this sort of normalization for security verification. I get that having it open would make it easier for people to find loopholes by brute-force testing, but these sorts of loopholes could also be patched rather quickly as they came up, providing benefit to everyone (especially from a security perspective).EDIT: Perhaps this could start out as a series of matches generated using ML classification? I don't know much about ML - does anyone who does think this is a realistic starting point?

评论 #16552023 未加载

beagerabout 7 years ago

Should be easy enough for networks to detect and remove these, by identifying content where character ranges in words routinely fall outside the charsets of languages.That, or some sort of fuzzy CV hashing, which is cool, but more intensive. That would also mitigate null length and invisible modifiers.

smsm42about 7 years ago

This is an old trick, successfully used for a while in domain names (does gооglе.com look suspicious to you? what if it had a valid SSL certificate?) but hopefully all browsers and registrars have smarted up by now.Another version of this trick has been popular in Russia with corrupt government workers: by law, a lot of government purchase/service contracts should be subject to public calls for bids, usually placed in a website which you can search. However, if you write what you need replacing some of Cyrillic characters with Latin ones, a honest supplier that is looking for a government contract will never find your entry. However a corrupt one that you have arranged with beforehand would, and will be the sole bidder on this contract, with a price that you have arranged before (which of course includes a juicy cut for the corrupt government official) and nobody is the wiser, all requirements of the law are fulfilled, who could be blamed that there's only one bidder?

filleokusabout 7 years ago

In Sweden (and probably other places), a service called URKUND[0] ("deed" in Swedish) is used for automatic detection of plagiarism for school work.I have always wondered to what extent they identify stuff like this, and other potential trickery with UTF-8 or removing text layers from PDF files.0: <a href="http://www.urkund.com/en/" rel="nofollow">http://www.urkund.com/en/</a>

评论 #16551920 未加载

评论 #16551921 未加载

评论 #16552008 未加载

goptimizeabout 7 years ago

Maarten Schenk (resident expert on fake news) create click-bait titles

评论 #16552732 未加载

BanzaiTokyoabout 7 years ago

substitution of characters o/a/e (that are similar in Latin and Cyrillic alphabets) has been used for years to pass automatic plagiarism detectors.

mfoy_about 7 years ago

>The site is part of a growing list of fake Native American pages run out of places like Macedonia, Kosovo or Vietnam.So the headline is a little misleading... It's just that there are a growing number of websites that simply plagiarize content to get views / ad revenue. Because their titles are obfuscated to prevent detection of the plagiarism, they have to target specific niche groups to drive views. So it's not some weird "fake Native American" scheme / scam / ploy... it's just that this site in particular seems to focus on "Native American topics".So it's not "Fake Native Americans Are Using Russian Characters to Avoid Plagiarism Detectors", it's "Fake News Sites Plagiarize Articles by Using Cyrillic Character Replacement to Avoid Detection", subtitle: "One such site targets Native Americans!"

评论 #16552704 未加载

评论 #16552161 未加载

kozakabout 7 years ago

Let me nitpick a bit: these characters are not Russian, they are Cyrillic. There are some Cyrillic characters that are distinctly Russian (i.e. used only in the Russian language), but these characters can't impersonate Latin letters because they are too different from them.<a href="https://en.wikipedia.org/wiki/Cyrillic_script" rel="nofollow">https://en.wikipedia.org/wiki/Cyrillic_script</a>

评论 #16552047 未加载