TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

An Unexpected Character Replacement

59 pointsby eaguyhnover 5 years ago

4 comments

Tarq0nover 5 years ago
These angle brackets are how R handles printing some Unicode to stdout on windows. In memory it should be regular Unicode though.<p>The mistake they made is a classic R footgun: the fileEncoding argument to write.table() controls the encoding of the filename, not its contents. You either have to control the encoding of the files by manually creating the connection through file() or ideally just use the readr or data.table libraries.<p>Base R makes a lot of Unixy assumptions when it comes to text, so it&#x27;s not pleasant to work with on Windows. The package ecosystem has solved most of these problems though.
评论 #21290506 未加载
avianover 5 years ago
I used to have a job that involved parsing large textual datasets. It was fascinating to me how far you could reconstruct a history of a dataset just by looking at encoding errors - and practically no dataset I&#x27;ve seen came without them. Sometimes I could be certain of several specific import&#x2F;export steps, each introducing a new layer of encoding errors on top of the previous one. Other times I could correlate timestamps and see when specific data entry bugs were introduced and when they were fixed.<p>Strictly speaking once you lose the information about the encoding of a string you can&#x27;t say anything about it. But given some heuristic, some contextual knowledge (like how the author of the post guesses that &quot;M&lt;fc&gt;ller&quot; means &quot;Müller&quot;) and a large enough amount of data you can pretty much always work back through and correct the errors. Well, as long as someone didn&#x27;t replace all 8-bit characters with question marks, but that was very rare in my experience.
ChrisSDover 5 years ago
&gt; The dataset started out as a Microsoft Excel file, presumably in Windows-1252 encoding. This was converted into a CSV, then loaded into the R environment for adding additional information required by GBIF, then exported from R as a text file with the command option fileEncoding = &quot;UTF-8&quot;.<p>Quite a journey. The ultimate culprit was an R text cleaning function but I wonder why the Excel sheet was in Windows-1252 encoding and can&#x27;t R import Excel files directly?
评论 #21287560 未加载
ngcc_hkover 5 years ago
By R should be in the title?
评论 #21287865 未加载
评论 #21287489 未加载