TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

R is a joy if you treat it like Awk

168 pointsby dwrodriover 5 years ago

30 comments

iron0013over 5 years ago
I use R a lot, and nothing in this post rings even the slightest bit true to me. This guy&#x27;s use case for R seems to be very unlike almost anyone I&#x27;ve ever heard of. Towards the end of the article, when he&#x27;s talking about how all he really needs from R is the summary() and boxplot() functions, it really becomes clear that he&#x27;s never done anything more than dip the very tip of his pinky toe into R.<p>There are lots of valid criticisms of R, but this article doesn&#x27;t touch on them. It&#x27;s so off-base that it&#x27;s the proverbial &quot;not even wrong&quot;.
评论 #21185793 未加载
评论 #21185026 未加载
halfeatenpieover 5 years ago
I use R on a daily level. A significant portion of our climate change adaptation research code and decision planning systems (which are used in various utilities around the world to support decision makers) are built using R.<p>I understand what the author is stating, but I just feel like this is from inexperience with R and ignoring the vast amount of packages available within it that are specifically targeted at data science. There are some valid issues and criticisms of R, but I think this article only focuses on the application of R in a single context. A significant portion of data science is about cleaning the data and an entire suite of packages known as tidyverse solves these problems (for me anyways) while also being very simple and easy to understand. I mean tidyverse supports piping, which is exactly what this article is saying to use.<p>Obviously your mileage may vary, but this post irks me the wrong way.
评论 #21185589 未加载
goodsideover 5 years ago
For historical context, R was not initially intended to be a standalone scripting language runnable via POSIX conventions and didn’t gain these features until circa 2010. R is designed to be used interactively, like the S language it was based on, a style later made familiar to students via MATLAB and TI graphing calculators. Before Rscript it was common to hack together shell scripts that cope with the expectation of human TTY input to run R in production, which functioned well as a “you must be this tall to ride” sign for people who might not appreciate how unreliable R code can be compared to a traditional language.
dredmorbiusover 5 years ago
The precursor to R, S, was designed as a Unix tool, and pretty much implicitly relied on awk as its data-cleaning preprocessor.<p>For those who come from the world of large enterprise statistical and data reporting tools such as SAS, awk shares an exceedingly strong resemblance to the SAS DATA step, whilst R effectively provides a host of analysis and graphics tools the correspond to numerous other SAS procedure and products.<p>The hacks for pipelining R are cool and useful. Thanks.
bllguoover 5 years ago
&gt; In my experience, just about anything beats R when it comes to cleaning dirty data.<p>&gt; I used to resent R, it was shoved upon me as a strange tool that promised to replace Python, but failed miserably.<p>well I&#x27;m glad the author found a way to make R work for their purposes, but this just reeks of inexperience with R..
评论 #21184843 未加载
tylermwover 5 years ago
&quot;R is only great if your data is already great.&quot;<p>Disagree. R has an incredible ecosystem for parsing, cleaning, and manipulating data. Even ignoring the tidyverse, base R provides more than enough functionality to clean and analyze data--you just need to spend some time learning it. If you use the tidyverse, it&#x27;s even easier. The only other ecosystem that comes close to R is Julia, which was designed taking many of the best parts of R into consideration.
评论 #21189969 未加载
评论 #21185993 未加载
评论 #21190872 未加载
评论 #21190887 未加载
rcthompsonover 5 years ago
R has lots of packages available for data cleaning. They just aren&#x27;t necessarily included in base R. Most of the ones I&#x27;m aware of are in the Tidyverse. Even if you&#x27;re dealing with a hand-written Excel file with multiple tables in the same sheet and various information encoded in text&#x2F;background colors and such, there&#x27;s they tidyxl package to help you do that.
wodenokotoover 5 years ago
I haven&#x27;t worked with logs, but I do find R a joy to work with in general, especially the tidy verse is a joy to use, but slow and memory hungry on very large datasets.<p>I haven&#x27;t found a good way around very inconsistently formatted csv files in any language (a row only represents column 3,4 and 6 if it starts with a comma, all other rows have all columns, but are space separated and values may contain comma, etc, etc)
oarabbus_over 5 years ago
R isn&#x27;t a joy, but it is powerful. This is the best blog post I&#x27;ve read on the matter: <a href="https:&#x2F;&#x2F;www.talyarkoni.org&#x2F;blog&#x2F;2012&#x2F;06&#x2F;08&#x2F;r-the-master-troll-of-statistical-languages&#x2F;comment-page-1&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.talyarkoni.org&#x2F;blog&#x2F;2012&#x2F;06&#x2F;08&#x2F;r-the-master-trol...</a>
CreRecombinaseover 5 years ago
If you want to treat R like awk, you should really check out the littler package, a super useful R package which provides an alternative to both Rscript and R CMD BATCH designed for writing one-liners <a href="https:&#x2F;&#x2F;github.com&#x2F;eddelbuettel&#x2F;littler" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;eddelbuettel&#x2F;littler</a>
larrydagover 5 years ago
To be honest I don&#x27;t find Python to be very useful for data cleansing. Full disclosure I use R quite a bit and use Python sparingly.<p>If you want to use R for data analysis I suggest the R package Tidyverse or use a product like OpenRefine.<p>If you just need to run summary statistics with an application then using Python is just fine.
评论 #21185846 未加载
评论 #21189800 未加载
chubotover 5 years ago
I get what the author is saying -- I use R in shell scripts too. It&#x27;s really useful and composes well with other shell tools.<p>I also get what the commenters are saying, because R is a useful interactive language too, and it&#x27;s also pretty good for data cleaning. Although it&#x27;s significantly slower than Python, which is why I do all cleaning that cuts down the data before loading it into R.<p>As an example, I generate some benchmarks with every release of Oil:<p><a href="https:&#x2F;&#x2F;www.oilshell.org&#x2F;release&#x2F;0.7.pre5&#x2F;benchmarks.wwz&#x2F;osh-parser&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.oilshell.org&#x2F;release&#x2F;0.7.pre5&#x2F;benchmarks.wwz&#x2F;osh...</a><p>and the tables are manipulated with R, but running the benchmarks is done with shell, and creating the HTML is done with Python:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;oilshell&#x2F;oil&#x2F;blob&#x2F;master&#x2F;benchmarks&#x2F;report.R" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;oilshell&#x2F;oil&#x2F;blob&#x2F;master&#x2F;benchmarks&#x2F;repor...</a><p>(and yes this page is meant to be motivation to speed up my principled but slow shell parser)<p>R and tidyverse are the best tools for manipulating tables by far. I wrote an intro here:<p><i>What Is a Data Frame? (In Python, R, and SQL)</i> <a href="http:&#x2F;&#x2F;www.oilshell.org&#x2F;blog&#x2F;2018&#x2F;11&#x2F;30.html" rel="nofollow">http:&#x2F;&#x2F;www.oilshell.org&#x2F;blog&#x2F;2018&#x2F;11&#x2F;30.html</a>
just_mylesover 5 years ago
I used to use awk all the time whenever I wanted to evaluate specific portions of a columnar file (With a short one liner). Python comes close but, awk is much faster. In conjunction with SED, I think that is a little much and imo exceeds what I was doing. If I need to do any replacing or &#x27;cleaning&#x27;, I would then use something like Python or SQL.
mikorymover 5 years ago
So, this is not about the article. I&#x27;ve recently started using rpy2 as a python wrapper for R functions and libraries and I&#x27;m finding that it is not that bad.<p>There is some performance issues, but I am OK to trade that off for the convenience of using &quot;try:&quot; rather than R&#x27;s &quot;tryCatch&quot;. Having tryCatch as a function rather than built into the syntax is unacceptable to me. But there are some libraries in R that don&#x27;t have elegant alternatives in Python or which you have multiple options or perhaps not the time to unlearn.
arminiusreturnsover 5 years ago
Here is how I like to use R, which can use the authors more introductory methods or full fledged data crunch:<p>Like I do all my other data sources, as code blocks inside an emacs-org notebook. If you are doing data science, you quickly find that it&#x27;s management and combination of the various particular projects that becomes the most daunting (imho), and your data science notebook becomes the most important part of that organization. In that arena for me it&#x27;s pretty much either jupyter or emacs org-mode.
crispyambulanceover 5 years ago
The regex stuff in stringr, and read_csv from readr are adequate for almost anything if you&#x27;re talking about ordinary delimited file import.<p>What does awk give you that these don&#x27;t?
isostaticover 5 years ago
Something to bypass while you reach for your Perl interpreter?
dima55over 5 years ago
R is really overkill here. Simpler and nicer tools exist if you want to munge data on the command line and&#x2F;or plot it. For instance:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;dkogan&#x2F;feedgnuplot&#x2F;" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dkogan&#x2F;feedgnuplot&#x2F;</a> <a href="https:&#x2F;&#x2F;github.com&#x2F;dkogan&#x2F;vnlog&#x2F;" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dkogan&#x2F;vnlog&#x2F;</a>
评论 #21187563 未加载
xiaodaiover 5 years ago
&quot;R is a joy if you don&#x27;t use it.&quot;<p>Joke. I am a heavy R user.
safgasCVSover 5 years ago
I’ve recently discovered the pipe command allowing R to consume the result of a terminal command’s output (link: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;RYhwZW6ofbI" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;RYhwZW6ofbI</a>). Quite useful for reading in compressed files and it’s made me learn a bit of awk to fix text files on the fly<p>Oh and there’s also Rio if you want to explore injecting R into your command line workflow.
alexhutchesonover 5 years ago
If you&#x27;re interested in learning how to do data cleaning and restructuring in R, I highly recommend the &quot;Wrangle&quot; chapters of Hadley Wickham&#x27;s R for Data Science book, which you can read online here: <a href="https:&#x2F;&#x2F;r4ds.had.co.nz&#x2F;wrangle-intro.html" rel="nofollow">https:&#x2F;&#x2F;r4ds.had.co.nz&#x2F;wrangle-intro.html</a>
jwilberover 5 years ago
I don’t want to post something inflammatory or condescending, but it’s very clearly that the author does not have experience using R.
yiyusover 5 years ago
<p><pre><code> awk &#x27;&#x2F;Recovery time:&#x2F;{print $2}&#x27; output.log | boxplot</code></pre>
Myrmornisover 5 years ago
dwodri: Not sure if you&#x27;re a MacOS user, but iTerm2 distributes a script to display images inline in the terminal. I&#x27;ve forgotten how to get R to output png data directly to stdout, but it could be nice to do that and display the images inline in the terminal<p><a href="https:&#x2F;&#x2F;www.iterm2.com&#x2F;documentation-images.html" rel="nofollow">https:&#x2F;&#x2F;www.iterm2.com&#x2F;documentation-images.html</a>
评论 #21185328 未加载
评论 #21186913 未加载
评论 #21187750 未加载
joker3over 5 years ago
Rscript can also be used to run R scripts from the command line. Just put &gt;#!&#x2F;usr&#x2F;bin&#x2F;env Rscript on the first line.
dwrodriover 5 years ago
OP Here: I’m aware of some CSS issues on mobile and will work to fix them shortly.<p>UPDATE: fixed color palette issue, although it still looks quite bad.
评论 #21184661 未加载
dfgdghdfover 5 years ago
Lots of people don&#x27;t realize this, but R is very similar to JavaScript. Shiny apps offer a nice functional web framework too.
评论 #21189840 未加载
gpvosover 5 years ago
<i>&gt; Half-assing something in Unix can produce great results as long you chose the right half of the ass.</i><p>That is a great quote!
enriqutoover 5 years ago
for that simple use case gnuplot is even better than R
hktover 5 years ago
Awk is one of the few tools that possesses the ability to make me beg for death.