TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: What data visualisation tools do data scientists and developers use?

65 pointsby lasryaricalmost 10 years ago
As a data scientist or software engineer working with &quot;medium &#x2F; large&quot; amount of data, how do you visualize them?<p>Lets take a few example:<p>- Working on a iPhone app dealing with signal processing, Fourier transform etc, how do you visualize your signal and frequencies?<p>- As a back-end engineer working with a directed graph data structure, how do you quickly visualize your graph? are you interested in seeing your graph changing, step by step?<p>- How do you quickly visualize massive amount of data points into time series?

15 comments

hglaseralmost 10 years ago
I&#x27;ll just say upfront that I&#x27;m the cofounder of Periscope (<a href="https:&#x2F;&#x2F;www.periscope.io&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.periscope.io&#x2F;</a>) which is specifically marketed at data scientists, so I have a horse in this race. :)<p>There are two kinds of charts: Charts designed to find information, and charts designed to sell information. The latter are often gorgeous and many-dimensional: Heatmaps, animated bubble charts, charts with time sliders, etc. And by all means, if selling the data is required, then sell it with the best tool for the job.<p>As for actually investigating the data, it&#x27;s usually a lot of tables, lines and bars. They&#x27;re simple to understand, and there&#x27;s no cleverness in the visualization that might hide critical information.<p>To answer your questions, at Periscope I&#x27;ve seen:<p>1. A line graph of amplitude over time. You should see the frequency emerge clear as day. If you want to calculate frequency explicitly, you could overlay a second line with its own axis. Again, super simple, but gives you the answer directly.<p>2. I&#x27;ve seen a lot of fancy graph visualizations, but nothing that makes me happy. Depending on what you want to know about your graph, maybe a simple table with a structure like:<p><pre><code> [node name][node name][weight] </code></pre> Or:<p><pre><code> [timestamp][node name][node name][weight] </code></pre> A pivot table on top of this data, transoforming the second node column into the table&#x27;s horizontal axis, can also be useful.<p>3. OK, obviously I think Periscope is a great choice here. Loads of data analysts use it to visualize time series data on many tens&#x2F;hundreds of billions of data points.<p>That said, other good choices are: Excel, R&#x2F;Stata&#x2F;Matlab, gnuplot, Apache Pig. And for the data storage itself, IMO Amazon Redshift is unparalleled.
评论 #9937194 未加载
评论 #9940152 未加载
评论 #9941812 未加载
Lofkinalmost 10 years ago
For interactive exploration of huge and out of core datasets, python Blaze&#x2F;Dask has data manipulation with OOC parallel dataframes , and python Bokeh has interactive webGL and downsampling for visualization for said dataframe.<p><a href="http:&#x2F;&#x2F;blaze.pydata.org&#x2F;en&#x2F;latest&#x2F;" rel="nofollow">http:&#x2F;&#x2F;blaze.pydata.org&#x2F;en&#x2F;latest&#x2F;</a> <a href="http:&#x2F;&#x2F;bokeh.pydata.org&#x2F;en&#x2F;latest&#x2F;" rel="nofollow">http:&#x2F;&#x2F;bokeh.pydata.org&#x2F;en&#x2F;latest&#x2F;</a> <a href="http:&#x2F;&#x2F;dask.pydata.org&#x2F;en&#x2F;latest&#x2F;" rel="nofollow">http:&#x2F;&#x2F;dask.pydata.org&#x2F;en&#x2F;latest&#x2F;</a>
ThePhysicistalmost 10 years ago
For exploratory data visualization I can recommend Matplotlib:<p><a href="http:&#x2F;&#x2F;matplotlib.org&#x2F;" rel="nofollow">http:&#x2F;&#x2F;matplotlib.org&#x2F;</a><p>It&#x27;s flexible, easy to use and can -with some tweaking- produce publication-quality output. It supports a wide range of backends such as PDF, SVG, PNG and a range of GUIs like Qt, so it is easy to use both for static and interactive graphs. More importantly, it plays nicely with the IPython notebook (<a href="http:&#x2F;&#x2F;ipython.org&#x2F;notebook.html" rel="nofollow">http:&#x2F;&#x2F;ipython.org&#x2F;notebook.html</a>) and with libraries such as numpy, pandas and scikit-learn. The last point is very important because when visualizing data half of the problem is getting the data in shape, and Python offers some excellent tools for this (much better than what you would get in Javascript).<p>The choice of the right tool depends of course on where you want to show your plots and what kind of plots you want to generate. If you intend showing your visualizations in a web browser then D3 or a similar JS library would probably be a good choice (also have a look at Bokeh, which is becoming a very good alternative to Matplotlib). For iPhone applications there are probably native toolkits to do this, but unfortunately I have no experience there and cannot make any recommendations.<p>Concerning the scalability of your visualizations: To my knowledge there are not many tools that can handle more than a few 10.000 data points without becoming slow. So you will probably always have to reduce the number of data points before plotting them. In any case, showing all of your data points without prior aggregation will probably not be a good idea in most cases, performance aside. If you really need to work with a massive amounts of data points have a look at Processing, which to my understanding can handle pretty large data sets (but is not a data visualization toolkit per se). OpenGL-based approaches can also be interesting for visualizing large amounts of data points, especially but not only in 3D.<p>Concerning your examples:<p>- Gephi is a nice tool for visualizing graphs (<a href="http:&#x2F;&#x2F;gephi.github.io&#x2F;" rel="nofollow">http:&#x2F;&#x2F;gephi.github.io&#x2F;</a>), but a bit clunky to use at time.<p>- R has the excellent GGPlot module that can work with pretty large data sets and has specialized graphs for e.g. time series data.
评论 #9938168 未加载
评论 #9937247 未加载
评论 #9937229 未加载
rgbrgbalmost 10 years ago
I used google&#x27;s svg charting library [0] this week and I have to say it&#x27;s really improved a ton in terms of customizability. Data in, chart out... only had to do a bit of custom html to fix up the legend but: <a href="https:&#x2F;&#x2F;www.openlistings.co&#x2F;near&#x2F;googleplex" rel="nofollow">https:&#x2F;&#x2F;www.openlistings.co&#x2F;near&#x2F;googleplex</a><p>That said, my feeling is that the hard part of dataviz is figuring out what to visualize. If you&#x27;re working with big datasets, it&#x27;s always going to be aggregate statistics that you&#x27;re looking at (unless you&#x27;re one of those people that can see blondes and brunettes in logs of the matrix).<p>[0]: <a href="https:&#x2F;&#x2F;developers.google.com&#x2F;chart&#x2F;interactive&#x2F;docs&#x2F;gallery&#x2F;linechart" rel="nofollow">https:&#x2F;&#x2F;developers.google.com&#x2F;chart&#x2F;interactive&#x2F;docs&#x2F;gallery...</a>
lmeyerovalmost 10 years ago
[Disclaimer: we&#x27;re doing <a href="http:&#x2F;&#x2F;www.graphistry.com" rel="nofollow">http:&#x2F;&#x2F;www.graphistry.com</a> , which helps with graph views of big IT infrastructure and log data.]<p>While doing research at Berkeley, and now building IT sec&#x2F;ops visibility tools, several phases:<p>Basically:<p>-- Start simple: Excel :)<p>-- Advance to Python: pivot tables and plots via IPython&#x2F;Jupyter notebooks with Pandas. (I used to do R but Python largely caught up, thankfully.) For bigger stuff, we started playing with PySpark.<p>-- For IT time series &amp; logs: graphite&#x2F;ganglia&#x2F;ELK, and if you can afford it, significantly more polished tools like Splunk.<p>-- Network graphs are an evolving story. Small graphs are easy through something like Neo4j&#x27;s built-in explorer. Alternatively, the Linkurious team is friendly, and Keylines is embedded all over the place. Big graphs are the problem: Gephi can do ~50K nodes and has a <i>lot</i> of features, but hard to use and showing its age. We&#x27;re building something more scalable and analysis focused, and particularly focusing on sec&#x2F;ops, so happy to share beta access. (We use GPUs and focus hard on smooth integrations, which let us change the rules a bit.) Let me know.
chrisalbonalmost 10 years ago
To blatantly toot my own horn here, but for visual exploratory data analysis, I just released Popily (<a href="http:&#x2F;&#x2F;popily.com" rel="nofollow">http:&#x2F;&#x2F;popily.com</a>).<p>It is meant as a no-code solution to get the layout of a dataset and do some fast EDA. Drag data in, then click around your data. It isn&#x27;t a replacement to in-depth analysis, but it is a heck of a lot faster for EDA than making dozens of charts in Matplotlib.<p>Here are a bunch of examples (<a href="http:&#x2F;&#x2F;popily.com&#x2F;examples&#x2F;" rel="nofollow">http:&#x2F;&#x2F;popily.com&#x2F;examples&#x2F;</a>) of Popily in action. Personal favorite: Battles In The Game Of Thrones (<a href="http:&#x2F;&#x2F;popily.com&#x2F;explore&#x2F;notebook&#x2F;battles-in-the-game-of-thrones&#x2F;" rel="nofollow">http:&#x2F;&#x2F;popily.com&#x2F;explore&#x2F;notebook&#x2F;battles-in-the-game-of-th...</a>).
adolgertalmost 10 years ago
There&#x27;s a sharp divide between what&#x27;s prepackaged for typical uses and attempts to customize visualization for any of three-dimensionality, parallel processing, or interactivity with large data. The latter set of tools use the Visualization Toolkit, or Paraview, MayaVI, and VisIt, all of which sit on top of VTK. Or there are commercial applications, such as EnVision and IDL. Keep a language-agnostic stable of prepackaged visualization techniques in Python, R, Matlab, Mathematica, Excel (yes, even), Julia, and Javascript. Then reach for the big hammers when absolutely necessary.
ironchefalmost 10 years ago
In general, most data scientists don&#x27;t need to see the entirety of a data set to get an idea. A lot of the initial data exploration is about separating signal from noise...what of the data is important.<p>After that (similar to your examples), it completely depends on what you&#x27;re trying to see. Am i looking at per capita data? Then chances are simply toss it at a choropleth. Time series tends to work pretty well in a simple linear graph type structure (as one will often overlay), etc.
rparrishalmost 10 years ago
I’ll say upfront that I’m a Product Manager at Treasure Data, and we market to Data Scientists. Specifically to enable you to perform analysis on large datasets, directly from your local machine. More generally, Treasure Data enables the collection, storage &amp; analytics of large-scale event data.<p>For performing preliminary analytics, I’ll agree with what the previous respondents have said - iPython Notebook is a GREAT tool. It’s certainly my go to. The libraries I think of using when working within this context are as follows:<p>The go-to packages: &gt; <a href="http:&#x2F;&#x2F;ggplot2.org&#x2F;" rel="nofollow">http:&#x2F;&#x2F;ggplot2.org&#x2F;</a> (R) &gt; <a href="http:&#x2F;&#x2F;matplotlib.org&#x2F;" rel="nofollow">http:&#x2F;&#x2F;matplotlib.org&#x2F;</a> (Python)<p>Graph visualizations: &gt; <a href="http:&#x2F;&#x2F;gephi.github.io&#x2F;" rel="nofollow">http:&#x2F;&#x2F;gephi.github.io&#x2F;</a> &gt; <a href="http:&#x2F;&#x2F;neo4j.com&#x2F;" rel="nofollow">http:&#x2F;&#x2F;neo4j.com&#x2F;</a><p>Online dashboards: &gt; <a href="https:&#x2F;&#x2F;github.com&#x2F;stitchfix&#x2F;pyxley" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;stitchfix&#x2F;pyxley</a> (&lt;- I’m particularly excited to try this out) &gt; <a href="http:&#x2F;&#x2F;bokeh.pydata.org&#x2F;" rel="nofollow">http:&#x2F;&#x2F;bokeh.pydata.org&#x2F;</a><p>Of course, the challenge is you don’t have a static dataset! New data is continuously coming in. Your dataset is growling larger all the time. It may be too large to fit on your local machine.<p>That’s why Treasure Data was founded, to enable the easy collection of, and analytics on, this type of data stream. Treasure enables complete removal of the engineering &amp; devops for these collection &amp; storage steps.<p>For example: &gt; Want a continuously updated dashboard of your incoming data? = Treasure Data + Jupiter Notebooks + Pyxley &gt; Want to perform graph visualizations on event data? = Treasure Data + Jupiter Notebooks + Neo4J &gt; Want to create visualizations in R? = Treasure Data + R + ggplot<p>The above is enabled through Treasure Data’s integration with Pandas &amp; R. (<a href="http:&#x2F;&#x2F;docs.treasuredata.com&#x2F;articles&#x2F;jupyter-pandas" rel="nofollow">http:&#x2F;&#x2F;docs.treasuredata.com&#x2F;articles&#x2F;jupyter-pandas</a>).<p>Good luck in your work!
评论 #9940944 未加载
评论 #9940958 未加载
minimaxiralmost 10 years ago
It depends on the context. Is this data being rendered as a PNG for use in a blog post? Or as an interactive application on a webpage?<p>For 2D time series, any 2D application is fine if the data is pre-processed, even Excel. For 2D plots in general, I&#x27;m more biased toward R and ggplot2, though (see my tutorial: <a href="http:&#x2F;&#x2F;minimaxir.com&#x2F;2015&#x2F;02&#x2F;ggplot-tutorial&#x2F;" rel="nofollow">http:&#x2F;&#x2F;minimaxir.com&#x2F;2015&#x2F;02&#x2F;ggplot-tutorial&#x2F;</a> ).<p>Graph data structures are a bit harder. I know Gephi is used for creating PNGs, but I have less experience with it.<p>For interactive web charts, you have to use libraries like d3.js or Highcharts, but I am not a fan for using interactive charts for static data unless necessary. (Mostly because they never play well with mobile devices without significant QA)
评论 #9937223 未加载
bluesmoonalmost 10 years ago
We use Julia (running in IJulia) with D3 using either iframe communication, window.postMessage or hooking directly onto the websocket to communicate between the D3 viz and the Julia backend.<p>We decided to use D3 because of the interactivity. There are performance problems with the default SVG examples when the number of visualized nodes grows so I wrote a hybrid CANVAS + SVG rendering layer. We use CANVAS for the bulk of the drawing, SVG for text nodes or a few nodes that need mouse interaction, and event delegation on the document to determine if the mouse has interacted with anything else.
evandrixalmost 10 years ago
for large-scale EDA time-series viz, `gnuplot` has served me well, even understanding ms.
malux85almost 10 years ago
It depends how fluid I need to be.<p>If I have a very specific, quantifiable goal in mind, then I use test driven development - added bonus of having a test suite at the end.<p>If I am working with large datasets on servers, then I simply subsample, and then scale up
andegrealmost 10 years ago
Machete is a good one!<p><a href="https:&#x2F;&#x2F;www.machete.io&#x2F;beta" rel="nofollow">https:&#x2F;&#x2F;www.machete.io&#x2F;beta</a>
评论 #9942169 未加载
ahaminoalmost 10 years ago
We use python a lot, so matplotlib is our tool of choice, combined with NumPy or Panda!