Ask HN: What data visualisation tools do data scientists and developers use?

65 点作者 lasryaric将近 10 年前

As a data scientist or software engineer working with "medium / large" amount of data, how do you visualize them?Lets take a few example:- Working on a iPhone app dealing with signal processing, Fourier transform etc, how do you visualize your signal and frequencies?- As a back-end engineer working with a directed graph data structure, how do you quickly visualize your graph? are you interested in seeing your graph changing, step by step?- How do you quickly visualize massive amount of data points into time series?

15 条评论

hglaser将近 10 年前

I'll just say upfront that I'm the cofounder of Periscope (<a href="https://www.periscope.io/" rel="nofollow">https://www.periscope.io/</a>) which is specifically marketed at data scientists, so I have a horse in this race. :)There are two kinds of charts: Charts designed to find information, and charts designed to sell information. The latter are often gorgeous and many-dimensional: Heatmaps, animated bubble charts, charts with time sliders, etc. And by all means, if selling the data is required, then sell it with the best tool for the job.As for actually investigating the data, it's usually a lot of tables, lines and bars. They're simple to understand, and there's no cleverness in the visualization that might hide critical information.To answer your questions, at Periscope I've seen:1. A line graph of amplitude over time. You should see the frequency emerge clear as day. If you want to calculate frequency explicitly, you could overlay a second line with its own axis. Again, super simple, but gives you the answer directly.2. I've seen a lot of fancy graph visualizations, but nothing that makes me happy. Depending on what you want to know about your graph, maybe a simple table with a structure like:<pre><code> [node name][node name][weight] </code></pre> Or:<pre><code> [timestamp][node name][node name][weight] </code></pre> A pivot table on top of this data, transoforming the second node column into the table's horizontal axis, can also be useful.3. OK, obviously I think Periscope is a great choice here. Loads of data analysts use it to visualize time series data on many tens/hundreds of billions of data points.That said, other good choices are: Excel, R/Stata/Matlab, gnuplot, Apache Pig. And for the data storage itself, IMO Amazon Redshift is unparalleled.

评论 #9937194 未加载

评论 #9940152 未加载

评论 #9941812 未加载

Lofkin将近 10 年前

For interactive exploration of huge and out of core datasets, python Blaze/Dask has data manipulation with OOC parallel dataframes , and python Bokeh has interactive webGL and downsampling for visualization for said dataframe.<a href="http://blaze.pydata.org/en/latest/" rel="nofollow">http://blaze.pydata.org/en/latest/</a> <a href="http://bokeh.pydata.org/en/latest/" rel="nofollow">http://bokeh.pydata.org/en/latest/</a> <a href="http://dask.pydata.org/en/latest/" rel="nofollow">http://dask.pydata.org/en/latest/</a>

ThePhysicist将近 10 年前

For exploratory data visualization I can recommend Matplotlib:<a href="http://matplotlib.org/" rel="nofollow">http://matplotlib.org/</a>It's flexible, easy to use and can -with some tweaking- produce publication-quality output. It supports a wide range of backends such as PDF, SVG, PNG and a range of GUIs like Qt, so it is easy to use both for static and interactive graphs. More importantly, it plays nicely with the IPython notebook (<a href="http://ipython.org/notebook.html" rel="nofollow">http://ipython.org/notebook.html</a>) and with libraries such as numpy, pandas and scikit-learn. The last point is very important because when visualizing data half of the problem is getting the data in shape, and Python offers some excellent tools for this (much better than what you would get in Javascript).The choice of the right tool depends of course on where you want to show your plots and what kind of plots you want to generate. If you intend showing your visualizations in a web browser then D3 or a similar JS library would probably be a good choice (also have a look at Bokeh, which is becoming a very good alternative to Matplotlib). For iPhone applications there are probably native toolkits to do this, but unfortunately I have no experience there and cannot make any recommendations.Concerning the scalability of your visualizations: To my knowledge there are not many tools that can handle more than a few 10.000 data points without becoming slow. So you will probably always have to reduce the number of data points before plotting them. In any case, showing all of your data points without prior aggregation will probably not be a good idea in most cases, performance aside. If you really need to work with a massive amounts of data points have a look at Processing, which to my understanding can handle pretty large data sets (but is not a data visualization toolkit per se). OpenGL-based approaches can also be interesting for visualizing large amounts of data points, especially but not only in 3D.Concerning your examples:- Gephi is a nice tool for visualizing graphs (<a href="http://gephi.github.io/" rel="nofollow">http://gephi.github.io/</a>), but a bit clunky to use at time.- R has the excellent GGPlot module that can work with pretty large data sets and has specialized graphs for e.g. time series data.

评论 #9938168 未加载

评论 #9937247 未加载

评论 #9937229 未加载

rgbrgb将近 10 年前

I used google's svg charting library [0] this week and I have to say it's really improved a ton in terms of customizability. Data in, chart out... only had to do a bit of custom html to fix up the legend but: <a href="https://www.openlistings.co/near/googleplex" rel="nofollow">https://www.openlistings.co/near/googleplex</a>That said, my feeling is that the hard part of dataviz is figuring out what to visualize. If you're working with big datasets, it's always going to be aggregate statistics that you're looking at (unless you're one of those people that can see blondes and brunettes in logs of the matrix).[0]: <a href="https://developers.google.com/chart/interactive/docs/gallery/linechart" rel="nofollow">https://developers.google.com/chart/interactive/docs/gallery...</a>

lmeyerov将近 10 年前

[Disclaimer: we're doing <a href="http://www.graphistry.com" rel="nofollow">http://www.graphistry.com</a> , which helps with graph views of big IT infrastructure and log data.]While doing research at Berkeley, and now building IT sec/ops visibility tools, several phases:Basically:-- Start simple: Excel :)-- Advance to Python: pivot tables and plots via IPython/Jupyter notebooks with Pandas. (I used to do R but Python largely caught up, thankfully.) For bigger stuff, we started playing with PySpark.-- For IT time series & logs: graphite/ganglia/ELK, and if you can afford it, significantly more polished tools like Splunk.-- Network graphs are an evolving story. Small graphs are easy through something like Neo4j's built-in explorer. Alternatively, the Linkurious team is friendly, and Keylines is embedded all over the place. Big graphs are the problem: Gephi can do ~50K nodes and has a lot of features, but hard to use and showing its age. We're building something more scalable and analysis focused, and particularly focusing on sec/ops, so happy to share beta access. (We use GPUs and focus hard on smooth integrations, which let us change the rules a bit.) Let me know.

chrisalbon将近 10 年前

To blatantly toot my own horn here, but for visual exploratory data analysis, I just released Popily (<a href="http://popily.com" rel="nofollow">http://popily.com</a>).It is meant as a no-code solution to get the layout of a dataset and do some fast EDA. Drag data in, then click around your data. It isn't a replacement to in-depth analysis, but it is a heck of a lot faster for EDA than making dozens of charts in Matplotlib.Here are a bunch of examples (<a href="http://popily.com/examples/" rel="nofollow">http://popily.com/examples/</a>) of Popily in action. Personal favorite: Battles In The Game Of Thrones (<a href="http://popily.com/explore/notebook/battles-in-the-game-of-thrones/" rel="nofollow">http://popily.com/explore/notebook/battles-in-the-game-of-th...</a>).

adolgert将近 10 年前

There's a sharp divide between what's prepackaged for typical uses and attempts to customize visualization for any of three-dimensionality, parallel processing, or interactivity with large data. The latter set of tools use the Visualization Toolkit, or Paraview, MayaVI, and VisIt, all of which sit on top of VTK. Or there are commercial applications, such as EnVision and IDL. Keep a language-agnostic stable of prepackaged visualization techniques in Python, R, Matlab, Mathematica, Excel (yes, even), Julia, and Javascript. Then reach for the big hammers when absolutely necessary.

ironchef将近 10 年前

In general, most data scientists don't need to see the entirety of a data set to get an idea. A lot of the initial data exploration is about separating signal from noise...what of the data is important.After that (similar to your examples), it completely depends on what you're trying to see. Am i looking at per capita data? Then chances are simply toss it at a choropleth. Time series tends to work pretty well in a simple linear graph type structure (as one will often overlay), etc.

rparrish将近 10 年前

I’ll say upfront that I’m a Product Manager at Treasure Data, and we market to Data Scientists. Specifically to enable you to perform analysis on large datasets, directly from your local machine. More generally, Treasure Data enables the collection, storage & analytics of large-scale event data.For performing preliminary analytics, I’ll agree with what the previous respondents have said - iPython Notebook is a GREAT tool. It’s certainly my go to. The libraries I think of using when working within this context are as follows:The go-to packages: > <a href="http://ggplot2.org/" rel="nofollow">http://ggplot2.org/</a> (R) > <a href="http://matplotlib.org/" rel="nofollow">http://matplotlib.org/</a> (Python)Graph visualizations: > <a href="http://gephi.github.io/" rel="nofollow">http://gephi.github.io/</a> > <a href="http://neo4j.com/" rel="nofollow">http://neo4j.com/</a>Online dashboards: > <a href="https://github.com/stitchfix/pyxley" rel="nofollow">https://github.com/stitchfix/pyxley</a> (<- I’m particularly excited to try this out) > <a href="http://bokeh.pydata.org/" rel="nofollow">http://bokeh.pydata.org/</a>Of course, the challenge is you don’t have a static dataset! New data is continuously coming in. Your dataset is growling larger all the time. It may be too large to fit on your local machine.That’s why Treasure Data was founded, to enable the easy collection of, and analytics on, this type of data stream. Treasure enables complete removal of the engineering & devops for these collection & storage steps.For example: > Want a continuously updated dashboard of your incoming data? = Treasure Data + Jupiter Notebooks + Pyxley > Want to perform graph visualizations on event data? = Treasure Data + Jupiter Notebooks + Neo4J > Want to create visualizations in R? = Treasure Data + R + ggplotThe above is enabled through Treasure Data’s integration with Pandas & R. (<a href="http://docs.treasuredata.com/articles/jupyter-pandas" rel="nofollow">http://docs.treasuredata.com/articles/jupyter-pandas</a>).Good luck in your work!

评论 #9940944 未加载

评论 #9940958 未加载

minimaxir将近 10 年前

It depends on the context. Is this data being rendered as a PNG for use in a blog post? Or as an interactive application on a webpage?For 2D time series, any 2D application is fine if the data is pre-processed, even Excel. For 2D plots in general, I'm more biased toward R and ggplot2, though (see my tutorial: <a href="http://minimaxir.com/2015/02/ggplot-tutorial/" rel="nofollow">http://minimaxir.com/2015/02/ggplot-tutorial/</a> ).Graph data structures are a bit harder. I know Gephi is used for creating PNGs, but I have less experience with it.For interactive web charts, you have to use libraries like d3.js or Highcharts, but I am not a fan for using interactive charts for static data unless necessary. (Mostly because they never play well with mobile devices without significant QA)

评论 #9937223 未加载

bluesmoon将近 10 年前

We use Julia (running in IJulia) with D3 using either iframe communication, window.postMessage or hooking directly onto the websocket to communicate between the D3 viz and the Julia backend.We decided to use D3 because of the interactivity. There are performance problems with the default SVG examples when the number of visualized nodes grows so I wrote a hybrid CANVAS + SVG rendering layer. We use CANVAS for the bulk of the drawing, SVG for text nodes or a few nodes that need mouse interaction, and event delegation on the document to determine if the mouse has interacted with anything else.

evandrix将近 10 年前

for large-scale EDA time-series viz, `gnuplot` has served me well, even understanding ms.

malux85将近 10 年前

It depends how fluid I need to be.If I have a very specific, quantifiable goal in mind, then I use test driven development - added bonus of having a test suite at the end.If I am working with large datasets on servers, then I simply subsample, and then scale up

andegre将近 10 年前

Machete is a good one!<a href="https://www.machete.io/beta" rel="nofollow">https://www.machete.io/beta</a>

评论 #9942169 未加载

ahamino将近 10 年前

We use python a lot, so matplotlib is our tool of choice, combined with NumPy or Panda!