Ask HN: All of you working with Big Data, what is your Data?

117 pointsby syskover 10 years ago

Big data is a trending topic these days and I'd like to get my hands dirty both out of curiosity and to make myself more relevant on the marketplace. That being said, I'm not sure which data sets are both interesting to play with and easily accessible. My question is:For those of you already working with big data, what kind of data do you work with?

39 comments

alanctgardner3over 10 years ago

If you want to work in a "big data"-type role as a developer, I wouldn't worry about finding huge data sets. There's a dearth of candidates, especially ones who actually have hands-on experience, and having deep knowledge of (and a little experience with) a broad range of tools will make you a pretty good candidate:Fire up a VM with a single-node install on it [1] and just grab any old CSVs. Load them into HDFS, query them with Hive, query them with Impala (Drill, SparkQL, etc.). Rinse and repeat for any size of syslog data, then JSON data. Write a MapReduce job to transform the files in some way. Move on to some Spark exercises [2]. Read up on Kafka, understand how it works and think about ways to get exactly-once message delivery. Hook Kafka up to HDFS, or HBase, or a complex event processing pipeline. You'll probably need to know about serialization formats too, so study up on Avro, protobuf and Parquet (or ORCfile, as long as you understand columnar storage).If you can talk intelligently about the whole grab bag of stuff these teams use, that'll get you in the door. Understanding RDBMSes, data warehousing concepts, and ETL is a big plus for people doing infrastructure work. If you're focused on analytics you can get away with less of the above, but knowing some of it, plus stats and BI tools (or D3 if you want to roll your own visualization) is a plus.[1] <a href="http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-3-x.html" rel="nofollow">http://www.cloudera.com/content/cloudera/en/downloads/quicks...</a> [2] <a href="http://ampcamp.berkeley.edu/5/" rel="nofollow">http://ampcamp.berkeley.edu/5/</a>

评论 #8940574 未加载

评论 #8940444 未加载

评论 #8939819 未加载

评论 #8940302 未加载

评论 #8941145 未加载

me_bxover 10 years ago

The twitter social graph (follow connections between people) is my data source, I extract it from the API and cache it in a database.The mariadb table storing this information currently takes a bit more than 500GB, it has about 4 billion rows (based on the statistics, I don't run SELECT count(*) on it anymore).I usually don't use the term "big data" because the buzzword is so popular that it doesn't mean anything anymore.

评论 #8939508 未加载

评论 #8939835 未加载

jawnsover 10 years ago

Day job: Web site user sessions and offline retail sales data.Side project: Poll responses on <a href="http://www.correlated.org" rel="nofollow">http://www.correlated.org</a>

malux85over 10 years ago

Here's some stuff I have done in the past year, I work for a small company, but run a personal computing cluster of 167 servers that I pay for out of my own pocket. I really enjoy loading "big" datasets into them and working on improving algorithms or gaining insight into the data.I (try and) network around London and offer my services for free to people who have interesting problems.- Very high resolution FMRI data. A single scan can be 10-20GB- Infringing URLs for a piracy company, 4 billion rows- DNA sequences and Protein Data, lots of variation in sizes, from a few hundred MB's of string data, to hundreds of GBs- RAW radio data for a military skunkworks project (10's of GB / min)I would really like to find an investor who could take me off my full time job, I have 3 quite large projects I would like to build, one I have almost finished.

评论 #8940015 未加载

评论 #8940060 未加载

ScottBursonover 10 years ago

If you're looking for data sets to play with, check out Kaggle [0]. Companies post data sets there along with questions they want answered, and people compete to find the best way to answer them.[0] www.kaggle.com

评论 #8940644 未加载

Maroover 10 years ago

I work at Prezi. We have about a petabyte of data. It's usage data coming from the product and the website. Clicks in the editor and such. Then we have a data warehouse with cleaned and accurate datasets, that's much less. We are on AWS, we use S3, EMR for Hadoop, Pig, Redshift for SQL, chartio, etc. We have our own hourly ETL written in Go which we will opensource this year.I recently talked at Strata, here's the Prezi:<a href="https://prezi.com/d1889jmlziks/strata-2014/" rel="nofollow">https://prezi.com/d1889jmlziks/strata-2014/</a>

nevineraover 10 years ago

Retail transaction/loyalty, network traffic, financial, and health data.To be clear 'big data' is poorly defined, and I mostly do not work with terabyte+ data sets, but rather with highly dimensional data in moderate volume. Data is only 'big' relative to the algorithms you try to use on it.

serhanbakerover 10 years ago

You can actually think an interesting application and generate your own data. For example, we were developing a product for processing network events in real-time. There were 6-10K events per second, and we were creating alerts for several different scenarios. For testing purposes, we actually wrote a program to simulate those events, with 20K events per second. It was generating fake (but realistic) data with the right format.Application idea from top of my head: Generate turnstile data for different subway stations (enter/exit, time) and wrote an application to show the density of those stations with times. You can create a scenario where a certain station is more dense than others, and this could be your test. And this application could be your proof of concept

crzrcnover 10 years ago

Treasure Data, 400k records per second. For us it's less about the data we manage, but how easy we make it for customers to store and query it.Data consists of IoT devices ranging from wearables to cars to frickin' windmills, analytics from various websites and mobile games.

laughfactoryover 10 years ago

I am the data modeler for an organization which lends to small businesses. In my experience "big data" is all in the eye of the beholder, and it's not all about how many gigabytes of data you work with, how wide, or how long it is. The challenges are the same: how to use the data in relevant ways to forward organizational goals. In my case the days isn't particularly long in terms of number of rows, but it is exceptionally wide in terms of potential variables. It's enough data that I have to spend a reasonable amount of time thinking about the most efficient way to model (statistically) and data mine. The issues are similar to other data oriented jobs I've had: how to determine which variables are relevant, clean and transform the data... And ultimately how to turn a big pile of data into a model which effectively predicts likelihood of charge off if the loan were to be approved. Scintillating stuff, but obscenely difficult. Of course, it's harder too because I'm the only modeler and am fairly inexperienced. My last experience building predictive models was a couple classes in college... Which was also my last experience using R (which I prefer to SAS.To answer your implied question, I'd recommend picking up ANY size real world data and playing with it. Build statistical models (predictive or otherwise), apply supervised and unsupervised machine learning methods to it, but above all develop a foundation of experience working with real world data. In class in college we used "canned" data sets which were already cleaned, validated, organized, and so forth. This made it unrealistically easy to model. In the real world just working with the data effectively is a hard won skill. So from the get go you need to learn how to explore data, visualize it, interpret plots and statistics, clean/transform/normalize it, formulate a question your data can answer, and apply the relevant methods in pursuit of the answers you seek. Once you have the fundamentals down the size of the data is immaterial--only requiring you to put additional thought into what you can computationally achieve (for instance, how to determine which of 150 candidate variables are statistically relevant).

mgkimsalover 10 years ago

OT but I've never liked the term 'big data' precisely because it's so ill-defined. Most people I speak with on this think they have 'big data'. Anything they can't comprehend is "big data". Anything that makes their Excel 97 crash is "big data". It's pervasive a term enough that people have heard it, and use it wrongly.Colleague of mine is at a company that's advertising for someone with "big data" experience. Collectively, for more than 10 years in business, they have maybe 100g of data. They just do not know how to organize the data sanely in a relational database, and actively refuse to consider normal data structures.

评论 #8939982 未加载

dmichulkeover 10 years ago

Financial data (tick to EOD), network traffic data (TCP packet level sends / receives) and farm data (sensor + farm ERP data)All of them are basically time series with some master data, none of them is more than a few dozen GBSo in any case, I think time series data is worth a look.

评论 #8939789 未加载

评论 #8939679 未加载

kforover 10 years ago

Governmental health records and survey data. A lot of the really big stuff we use requires academic licenses, but there's still a lot of publicly accessible data.For the U.S. try- CDC's National Center for Health Statistics: <a href="http://www.cdc.gov/nchs/" rel="nofollow">http://www.cdc.gov/nchs/</a>- CDC WONDER: <a href="http://wonder.cdc.gov/" rel="nofollow">http://wonder.cdc.gov/</a>- NIH's Unified Medical Language System: <a href="http://www.nlm.nih.gov/research/umls/" rel="nofollow">http://www.nlm.nih.gov/research/umls/</a>And for global try the Global Health Data Exchange: <a href="http://ghdx.healthdata.org" rel="nofollow">http://ghdx.healthdata.org</a>

kenrick95over 10 years ago

There are organizations that collect big data at various locations and shared among them. Have a look: <a href="http://webscience.org/web-observatory/" rel="nofollow">http://webscience.org/web-observatory/</a>

jayshahtxover 10 years ago

I think the most interesting datasets are within reach but require curation yourself. For example there are extremely powerful scraping libraries in just about every popular language today, not to mention APIs such as Twitter's.If you're looking for a cool dataset to play with, I think it is more productive to ask yourself what questions you want to answer and then find/curate the data VS find a dataset and then ask "what questions can I answer?". The former approach will also keep motivations high if you're driven by curiosity.

评论 #8940237 未加载

alexatkeplarover 10 years ago

Human and machine-generated structured event streams, via Snowplow (<a href="https://github.com/snowplow/snowplow" rel="nofollow">https://github.com/snowplow/snowplow</a>).The largest open-access event stream archive I know about is from GitHub, I think it's about 100Gb: <a href="https://www.githubarchive.org/" rel="nofollow">https://www.githubarchive.org/</a>

matt_sover 10 years ago

Data collected from devices and it is large, but not big. Around 40-60TB and very repetitive data. Find some open set of data that interests you and just do something to get familiar with the tools.I think most data sets could be handled via RDBMS and Big Data is just another choice. The more interesting thing to me is what you accomplish and if a new tech can get you there faster or cheaper, etc.

jongosover 10 years ago

Job: Data Science Consultant, Governments and NGOsFor me it's primarily population data. It's not exactly 'big' data in the raw form, but what makes it bigger are the variations analyzing, applications of predictive models, and new metadata values extracted from it.The data grows faster than we're collecting it exponentially because of all the analysis.

bsmarttover 10 years ago

Working with information about attacks all the way down the killchain. Everything from IDS sigs, english descriptions, attribution, ip/host reputation.AlienVault is hiring security researchers.edit: we have some limited data sets that we make public, incase you're interested, hence the name 'open threat exchange'.

danmaz74over 10 years ago

I work on the relationships between hashtags and between hashtags and influencers: <a href="http://hashtagify.me" rel="nofollow">http://hashtagify.me</a>For this analysis, we collect the data from Twitter's public API

alexvayover 10 years ago

Working on reducing Big Data to help network security engineers investigate threats faster and respond more accurately.The data, currently, is mainly from various Network Security Monitoring appliances & SIEMs.

calinet6over 10 years ago

I work at Localytics. We have analytics data from billions of mobile and web users, including specific user actions, usage in general, and user profiles. It really is a fascinating dataset.

评论 #8939814 未加载

Arkidover 10 years ago

Have worked on a few Big Data projects: - Sensor Data from haul trucks to predict their failures and optimize their routes in the mines - Telematics data for insurance companies

valevkover 10 years ago

Mostly logfiles, and other machine generated data (of which 99% can be thrown away, but that's what "big data" does for me, filter out what's important).

gesmanover 10 years ago

Banking and brokerage portal access data.Utilizing Splunk as analytics and alerting platform to correlate real time financial activity events with multiple threat intelligence feeds.

lolwhatover 10 years ago

IMDB and Boxoffice mojo data. Thinking of moving to mongodb.

robinho364over 10 years ago

In search engine companies, we sort out cookies every day.

hijinksover 10 years ago

I'm in devops but support the Hadoop cluste. We are a adtech company that has close to a 2 petabyte cluster that is around 76% full.

boboshaover 10 years ago

Perhaps the biggest of big data problems - imagery (photos & video) We build algorithms to extract value in imagery.

byoung2over 10 years ago

Currently sentiment analysis on business reviews (Yelp, Google, Citysearch, Facebook, OpenTable, TripAdvisor).

iskanderover 10 years ago

Genetic sequence data, mostly from cancer.(Tools are terrible, data sizes up to hundreds of gigabytes per patient)

quentindemetzover 10 years ago

Hotel reservations, prices, and numerous market indicators. for thousands of hotels.PriceMatch is hiring in Paris!

sjwhitworthover 10 years ago

GPS and transport data.

评论 #8940413 未加载

daemonkover 10 years ago

I work with genomics data. The data is more complex than big.

ronreiterover 10 years ago

Log data of browsing history. 500k requests per sec.

评论 #8940169 未加载

Demiurgeover 10 years ago

working with ~1PB of remote sensing data and station derived data. never used the word 'Big Data' in any work context.

vishalzone2002over 10 years ago

clickstream logs to build recommendation systems.

bkruseover 10 years ago

Genomics!

gaiusover 10 years ago

I think I would consider anything 100Tb and up to be big data. There is no big data that is "easily accessible"; that's why it's "big" because it requires extremely powerful hardware and advanced techniques to work on. Otherwise its just "data".NOTE: there are people in the world who would laugh at my definition and say that big data starts at 1Pb.

评论 #8939569 未加载

评论 #8939547 未加载

评论 #8941307 未加载

评论 #8939592 未加载

评论 #8939839 未加载

评论 #8939586 未加载