TechEcho

11 comments

fhoffaabout 6 years ago

Hi, Felipe Hoffa at Google here.We're aware the dataset hasn't been updated since a month ago, and we are working to fix it. You can track the issue here:- <a href="https://issuetracker.google.com/issues/127132286" rel="nofollow">https://issuetracker.google.com/issues/127132286</a>In the meantime you can still play with the dataset, and dig into the full history of Hacker News - less this last month. I left some interesting queries to get you started here:- <a href="https://medium.com/@hoffa/hacker-news-on-bigquery-now-with-daily-updates-so-what-are-the-top-domains-963d3c68b2e2" rel="nofollow">https://medium.com/@hoffa/hacker-news-on-bigquery-now-with-d...</a>

评论 #19306290 未加载

评论 #19306478 未加载

评论 #19308168 未加载

danielecookabout 6 years ago

I've been using the HN API to maintain a bigquery table of all posts, comments, and URLs on HN and putting it on BigQuery for a while now. I use it to put this site together: <a href="https://hntrending.com/" rel="nofollow">https://hntrending.com/</a>. BQ is awesome.It's a side project so may have some issues!

minimaxirabout 6 years ago

Looks like it stopped updating as of February 2nd, but otherwise it's pretty reliable, and as noted in the description, it's free. (you probably won't hit the 1TB limit working with this dataset). Here's a few queries I've done recently to answer ad-hoc questions to get an exact answer:Top posts about bootstrapping (<a href="https://news.ycombinator.com/item?id=19258249" rel="nofollow">https://news.ycombinator.com/item?id=19258249</a>):<pre><code> #standardSQL SELECT * FROM `bigquery-public-data.hacker_news.full` WHERE REGEXP_CONTAINS(title, '[Bb]ootstrap') ORDER BY score DESC LIMIT 100 </code></pre> Count of YC startup posts over time by month (<a href="https://news.ycombinator.com/item?id=19185946" rel="nofollow">https://news.ycombinator.com/item?id=19185946</a>):<pre><code> #standardSQL SELECT TIMESTAMP_TRUNC(timestamp, MONTH) as month_posted, COUNT(*) as num_posts_gte_5 FROM `bigquery-public-data.hacker_news.full` WHERE REGEXP_CONTAINS(title, 'YC [S|W][0-9]{2}') AND score >= 5 AND timestamp >= '2015-01-01' GROUP BY 1 ORDER BY 1</code></pre>

评论 #19311306 未加载

评论 #19305063 未加载

cobookmanabout 6 years ago

Top Commentors of all time. tptacek is at 1st place with 33839 comments.Hacker news is 12 years old. That's an average of 7 comments per day since inception. Wow<pre><code> #standardSQL SELECT author, count(DISTINCT id) as `num_comments` FROM `bigquery-public-data.hacker_news.comments` WHERE id IS NOT NULL GROUP BY author ORDER BY num_comments DESC LIMIT 100;</code></pre>

评论 #19305428 未加载

sbr464about 6 years ago

I added a simple api endpoint to access favorites on HN, since they weren’t available on the normal api.<a href="https://github.com/reactual/hacker-news-favorites-api" rel="nofollow">https://github.com/reactual/hacker-news-favorites-api</a>

vinnyglennonabout 6 years ago

<a href="https://hnify.com/leaderboard.html" rel="nofollow">https://hnify.com/leaderboard.html</a> using the dataset tool too, amazing to have so much data freely available to play with.

refrigeratorabout 6 years ago

Last year I built a domain leaderboard based on this dataset: <a href="https://hnleaderboard.com" rel="nofollow">https://hnleaderboard.com</a> — planning to update for 2019 soon!

lettergramabout 6 years ago

I’m actually fairly excited to learn about this. I painstakingly scrapped HN to build:<a href="https://hnprofile.com/" rel="nofollow">https://hnprofile.com/</a>I’m excited about this alternative

fsiefkenabout 6 years ago

is there a way to download the dataset and query it locally from for example postgresql or sqlite? How big is the database, 4G compressed?

tobrabout 6 years ago

A dataset like this is going to have a bunch of personal information in it. When it’s distributed like this, how does that jive with regulations like GDPR? If a HN user would like to delete all their comments, how would that request be forwarded to every user of this dataset?

评论 #19306643 未加载

评论 #19305709 未加载

评论 #19305407 未加载

cannabisfarmerabout 6 years ago

BigQuery keeps adding useless data.What we truly need is common crawl data then we can check specific site on our own.Or wait, BigQuery simply can't handle common crawl size dataset in their public service!Otherwise there is no reason to not add it, maybe it puts their search engine/ad business in geoparady.Is there any other Google public dataset BigQuery like platform? Where their direct search engine/ad platform interests don't get in way of Common Crawl like data searching/indexing?

评论 #19305645 未加载

评论 #19305208 未加载

11 comments

fhoffaabout 6 years ago

评论 #19306290 未加载

评论 #19306478 未加载

评论 #19308168 未加载

danielecookabout 6 years ago

minimaxirabout 6 years ago

评论 #19311306 未加载

评论 #19305063 未加载

cobookmanabout 6 years ago

评论 #19305428 未加载

sbr464about 6 years ago

vinnyglennonabout 6 years ago

<a href="https://hnify.com/leaderboard.html" rel="nofollow">https://hnify.com/leaderboard.html</a> using the dataset tool too, amazing to have so much data freely available to play with.

refrigeratorabout 6 years ago

Last year I built a domain leaderboard based on this dataset: <a href="https://hnleaderboard.com" rel="nofollow">https://hnleaderboard.com</a> — planning to update for 2019 soon!

lettergramabout 6 years ago

fsiefkenabout 6 years ago

is there a way to download the dataset and query it locally from for example postgresql or sqlite? How big is the database, 4G compressed?

Hacker News BigQuery Dataset

11 comments

Hacker News BigQuery Dataset

11 comments