Hi, Felipe Hoffa at Google here.<p>We're aware the dataset hasn't been updated since a month ago, and we are working to fix it. You can track the issue here:<p>- <a href="https://issuetracker.google.com/issues/127132286" rel="nofollow">https://issuetracker.google.com/issues/127132286</a><p>In the meantime you can still play with the dataset, and dig into the full history of Hacker News - less this last month. I left some interesting queries to get you started here:<p>- <a href="https://medium.com/@hoffa/hacker-news-on-bigquery-now-with-daily-updates-so-what-are-the-top-domains-963d3c68b2e2" rel="nofollow">https://medium.com/@hoffa/hacker-news-on-bigquery-now-with-d...</a>
I've been using the HN API to maintain a bigquery table of all posts, comments, and URLs on HN and putting it on BigQuery for a while now. I use it to put this site together: <a href="https://hntrending.com/" rel="nofollow">https://hntrending.com/</a>. BQ is awesome.<p>It's a side project so may have some issues!
Looks like it stopped updating as of February 2nd, but otherwise it's pretty reliable, and as noted in the description, it's free. (you probably won't hit the 1TB limit working with this dataset). Here's a few queries I've done recently to answer ad-hoc questions to get an exact answer:<p>Top posts about bootstrapping (<a href="https://news.ycombinator.com/item?id=19258249" rel="nofollow">https://news.ycombinator.com/item?id=19258249</a>):<p><pre><code> #standardSQL
SELECT *
FROM `bigquery-public-data.hacker_news.full`
WHERE REGEXP_CONTAINS(title, '[Bb]ootstrap')
ORDER BY score DESC
LIMIT 100
</code></pre>
Count of YC startup posts over time by month (<a href="https://news.ycombinator.com/item?id=19185946" rel="nofollow">https://news.ycombinator.com/item?id=19185946</a>):<p><pre><code> #standardSQL
SELECT TIMESTAMP_TRUNC(timestamp, MONTH) as month_posted,
COUNT(*) as num_posts_gte_5
FROM `bigquery-public-data.hacker_news.full`
WHERE REGEXP_CONTAINS(title, 'YC [S|W][0-9]{2}')
AND score >= 5
AND timestamp >= '2015-01-01'
GROUP BY 1
ORDER BY 1</code></pre>
Top Commentors of all time.
tptacek is at 1st place with 33839 comments.<p>Hacker news is 12 years old. That's an average of 7 comments per day since inception. Wow<p><pre><code> #standardSQL
SELECT
author,
count(DISTINCT id) as `num_comments`
FROM `bigquery-public-data.hacker_news.comments`
WHERE id IS NOT NULL
GROUP BY author
ORDER BY num_comments DESC
LIMIT 100;</code></pre>
I added a simple api endpoint to access favorites on HN, since they weren’t available on the normal api.<p><a href="https://github.com/reactual/hacker-news-favorites-api" rel="nofollow">https://github.com/reactual/hacker-news-favorites-api</a>
<a href="https://hnify.com/leaderboard.html" rel="nofollow">https://hnify.com/leaderboard.html</a> using the dataset tool too, amazing to have so much data freely available to play with.
Last year I built a domain leaderboard based on this dataset: <a href="https://hnleaderboard.com" rel="nofollow">https://hnleaderboard.com</a> — planning to update for 2019 soon!
I’m actually fairly excited to learn about this. I painstakingly scrapped HN to build:<p><a href="https://hnprofile.com/" rel="nofollow">https://hnprofile.com/</a><p>I’m excited about this alternative
A dataset like this is going to have a bunch of personal information in it. When it’s distributed like this, how does that jive with regulations like GDPR? If a HN user would like to delete all their comments, how would that request be forwarded to every user of this dataset?
BigQuery keeps adding useless data.<p>What we truly need is common crawl data then we can check specific site on our own.<p>Or wait, BigQuery simply can't handle common crawl size dataset in their public service!<p>Otherwise there is no reason to not add it, maybe it puts their search engine/ad business in geoparady.<p>Is there any other Google public dataset BigQuery like platform? Where their direct search engine/ad platform interests don't get in way of Common Crawl like data searching/indexing?