TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Hacker News BigQuery Dataset

158 pointsby svdrabout 6 years ago

11 comments

fhoffaabout 6 years ago
Hi, Felipe Hoffa at Google here.<p>We&#x27;re aware the dataset hasn&#x27;t been updated since a month ago, and we are working to fix it. You can track the issue here:<p>- <a href="https:&#x2F;&#x2F;issuetracker.google.com&#x2F;issues&#x2F;127132286" rel="nofollow">https:&#x2F;&#x2F;issuetracker.google.com&#x2F;issues&#x2F;127132286</a><p>In the meantime you can still play with the dataset, and dig into the full history of Hacker News - less this last month. I left some interesting queries to get you started here:<p>- <a href="https:&#x2F;&#x2F;medium.com&#x2F;@hoffa&#x2F;hacker-news-on-bigquery-now-with-daily-updates-so-what-are-the-top-domains-963d3c68b2e2" rel="nofollow">https:&#x2F;&#x2F;medium.com&#x2F;@hoffa&#x2F;hacker-news-on-bigquery-now-with-d...</a>
评论 #19306290 未加载
评论 #19306478 未加载
评论 #19308168 未加载
danielecookabout 6 years ago
I&#x27;ve been using the HN API to maintain a bigquery table of all posts, comments, and URLs on HN and putting it on BigQuery for a while now. I use it to put this site together: <a href="https:&#x2F;&#x2F;hntrending.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;hntrending.com&#x2F;</a>. BQ is awesome.<p>It&#x27;s a side project so may have some issues!
minimaxirabout 6 years ago
Looks like it stopped updating as of February 2nd, but otherwise it&#x27;s pretty reliable, and as noted in the description, it&#x27;s free. (you probably won&#x27;t hit the 1TB limit working with this dataset). Here&#x27;s a few queries I&#x27;ve done recently to answer ad-hoc questions to get an exact answer:<p>Top posts about bootstrapping (<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=19258249" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=19258249</a>):<p><pre><code> #standardSQL SELECT * FROM `bigquery-public-data.hacker_news.full` WHERE REGEXP_CONTAINS(title, &#x27;[Bb]ootstrap&#x27;) ORDER BY score DESC LIMIT 100 </code></pre> Count of YC startup posts over time by month (<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=19185946" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=19185946</a>):<p><pre><code> #standardSQL SELECT TIMESTAMP_TRUNC(timestamp, MONTH) as month_posted, COUNT(*) as num_posts_gte_5 FROM `bigquery-public-data.hacker_news.full` WHERE REGEXP_CONTAINS(title, &#x27;YC [S|W][0-9]{2}&#x27;) AND score &gt;= 5 AND timestamp &gt;= &#x27;2015-01-01&#x27; GROUP BY 1 ORDER BY 1</code></pre>
评论 #19311306 未加载
评论 #19305063 未加载
cobookmanabout 6 years ago
Top Commentors of all time. tptacek is at 1st place with 33839 comments.<p>Hacker news is 12 years old. That&#x27;s an average of 7 comments per day since inception. Wow<p><pre><code> #standardSQL SELECT author, count(DISTINCT id) as `num_comments` FROM `bigquery-public-data.hacker_news.comments` WHERE id IS NOT NULL GROUP BY author ORDER BY num_comments DESC LIMIT 100;</code></pre>
评论 #19305428 未加载
sbr464about 6 years ago
I added a simple api endpoint to access favorites on HN, since they weren’t available on the normal api.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;reactual&#x2F;hacker-news-favorites-api" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;reactual&#x2F;hacker-news-favorites-api</a>
vinnyglennonabout 6 years ago
<a href="https:&#x2F;&#x2F;hnify.com&#x2F;leaderboard.html" rel="nofollow">https:&#x2F;&#x2F;hnify.com&#x2F;leaderboard.html</a> using the dataset tool too, amazing to have so much data freely available to play with.
refrigeratorabout 6 years ago
Last year I built a domain leaderboard based on this dataset: <a href="https:&#x2F;&#x2F;hnleaderboard.com" rel="nofollow">https:&#x2F;&#x2F;hnleaderboard.com</a> — planning to update for 2019 soon!
lettergramabout 6 years ago
I’m actually fairly excited to learn about this. I painstakingly scrapped HN to build:<p><a href="https:&#x2F;&#x2F;hnprofile.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;hnprofile.com&#x2F;</a><p>I’m excited about this alternative
fsiefkenabout 6 years ago
is there a way to download the dataset and query it locally from for example postgresql or sqlite? How big is the database, 4G compressed?
tobrabout 6 years ago
A dataset like this is going to have a bunch of personal information in it. When it’s distributed like this, how does that jive with regulations like GDPR? If a HN user would like to delete all their comments, how would that request be forwarded to every user of this dataset?
评论 #19306643 未加载
评论 #19305709 未加载
评论 #19305407 未加载
cannabisfarmerabout 6 years ago
BigQuery keeps adding useless data.<p>What we truly need is common crawl data then we can check specific site on our own.<p>Or wait, BigQuery simply can&#x27;t handle common crawl size dataset in their public service!<p>Otherwise there is no reason to not add it, maybe it puts their search engine&#x2F;ad business in geoparady.<p>Is there any other Google public dataset BigQuery like platform? Where their direct search engine&#x2F;ad platform interests don&#x27;t get in way of Common Crawl like data searching&#x2F;indexing?
评论 #19305645 未加载
评论 #19305208 未加载