TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask PG: HN Ngram Viewer?

8 pointsby zissouover 11 years ago
Since writing a scraper to discover and parse all historical comments&#x2F;submissions on HN would obviously get me in trouble, would the HN admins be willing to provide a dump of the historical text&#x2F;metadata from all comments and [local] submissions so I can make a HN Ngram Viewer for the HN public?<p>I work in an academic lab where I&#x27;m one of the developers of a system that generates ngram viewers from large corpuses of text, which we call &quot;Bookworms&quot;. Here are a few Bookworms we&#x27;ve created:<p>arXiv scientific publications: http:&#x2F;&#x2F;bookworm.culturomics.org&#x2F;arxiv&#x2F;<p>US Congress legislation: http:&#x2F;&#x2F;bookworm.culturomics.org&#x2F;congress&#x2F;<p>Open Library books: http:&#x2F;&#x2F;bookworm.culturomics.org&#x2F;OL&#x2F;<p>Chronicling America historical newspapers: http:&#x2F;&#x2F;bookworm.culturomics.org&#x2F;ChronAm&#x2F;<p>Social Science Research Network research paper abstracts: http:&#x2F;&#x2F;bookworm.culturomics.org&#x2F;ssrn&#x2F;<p>We have more Bookworms in the pipeline, including historical legislation in the UK and a massive corpus of texts (70MM+ documents) from the National Library of Australia (Trove) spanning multiple centuries. A new GUI for all our Bookworms will also be rolling out shortly. (Preview: http:&#x2F;&#x2F;bookworm.culturomics.org&#x2F;new_gui_teaser.png).<p>In my opinion, HN be an awesome candidate for an ngram viewer because there are so many subsets of topics that come&#x2F;go&#x2F;stay here, such as the frequency of discussions about web technologies, programming languages, companies&#x2F;services, the NSA, etc.<p>If this is something the HN admins would be interested in, I&#x27;d be happy to put it together. If a privacy agreement is desired before passing off any bulk data, that is not a problem as we&#x27;ve gone this route before, albeit only for private ngram viewers we&#x27;ve created for companies, like the NYT, to use internally.

2 comments

kogirover 11 years ago
The Octopart team has done a great job with HNSearch, and we really appreciate the huge favor they&#x27;ve done us by providing it. That said, due to limitations on our end around how they integrate with us, they&#x27;re not able to offer real-time updates or full fidelity ranking snapshots.<p>I&#x27;m working on a more comprehensive first-party API for HN, and plan to implement the following, in this order:<p><pre><code> 1) Near-real-time profiles, comments, and stories as JSON. 2) Real-time streaming of profile and item changes. 3) Near-real-time ranking of comments and stories. 4) Real-time streaming of ranking changes. 5) History of ranking changes. </code></pre> Sadly, I can&#x27;t commit to any firm timeline for future progress right now, but know that I&#x27;m working on it :)<p>-- Edit: Remove link to broken data file. Fixing it up tomorrow.
评论 #6825482 未加载
kristianpover 11 years ago
You could look into <a href="https://www.hnsearch.com/api" rel="nofollow">https:&#x2F;&#x2F;www.hnsearch.com&#x2F;api</a> . They provide the search bar functionality on this site.