TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: An open distributed search engine for science

98 点作者 juretriglav将近 11 年前

6 条评论

Blahah将近 11 年前
Jure, your projects never cease to impress me. Really looking forward to talking in depth at OKfest. This idea is so close to what we&#x27;ve been doing that it&#x27;s a real shame we didn&#x27;t talk earlier, but the parts of what you&#x27;re doing that are unique are also truly awesome.<p>At ContentMine we&#x27;re doing something totally complementary to this. Some of the tools will overlap and we should be sharing what we&#x27;re doing. For example, I&#x27;ve been working on a standardised declarative JSON-XPath scraper definition format and a subset of it for academic journal scraping. I&#x27;ve been building a library of ScraperJSON definitions for academic publisher sites, and I&#x27;ve converged on some formats that work for a majority of publishers with no modification (because they silently follow undocumented standards like the HighWire metadata). We&#x27;ve got a growing community of volunteers who will keep the definitions up to date for hundreds or thousands of journals. If you also use our scraper definitions for your metadata you&#x27;ll get all the publishers for free.<p>Our goal initially is to scrape the entire literature (we have TOCs for 23,000 journals) as it is published every day. We then use natural language and image processing tools to extract uncopyrightable facts from the full texts, and republish those facts in open streams. For example we can capture all phylogenetic trees, reverse engineer the newick format from images, and submit them to the Tree Of Life. Or we can find all new mentions of endangered species and submit updates to the IUCN Red List. There&#x27;s a ton of other interesting stuff downstream (e.g. automatic fraud detection, data streams for any conceivable subject of interest in the scientific literature).<p>I have a question. Why are you saying you&#x27;ll never do full texts? You could index all CC-BY and better full texts completely legally, and this would greatly expand the literature search power.
评论 #7926967 未加载
评论 #7926342 未加载
higherpurpose将近 11 年前
I hope you carry on with this project. If there&#x27;s any search engine that can beat Google (long into the future) it&#x27;s a P2P one.<p>Speaking of the devil, are you aware you can&#x27;t install extensions from 3rd party sources anymore at all? You can thank Google for this idiotic and completely self-interested move.
评论 #7927200 未加载
petermurrayrust将近 11 年前
This is really great and is fully complementary to our Content Mine (contentmine.org).<p>Its&#x27; very similar to what I proposed as &quot;the World Wide Molecular Matrix&quot; (WWMM) about 10 years ago (<a href="http://en.wikipedia.org/wiki/World_Wide_Molecular_Matrix" rel="nofollow">http:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;World_Wide_Molecular_Matrix</a>). P2p was an exciting development then and there was talk about browser&#x2F;servers. Then the technology was Napster-like.<p>WWMM was ahead of both the technology and the culture. It should work now and I think Ninja wil fly (if that&#x27;s the right verb). I think we have to pick a field where there is a lot of interest (currently I am fixated on dinosaurs) , where there is a lot of Open material, and where the people are likely to have excited minds.<p>We need a project that will start to be useful within a month. Because the main advocacy will be showing that it&#x27;s valuable. The competition is with centralised for profit services such as Mendeley. The huge advantage of Ninja is that it&#x27;s distributed, which absolutely gauarantees non-centrality. The challenges - not sure in what order - are apathy, and legal challenges (e.g. can it be represented as spyware - I know it&#x27;s absurd but the world is becoming absurd).<p>Love to talk at Berlin.
yid将近 11 年前
It seems like nothing like this currently exists in a centralized, non-distributed way. Why add the complexity of a p2p network into an unproven concept? Is it purely to save on the cost of indexing and serving queries?<p>&gt; Scraping Google is a bad idea, which is quite funny as Google itself is the mother of all scrapers, but I digress.<p>It&#x27;s not really &quot;funny&quot;&#x2F;ironic&#x2F;etc -- Google put capital into scraping websites to build an <i>index</i>, and you&#x27;re free to do the same, but you shouldn&#x27;t expect Google to allow you to scrape their <i>index</i> for free.<p>EDIT: just saw this:<p>&gt; Right now, PLOS, eLife, PeerJ and ScienceDirect are supported, so any paper you read from these publishers, while using the extension, will get indexed and added to the network automatically.<p>Yeah, they&#x27;re not going to like that. You might want to consult a lawyer.
评论 #7926934 未加载
评论 #7926302 未加载
nl将近 11 年前
Why not index preprints, which are generally available via OAI harvesting?<p>I&#x27;m not following the field closely at the moment, but I&#x27;m pretty sure PLOS at least has an OAI interface too.
评论 #7928416 未加载
hershel将近 11 年前
What about <a href="http://commoncrawl.org/" rel="nofollow">http:&#x2F;&#x2F;commoncrawl.org&#x2F;</a>? Why not use it?
评论 #7927262 未加载