TechEcho

6 comments

Blahahalmost 11 years ago

Jure, your projects never cease to impress me. Really looking forward to talking in depth at OKfest. This idea is so close to what we've been doing that it's a real shame we didn't talk earlier, but the parts of what you're doing that are unique are also truly awesome.At ContentMine we're doing something totally complementary to this. Some of the tools will overlap and we should be sharing what we're doing. For example, I've been working on a standardised declarative JSON-XPath scraper definition format and a subset of it for academic journal scraping. I've been building a library of ScraperJSON definitions for academic publisher sites, and I've converged on some formats that work for a majority of publishers with no modification (because they silently follow undocumented standards like the HighWire metadata). We've got a growing community of volunteers who will keep the definitions up to date for hundreds or thousands of journals. If you also use our scraper definitions for your metadata you'll get all the publishers for free.Our goal initially is to scrape the entire literature (we have TOCs for 23,000 journals) as it is published every day. We then use natural language and image processing tools to extract uncopyrightable facts from the full texts, and republish those facts in open streams. For example we can capture all phylogenetic trees, reverse engineer the newick format from images, and submit them to the Tree Of Life. Or we can find all new mentions of endangered species and submit updates to the IUCN Red List. There's a ton of other interesting stuff downstream (e.g. automatic fraud detection, data streams for any conceivable subject of interest in the scientific literature).I have a question. Why are you saying you'll never do full texts? You could index all CC-BY and better full texts completely legally, and this would greatly expand the literature search power.

评论 #7926967 未加载

评论 #7926342 未加载

higherpurposealmost 11 years ago

I hope you carry on with this project. If there's any search engine that can beat Google (long into the future) it's a P2P one.Speaking of the devil, are you aware you can't install extensions from 3rd party sources anymore at all? You can thank Google for this idiotic and completely self-interested move.

评论 #7927200 未加载

petermurrayrustalmost 11 years ago

This is really great and is fully complementary to our Content Mine (contentmine.org).Its' very similar to what I proposed as "the World Wide Molecular Matrix" (WWMM) about 10 years ago (<a href="http://en.wikipedia.org/wiki/World_Wide_Molecular_Matrix" rel="nofollow">http://en.wikipedia.org/wiki/World_Wide_Molecular_Matrix</a>). P2p was an exciting development then and there was talk about browser/servers. Then the technology was Napster-like.WWMM was ahead of both the technology and the culture. It should work now and I think Ninja wil fly (if that's the right verb). I think we have to pick a field where there is a lot of interest (currently I am fixated on dinosaurs) , where there is a lot of Open material, and where the people are likely to have excited minds.We need a project that will start to be useful within a month. Because the main advocacy will be showing that it's valuable. The competition is with centralised for profit services such as Mendeley. The huge advantage of Ninja is that it's distributed, which absolutely gauarantees non-centrality. The challenges - not sure in what order - are apathy, and legal challenges (e.g. can it be represented as spyware - I know it's absurd but the world is becoming absurd).Love to talk at Berlin.

yidalmost 11 years ago

It seems like nothing like this currently exists in a centralized, non-distributed way. Why add the complexity of a p2p network into an unproven concept? Is it purely to save on the cost of indexing and serving queries?> Scraping Google is a bad idea, which is quite funny as Google itself is the mother of all scrapers, but I digress.It's not really "funny"/ironic/etc -- Google put capital into scraping websites to build an index, and you're free to do the same, but you shouldn't expect Google to allow you to scrape their index for free.EDIT: just saw this:> Right now, PLOS, eLife, PeerJ and ScienceDirect are supported, so any paper you read from these publishers, while using the extension, will get indexed and added to the network automatically.Yeah, they're not going to like that. You might want to consult a lawyer.

评论 #7926934 未加载

评论 #7926302 未加载

nlalmost 11 years ago

Why not index preprints, which are generally available via OAI harvesting?I'm not following the field closely at the moment, but I'm pretty sure PLOS at least has an OAI interface too.

评论 #7928416 未加载

hershelalmost 11 years ago

What about <a href="http://commoncrawl.org/" rel="nofollow">http://commoncrawl.org/</a>? Why not use it?

评论 #7927262 未加载

6 comments

Blahahalmost 11 years ago

评论 #7926967 未加载

评论 #7926342 未加载

higherpurposealmost 11 years ago

评论 #7927200 未加载

petermurrayrustalmost 11 years ago

yidalmost 11 years ago

评论 #7926934 未加载

评论 #7926302 未加载

nlalmost 11 years ago

Why not index preprints, which are generally available via OAI harvesting?I'm not following the field closely at the moment, but I'm pretty sure PLOS at least has an OAI interface too.

评论 #7928416 未加载

hershelalmost 11 years ago

What about <a href="http://commoncrawl.org/" rel="nofollow">http://commoncrawl.org/</a>? Why not use it?

评论 #7927262 未加载

Show HN: An open distributed search engine for science

6 comments

Show HN: An open distributed search engine for science

6 comments