Former Freebase, current Google engineer here:<p>First of all, let me say that I'm glad more people are thinking/working in the space of triples. Even unstructured ones like this.<p>But when there's no semi-strict schema, it gets really, really tricky. Free text is hard, and actual meaning is hard to separate. (I say semi-strict, as Freebase is schema-last -- feel free to create your own! -- but has some level of enforcement)<p>For specific domains you may be okay with tags. And for some limited applications it probably works great. Triples are cool!<p>But when you start talking about larger, broader, datasets, ones that no one person or small group can curate, you're going to start running into collision.<p>There's certainly an argument to be made for metaschema -- <a href="https://developers.google.com/freebase/v1/search-metaschema" rel="nofollow">https://developers.google.com/freebase/v1/search-metaschema</a> -- and crowdsourcing these sorts of things could be interesting.<p>I think there's a lot of interesting work to be done. But I doubt that this is "better" per se, or at the very least, is little more than a toy.<p>And hey, I built such a toy graph engine once upon a time (be gentle -- it was really a demo hack) <a href="https://github.com/barakmich/jgd" rel="nofollow">https://github.com/barakmich/jgd</a> -- you can even query it with Freebase's old MQL. (Which I have mixed feelings about, but is cool in its own way)<p>I guess my argument is, don't throw the baby out with the bathwater. And feel free to ping me for more!
Comparison with Freebase:<p>> <i>Simpler structure: There are no datatypes, namespaces, lists, domains. Just ordered nodes. Having a dead simple structure like that allows developers to quickly and intuitively know how to access the info they want.</i><p>I don't see how this makes it simpler or intuitive at ALL. If there's no convention as to whether I should search for "born on" or "born_date" or "year_born", or whether the date will be "1900-08-01" or "08-01-1900" or "1900/08"... then how is this supposed to be useful?<p>The central problem is, there are lots of textual ways of describing the same thing. Without standardized datatypes and standardized tags, it quickly becomes a messy, useless free-for-all.<p>I don't see how TheBigDB gets around this. The FAQ explains how it's <i>different</i> from Freebase/Wikidata, but I don't at all understand how it's supposed to be <i>better</i>, or even as good.
Sounds like a much simplified version of Douglas Lenat's Cyc project [1], which has been going since the mid eighties and is attempting to build a structured knowledgebase/ontology of everyday knowledge. They have freely downloadable subset called OpenCyc [2]. It seemed pretty impressive last time I looked at it.<p>[1] <a href="http://en.wikipedia.org/wiki/Cyc" rel="nofollow">http://en.wikipedia.org/wiki/Cyc</a><p>[2] <a href="http://www.cyc.com/platform/opencyc" rel="nofollow">http://www.cyc.com/platform/opencyc</a>
I wonder if you could do machine learning on schemata. Basically start learning about dates (as an example) and as it learns updates the information with what it has learned. Something that has one person putting in { name "foo", born "10/1/92"} and someone else putting in { name "bar", born "september 30th, 1966" } and then going back and replacing the dates with an ISO standard date type but with a change history so you could look backwards in time at the data and see how the database had "improved" it. (or not). Then by voting on the improvements you teach the system to clean up its data representations. Crazy? Insightful? Stupid? I don't know but it was the question that popped into my head.
One nice property of the Wikidata database is that it is a "secondary database. Wikidata will record not just statements, but their sources, thus reflecting the diversity of knowledge available and supporting the notion of verifiability." [1]<p>I think that's far better than voting. Voting for facts amounts to relying on a logical fallacy: appeal to the majority. [2] (Voting is fine for popularity contests, or things that can only be matters of opinion, but facts?)<p>[1] <a href="http://www.wikidata.org/wiki/Wikidata:Introduction" rel="nofollow">http://www.wikidata.org/wiki/Wikidata:Introduction</a><p>[2] <a href="https://en.wikipedia.org/wiki/Argumentum_ad_populum" rel="nofollow">https://en.wikipedia.org/wiki/Argumentum_ad_populum</a>
Is it possible to download all data and use it under some open license (like CC-BY)? I can't find data license terms.<p>If no, then sorry, freebase is vastly superior IMHO - from user's point of view I don't see a point in a crowdsourced proprietary database (even if API is currently free).
Have you/do you plan to seed your database with the already structured data from freebase? It should be relatively straightforward, right? Well, I mean, minus the time to properly map the Freebase schema into your format. But that's probably less time than it takes to wait for people to fill in enough facts.
Excellent! I've been working on something similar. Trying to come up with a schema that is data-centric is hard enough let alone focusing on the ease of use by developers. Good luck!<p><i>Can I send how many requests I want?</i>, I think you might mean <i>Can I send as many requests as I want?</i> ?
Any chance to release this as an open source? For example, people would like to have installed in their servers and use it for their own things. I think it would be useful for fandom. For example, the Star Wars DB or Lord of the Rings Db :-)
Don't be deterred by the negative comments about the unstructured data. It's a tough problem but not an impossible one. I know because I'm also battling the same question building a free-form NLP based self tracking app to help track daily data ( <a href="http://thyself.io" rel="nofollow">http://thyself.io</a> ). The problem for me is that it's hard to perform analytics when one datapoint is in "miles walked" and the other is in "laps ran".<p>As you said, conventions help mitigate the problem a little bit but the end user can hardly be expected to stick to best practices.<p>I have hope though. This is a problem worth solving.
Reminds me of Freebase. They built a huge data-set as well as tools and an api to access themselves. Have you talked to anyone on the team? (they are now at Google) How would you say that you are different from them?
I like this idea in the sense of an experiment. I'm not sure where it will end up, but it could be interesting.<p>As others have pointed out, some kind of conventions must be established around the semantics, and something must be done to avoid redundancy (which leads to inconsistency) and ambiguity.<p>I agree with those criticisms, but if the community also helps develop the schema, it will be interesting to see. What collisions will happen? What will be the result of queries that reach far across disciplines?
I appreciate any new service that attempts to organize data / information. With that in mind, I hope this succeeds.<p>A suggestion: it needs a demo query box on the site. Shouldn't be too hard to let a rate limited IP address throw a few keywords at it and spit back results. I'd like to see what the db contains before I invest too much time (how many topics, how many facts, etc).
Based on observations and prior experience (esp. Bitzi), I believe the wiki approach of "correct-in-place" leads to better convergence and community than "downvote the errors, add a corrected entry, upvote the better entry".<p>(Voting democracy may help prevent people from being oppressed in certain ways, but it isn't much of a truth-discovery mechanism.)
Interesting concept. It's like RDF for human beings. It's easier for human beings to look at unstructured data, but at the same time it makes it extremely hard to do interesting stuff programmatically. You just can't do reliable inferencing