Ask HN: Implementing a graph database using Postgres tables for nodes and edges?

87 点作者 arunnanda超过 9 年前

After hearing/reading about the negative experiences many people have had with upcoming graph databases, I am wary of jumping into a graph db. But almost any modern application has things that will be suited to a graph representation.So I am considering a workaround to use a graph structure, but represent the underlying data in a reliable rdbms, in particular postgres.The approach is: 1. Make two parent tables: node and edge. 2. Make separate tables for all "objects" (like people, places, things, etc.) which would be inherited from node, so they would all have a unique node ID. 3. Make separate tables inherited from edge which would be used to represent the many many relations. So each relation has an edge ID, and each table inherited from edge can be thought of as representing a specific kind of relation (like person lives in place, or person is friends with person).One thing I have observed so far, is a large number of tables with few columns, but I think that lends itself to the advantage of easy indexing. There can be a large number of individual queries from the front end, but I believe I can use views to make represent tables with more comprehensive info to reduce the number of queries.What do you guys think of an approach like this? What am I missing, what is wrong with it? I haven't come across this previously, and so am a bit nervous about the ramifications. Is someone else also doing this?

30 条评论

buro9超过 9 年前

Storing the data isn't an issue. It's fairly trivial to come up with a number of very good solutions to describing a graph.The issue is querying the data, specifically when that involves walking the graph.The real question then, is how do you want to query the data? If you are willing to make an assumption now that adds a tight constraint on your future work... i.e. "We will only ever perform queries that are at most 1 or 2 hops from the nodes/edges being queried" then perhaps PostgreSQL will work for you.

评论 #10318102 未加载

评论 #10327451 未加载

duncanawoods超过 9 年前

The boring answer is that it depends on your data and what you need to do with it. Insights like "all data structures are graphs!" are dangerous - it can actually add vagueness that might prevent more useful understanding of your problem. Excessively generic handling of data often ends in pain when it comes to the real world (performance and complexity) compared to domain specific approaches.If you have multi GB graphs and need to run intensive classic graph analysis algorithms then a lot of that can come optimised out of the box with a dedicated graph db. They will still leave you with plenty of hard problems.However, if you have small graphs or only need conventional crud operations then using a new and unfamiliar graph db would probably be insane.If your case is the latter and postgres is your comfort zone and you can get a working solution quickly, I would be tempted to prototype along the lines you suggest. When you start getting performance problems or need graph analysis, you can then make a better decision whether a graphdb or restructuring your relational db will solve it the best.

评论 #10327465 未加载

pella超过 9 年前

The PostgreSQL Conference 2011 :"Trees and Graphs in the database" <a href="https://www.pgcon.org/2011/schedule/events/357.en.html" rel="nofollow">https://www.pgcon.org/2011/schedule/events/357.en.html</a>* Trees in the database - Slides: <a href="http://www.slideshare.net/quipo/trees-in-the-database-advanced-data-structures" rel="nofollow">http://www.slideshare.net/quipo/trees-in-the-database-advanc...</a>* Graphs in the database - Slides : <a href="http://www.slideshare.net/quipo/rdbms-in-the-social-networks-age" rel="nofollow">http://www.slideshare.net/quipo/rdbms-in-the-social-networks...</a>

评论 #10327493 未加载

emehrkay超过 9 年前

I am a huge fan of the Tinkerpop Stack (<a href="http://tinkerpop.incubator.apache.org" rel="nofollow">http://tinkerpop.incubator.apache.org</a>) -- they have written a common set of interfaces for graph databases. There is an implementation that can use Postgres as its storage engine (<a href="https://github.com/pietermartin/sqlg" rel="nofollow">https://github.com/pietermartin/sqlg</a>), there is one that can use ElasticSearch, I even seen one that uses MongoDB. Check it out

评论 #10327488 未加载

twic超过 9 年前

This sounds broadly sensible. It's basically a classic normalised relational design, which is as perfect as alligators or sewing needles, but with extracted supertables for nodes and edges.I would challenge the idea that you need these supertables, though. What those allow you to do is to write graph queries which are generic (polymorphic?) over multiple kinds of node and edge. So as well as "find me all the people who are friends with my friends", you can write "find me all the people who are friends with my friends, or who have been to a place i've been, or who own a thing i own" without using a union. Do you actually need to write queries like that?There are two downsides to the supertables. Firstly, more complexity, although it's a minor, or at least constant-factor, amount. Secondly, a loss of type safety. If your edges are defined in a supertable, then the columns which point to the ends of the edge have to be foreign keys to the node supertable. That means they can be any type of node; there's no way to constrain particular kinds of edge to connecting particular kinds of node. That seems like a considerable drawback to me.

评论 #10327353 未加载

t-rav超过 9 年前

Why re-invent the wheel ! All you need to do is properly understand your tool of choice. I have used Neo4j extensively for years and have had great results. Our biggest hurdle was understanding how to model the data for a graph properly. Get this right and everything falls into place. The one thing you are going to need is a good visualisation tool, so be prepared to roll one of these too :)

berdario超过 9 年前

I was also wondering about the same thing:Implementing a graph on top of a RDBMS is trivial, and if the semantics are correct (the same as the ones exposed by a graph db?), then I'm not sure why would people want to use a proper graph db.I thought that probably it'd be an issue of performance: the "right tool for the job" that "does one thing and does it well" probably is leaner and more efficient. After all, unlike a trivial implementation, getting a graph on a RDBMS to perform well might not be that simple after all (still, your idea of inheriting tables might make things more flexible and maybe more efficient)But then, when looking up some Neo4j benchmarks, the numbers seems to not be good at all:<a href="http://baach.de/Members/jhb/neo4j-performance-compared-to-mysql" rel="nofollow">http://baach.de/Members/jhb/neo4j-performance-compared-to-my...</a>I'd like to hear from someone that used Neo4j (also, other graph databases are interesting) in production, and benchmarked it against a RDBMS prototype, finding the former as the better solution of the two.

评论 #10327515 未加载

评论 #10318465 未加载

joshberkus超过 9 年前

I've done this a few times. First, see the list of PostgreSQL-based extensions and tools listed by other folks below; they're all good things to look at before you implement something from scratch. Regardless, this is probably a good approach if you are using Postgres anyway and just need a graph for part of your application.For short graph traversals, recursive queries and joins perform surprisingly well, even when compared against dedicated graph databases -- at least, the open source ones I've been able to try, including neo4j. The ones I've tried perform better than Postgres mostly when you get more than two hops away -- but then at least Neo and the MySQL thing tend to suffer if the data set doesn't fit in memory.I had at least one project which reverted from a dedicated graph database to Postgres specifically because it needed decent on-disk performance (2TB per shard).Anyway, to your question: you could build a PostgreSQL database like that, but why would you? If you have distinct types of entities (person vs. furniture vs. town), why would you denormalize them into generic node interitance? Having a completely abstracted database with one big table called "things" and another big table called "relationships" seems really attractive before you actually do it. Then it starts to suck.Relational databases are designed to perform well, and to be easy to query, with "normalized" databases in which different types of entities are represented by different tables and relationships are defined by foreign keys. The further you get away from that model, the harder the database gets to use.This is one of the problems with using graph concepts for questions which are not, fundamentally, graphs. Yes, you can define people's nationality as a graph from location to nation, but that makes it pretty hard to do voter registration stats. Graphs are a great tool for answering many important questions, but they are not a great tool for answering all questions.Now, if you are answering an actual graphing question (i.e. show me everyone who shares something in common with me via at least two links within 3 hops), then proceed. But figure out what kinds of questions you specifically want to answer before you get too far.

评论 #10327486 未加载

pella超过 9 年前

some alternatives to check :* RDFLib Store for PostgreSQL ( <a href="https://github.com/RDFLib/rdflib-sqlalchemy" rel="nofollow">https://github.com/RDFLib/rdflib-sqlalchemy</a> )* Sparqlify is a SPARQL-SQL rewriter that enables one to define RDF views on relational databases and query them with SPARQL. ( PostgreSQL supported <a href="http://aksw.org/Projects/Sparqlify.html" rel="nofollow">http://aksw.org/Projects/Sparqlify.html</a> )* PostgreSQL-Neo4j Foreign Data Wrappers <a href="https://github.com/nuko-yokohama/neo4j_fdw" rel="nofollow">https://github.com/nuko-yokohama/neo4j_fdw</a>* PgRouting (network ) <a href="http://docs.pgrouting.org/dev/doc/src/tutorial/analytics.html#graph-analytics" rel="nofollow">http://docs.pgrouting.org/dev/doc/src/tutorial/analytics.htm...</a>

评论 #10327525 未加载

kitd超过 9 年前

I think you want to google "triple store" because that is effectively what you are trying to do AIUI. Triple stores are the persistence strategy of choice for RDF and other Semantic-Web-type applications.Each row in a triple store stores a relationship between 2 objects, with unique IDs for the subject, the relationship type, and the object. The IDs could refer to other tables that store entity and relationship details.There are specialised triple store DBMSs, but implementing one in PostgreSQL should be pretty easy.

ladon86超过 9 年前

No-one seems to have yet mentioned Facebook's TAO paper, which describes scaling basically this approach to 1 billion reads per second. <a href="https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/readings/data-store/tao-facebook-distributed-datastore-atc-2013.pdf" rel="nofollow">https://cs.uwaterloo.ca/~brecht/courses/854-Emerging-2014/re...</a>The nice thing about your approach is that you can shard your objects by their id and your edges by their "from" id, and have all lookups go to one box when you shard. Throw in some caching and it scales to multiple boxes really well.

评论 #10326844 未加载

rikkus超过 9 年前

If directed acyclic is sufficient for your use case, you can use this approach, which has worked perfectly for me in the past: <a href="http://www.codeproject.com/Articles/22824/A-Model-to-Represent-Directed-Acyclic-Graphs-DAG-o" rel="nofollow">http://www.codeproject.com/Articles/22824/A-Model-to-Represe...</a>

评论 #10319125 未加载

mikkergp超过 9 年前

I use Neo4J and I'm curious what negative experiences you are referring to, and what benefit this might have over that. The biggest problem I foresee is querying and possibly data integrity. Querying since doing any kind of graph traversal would probably require writing hacks in software, and data integrity for much of the same reason. It may be easy to ensure a basic relationship but beyond that I would think it would be more difficult.

评论 #10317569 未加载

评论 #10317464 未加载

pbnjay超过 9 年前

I've done this a few times. Depending on the size/density of your data, it can be fine. If you have billions of entities and/or highly dense relations, it's not very efficient though.I didn't use table inheritance, and I don't distinguish relationships at the query stage, so if that's core to your method you should think about it. I assume with that setup it'd be easier to take advantage of partial indexes though.Recursive queries work but if the intermediate tables gets large it explodes fast. You can't use aggregates in the recursive function either (e.g. to count number of leaf nodes in a tree), you have to apply them at the end, in which case the intermediate table has to be large...I've had decent experience so far with GIN indexes and @> operators on an ARRAY[] column adjacency list. In my case I stored "ancestors" so I could select by object id and get all ancestors, or use `ancestors @> ARRAY[parent_id]` to select all object ids as descendants.Of course, if you don't need any "self" relationships, then none of the really complicated stuff matters...

评论 #10326937 未加载

bobosha超过 9 年前

I had a similar challenge and ended up using ElasticSearch in a thoroughly denormalized data format Allowed me the flexibility of stateful (contextual) queries etc. Bit of a hack, but served my purpose. Hope that helps.

评论 #10327605 未加载

评论 #10327608 未加载

jerven超过 9 年前

A lot of the early academic RDF graph databases used this approach. However, performance was no where near to what is required. This let to for the SPARQL/RDF graph databases to now a whole set of independently developed stores. Some on top of existing solutions e.g. Oracle semnet, Virtuoso and DB2 sparql. More on their own solid foundations.You could test out 3 or 4 different SPARQL solutions in the time that it would take you to develop something graph like on your own.On the other hand, cutting edge approaches, actually take a graph representation of data and lay it out in a relational manner. <a href="http://ercim-news.ercim.eu/en96/special/monetdb-rdf-discovering-and-exploiting-the-emergent-schema-of-rdf-data" rel="nofollow">http://ercim-news.ercim.eu/en96/special/monetdb-rdf-discover...</a> giving the best of both worlds.In short you can build something yourself. But don't expect that it will be better than something build and supported by someone else.So investigate the competition: BlazeGraph, Virtuoso, StarDog, Oracle 12c EE/semnet, DB9, Dydra before deciding to build your own. Building your own because its fun to do is great, but unless it pays your bills not a good idea for production environment.PS. The edge table (EAV) is the major problem, it leads to a lot of self joins and difficult exercises for the query planner.You can improve a lot on this if you can put "different" edges into different tables or partitions.Triple stores with support for JSON-LD framing such as Dydra can also make it easier to have a front-end on top of your DB without extensive middle layer code.A store like StarDog and BlazeGraph on the other hand gives you a lot of flexibility by both supporting SPARQL and TinkerPop. (both cluster and scale out, although BlazeGraph has GPL option. StarDog is only Commercial)

评论 #10318360 未加载

newobj超过 9 年前

<a href="https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-the-graph/10151525983993920" rel="nofollow">https://www.facebook.com/notes/facebook-engineering/tao-the-...</a>

评论 #10319598 未加载

russtrotter超过 9 年前

One technique you might look into is "Materialized Path":* usually one single query to fetch an entire tree or subtree * more work on reconstructing tree from query result * requires prefix pattern matching operator, e.g. "like" from RDBMS

评论 #10326761 未加载

rawnlq超过 9 年前

Dropbox does something similar called Edgestore which is built on top of MySQL: <a href="https://www.youtube.com/watch?v=VZ-zJEWi-Vo" rel="nofollow">https://www.youtube.com/watch?v=VZ-zJEWi-Vo</a>

评论 #10317553 未加载

amirouche超过 9 年前

Building a graph database is time consuming. Why are you putting edge's children (person, place, friend) in a separate table instead of using a json field? Do you plan to query the person, place, friend table directly? You need to benchmark the different queries.My understanding is that big5 and others do not use graphdb for all their stuff since they are not as good as rdbms to do queries with a single hop or JOIN. They embrace microservices and maintain a graphdb (in memory, persistent or distributed) to answer domain specific queries. That approach is similar to your schema except that graph queries run on a single node without superfluous network roundtrips.It would be nice to be able to use a single database for all data related stuff to have atomic writes and simpler architecture. That's what multi-model databases are tackling have a look at OrientDB and ArangoDB [1].Also, Tinkerpop, already mentioned in the thread, is a ready-to-scale graphdb with much love I recommend you have a look at Tales of the Tinkerpop [2].[1] <a href="https://news.ycombinator.com/item?id=10180185" rel="nofollow">https://news.ycombinator.com/item?id=10180185</a>[2] <a href="https://news.ycombinator.com/item?id=10316140" rel="nofollow">https://news.ycombinator.com/item?id=10316140</a>

评论 #10326771 未加载

评论 #10327919 未加载

3pt14159超过 9 年前

I've done this before with MySQL in 2011 after throwing my hands in the air with two different graph DBs.By far the most productive thing I did. People will say that you need recursive queries, etc, but I found most of the time just keeping the query set in the application layer was easy enough and super fast. Scaling it was easier too because there are lots of articles on scaling the MySQL / Postgres.

评论 #10326784 未加载

swah超过 9 年前

I'd do the same. I think this book is still relevant: <a href="http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557" rel="nofollow">http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic...</a>

barakm超过 9 年前

If you want to help hack on this, I've written a Postgres backend for the graph database Cayley that's fairly fast and about to merge (for use on a prod feature): <a href="https://github.com/google/cayley/pull/289" rel="nofollow">https://github.com/google/cayley/pull/289</a>For traversals, it tries its level best to let Postgres create the appropriate JOINs, which Postgres does well... most of the time anyway. In doing this, I learned really fast how many weird semantic corners of bog-standard SQL are, and the optimization choices Postgres makes based on the way you formulate the query.Ping me on that PR! And/or join #cayley on Freenode

评论 #10327986 未加载

JoachimSchipper超过 9 年前

What are you actually trying to do? If you just need some queries on a small amount of read-only data, objects in memory (in a dedicated process, perhaps) are a perfectly sane approach, and much faster than ping-ponging queries with any database.

neurohax超过 9 年前

I believe Redis and Hyperdex have constructs such as bitarrays and sets that can be easily used for representing and querying graphs/hypergraphs. For Redis there are some success stories and even ready to use extensions.

calpaterson超过 9 年前

I don't know much about this but Joe Celko has a whole book on representing trees in SQL: <a href="http://www.amazon.co.uk/Hierarchies-Smarties-Kaufmann-Management-Systems/dp/0123877334/" rel="nofollow">http://www.amazon.co.uk/Hierarchies-Smarties-Kaufmann-Manage...</a>The obvious first question would be: why are you representing your data as a graph? Can you represent it better as sets of predicates (ie: relations)?Using views should not reduce the number of queries: a view is just a query with a name. If you can do it by combining views you can do it by combining queries.

nrjames超过 9 年前

I would love to see a SQLite extension to support graphs...

rottyguy超过 9 年前

This is fairly timely as I'm investigating possibilities for our data model and graphdb just popped up on my radar. Are gdb good for modeling multiple types of products? Let's say v1 of product has attributes a, b, c but v2 now adds x, y, z and removes a. Would that be 2 separate nodes?I too am interested in understanding the downsides of a gdb (seeing some refs to performance but nothing concrete).

评论 #10326800 未加载

madawan超过 9 年前

Do you have a gist that implements this approach?

marknadal超过 9 年前

I met with Kyle Kingsbury (Aphyr of Call Me Maybe the Jepsen database testing series) out in Berlin and I recall him mentioning at one point that Postgres was a good generic database especially with the JSON blob storage. So I am wondering if making the extensions you propose might not be worth it cause it will come with all that extra overhead.I do however recommend exploring and even building your own graph database, it is a great learning exercise. But don't fool yourself by doing it on top of Postgres, actually jump in and become immersed. I built <a href="http://gunDB.io/" rel="nofollow">http://gunDB.io/</a> to scratch a similar itch, I wanted an easy graph database with realtime updates like Firebase. I would love to hear your feedback, or even have you contribute.

30 条评论

buro9超过 9 年前

评论 #10318102 未加载

评论 #10327451 未加载

duncanawoods超过 9 年前

评论 #10327465 未加载

pella超过 9 年前

评论 #10327493 未加载

emehrkay超过 9 年前

评论 #10327488 未加载

twic超过 9 年前

评论 #10327353 未加载

t-rav超过 9 年前

berdario超过 9 年前

评论 #10327515 未加载

评论 #10318465 未加载

joshberkus超过 9 年前

评论 #10327486 未加载

pella超过 9 年前

评论 #10327525 未加载

kitd超过 9 年前

ladon86超过 9 年前

评论 #10326844 未加载

rikkus超过 9 年前

评论 #10319125 未加载

mikkergp超过 9 年前

评论 #10317569 未加载

评论 #10317464 未加载

pbnjay超过 9 年前

评论 #10326937 未加载

bobosha超过 9 年前

评论 #10327605 未加载

评论 #10327608 未加载

jerven超过 9 年前

评论 #10318360 未加载

newobj超过 9 年前

<a href="https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-the-graph/10151525983993920" rel="nofollow">https://www.facebook.com/notes/facebook-engineering/tao-the-...</a>

评论 #10319598 未加载

russtrotter超过 9 年前

评论 #10326761 未加载

rawnlq超过 9 年前

Dropbox does something similar called Edgestore which is built on top of MySQL: <a href="https://www.youtube.com/watch?v=VZ-zJEWi-Vo" rel="nofollow">https://www.youtube.com/watch?v=VZ-zJEWi-Vo</a>

评论 #10317553 未加载

amirouche超过 9 年前

评论 #10326771 未加载

评论 #10327919 未加载

3pt14159超过 9 年前

评论 #10326784 未加载

swah超过 9 年前

barakm超过 9 年前

评论 #10327986 未加载

JoachimSchipper超过 9 年前

neurohax超过 9 年前

calpaterson超过 9 年前

nrjames超过 9 年前

I would love to see a SQLite extension to support graphs...

rottyguy超过 9 年前

评论 #10326800 未加载

madawan超过 9 年前

Do you have a gist that implements this approach?

marknadal超过 9 年前