Every now and then you come across open source projects that just amazes you. OrientDB is one of these projects.
I’ve always assumed that I’d have to use a polyglot persistence model in complex applications. I’d use a graph database if I want to traverse the information, I’d use a document database when I want schema less complex structures, and the list goes on.
OrientDB seems to have it all though. It is kind of the Swiss army knife of databases, but unlike a Swiss army knife, each of the tools is best of breed.
I’ve had a few experiences with applications built on OrientDB and also been spending some time testing and evaluating the database. I keep thinking back to projects that I’ve implemented in the past and wishing I had OrientDB to my disposal. Asking questions such as:
- Would it be a viable candidate to replace the database we used?
- How would I have changed the architecture if I did use OrientDB?
- What would the impact of OrientDB be on factors such as:
- Elegance of implementation
- Cost of development
- and so on…
In this article I’ll explain what OrientDB is (from my perspective), why it may be hard to classify and some scenarios of how it could be used.
What is OrientDB?
OrientDB is a tool capable of defining, persisting, retrieving and traversing information. I want to start there, rather than saying it is a XXX type database. This is because OrientDB can be used in multiple ways. It can play a document database (making it a competitor to MongoDB, CouchDB, etc.), it can be a graph database (making it a competitor to Neo4J, Titan, etc.) and it can be an Object-Oriented Database. And it can play all those roles at the same time.
OrientDB as a Document Database
Let’s look at OrientDB from the perspective of a document database. OrientDB can store documents (documents here being a nested set of name-value pairs). Perhaps you’re familiar with MongoDB or CouchDB? If so, OrientDB can take an arbitrary document (e.g., a JSON document) and store it. After it has been stored you can query it using path expressions, as you would expect from any document database.
If you ever worked with document databases, but may sometimes come across the need to store links. I see this all the time. Say we used a document database somewhere and some of the team members have experience with relational database. When they discover the primitive support for links we’ll have long discussions of normalizations and how document database are different etc. I can usually convince the members that the document database is a better solution, but the truth is… I kind of miss my relationships.
OrientDB as a Graph Database
Talking about relationships, the ultimate in handling relationships are, as you probably know, graph databases. Graph databases typically implement the relationships as first class citizens called edges (first class citizen as opposed to relational databases that uses key/foreign-key). Edges connect vertices. A vertex, in most graph databases, is a simple cluster of name-value pairs.
Now, imagine each document in the document database as a vertex? Is that possible? OrientDB has done exactly that. Instead of each node being a flat set of properties, it can be a complete document (with nested properties).
OrientDB as an Object-Oriented Database
Why do we create documents? What do they represent?
In most cases I would think that each document represent some conceptual object? Think of it. What does each of your documents in a document database represent? Perhaps it represented a company, a person or a transaction? I would say, more generically, it probably represented an object. Also, in document databases, we most often type these documents. That is, there is a class of documents that follow the same set of rules.
How about graphs? I would suggest here also each vertex typically maps to some conceptual object.
In OrientDB the vertex and the document are superimposed and also here it would be interesting to think of the document/vertex as an object and the rules for objects following similar rules as classes. So, let’s assume we want to impose rules for the data structures of each related object like in an object-oriented system, what advantages could we obtain?
- We would have a guarantee that the objects conformed to some rules we defined
- It would be easier to query the objects because they at least named the properties the same
- Perhaps we could use relationships between rules for structures as in object-oriented systems (often called inheritance) to organize the rules.
OrientDB allows you to define classes that the objects (vertices or documents) must conform to. It is probably necessary for me to point out that OrientDB does not force you to do so. You can run in strict schema mode (all objects are typed and must conform to the class definitions), in a hybrid mode (all objects must AT LEAST conform to the rules of the classes but may add any other properties not specified in the classes) or in schema-less mode.
I can hear the skeptics here…. Sure, we’ve seen this with some of the relational database vendors also. When they got scared of objects, they introduced something that looked like classes, but when we studied it closer it was missing important things like polymorphism (e.g., you could define a hierarchy with Pet as a super class, Cat and Dog as subclasses, but the database would not understand queries like “give me all pets”, you would have to ask “give me all dogs and cats”). However, in OrientDB this is working too!
Come on! There has to be a Catch?
Perhaps it doesn’t scale? Maybe it doesn’t perform? This is too good to be true!
I’ll keep looking and if I find something I’ll post it. The two questions above were where I thought I’d find the issues.
OrientDB scales. Really, it truly scales. It seems to have a much better strategy than its competitors. It’s hard to know exactly who to compare it to… do I compare it to the document databases or the graph databases? I decided to look at both categories.
I’ve yet to test this out in a large project. However, at least on paper, the master-master replication, the multi-cluster support, etc. makes me very optimistic with respect to scaling.
It is very hard to find performance numbers that compare databases. I did actually see some test from a university in Japan where they compared performance numbers of the various graph databases. OrientDB in this test was outperforming the competitor by a factor. But since the numbers are from older version of both OrientDB and the competitor tools, I’m not sure how much weight to give the test. The first time we used it on a client application where the client allows me to publish numbers, I’ll promise to share the numbers. One client did allow me to at least say that from their numbers, OrientDB still outperformed their competitors and what was more interesting (to me), it also outperformed one of the leading relational databases in some non-traversal scenarios (we know graph databases are fast when we lookup a vertex/document and start navigating from there, however, I would have thought it would not be able to compete for queries such as “select * from Person where firstName like ‘%Petter%’).
Use Cases for OrientDB
I would think almost anywhere you build a canonical information model to store the state of the system, OrientDB would be a good choice. I’m not sure it is the best choose for time-series databases (perhaps a database such as Cassandra would have an edge here), however, for most traditional domain models, it should work well.
Traditional Domain Model Implementations
For most systems, we build out a domain model (or logical information model) that describes what information the system must maintain. Because RAM is more expensive many other storage forms and because RAM has a tendency to loose its state when the power goes off, we want to ensure that this information is stored on some disk somewhere so that it can be put back into RAM upon need.
The state of the art for building such models is to build an object-oriented class diagram, typically in UML (I would say this could be argued. There are some better alternatives here. For instance Express/Express-G and Clafer are perhaps better languages, however, UML is more readily adopted, so… UML it is…). This model defines classes with their properties and associations between classes. With most databases we’ll experience some impedance mismatch when mapping the canonical model:
- Relational Databases
- No support for inheritance. Need different strategies such as single table inheritance, table per class, etc.
- Complex properties typically require their own table. It is now unclear if the table represents an object with or without individuality (or at least the distinction is lost when looking at the tables).
- No support for polymorphism.
- Relationships have to be mapped into key/foreign-keys.
- Graph Databases
- No support for inheritance (although, with some clever engineering, one can define the meta-data hierarchy and vertices and edges and get pretty close).
- Complex properties introduces new vertices even though we don’t really need to link to them (no individuality).
- No support for polymorphism.
- Document Databases
- No support for inheritance (although, easy to simulate)
- No (or limited) support for relationships
In OrientDB the mapping pretty much eliminates all impedance mismatch:
- An object becomes a vertex
- Complex properties can easily be handled as documents
- Explicit support for relationships
- Understands typing (in schema-model) means
- Strict enforcement of constraints
From the top of my head, I can think of at least 20 projects I’ve worked on that had the need for some degree of run-time configuration where user could configure the rules of the data structures stored. For those of you that have never worked on these kinds of systems, you may not fully appreciate the complexity introduced in the implementation by such demands.
Perhaps you have developed a web-form before that collected some data from your users. You knew what kind of data structure to obtain and you simply defined a form that was capable of collecting such information.
I want you to imagine a system where you are selling the capability for customers to define their own forms, then it is your task to build a system that allows for the customer to define the form and collect and store the information from these forms. What would such a system look like?
If you have already build such a system, you probably know that you’ll end up with two distinct kinds of data:
- Defines the rules for the data structures
- Form a has to include a string called Social Security Number and it must conform to some reg-ex pattern
- Instance data
- Defines the actual data collected and links from the meta-data that establishes semantics
- User John’s instance of a form where there is a string property with the value 123-45-6789 that was collected as a social security number (the link to the meta data)
Now, I want you to imagine the relational database schema behind such an application and the complexity in the joins that retrieve this data. Not much fun!
A schema-free document database could at least simplify this problem (you would still have to do some nasty coding to match up the meta-data with the instance data, but the data model would be elegant and quite explicit) In a graph database; similarly, such problem can be easily accommodated.
Here again, OrientDB shines, by:
- Being a document database (you can store any document you want on a vertex)
- Being a graph database (you can simply introduce new edges and properties, remember properties are simple name-value pairs)
- Allowing for schemas to be introduced at runtime and hence enforcing many of the rules for you
- In many cases, this could be your metadata!
This has been a shameless tribute to the brains and muscles behind OrientDB, the most versatile database I’ve run across. May the future bring you fame, fortune and happiness!