eigenclass logo
MAIN  Index  Search  Changes  PageRank  Login

Addressing the ORM problem, typed relational algebra

I have been working on a typed relational algebra that allows to abstract and compose queries that can be verified statically. This means that the type system ensures all queries are well-formed and you never try to use a non-existent column or table or an existent one with the wrong types.

There's no way a change to the DB schema can break your code silently.

Query composition is sorely missing in most ORMs; you normally do some query and get in return a number of objects in some sort of container, which can be as simple as an array. You cannot compose queries, all you get is rows. In other words, they abstract at the wrong level, focusing on rows instead of relations.

By applying ideas learned from the functional paradigm, in the code I have been writing queries can be abstracted and composed freely. The system ensures they are sound and will help me fix them if they get out of sync with the DB schema.

The basic idea is to work with relations instead of rows. There are only a few operations in relational algebra and they can be easily represented by functions. Selection is a function of type

 selection :  relation_type_1 -> relation_type_1

where the result contains a subset of the elements in the initial set. The cartesian product takes two relations and returns a third one

 product   :  relation_type_1 -> relation_type_2 -> relation_type_12

where the third relation contains all the attributes in the two relations given as arguments. Projection is then of the form

 project   :  relation_type_1 -> columns_2 -> relation_type_2

Here the output contains a subset of the attributes in the initial relation. The remaining operations are defined in similar ways.

You have probably noted that I have been using different subscript indices in the above type declarations. That's because the relational operators can be seen not only as an operation on relations but also on their types. If you have a USER table with two columns, NAME and PASSWORD, one projection could be

 (relation with NAME and PASSWORD) -> (keep NAME) -> (relation with NAME)

That is, the type of a relation carries information about its attributes and their types (name is a string, age is an integer...).

Since the relational operators operate on whole relations, not on rows, the SQL query can be done lazily, and several optimizations are possible. The system "understands" the queries since they're not hidden inside opaque strings, and can be much smarter than a run-of-the-mill ORM.

Implementation

I make full usage of the OCaml type system to encode schema information in the types. I'm using phantom types, existential types (encoded using polymorphic records, rank-1 polymorphism), polymorphic variant types, structural polymorphism, (obviously) parametric polymorphism... But all the magic is encapsulated and none of it is seen by the developer; all you see is queries expressed in a simple language that are checked by the compiler. Fortunately, OCaml includes a tool that allows you to extend the language itself, camlp4. I have written a camlp4 extension that supports relational queries expressed like this:

 (* this function takes a relation with a user_age attribute and returns
    another with the rows satisfying the predicate *) 
 let minors users = SELECT [Age < 18] users

The "minors" function only makes sense when the relation given in the user argument has got an "age" attribute. This is encoded in the inferred type as follows:

 val minors :
   ([> `Age of int ] as 'a, [> `Age ] as 'b) Relational.relation ->
   ('a, 'b) Relational.relation

Relational.relation is a polymorphic type, and "minors" only operates on relations with at least an "age" column of type int. If you try to apply it to a relation that doesn't have that column, you'll get a very expressive compile-time error like this:

   File "foo.ml", line 14, characters 17-22:
   This expression has type
     ([ `Name of string | `Password of string], [ `Name | `Password]) Relational.relation
   but is here used with type
     ([> `Age of int ], [> `Age ]) Relational.relation
   The first variant type does not allow tag(s) `Age

The compiler has detected an invalid query; you're doing

 SELECT * FROM users WHERE age < 18

on a table that doesn't have an "age" column! All broken queries will be found statically before that code is executed. This is stronger than a unit/integration test because the whereas type system actually proves that there are no meaningless queries, testing can only verify that they aren't observed along some code path.

In the above examples, "minors" is a function, a first-class value, and can thus be manipulated like any other function. All such operations will be typed and their soundness checked by the compiler. You can use parametric polymorphism to create reusable queries of increasing complexity. Here's a simple "higher-order query" that takes an age and two predicates parameterized by an age, and returns a relation corresponding to the rows that satisfy them both:

 let select (age : int) query1 query2 users = query1 age (query2 age users)
  (*            ^^^^^^ *)
  (* we restrict polymorphism here because we only want integer ages *)

This is the inferred type:

 val select : int -> (int -> 'a -> 'b) -> (int -> 'c -> 'a) -> 'c -> 'b

Yes, it's a plain old higher-order function! The relational operators are plain functions, so they can be composed with functions like the above.

I have only shown a very simple SELECT query above, but I have implemented type-safe projection, renaming and product, as well as some other SQL operators (order, limit...). They look like this

 let name_and_age rel = LIMIT 100 (ORDER [Age] (PROJECT [Name Age] rel))
 
 (* get all name pairs *)
 let name_cartesian_product users =
   PRODUCT [User_name] [User2_name] 
     users (RENAME [User_name AS User2_name] users)

I might add further syntactic sugar to make them even more convenient, maybe approaching something like HQL.

Reassessing the ORM problem

The ORM problem has several aspects and nobody so far can claim rightfully to have solved them all, but the typed relational algebra I have been working on addresses the following issues :

  1. having to model your data twice, in the DB schema and in your object models. Describing the mapping between relations and objects manually (say in XML configuration files) also counts as unneeded redundancy.
  2. coupling between the schema and the object model: all your queries are relative to a given schema and will break if the latter changes. Changing the schema becomes increasingly hard.
  3. the N+1 query problem

(1) Can be addressed in at least two ways. The first one is to complete object models dynamically based on the tables found in the DB at runtime (as done in ActiveRecord). The second approach consists in describing the schema in the code and have both the object models and the DB schema generated at once (this is what Og does), possibly mutating an existent schema and moving/renaming tables and columns as needed.

The second technique allows to use the type system to enforce the soundness of all queries statically. In my code, if a query isn't meaningful, it just doesn't compile.

(2) Coupling is a nasty problem mainly because there's nothing to help you to maintain consistency between the DB and the object model apart from the tests (which will have to run against a DB, kiss mocking bye). This problem haunts nearly all ORMs out there, no matter the language or type system. The problem is obvious in Rails, from the moment you have User.find_by_name you're depending on a "name" column in the "users" table and your code will bomb at runtime if either of them goes away. In any other system where queries are represented by opaque strings, there is no way to validate queries before the code is actually executed.

(3) The N+1 problem seems to me more fundamental than the others. It is caused by deep differences in the way relations are represented. For instance, in 1:N relationships the representations are inverted. Whereas in the object model a user object knows about all the documents it owns, (has_many), in the DB it is rather each document that knows who it belongs to!

Eager loading is but an unsatisfactory solution to this problem. IMO it is best to stop trying to shoehorn relations into containers in the object model. The ability to abstract and compose queries allows to recover most of what we lost in terms of convenience.

Related work

Here are some pointers if you are interested in this approach:

update.png [Thanks to Robert Brewer]



Wow - chris2 (2007-10-31 (Mit) 11:21:16)

I have to bow to you. :-)

mfp 2007-10-31 (Wed) 18:45:36

I was going to write "... yet you put it in the WYW (what you want) category", but I just realized I've been misreading WJW all this time :-) Now I wonder what it means.

Color me impressed - brian (2007-10-31 (Wed) 11:46:20)

I'd love to hear more about what you call the N+1 query problem, I've never found a satisfactory solution to this issue that seems to plague all ORMs.

mfp 2007-10-31 (Wed) 18:13:02

I don't claim I have a great solution; I just believe it's better to give up and not expect the associations to be reflected directly by the object model via containers in the "parent" object. Instead, you can have a query that generates a relation and use it both to materialize some models and in a join that yields what would have taken N queries normally. I haven't shown how relations are constructed or consumed yet but I'll write an example to illustrate what I mean.

No Title - vruz (2007-10-31 (Wed) 14:48:02)

in awe

:-)


type checking - mike bayer (2007-10-31 (Wed) 15:45:12)

whats wrong with having the database itself report typing errors ? every database has different typing behavior, coercion rules are different, etc. how does having a query compile time error improve upon letting the database, which may actually be able to work with the given query, make the call ?

mfp 2007-10-31 (Wed) 17:11:50

It's just the difference between a compile time error and a runtime one. If you rely on runtime errors, you'll have to test all code paths and you can never be sure there isn't some buggy query somewhere, especially if you change the database or refactor your code. If you piggyback on the type system, it is *proving* for you that there are no faulty queries. So it's not just about checking the type of some variables in a prepared statement, it's rather proving that all queries will be correct (you're referencing existent tables and columns, the joins are meaningful, etc.).

There's another more general advantage in practice to type checking vs. unit/functional testing in a dynamic language. Take Ruby, and consider some nicely duck-typed code. If something goes wrong, you'll probably get a NoMethodError/NameError where you shouldn't at runtime, but the tests won't always tell you where the object that failed to respond to a method adequately came from. You have to examine the stack frames and try to infer how you got that spurious object. This can be hard. OTOH, if there's a type error the compiler can probably point at the very token that causes the problem.

mfp 2007-10-31 (Wed) 18:04:40

correction: the "more general advantage" applies to a type system vs. functional tests. In unit tests, determining the origin of the problem should always be easy. It gets difficult when other subsystems are involved.


No Title - Nick Murphy (2007-10-31 (Wed) 16:28:24)

I've been working on a similar problem, but I've gone about it a slightly different way. I'm working on a DSL that implements the relational algebra and runs in an embedded interpreter. The interpreter is written in the same language as the host language so they can share the same type system. Check it out if you're interested, at:

http://relationale.sourceforge.net

mfp 2007-10-31 (Wed) 18:15:58

Thanks for the pointer. I'll look at your code; I wonder what sort of type checking you're doing in Python.

Nick Murphy 2007-10-31 (Wed) 21:57:23

The embedded language is statically typed, even though Python isn't. Types are assigned to values upon registration by inspecting the objects being registered -- the interpreter maintains a map of Python classes to scalar types, along with a means of converting back and forth. Any query issued to the interpreter is first compiled and then checked for type correctness before execution.


No Title - mike bayer (2007-10-31 (Wed) 17:36:28)

Well I work in Python, we don't really have "compile time" errors except language syntax errors. Errors in expression construction/DSL errors etc. have to be checked via unit tests anyway (and aren't we writing unit tests regardless of compiled or not ?), so having your expression language detect errors based on their constructed form vs. sending them off to a local sqlite in-memory database to do exactly the same thing is the just the difference between reinventing the wheel and not.

mfp 2007-10-31 (Wed) 18:26:28

I know how things work with a dynamically typed language (in case you haven't noticed, you'll find lots of Ruby on this site :)... This is a qualitatively different wheel, one that is guaranteed not to break in some ways once it starts turning. See my above reply.

My code also allows me to abstract and compose queries in a way most ORMs cannot. It is possible to follow the same strategy in a dynamically typed language, with similar benefits relative to traditional ORMs; look at Roe, SchemeQL or this. What you don't get in that case is proofs at compile time.


How about joins? - Justin Bailey (2007-10-31 (Wed) 17:44:38)

Can your ORM generate joins with group by/having clauses? Is it able to produce nested "case" statements in the columns clause, where the cases depend on outer columns in the enclosing query (Useful when rolling up up detail data into a summary). How is the database schema discovered?

Looks really cool!

mfp 2007-10-31 (Wed) 18:39:57

I think my code can in principle express any kind of joins. I haven't implemented group by yet; I have to think about how to enforce the usage of aggregate functions for columns not in the "group by" expression using the type system. I have an idea but I haven't tried it yet.

The schema isn't discovered dynamically. It is instead described once in the code and the DB is generated from that description. I have some plans to modify existing databases to fit the new schema automatically (think Rails migrations) and to verify that the schema of an existent DB corresponds to the definition (this would be done once when the program starts, so you still get "fail early" semantics).


Another "related work" link for you - Robert Brewer (2007-10-31 (Wed) 19:11:06)

Dejavu and Geniusql are python ORM's that implement named, typed, relational algebra. All expressions are entered in Python syntax (lambdas); those AST's are then used to construct the equivalent SQL, with each subexpression being fully type-aware.

I'd also assume GLORP does something similar in Smalltalk.

mfp 2007-11-01 (Thr) 16:42:43

Thanks, I'm adding them to the above list.


How does Jones & Wadler's New Work Fit In - Tom Davis (2007-11-01 (Thr) 07:39:05)

Great post!

Curious how you see these concepts relating to Jones and Wadler's new work on Comprehensive Comprehensions in Haskell. They're borrowing work on group by and order by from LINQ and backfitting it to Haskell.

See http://research.microsoft.com/~simonpj/papers/list-comp/index.htm

mfp 2007-11-01 (Thr) 16:38:26

Only gave it a superficial read, but it's interesting! It differs in spirit in that they're bringing concepts taken from SQL to the language (generalizing them in the process), and I'm bringing functional ideas to SQL, in a way (typing, higher-order functions). The extension to arbitrary container types, LINQ-style, they hint at seems especially attractive. I'll read carefully their translations of comprehensions; their generalization could be applicable to what I'm doing.


Some other interesting pointers - Dan S. (2007-11-02 (Fri) 10:52:14)

If you start thinking about optimization, there's a great thesis out there: Comprehending Queries... covers a nice tight low-level representation for queries across a number of different types of containers, and then derives a whole framework of optimizations from that.

http://www.google.com/url?sa=t&ct=res&cd=1&url=http%3A%2F%2Fwww.ub.uni-konstanz.de%2Fkops%2Fvolltexte%2F1999%2F312%2F312_1.pdf&ei=Q1UrR_fRHJPCiAGA8vD3Dw&usg=AFQjCNEXQpSHjoNxl6KfPXOuw8TVOlokUQ&sig2=IajOSQnX6yWj_GQROM6wfA

Also, the Kleisli Query System is worth a google.

mfp 2007-11-07 (Wed) 15:14:02

thank you, I'll take a look


comparision - Guest (2007-11-07 (Wed) 05:45:06)

How does it compare to http://merjis.com/developers/pgocaml

mfp 2007-11-07 (Wed) 15:05:22

The main advantage of PGOCaml is that it allows you to do arbitrarily complex queries; you can use everything PostgreSQL allows. My typed relational algebra is more restrictive. In PGOcaml, queries are type-checked in a very different way; unlike my code, it doesn't employ OCaml's type system. Instead, it checks against the database at compile time. There are a few downsides to this:

  • you need access to the DB when compiling, and things will break if the schema is different when you deploy
  • it is impossible to compose queries as done above. Even something easy like changing the ORDER BY of a relation based on some condition is impossible (so you have to write N nearly identical queries where only ORDER BY differs).
  • PGOcaml only runs on PostgreSQL (it uses the non-standard DESCRIBE command to obtain the types of parameters and return values); the above can in principle work on any DB

PGOcaml is mature and available, my code is work in progress and secret ;)


comparision - Guest (2007-11-08 (Thr) 06:48:49)

Hi, - How do you catch type errors at compiler time if you don't have access to the db? Also how do you deal with varchar(20) vs varchar(100)? Are they different types? - Composable queries are a big plus for your approarch - It's not hard to adapt PGOcaml to other databases with similar capabilities - Maybe not related: How do you deal with transactions? Thanks

mfp 2007-11-08 (Thr) 16:32:49

There's a construct to specify the DB schema (which looks syntactically similar to a SQL schema) that is expanded by the syntax extension into the SQL schema, a relation with the appropriate types and a few support functions and classes. The idea is to either create the DB or validate it against the specified schema once when the program starts. This is actually safer than only checking at compile time against the DB because the code will normally be deployed somewhere else and you certainly don't want to develop against the production DB.

As for varchar(20) vs varchar(100), by default they are both of type string, but it is possible to specify custom types in the above-mentioned construct. I added support for custom types mainly for foreign keys (so you cannot pass any int by accident), but they could be used in that case too. I don't know if it would be much better though, unless you have operations that are guaranteed not to break the length invariant (if you have to check at runtime with some varchar20_of_string, you can as well use plain strings and let the DB signal the problem).

I guess I read too much into the statement from the author of PGOcaml:

It doesn't work with other databases, nor will it ever work with other databases.

I read it as both "it's probably impossible to make it work..." and "I don't have interest in trying". You're very right, in principle it could work on any DB with similar capabilities. I don't know if any other does, though. In MySQL, DESCRIBE is limited to tables; maybe it's got some other command.

I haven't given much thought to transactions yet. I didn't anticipate any problems and thought good old HOFs would handle them easily, but I have to ponder about it. Maybe it'd be better off with something fancy like monads.



release? - Tsuyoshi (2007-12-14 (Fri) 23:04:39)

Do you have any plans of making this public? In the past couple weeks I've been working on an extension to the OCaml compiler which has a few differences but the basic idea is the same - operating on tables rather than queries or rows. If yours is already working then I'd like to try it out (and maybe drop my own project).

mfp 2007-12-18 (Tue) 12:52:38

I've gone a bit further than was it shown above. I have type- and "schema-safe" insertion, update , removal and materialization (either tuple arrays or more classical streams of row objects) working, but I haven't used this extensively yet. The code is a bit embarrassing; in particular, the encoding of the relational constraints is quite heavy: things like "b is a subtype of b1 and c is a subtype of c1 and a is a subtype of b1 and c1" require a fair amount of code to be generated by camlp4. I plan to release this but I must overcome the shame first.

Are you modifying the compiler directly (forking it the way MetaOCaml or JoCaml do) or also using camlp4? If you're doing the former, your work will surely render my code irrelevant (I do my best to define a reasonable syntax & encode as many invariants as possible with the type system, but some things are bulky when you lack ad-hoc polymorphism; the compiler can do so much more).

Tsuyoshi 2007-12-23 (Sun) 00:56:19

I'm modifying the compiler directly. It's actually mostly not too difficult, but I was trying to get it to compile regular OCaml code to SQL so that I could basically use regular OCaml functions in SQL where clauses, and I kind of got bogged down in that. It was sort of like adding an SQL "processor" to the compiler (which gets pretty involved since SQL doesn't have first-class functions or even local variables) and I started wondering if it was really worth it. It would be nice in some ways to eliminate the wall between code that runs on the database server and code that runs on the local machine, but it's too much work for me at this point.

So I've gone in a different direction and now I'm trying to decide on an appropriate miniature language that can be used for things like column defaults and where clauses. I think a primitive kind of OCaml is OK - I just need to restrict it to what is easily convertable to SQL.

BTW aren't tuple arrays or streams overkill? I would think all you need is a Relational.iter function.

mfp 2007-12-23 (Sun) 15:52:09

I see. So the problem is that targeting SQL is difficult when you have more complex OCaml expressions, but that is precisely the situation where moving away from SQL is very advantageous, right? Even if you restrict the accepted language, working with the compiler as opposed to camlp4 will retain the other advantages (better typing, syntax, etc.), though.

As for Relational.iter, I'm not sure I can easily write a polymorphic one given the way I've encoded relations in the type system, but a relation-specific iter HOF generated by camlp4 sounds like a lighter alternative to streams.


Last modified:2007/10/31 09:58:42
Keyword(s):[database] [ORM] [relation] [ocaml] [blog] [frontpage]
References:[Typed relational algebra: schemas, CRUD, source code]