Typed relational algebra: schemas, CRUD, source code
The typed relational algebra I introduced some time ago is more mature and I can now give more examples (this is not all I'm showing today: you can find a link to the source code below) that illustrate how this algebra addresses the problems I identified in my previous post:
- data model duplication (DB schema and object models)
- overall brittleness as the DB schema is modified: tracking the schema changes in the application becomes increasingly hard as the codebase grows
- the 1+N query issue
Declaring relations
As strange as it may seem, the following code is valid OCaml... with a twist: I have extended the grammar using the camlp4 tool, included in the Objective Caml system. camlp4 adds tremendous power to OCaml by allowing you to adapt the grammar in a way a bit reminiscent of Lisp's macros.
This snippet defines the relations consumed by the relational operators I will show later, generates code to create and verify the SQL schema (so the application can verify that its assumptions about the DB are correct as it starts), as well as a number of utility functions.
TABLE user users
COLUMN id SERIAL AUTO PRIMARY KEY
COLUMN name VARCHAR(64) UNIQUE
COLUMN age INT NULLABLE INDEXED
COLUMN password VARCHAR(64)
END
TABLE comment comments
COLUMN id SERIAL AUTO PRIMARY KEY
COLUMN title TEXT
COLUMN text TEXT
COLUMN created_at TIMESTAMPZ
COLUMN author SERIAL FOREIGN(users, id)
END
(As you can see, I've made no attempt to automate the pluralization of the row names...)
Among the values generated by the above code, four are especially important: those that represent the relations (named "users" and "comments") and those that stand for the tables ("users_schema" and "comments_schema"). Why establish this distinction? To put it simply, because we will be manipulating relations that don't correspond directly to a table (e.g. a subset of the rows from a table). So relations are consumed (and generated) by the relational operators, and tables are used when we update or delete rows (insertion is performed a bit differently, more on that later).
The important thing to keep in mind is that both relations and tables have information about the columns (and also about the primary keys, for the latter) encoded in their types, in such a way that we can check statically that all queries are sound. The good news is that, in this case, "we" means "the compiler", and "check statically" just means "compile", so this doesn't require any additional work.
Here's the type of the "users" relation. You need not pay much attention to this, what matters is that it includes all we need to type-check the queries:
val users : ([ `User_age of int option | `User_id of int | `User_name of string | `User_password of string ], [ `User_age | `User_id | `User_name | `User_password]) Relational.Relations.relation
Selection
Here's how you select rows from a relation (incidentally, I'm typing this in the "ocaml" REPL; the output has been abbreviated for clarity):
# let u_18_to_40 = SELECT [User_age < (Some 40) AND User_age > (Some 18)] users;;
val u_18_to_40 :
([ `User_age of int option
| `User_id of int
| `User_name of string
| `User_password of string ],
[ `User_age | `User_id | `User_name | `User_password ])
Relational.Relations.relation =
[...]
In fact, we can abstract that, and create a function that selects the desired rows from any relation with the appropriate columns:
# let u_18_to_40 x = SELECT [User_age < (Some 40) AND User_age > (Some 18)] x;;
val u_18_to_40 :
([> `User_age of int option ] as 'a, [> `User_age ] as 'b)
Relational.Relations.relation -> ('a, 'b) Relational.Relations.relation =
<fun>
This is a function and can thus be composed easily:
# let targets x = SELECT [User_name LIKE "%paul%"] (u_18_to_40 x);;
val targets :
([> `User_age of int option | `User_name of string ] as 'a,
[> `User_age | `User_name ] as 'b)
Relational.Relations.relation -> ('a, 'b) Relational.Relations.relation =
<fun>
Here starts the new stuff I hadn't shown before:
# print_endline (Relations.to_sql (targets users));;
SELECT "id", "name", "age", "password" FROM users
WHERE (("age" < 40) AND ("age" > 18)) AND ("name" LIKE '%paul%')
You don't want to manipulate the SQL directly, though (in fact, I might hide the to_sql function at some point), because what you really need is a way to access the values from the relation.
Materialization
There are two ways to materialize a relation: you can either iterate over row objects with all the columns from a given table, or get an array with only the columns you want. The former is done with a function generated from the above table declarations ("user_iter", "comment_iter"), and the latter with the MATERIALIZE operator.
The materialization operator is used like this:
let rows = MATERIALIZE [User_id, User_name] db users in
Array.iter
(fun u -> printf "User id %d, name %s\n" u#id u#name)
rows
As with other operators, the type system will ensure that we're behaving properly.
The iterator functions take a DB handle, a relation, and a function that requires an object representing a row. It is equivalent to iterating with Array.iter over a MATERIALIZE expression with all the columns from the corresponding table.
Row creation and update
The code generated from a table declaration includes a couple helper classes used to represent existent rows and new rows. These classes differ in that the latter doesn't need to be given the primary key if it is a SERIAL, and their #save methods perform insertion and update, respectively.
The constructors uses labelled parameters of the right types, so we have
# new new_user;; - : name:string -> ?age:int -> password:string -> unit -> new_user = <fun>
The #save method of user/new_user objects takes a function that will be given the appropriate SQL code:
# let u = new new_user ~name:"foo" ~age:10 ~password:"badpass" ();;
val u : new_user = <obj>
# u#save print_endline;;
INSERT INTO "users"("name","age","password") VALUES('foo',10,'badpass')
- : unit = ()
Update
There's an UPDATE operator that generates a function which will return the SQL code to update a row. It is parameterized over a list of columns and a table. The type system verifies that the first column is actually the primary key of the table:
# UPDATE [User_id, User_name] users_schema;; - : id:int -> name:string -> string # (UPDATE [User_id, User_name] users_schema) 10 "foo";; - : string = "UPDATE \"users\" SET \"name\" = 'foo' WHERE \"id\" = 10" # print_endline ((UPDATE [User_id, User_name] users_schema) 10 "foo");; UPDATE "users" SET "name" = 'foo' WHERE "id" = 10
The SQL code is straightforward; what matters in this case is the type-checking, which will ensure that the UPDATE remains valid.
Tighter checking
users.id and comments.id are encoded in OCaml as simple ints, so it is possible to get runtime errors when updating a row with a non-existent primary key, for instance, since any int can serve as an id.
In order to prevent such errors, it is possible to specify a custom type that will be used in the OCaml side, by declaring a column as SERIAL(<sometype>) (where <sometype> is a valid lowercase identifier). The generated code will use the
encode_sometype : sometype -> string
and
decode_sometype : string -> sometype
functions that must be provided by the user.
Status
This is still work in progress, but it's usable (and I'm starting to use it for real). It only works with PostgreSQL at this point, but it can be easily adapted to other DBs. As for the relational algebra, the main feature missing is aggregate functions. I believe they can also be typed, but it might involve substantial reworking in the encoding of relations. Very complex SQL expressions will probably be hard to type, so I'm planning to provide a means to write type-checked "annotations" that can be placed right before the raw SQL statements (giving up on composability but retaining to some extent static safety).
You can get the code with git:
git clone http://eigenclass.org/repos/git/relational/.git/
You can also browse the repository here (please be kind to the server; I will have to disable this if it results in excessive load).
More Processor Usage - RyanTM (2008-02-05 (Tue) 11:30:57)
It seems like this method of doing things will sacrifice processor time in order to get more efficient SQL statements. I think this is a big plus because a lot of applications are limited by database access in this day and age.
mfp 2008-02-05 (Tue) 17:35:37
There's some overhead in the decoding of the values, but it's fairly small (maybe a few dozen instructions for typical expressions with ocamlopt). Currently, only trivial simplifications are performed on the SQL; writing a plan optimizer is hard and only the RDBMS has got enough information to find the best plan anyway. The relational algebra allows, however, to avoid pitfalls like the infamous 1+N query problem, which would kill performance dead. Essentially, it makes it possible to materialize (= retrieve) just the right amount of data from the DB. I'll elaborate on this soon.
lexapro 2008-04-09 (Wed) 22:06:03
Article Opinion <a
xenical 2008-04-09 (Wed) 22:07:01
Article Opinion <a
ambien 2008-04-09 (Wed) 22:07:08
Article Opinion <a
ephedrine 2008-04-09 (Wed) 22:07:31
Article Opinion <a
viagra 2008-04-09 (Wed) 22:07:59
Article Opinion <a
xenical 2008-06-17 (Tue) 12:39:29
Article Opinion <a
viagra 2008-06-17 (Tue) 12:39:39
Article Opinion <a
ambien 2008-06-17 (Tue) 12:39:46
Article Opinion <a
ephedrine 2008-06-17 (Tue) 12:39:48
Article Opinion <a
lexapro 2008-06-17 (Tue) 12:40:30
Article Opinion <a
ActiveRelation - Nick Kallen (2008-02-05 (Tue) 20:17:04)
I'm also working on a Relational Algebra (for Ruby), called ActiveRelation. For the moment, it leverages the ActiveRecord connection adapters so it is database neutral. It's not yet ready for real use, but here are some interesting features:
foo = Table.new(:foo)
bar = Table.new(:bar)
foo.join(bar).on(foo[:id].equals(bar[:id])).to_sql.should be_like("""
SELECT `foo`.`name`, `foo`.`id`, `bar`.`name`, `bar`.`foo_id`, `bar`.`id`
FROM `foo`
INNER JOIN `bar` ON `foo`.`id` = `bar`.`id`
""")
It's compositional as you would expect of a relational algebra. I have support for aggregate relations in a neat way: supporting closure under join even in the presence of aggregations!:
users = Table.new(:users)
photos = Table.new(:photos)
photo_count = photos \
.aggregate(photos[:user_id], photos[:id].count) \
.group(photos[:user_id]) \
.rename(photos[:id].count, :cnt) \
.as(:photo_count)
users.join(photo_count).on(photo_count[:user_id].equals(users[:id])).to_sql.should be_like("""
SELECT `users`.`name`, `users`.`id`, `photo_count`.`user_id`, `photo_count`.`cnt`
FROM `users`
INNER JOIN (SELECT `photos`.`user_id`, COUNT(`photos`.`id`) AS `cnt` FROM `photos` GROUP BY `photos`.`user_id`) AS `photo_count`
ON `photo_count`.`user_id` = `users`.`id`
""")
Note the derived table (subselect); you need derived tables when joining with an aggregation. Note also how attributes have different names inside and outside of the subselect (`photos` vs. `photo_count`).
You can also correlate columns, as in sql_alchemy:
photo_count[photos[:id]] == photo_count[:id].
This is useful for various things.
The way I'm approaching the problem, it's not clear that my Relational Algebra solves the N+1 problem out-of-the-box. I can, however, once expressing the relational operations that join a user to a photo (User.has_many :photos), use that to get the photos of one user or many users, since the same "joining" operation can be used on users.select(user[:id].equals(10)) as users. In other words, "eager includes" become trivial. But the best approach I've seen to n+1 in the context of ORM is the approach datamapper takes. It notices when you call user.photos; and if that user is from a users collection, it produces the query to find photos for the collection; it then caches them.
awesome stuff - Toby Ho (2008-02-06 (Wed) 00:17:03)
I, for one are very excited about the approach that you are taking. Thanks for opening up the source code so I can have a peek.
Don't take away to_sql! - Justin (2008-02-07 (Thr) 17:59:17)
I have been working with a Haskell library, haskelldb, which has many of these same ideas. I am not using it to actually access the database, though. Instead, I'm using it to generate SQL that is then used in a PHP application. Haskell acts as a specification language, and won't allow me to write stupid queries (e.g. with non-existent columns or bad value comparisons). I suggest leaving in to_sql as someone may want to use your library the same way.
Justin
Very good! - Ryan (2008-05-04 (Sun) 07:51:42)
Very interesting work -- you might want to check out my library, JRel http://thimbleware.com/projects/jrel, which does something similar to your Ruby project, but for Java.
- 252 http://reddit.com/r/programming
- 36 http://anarchaia.org
- 24 http://reddit.com/r/ocaml
- 20 http://www.dzone.com/links/ocaml_typed_relational_algebra_schemas_crud_sourc.html
- 17 http://reddit.com/r/programming/info/6biqw/comments
- 14 http://reddit.com/r/programming/new
- 12 http://www.anarchaia.org
- 12 http://news.ycombinator.com/item?id=112129
- 11 http://www.artima.com/forums/flat.jsp?forum=123&thread=224263
- 10 http://planetruby.0x42.net
Keyword(s):[database] [ORM] [relation] [relational] [algebra] [ocaml] [blog] [frontpage]
References: