Monday, October 30, 2006

On relations (part 1)

Introduction
Codd's relational algebra was a beginning of a significant break-through in computer science. The genious part was to restrict operands to finite relations only and define the term relational completeness. A loose example would be a definition that 'NOT something' means not all the infinite possibilities in the universe, but a finite set of tuples from current relation that have no 'something'; e.g. all rows from the table that are 'not something' count.

Back then, in 1970, it was clearly thinking ahead of time, with relational database systems golden time coming some 20 years later. However, the beginning of the 21st century was marked by extensive development and use of object-relational mappings and numerous tries to provide relational algebra for object-oriented domains. Essentially, people are looking to keep integrity and flexibility of relational query languages, while allowing advanced OO features like polymorphism, encapsulation and inheritance. And there is, of course, object-relational impendance mismatch.

Hence, a number of OQLs emerged. OQL is some kind Object Query Language, with queries executed against class domains, not relations (tables). It allows to query and get access to all objects of certain class with provided (declared) properties. OQLs are declarative languages for imperative world. However breathtaking the idea is, implementation and usage hits many obstacles, some of which are not solved up to this time. Following provides an overview of problems and solutions for relational algebra usage in object-oriented domains.

Object Identification
Relational algebra is based on a notion of a candidate key, which is almost the same as primary key in the database. Essentially, any candidate key has to be unique in its domain and no part of this key can be unique by itself. The second statement is only relevant for compound keys. It is not a candidate key, if one of its parts is already unique. So, there is no need to include more than necessary values into the key.

Traditionally in OO languages an object is identified by its address in memory, essentially prohibiting its use in distributed, storage-agnistic environment. This is a huge limitation that renders relational algebra irrelevant to OO. The only solution is to introduce proper environment agnostic candidate keys for objects in a class. This is usually done by adding ID property to the most basic interface (or a pair of setId()/getId() methods). But what is a good ID/candidate/PK?

Having a global oject enumeration system allows introduction of a single base class

Many people start with reusing data fields, such as first and last name of a person, to compound a primary key. It is wrong from two points of view: it is not reliable (people get married, change names, even SSN number is changed sometimes), and it creates immense complexity to handle it. Essentialy a system has to provide as many linking mechanisms between objects, as there are candidate key combinations. So, good solution is to have data agnostic, uniform object identification. It also fits nicely to the OO concept of address in memory, that never depends on the object's members. Many systems fall back to database-provided sequetial ID's. First decision is, whether to create a sequence per class, or have one global sequence for all classes in the domain. The second is slower, and there is a danger to run out of IDs faster, but it provides additional data safety and integrity, because there is much smaller chance of mistake while resolving a reference. But there is more to it. Having a global oject enumeration system allows introduction of a single base class, similar to Object in Java, object in C# or TObject in Object Pascal. Without it, you would not have a chance to introduce a foreign key to (or any relationship with) an unknown object in the system (like void* in C++, or Object reference in Java). So, most enterprise data repositories have some concept of unified ID sequence. But only few of them go further.

What is the one most important method in a single base class of any OO language? Not the toString(), obviously, and not the equals(), which is a problem in itself, when not overriden. It is the getClass() ot getType(). The central method that allows to learn stuff about current object at hand. If you analyze, how often you use instanceof and type conversions in your code, it will become clear to you, that ability to resolve the class of the object fast will certainly have impact on overall system performance.

One solution is to have R(object,class) relationship table. This table has a tendancy to grow huge, which is usually not a problem with modern RDBMs. What becomes a problem is the frequency of access and associated locks. Since your system needs to consult this table rather often, it quckly becomes a bottleneck. Smart way of solving it is to include class information into ID. "But it will be a compound key that is bad!", says attentive reader. Not necessarily compound. What we need is a domain-unique key that contains class reference. We would also like it to be short to preserve memory space. Following solutions have been tested by the author:


  • a string (varchar-based) ID in form (<classid>.<objectid>) Note that you can fall back to an ID sequence per class, since the ID as a whole will still be unique accross domain.

  • M-bit integer with first N fixed bits dedicated to class and the rest to object ID, obviously M>N and N<(M-N).


Next part will provide comparison and characteristics of both ID methods.