Friday, November 24, 2006

On relations (part 2) - IDs

The first part of this article identified a need to have system-wide object identifier in semantic form like (<classid>.<objectid>).

There are many different ways to implement it in practice. Obviously, the overall length of the ID is important for performance and to preserve memory.
However, the real data type ID mapping in the database is of less importance. Performance tests done by author have shown that VARCHAR, CHAR, BINARY or NUMERIC types of the comparable length perform similarly well in the typical server environment. The tests included bulk object creation, fetching by ID and joining multiple tables by ID with different execution plans and for different scenarios. This often comes as a surprise to the people, who believe that auto-incremented IDs will create measurable performance gains. Detailed review of this claim is not in the scope of this article and will probably be a separate post.

Now, the length of the ID is trickier. The "object" part of the ID has following logical constrain:
- ID never repeats within the same class in the whole, often distributed, object domain

You can either achieve that by having a single synchronized centralized ID authority with active ID sequences, or by generating world-unique IDs similar to Microsoft's GUID. It is up to you to take either way, however you need to think about this: object domain is not necessarily a single database. And in distributed environment, part of it that absolutely has to be functioning at certain time might not have connectivity to the ID authority. On the other hand, world unique IDs usually take much more space (at the beginning, at least) and are never guaranteed to be unique, even though the probability of collision is ridiculously small (probably, Whales and Petunias will fall from the sky much sooner).

To summarize, GUIDs are easy to generate and fit well into distributed architecture. What is not good about GUIDs is that their content is fairly useless. Ordering by GUIDs doesn't give any clue on the historical order in which objects were created. Neither it contains any semantic information, for example class. In complex environment, retrieval of the class of object is one of the most often performed operations. Many systems use discriminator field to classify data. Doing SQL SELECT every time one has to know the class is not efficient, considering that class of object never ever changes during its lifetime. What makes sense is to embed class information into ID. All of the above calls for an ID string in following form:

[classID].[creation_time].[random_component]

classID Naturally, there are less classes then objects in life system. Also, their creation is relatively rare. To minimize ID size it might be acceptable to have classID as a short integer or string registered in some central authority. It can be even static pre-defined value.

Creation time is useful not only to order entities, but to minimize the size of random component. There are only so many objects that can can be created every 50ms (and on modern systems timer resolution is actually 100 nanoseconds). You would need to do some custom math to figure out the comfortable length of random component, but something between 32 to 64 bit will satisfy even very large and very distributed installations. The key here is to determine probability of collision within the timer resolution on a given maximum number of ID generating nodes, possible time shift accounted for. To read more on GUID creation, see IETF draft.

The author successfuly used 4-char class code for the first part, number of 100 ns intervals since Jan 1 2000 for the second part (44 bit is enough for over 100 years), and 32 bit XOR of guid parts for the random suffix. VARCHAR is better for debugging while BINARY IDs will be smaller - the choice is yours.

Of course, you can devise your own ID format taking in consideration the facts above.

Monday, October 30, 2006

On relations (part 1)

Introduction
Codd's relational algebra was a beginning of a significant break-through in computer science. The genious part was to restrict operands to finite relations only and define the term relational completeness. A loose example would be a definition that 'NOT something' means not all the infinite possibilities in the universe, but a finite set of tuples from current relation that have no 'something'; e.g. all rows from the table that are 'not something' count.

Back then, in 1970, it was clearly thinking ahead of time, with relational database systems golden time coming some 20 years later. However, the beginning of the 21st century was marked by extensive development and use of object-relational mappings and numerous tries to provide relational algebra for object-oriented domains. Essentially, people are looking to keep integrity and flexibility of relational query languages, while allowing advanced OO features like polymorphism, encapsulation and inheritance. And there is, of course, object-relational impendance mismatch.

Hence, a number of OQLs emerged. OQL is some kind Object Query Language, with queries executed against class domains, not relations (tables). It allows to query and get access to all objects of certain class with provided (declared) properties. OQLs are declarative languages for imperative world. However breathtaking the idea is, implementation and usage hits many obstacles, some of which are not solved up to this time. Following provides an overview of problems and solutions for relational algebra usage in object-oriented domains.

Object Identification
Relational algebra is based on a notion of a candidate key, which is almost the same as primary key in the database. Essentially, any candidate key has to be unique in its domain and no part of this key can be unique by itself. The second statement is only relevant for compound keys. It is not a candidate key, if one of its parts is already unique. So, there is no need to include more than necessary values into the key.

Traditionally in OO languages an object is identified by its address in memory, essentially prohibiting its use in distributed, storage-agnistic environment. This is a huge limitation that renders relational algebra irrelevant to OO. The only solution is to introduce proper environment agnostic candidate keys for objects in a class. This is usually done by adding ID property to the most basic interface (or a pair of setId()/getId() methods). But what is a good ID/candidate/PK?

Having a global oject enumeration system allows introduction of a single base class

Many people start with reusing data fields, such as first and last name of a person, to compound a primary key. It is wrong from two points of view: it is not reliable (people get married, change names, even SSN number is changed sometimes), and it creates immense complexity to handle it. Essentialy a system has to provide as many linking mechanisms between objects, as there are candidate key combinations. So, good solution is to have data agnostic, uniform object identification. It also fits nicely to the OO concept of address in memory, that never depends on the object's members. Many systems fall back to database-provided sequetial ID's. First decision is, whether to create a sequence per class, or have one global sequence for all classes in the domain. The second is slower, and there is a danger to run out of IDs faster, but it provides additional data safety and integrity, because there is much smaller chance of mistake while resolving a reference. But there is more to it. Having a global oject enumeration system allows introduction of a single base class, similar to Object in Java, object in C# or TObject in Object Pascal. Without it, you would not have a chance to introduce a foreign key to (or any relationship with) an unknown object in the system (like void* in C++, or Object reference in Java). So, most enterprise data repositories have some concept of unified ID sequence. But only few of them go further.

What is the one most important method in a single base class of any OO language? Not the toString(), obviously, and not the equals(), which is a problem in itself, when not overriden. It is the getClass() ot getType(). The central method that allows to learn stuff about current object at hand. If you analyze, how often you use instanceof and type conversions in your code, it will become clear to you, that ability to resolve the class of the object fast will certainly have impact on overall system performance.

One solution is to have R(object,class) relationship table. This table has a tendancy to grow huge, which is usually not a problem with modern RDBMs. What becomes a problem is the frequency of access and associated locks. Since your system needs to consult this table rather often, it quckly becomes a bottleneck. Smart way of solving it is to include class information into ID. "But it will be a compound key that is bad!", says attentive reader. Not necessarily compound. What we need is a domain-unique key that contains class reference. We would also like it to be short to preserve memory space. Following solutions have been tested by the author:


  • a string (varchar-based) ID in form (<classid>.<objectid>) Note that you can fall back to an ID sequence per class, since the ID as a whole will still be unique accross domain.

  • M-bit integer with first N fixed bits dedicated to class and the rest to object ID, obviously M>N and N<(M-N).


Next part will provide comparison and characteristics of both ID methods.