Friday, November 24, 2006

On relations (part 2) - IDs

The first part of this article identified a need to have system-wide object identifier in semantic form like (<classid>.<objectid>).

There are many different ways to implement it in practice. Obviously, the overall length of the ID is important for performance and to preserve memory.
However, the real data type ID mapping in the database is of less importance. Performance tests done by author have shown that VARCHAR, CHAR, BINARY or NUMERIC types of the comparable length perform similarly well in the typical server environment. The tests included bulk object creation, fetching by ID and joining multiple tables by ID with different execution plans and for different scenarios. This often comes as a surprise to the people, who believe that auto-incremented IDs will create measurable performance gains. Detailed review of this claim is not in the scope of this article and will probably be a separate post.

Now, the length of the ID is trickier. The "object" part of the ID has following logical constrain:
- ID never repeats within the same class in the whole, often distributed, object domain

You can either achieve that by having a single synchronized centralized ID authority with active ID sequences, or by generating world-unique IDs similar to Microsoft's GUID. It is up to you to take either way, however you need to think about this: object domain is not necessarily a single database. And in distributed environment, part of it that absolutely has to be functioning at certain time might not have connectivity to the ID authority. On the other hand, world unique IDs usually take much more space (at the beginning, at least) and are never guaranteed to be unique, even though the probability of collision is ridiculously small (probably, Whales and Petunias will fall from the sky much sooner).

To summarize, GUIDs are easy to generate and fit well into distributed architecture. What is not good about GUIDs is that their content is fairly useless. Ordering by GUIDs doesn't give any clue on the historical order in which objects were created. Neither it contains any semantic information, for example class. In complex environment, retrieval of the class of object is one of the most often performed operations. Many systems use discriminator field to classify data. Doing SQL SELECT every time one has to know the class is not efficient, considering that class of object never ever changes during its lifetime. What makes sense is to embed class information into ID. All of the above calls for an ID string in following form:

[classID].[creation_time].[random_component]

classID Naturally, there are less classes then objects in life system. Also, their creation is relatively rare. To minimize ID size it might be acceptable to have classID as a short integer or string registered in some central authority. It can be even static pre-defined value.

Creation time is useful not only to order entities, but to minimize the size of random component. There are only so many objects that can can be created every 50ms (and on modern systems timer resolution is actually 100 nanoseconds). You would need to do some custom math to figure out the comfortable length of random component, but something between 32 to 64 bit will satisfy even very large and very distributed installations. The key here is to determine probability of collision within the timer resolution on a given maximum number of ID generating nodes, possible time shift accounted for. To read more on GUID creation, see IETF draft.

The author successfuly used 4-char class code for the first part, number of 100 ns intervals since Jan 1 2000 for the second part (44 bit is enough for over 100 years), and 32 bit XOR of guid parts for the random suffix. VARCHAR is better for debugging while BINARY IDs will be smaller - the choice is yours.

Of course, you can devise your own ID format taking in consideration the facts above.