What My Backpack Can Teach Us About Location-Independent Identifiers
I'm going to introduce the concept of location independence with an example from the real world, my backpack. When I was (much) younger and going off to school, I bought a backpack to carry my books. I carried it everywhere and once took it to a museum in Dallas where I wasn't permitted to take it inside. Instead, someone at the museum took it from me, stored it in a numbered bin and gave me a tag with the bin number so I could retrieve it later.
On a subsequent trip I took the backpack to a London museum where it was stored in a bin with a different number. Let's say it was stored in bin 218 during that visit. The bin numbers at each museum were used for item retrieval, but they really had nothing to do with properties of my backpack that identify it as its own thing. My backpack existed and could be uniquely identified via other attributes before I ever stored it in that first bin in Dallas. If asked to describe my backpack and distinguish it from all others, I would never think to say it was once stored in bin 427 in a Dallas museum. The bin number served a purpose in the context of the Dallas museum, but storing my bag in that bin is not what gave it an identity. After the museum visit, I certainly did not start referring to my backpack as 427.
Backpacks are synonymous with mobility. It makes no sense to give one a permanent identifier based on the first location where it happened to be stored, especially when the identifier will be shared with others who have no knowledge of that first location. I wouldn't consider telling a person in the London museum about the bin number in Dallas. It could only introduce confusion, especially since the backpack will be stored in a different bin in the London museum.
Consider if I lost my bin number and needed to identify the backpack some other way. If my backpack was marked with an identifier like a luggage tag, museum employees could identify it using that. Yes, a bin number is the fastest way to retrieve an item at a specific museum, but a luggage tag provides a way to uniquely identify something regardless of the location in which it is stored at a particular time.
The problem with location-based IDs
We are in the data business. Why did I tell you a silly story about a backpack when that has nothing to do with data? Backpacks are designed to be transported, but they are physical items and can only be at one place at a time. Data, on the other hand, can exist in multiple places at the same time, but our traditional engineering practices treat it as if it will always remain in the place where it was originally stored. We frequently assign identities to data based on storage locations in a database table. We would never think of permanently referring to a backpack by the bin number where we first stored it, yet we have historically taken exactly this approach with our data when storing it in a relational database.
Relational database tables can generate auto-incrementing IDs when storing data in a table. Going back to our analogy, you can think of the auto-generated ID as the equivalent of a numbered museum bin. When we store data in a row, we are effectively given a RowID that provides an easy way to retrieve the data from that table in the future. The ID should be thought of as a convenient way to access the data in that storage mechanism's context, but it should NOT be considered as the primary identifier of the data itself.
If the idea of using a location-based ID for a backpack seems ludicrous, but the idea of using it for data seems fine, it's because we've grown accustomed to using the table-generated IDs for so long that it just seems "normal". We've conflated the concepts of a storage location and a unique ID because the data has primarily existed in a single system and a single storage mechanism, a database table. In an isolated, monolithic system, the location-based ID was sufficient, but we need to start using location-independent strategies to identify data points when breaking down monoliths into different service components. As we begin passing data between completely different systems and storing data in new storage layers, such as S3 and ElastiCache, we need a storage and location-independent way of identifying data points.
Types of location-independent identifiers
The "Fundamentals of Distributed Systems" course listed at the end of this article describes several types of keys that can be used to avoid location-specific identifiers. I have listed examples of them below.
- Natural keys
Natural keys such as a driver's license number, a license plate number or a social security number may be used when available, however you should evaluate if/how you store personally identifiable information to ensure compliance with data protection rules.
- Public keys
Public keys, such as CUSIP identifiers or ISO codes, may be applicable for certain data types.
- Alternate keys (usually client-provided)
Most of the time our data points will not have natural or public keys available. Alternate keys, such as GUIDs, provide a more universal way to introduce location-independent IDs. They work well with CQRS because the client can generate an ID and send a message to a queue or message broker, allowing the message to be stored in the persistence layer when everything is on-line. The key will be known to the client at creation time, will be unique, and will not be dependent upon the database being available for it to be generated. Alternate keys can be retrofitted into an existing system by adding columns to existing tables and generating keys (usually GUIDs) for the missing values. The value can also be added to insert statements, even if the upstream clients are not yet updated. The new keys will be available to use as identifiers when sharing data across systems or using it in different persistence mechanisms, such as S3 or ElastiCache.
- Content-based IDs (hashes)
Yet another way to uniquely identify a data point is by hashing its contents. We could uniquely identify a trade, for example, by making a hash of its component data points. The hash is location-independent and will be unique to that data-point. This may seem overly complicated, but you probably use this approach every day even if you don't know it, because that is exactly how Git works. If you use that, you are already using content-based keys in your day-to-day workflow.
Server-generated IDs don't play well with distributed systems
A downside to relying on database-generated IDs is that the caller cannot know the ID until AFTER the database stores it. This means that the database and any APIs accessing it must be up and running for the server to insert data and return an ID. This coupling of services / layers reduces overall system availability, because all the systems must be available for it to be functional.
A different approach is to introduce CQRS and not depend on the receiving system to generate an ID for you. In this scenario, the client application, rather than the server, generates the ID when creating new items. It's a storage-independent way of identifying data, and can be thought of as the luggage tag in the backpack example. It's still possible, and likely, that a relational database will auto-generate a RowID when an insert happens, but the client application should be completely unaware of it. Location-specific IDs can still be useful within the system that owns the storage location, but those IDs should not leak out to clients or other systems. As a matter of fact, responses to commands (messages that cause state changes) in CQRS should not return any data at all, so the only way for the client to know the ID is to provide it up front.
An interesting aside is that server-generated integer IDs almost imply that a relational database is being used as the storage mechanism. What if we decide to use a different type of persistence, like S3, that doesn't generate integer IDs? To avoid breaking existing clients, we would have to come up with some manner to generate integers and associate them with newly-created data. It's better to let the clients provide the IDs and stop relying on the server to do it.
When crossing system boundaries, the use of integers as identifiers should be viewed as a code smell. If you see this happening in your code, you should question if you are leaking implementation details outside of your system boundaries (because you probably are).
You may notice that the bin number example I used works precisely BY giving integers to clients, and that system works just fine. Does that mean it's violating the concept we just described by giving location details to clients? The answer is no, because in that case the only significance the bin number has is the storage location itself. At no time does the act of placing an item in a bin confer a new identity on the item inside it. We should not conflate the ideas of identity and location in our systems either.
Why is this important now?
As we break apart monolithic software and share data between services and across the enterprise, we need the ability to uniquely identify data points regardless of the persistence layers in which they are stored. We will be communicating via queues and asynchronous messages and cannot rely on consumer-generated IDs, especially things like integer RowIDs. Not only are they implementation details that shouldn't leak out, they are also not unique across services. 427 in one system almost certainly will not refer to the same piece of data as 427 in a different system. We need to introduce location-independent keys to avoid confusion and the leakage of implementation details.
The key (pun intended) is to use identifiers that are not dependent upon storage locations or specific database technologies, and ensure that details that ARE location-dependent do not get exposed to clients or other systems.
Referenced resources:
Fundamentals of Distributed Systems (Pluralsight)
CQRS in Practice (Pluralsight)
S&P Global provides industry-leading data, software and technology platforms and managed services to tackle some of the most difficult challenges in financial markets. We help our customers better understand complicated markets, reduce risk, operate more efficiently and comply with financial regulation.
This article was published by S&P Global Market Intelligence and not by S&P Global Ratings, which is a separately managed division of S&P Global.