Standardized naming & data typing for integrated & loosely coupled systems

When working on a project with dozens of engineers (some of whom come and go) over the span of a decade, it becomes apparent that not everyone thinks the same, names variables the same that cross system boundaries, or uses data types consistently.

Yet converging on a standard is important so that “downstream” consumers of the data can interpret the data consistently, accurately, without as much special case code.

That may sound abstract, so I will give some context and concrete examples. The system may look something like the following, at a high level:

  • Hubs, Cameras & IoT sensors in millions of homes or businesses, reporting in to a centralized set of servers. The devices have software written in Python, C, C++, Rust and bash.
  • Servers that store the data & provide APIs for Smarphone apps & other business stakeholders — service tools, inventory control, account management, etc. The database could be a document store without a formal schema or common data types.
  • The server farm consists of dozens or even hundreds of microservices,, many of which need to understand the data.
  • A data lake that replicates the data and allows data analysts to do their thing

Let’s say that we want to report device status — which could include whether the device is online or offline. What do we name it? Is it a boolean or a multi-state with states of Unknown, Initializing, Offline and Online? Do we allow booleans to be represented as integers? With a variety of different engineers, we could end up all over the place with naming and types — each device could be arrive with a different set of choices.

If all of the code is “owned” by one organization, it is helpful if there’s consistency — it helps with understanding, consistent interpretation of the data, maintenance, and less necessity for special casing.

Imagine if an Android app needs to present a list view of the online status of all devices, and it has to cycle through logic to handle different names for each device type?

Loop over a list of all devices...
If device is camera_type_1, retrieve the "online" property as a boolean
else if the device is camera_device_2, retrieve the "offline" tri-state integer
else if the device is a thermostat, retrieve the "node_online" property as an integer that is essentially a boolean

And multiply that by many times, across iOS, web apps, many micro service and data analysts.

I hope that it’s clear that engineering standards can ease the cost for downstream consumers of the data.

But what if we have engineering standards, and no one reads them? There’s already enough other mandates, requirements and complexity for new-hires and even seasoned software developers.

That’s where tools can help. Like running linters or build servers that do checking, we can implement our own tools that do the checking for us, so that we don’t have to know or remember every standard.

What if not all of the code is “owned” by one organization, and devices must be integrated that use different standards? It can be more of a challenge.

It may be possible for the ingress/egress interaction points on the servers to translate to and from consistent naming standards and data representations. It can still be difficult when the data types and representations vary wildly. Sometimes, the desire to standardize results in data loss.

Consider what happens when the servers expect temperature to be represented as Celsius as a integer and a device supports temperature as Fahrenheit as an integer: There’s going to be rounding errors going back and forth between the two. Other choices could include the server representing temperature as centi-Celsius, milli-Celsius integer, or as a float.

What if the server stores temperature for different devices in Celsius, Centi-Celsius, Fahrenheit and Kelvin? We don’t want people incorrectly interpreting the data. It’s helpful to have a standard so that people know the units involved. Choices could include a suffix indicator, or an additional data point that tells the units.

Working with 64 bit integers can be a challenge for web front-ends. The device, the server and the database may all handle them adeptly, but what happens when we send one to a JavaScript front-end? It will break on large integer values — because JavaScript is limited to supporting integers of up to 53 bits, minus one (i.e. 2 ** 53 -1). It may be that losing precision is acceptable in some cases, and not in others, or that we want people to store them as a string. An engineering standard can help address this kind of challenge.

What about MAC addresses? Could it be helpful for a standard to specify upper or lower case, and whether to use separators? It makes the job of downstream consumers easier.

What about normalizing data and losing information? If data loss is a problem, could the server store the data in “raw” form, as sent by the device, and also support a normalized/translated version? When translation fails, what value does the normalized/translated data contain? E.g. a device sends an Ethernet MAC address of “uninitialized” or “driver error”, what do we do?

Engineering standards can also define…

  1. default values. E.g. an uninitialized dBm signal strength… should it be 0 or negative 100 or negative 128? Many languages initialize variables to 0. A dBm of 0 is an impossibly powerful signal strength for in-home radios like WiFi devices.
  2. allowable ranges for temperature, signal strength, etc.
  3. preferred unit representations. E.g. dBm instead of a percentage.
  4. maximum length of strings, arrays, and maps.
  5. positive naming logic preferred over negative, where applicable. e.g. “online” instead of “offline”.
  6. and so much more.

Thank you for reading. May your code wrangling and engineering standards go well.