XML for documents, not for large data streams

I like XML, and I hate XML. XML is great because robust parsers already exist for nearly every programming language, thus saving work for programmers and reducing bugs. XML stinks because it’s not always the right tool for the job — it’s ugly, and it’s bulky. So when I read Michael E. Driscoll’s [comparison of documents (including XML) to trees and data to streams](http://dataspora.com/blog/the-rise-of-the-data-web/), it struck a chord with me:

> Trees are rooted and finite: you can’t chop up a tree and easily put it back together again. Streams can be split, sampled, and filtered. The divisibility of data streams lends itself to parallelism in a way that document trees do not. The stream paradigm conceives of data as extending infinitely forward in time. The Twitter data stream has no end: it ought have no end tag. Conceiving of data as streams moves us out of the realm of static objects and into the realm of signal processing.

He also [explains why XML shouldn’t be used for large data streams](http://dataspora.com/blog/xml-and-big-data/):

> XML is a poor language for data because it solves the wrong problems — those of documents — while leaving many of data’s unique issues unaddressed. But many promising alternatives exist — microformats like JSON, Thrift, and even SQLite’s file format.

I wouldn’t have thought of using SQLite’s file format — it has become somewhat ubiquitous. I admire Google ProtocolBuffers and Apache Thrift for offering open source, multi-language binary encoding for data. Now programmers won’t be as likely to reinvent the wheel, and they can rely on robust libraries.