XML-based File Formats: Good or Bad?
This post asks a question: are XML-based file formats good or are they bad? The short answer is "it depends."
Let's go back in history a while. Doug McIllroy, inventor of Unix pipes is said to have said, “Write programs to handle text streams, because that is a universal interface.” Using text files means that it is easier for another application to understand and process your file. Text files can be easier to interpret than binary files, because text files can be defined to contain a descriptor for each element of data that tells the meaning or purpose of that data. Not all text files are so defined, and so not all text files will contain such descriptors.
Text files do this at the expense of being larger. If, for example, a field contained the numeric value 32767, it would take up 5 bytes of storage in an ASCII text file, but only two bytes of storage in a binary file. With larger numeric values, and a largely numeric set of data, this difference only gets larger.
It can take some time for data to be converted to text representation or back from text to the binary form used by the application to hold the data. A larger size on disk with a text file means that it will take longer to read or write because of its size on the disk, and also because of the conversion of the data. Years ago, this was more of an issue, because computers had slower processors, smaller disks, and less memory.
Text files, especially if they contain data descriptors, can be easier for humans to read and understand. Text files are also easier for humans to write or edit.
On the other hand, binary files generally will not contain an easily-read descriptor for each data field. The file would not be made for humans to directly read or edit, but instead for a software application's use. This binary file would generally be smaller on disk, and faster for the application to read or write. It would be especially fast if the file format was designed to mimic the way that data was stored in memory by the application, at the expense of causing any other application that uses the file (but does not use the same data structures internally) to be slower.
Because of their compact representation, binary files are also more likely to be fragile. That is, a small amount of corruption could make the file unusable.
In either text or binary files, they are better if they are strictly defined and specified, and all people or applications using said files stick to the specification. Each specification and each application implementing a specification can determine whether to allow "hand grenade" files (files which almost conform to the spec) to be processed on a best-effort basis.
I discussed file formats at my tech blog.
XML files have all of the advantages and disadvantages of text files. In the case of XML, it adds a hierarchical structure along with the field descriptor. XML files are designed to be self-describing, both of the data and the structure. XML has a field near the top of the file where the name and location of the specification must be noted, to help in determining the elements and structures that are appropriate, and the meanings that should be assigned to them. All of this comes at the price of verbosity. Because of the sloppy ways in which people have written HTML, XML (and applications implementing it) is supposed to require strict compliance. This makes it much easier to implement XML data files in a software application, because it does not have to guess at something's meaning or intent. If it violates the spec, the application can ignore it or reject it, but should inform the user of the violation.
XML files are readable by humans. XML files can be edited or written by humans, but because XML expects strict adherence to the specs, such editing should be done with the relevant specs in hand until the user is completely familiar with the specs. XML files, being text files, and repetitive text files at that, are pretty compressible using zip, gzip, 7-zip, bzip2, or similar compression methods. This removes a substantial amount of the file size issue.
Because of the advantages of XML files for data storage, next-generation office suites are moving to XML-based file formats. OpenOffice.org, for example, uses the international ISO-approved standard OASIS OpenDocument Formats (ODF). Microsoft Office 2007 will use formats based on the ECMA OOXML formats. These types of applications that are using XML are one of the seven most visible applications of XML. It is hoped that the spread of ODF (a zipped XML format) will make users' data more permanent, since it is an open standard that anyone can implement. This means no more losing access to your own data because you no longer own the software that created it, or because you change to a different operating system (such as GNU+Linux, OpenBSD, FreeBSD, Syllable, or Haiku). It also means that any decent programmer can write an application that uses the data stored in ODF files.
So in conclusion, XML is a good format for data storage, especially for data that will be accessed by multiple applications and possibly by people directly. XML excels, I think, as a data interchange format—one party sending data to another part—and as a basis for transformation of data from one format to another format.
- The W3C, the inventors of XML
- A Bite Of XML, a blog about XML
- W3 Schools
- XML.com, an O'Reilly Media site
- xml.apache.org, Apache's XML software project
- ODF 1.1 Draft Specification