XML-based File Formats: Good or Bad?

Saturday, 2006-December-23 at 00:41 4 comments

This post asks a question: are XML-based file formats good or are they bad?  The short answer is "it depends."

Let's go back in history a while.  Doug McIllroy, inventor of Unix pipes is said to have said, “Write programs to handle text streams, because that is a universal interface.”  Using text files means that it is easier for another application to understand and process your file.  Text files can be easier to interpret than binary files, because text files can be defined to contain a descriptor for each element of data that tells the meaning or purpose of that data.  Not all text files are so defined, and so not all text files will contain such descriptors.

Text files do this at the expense of being larger.  If, for example, a field contained the numeric value 32767, it would take up 5 bytes of storage in an ASCII text file, but only two bytes of storage in a binary file.  With larger numeric values, and a largely numeric set of data, this difference only gets larger.

It can take some time for data to be converted to text representation or back from text to the binary form used by the application to hold the data.  A larger size on disk with a text file means that it will take longer to read or write because of its size on the disk, and also because of the conversion of the data.  Years ago, this was more of an issue, because computers had slower processors, smaller disks, and less memory.

Text files, especially if they contain data descriptors, can be easier for humans to read and understand.  Text files are also easier for humans to write or edit.

On the other hand, binary files generally will not contain an easily-read descriptor for each data field.  The file would not be made for humans to directly read or edit, but instead for a software application's use.  This binary file would generally be smaller on disk, and faster for the application to read or write.  It would be especially fast if the file format was designed to mimic the way that data was stored in memory by the application, at the expense of causing any other application that uses the file (but does not use the same data structures internally) to be slower.

Because of their compact representation, binary files are also more likely to be fragile.  That is, a small amount of corruption could make the file unusable.

In either text or binary files, they are better if they are strictly defined and specified, and all people or applications using said files stick to the specification.  Each specification and each application implementing a specification can determine whether to allow "hand grenade" files (files which almost conform to the spec) to be processed on a best-effort basis.

I discussed file formats at my tech blog.

XML files have all of the advantages and disadvantages of text files.  In the case of XML, it adds a hierarchical structure along with the field descriptor.  XML files are designed to be self-describing, both of the data and the structure.  XML has a field near the top of the file where the name and location of the specification must be noted, to help in determining the elements and structures that are appropriate, and the meanings that should be assigned to them.  All of this comes at the price of verbosity.  Because of the sloppy ways in which people have written HTML, XML (and applications implementing it) is supposed to require strict compliance.  This makes it much easier to implement XML data files in a software application, because it does not have to guess at something's meaning or intent.  If it violates the spec, the application can ignore it or reject it, but should inform the user of the violation.

XML files are readable by humans.  XML files can be edited or written by humans, but because XML expects strict adherence to the specs, such editing should be done with the relevant specs in hand until the user is completely familiar with the specs.  XML files, being text files, and repetitive text files at that, are pretty compressible using zip, gzip, 7-zip, bzip2, or similar compression methods.  This removes a substantial amount of the file size issue.

Because of the advantages of XML files for data storage, next-generation office suites are moving to XML-based file formats.  OpenOffice.org, for example, uses the international ISO-approved standard OASIS OpenDocument Formats (ODF).  Microsoft Office 2007 will use formats based on the ECMA OOXML formats.  These types of applications that are using XML are one of the seven most visible applications of XML.  It is hoped that the spread of ODF (a zipped XML format) will make users' data more permanent, since it is an open standard that anyone can implement.  This means no more losing access to your own data because you no longer own the software that created it, or because you change to a different operating system (such as GNU+Linux, OpenBSD, FreeBSD, Syllable, or Haiku).  It also means that any decent programmer can write an application that uses the data stored in ODF files.

So in conclusion, XML is a good format for data storage, especially for data that will be accessed by multiple applications and possibly by people directly.  XML excels, I think, as a data interchange format—one party sending data to another part—and as a basis for transformation of data from one format to another format.

XML Resources:

Entry filed under: Computers, Linux, ODF, Software, XML. Tags: .

Are file formats just an “academic” issue? Ever Wished You Could Ignore Something?

4 Comments

  • 1. dorai  |  Thursday, 2007-March-08 at 06:24

    While text streams were good a while ago, I think as the web progresses, we need text streams with meta data in them. XML is just one such format. JSON could be another (almost close to text streams in its simplicity) with some structure.

  • 2. lnxwalt  |  Thursday, 2007-March-08 at 11:41

    The self-describing, or semantic nature of XML is in fact the metadata you were speaking of.  In XML (and now also JSON), we have a text-based, self-describing way of representing data. It means that you can send me data, and I can understand not only what it says but what it means.  Among other things, this gives me the ability to make use of the data in completely unanticipated ways.

  • 3. Broken_Bazooka  |  Thursday, 2008-July-10 at 03:20

    Quote:
    “Because of their compact representation, binary files are also more likely to be fragile. That is, a small amount of corruption could make the file unusable.”

    Because of the severely increased size, XML files are more likely to become corrupted, corruption is also impossible to detect in an XML file and is likely to cause unforeseen damage when the data is used.

  • 4. lnxwalt  |  Thursday, 2008-July-10 at 19:21

    Thanks for the comment.

    Corruption in any structured, text-based format is more likely to be detectable and repairable than corruption in any binary format, mostly because one can quickly see where random garbage appears in the text file, while the binary file may show no signs until you try to load it into an application.

    My current thinking is to prefer formats similar to INI files when it makes sense to do so, because they still retain most of the self-describing nature that characterizes XML, without the verbosity. Also, XML user agents are supposed to be stricter about following the DTD or schema that is in use, while INI file reading tends to be a little looser about their requirements.

    Once again, thanks for your input.


RSS Unknown Feed

  • An error has occurred; the feed is probably down. Try again later.

RSS Owner Managed Business

Archives

Recent Posts

Blog Stats

  • 583,618 hits

SUBSCRIBE


Follow

Get every new post delivered to your Inbox.

Join 149 other followers

%d bloggers like this: