Simeon Simeonov

Subscribe to Simeon Simeonov: eMailAlertsEmail Alerts
Get Simeon Simeonov via: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Related Topics: Apache Web Server Journal, XML Magazine

Apache Web Server: Article

XML in Transit: Encoding Data

XML in Transit: Encoding Data

I just came back from the first face-to-face meeting of the W3C working group on XML Protocol (is it just me, or is the name somewhat odd-sounding?), and I'm wondering what topics to exclude from this column. Yes, that's right - exclude. Encoding data in XML is a difficult topic for many reasons. First, it's one of those technical subjects in which you need to look at lots of XML instance/schema/DTD snippets. Second, the devil is very much in the details and there are lots of them. Last but not least, there are as many ways to encode data in XML as there are data encoding needs. With this caveat, let's dive in. Keeping with the spirit of the column we'll touch on issues that are most relevant to XML protocols.

Have Protocol, Will Move Data
Imagine that some good people have developed a flexible and extensible XML protocol that can work with arbitrary data encoding styles. For example, SOAP defines an attribute in the envelope namespace - SOAP-ENV:encodingStyle - whose value is a URI identifying a particular encoding style. The encoding style applies to the element associated with the attribute as well as its content, excluding any child elements decorated with an encoding style specifier. For a quick refresher, peek at the following code:

<x:UpdateStock xmlns:x="Some URI">

There are many data transport scenarios and many possible data encoding styles that can be used with them. To put some structure to the discussion, think of the decision space as a choice tree. A choice tree has yes/no questions at its nodes and outcomes at its leaves (see Figure 1).

XML Data
Probably the most common choice involves whether the data is already in (or can easily be put into) an XML format. If we can represent the data as XML, we need only to decide how to include it in the XML instance document that will represent a message in the protocol. Ideally, we could just mix it in amid the protocol-specific XML, but under a different namespace (as shown in the previous code snippet). There are several benefits to this approach:

  1. The message is easy to construct and process using standard XML tools.
  2. Its contents can be queried using XQuery.
  3. If need be, it can be transformed using XSLT.

There's a catch.... The problem has to do with a seldom-considered but important aspect of XML - the uniqueness rule for ID attributes. The values of attributes of type ID must be unique in an XML instance so that the elements with these attributes can be conveniently referred to using attributes of type IDREF (following code snippet). (For more information on the uses of ID/IDREF read "Eliminating Redundancy in XML Using ID/IDREF" [XML-J, Vol. 1, issue 4].)

<Target id="mainTarget"/>
<Reference href="#mainTarget"/>
If your data doesn't use ID attributes you can include it inline (textually) in the XML protocol message under a separate namespace. However, if you do use ID attributes you'll run the risk of violating the uniqueness rule. For example, in the following code both message elements have the same id. This makes the document invalid XML. And no, namespaces do not address the issue. In fact, the problems are so serious that nothing short of a change in the core XML specification and in most XML processing tools can change the status quo. Don't wait for this to happen.

<message id="msg-1">
A message with an attached <a href="#msg-1">message</a>.
<attachment id="attachment-1">
<!-- ID conflict right here -->
<message id="msg-1">
This is a textually included message.

There are two ways to work around the problem. If no one ever externally references specific IDs within the protocol message data, your XML protocol toolset can automatically rewrite the IDs and references to them as you include the XML inside the message (see code below). This will give you the benefits described above at the cost of some extra processing and a slight deterioration in readability due to the machine-generated IDs.

<message id="msg-1">
A message with an attached <a href="#id-9137">message</a>.
<attachment id="attachment-1">
<!-- ID has been changed -->
<message id="id-9137">
This is a textually included message.

However, if you can't do this, you'll have to include the XML as an opaque chunk of text inside your protocol message (see the following code). In this case we've escaped all pointy brackets, but we could have included the whole message in a CDATA section. The benefit of this approach is that it's easy and works for any XML content. But you don't get any of the benefits of XML either. You can't validate, query, or transform the data directly and you can't reference pieces of it from other parts of the message.

<message id="msg-1">
A message with an attached message that we can no longer refer to directly.
<attachment id="attachment-1">
<!-- Message included as text -->
<message id="id-9137">
This is a textually included message.

Binary Data
So far we've covered encoding options for preexisting XML data. But, what if you're not dealing with XML data? What if you want to transport binary data as part of your message instead? The commonly used solution is good old base-64 encoding (see Listing 1). On the positive side, base-64 data is easy to encode and decode and the character set of base-64 encoded data is valid XML element content. On the negative side, base-64 encoding takes up nearly 33% more memory than pure binary representation. If you need to move a lot of binary data and space/time efficiency is a concern, you might have to look for alternatives. More on this in a bit.

You may want to consider using base-64 encoding even when you want to move some plain text as part of a message because XML's document-centric SGML origin led to several awkward restrictions on the textual content of XML instances. For example, an XML document can't include any control characters (ASCII codes 0-31) except tabs, carriage returns, and line feeds. This covers both the straight occurrences of the characters and their encoded form as character references (e.g., &#x04;). (This caused me a lot of pain when I was creating WDDX; I still haven't gotten over it.) Further, carriage returns are always converted to line feeds by XML processors. It's important to keep in mind that not all characters you can put in a string variable in a programming language can be represented in XML documents.

Abstract Data Models
If you're not dealing with plain text, XML, or binary data, you probably have some form of structured data represented via an abstract data model. (Both the SOAP specification and the XML Protocol materials use the term nonsyntactic to mean abstract; don't let this nondescript use of language scare you.) Usually abstract data models are ultimately instantiated as programming language data structures. A commonly used abstract data model is the directed labeled graph (DLG). A DLG consists of named nodes and directed named edges that connect source nodes with destination nodes. A node may have more than one edge with the same name. Nodes can have any number of useful properties - such as type - that don't fundamentally change the data model as they themselves can be expressed via nodes and edges.

All programming language and database data structures can be expressed as DLGs. Therefore, if we have a good way to represent DLGs in XML, we have a generic mechanism for handling abstract data models. We need three things:

  1. Given metadata about an abstract data model, we should have a way to map the model to a DLG model and construct an XML schema from it.
  2. Given an instance graph of the data model, we can generate XML that conforms to the schema. This is the serialization operation.
  3. Given XML that conforms to the schema, we can create an instance graph that conforms to the abstract data model's schema. This is the deserialization operation. Further, if we follow serialization by deserialization, we should obtain an identical instance graph to the one we started with.

As with many things in the XML industry, several specifications address this space. XMI, described in the XML-J article "UML, MOF, and XMI" (Vol. 1, issue 3), offers one mechanism. SOAP defines its own set of encoding rules that are fairly detailed and rather complex. In fact, they take up about 50% of the volume of the specification. The other 50% covers the envelope framework, header/body structure, extensibility mechanisms, intermediaries, error handling, RPC conventions, and HTTP bindings. We won't go into the details; there are too many of them. Suffice to say, in many cases you'll never have to worry about the mechanics of the serialization/deserialization processes. The following code gives you a taste of how the instance data looks, while Listing 2 shows you a possible schema for the data. The instance data markup can appear inside both the headers and the body of a SOAP message.

<name>XML Guru</name>
<comment href="#comment-1"/>
<contactNumbers SOAP-ENC:arrayType="x:phoneNumber[2]">
<phoneNumber>617.555.1212</phoneNumber >
<phoneNumber >415.555.1212</phoneNumber >
<x:comment id="comment-1" xsi:type="SOAP-ENC:string"> The one true XML guru. </x:comment>

As you can see, a lot is going on here. First, it's clear that the SOAP encoding model depends heavily on XML Schema. ID/IDREF attributes are used to handle multiple references to the same piece of data. The xsi:type attribute can be used to provide type information to the XML processor in the absence of a schema. For some types, notably sequences/arrays, you need to subclass predefined data types. In addition, array content information (SOAP-ENC:arrayType) must be stored in the instance data; pity the array structure syntax is not XML.

Pretty much any data can be encoded; there are no limits on the types of objects that can be represented. The schema fragment could have been autogenerated by introspecting some Java classes, for example. There are also ways to encode data without having to worry about the schema at all, using self-describing element names.

Linking Data
So far we've only considered scenarios in which the encoded data is part of the XML document describing a protocol message. This may create some problems for including preexisting XML content and waste space in the case of base-64 encoded binary objects. The alternative would be keeping the data outside the message and somehow bringing it in at the right time.

There are two general mechanisms for doing this. The first one comes straight out of XML 1.0. It involves external entity references that allow content external to an XML document to be brought in during processing. Many people in the industry prefer pure markup approaches and therefore favor using explicit link elements that comply with the XLink specification. Both methods could work. Both require extensions to the existing XML protocol toolsets.

Of course, there are purely application-based methods for linking. You could pass a URI known to mean "get the actual content here." However, this approach doesn't scale to generic data-encoding mechanisms because it requires application-level knowledge.

External content can be kept on a separate server to be delivered on demand. It can also be packaged together with the protocol message in a MIME envelope. In this case the links to it should probably use the MIME unique-content IDs (CIDs) for identification purposes. Traditionally, SOAP has steered clear of anything having to do with MIME. On the other hand, the ebXML Transport/Routing and Packaging working group is looking very seriously at multipart MIME messages. This historic difference is understandable when we consider that SOAP grew out of RPC work and the ebXML folks are focused on business messaging where, for example, an auto insurance claim might carry along several accident pictures. MIME offers a mechanism to combine the XML protocol message with the external content in a single package.

Choose Wisely
There are many ways to encode data in XML, and well-designed XML protocols will let you plug any encoding style you choose. How should you make this important decision? First, of course, keep it simple. If possible, choose standards-based and well-deployed technology. Then consider your needs and match them against some of the important facets of XML data encoding:

  • Time efficiency: how fast can you serialize/deserialize data? This becomes particularly important in transaction-intensive systems. In some cases, if you know certain things about your data, you can use much higher performance encoding/decoding modules. For example, WDDX doesn't support directed labeled graphs; it only supports tree structures. However, because of this simplification, serialization and deserialization can fly.
  • Memory efficiency: how much memory do you need during serialization/deserialization? You may not care about this on an application server with 2Gb RAM, but do you expect handheld devices to be able to make requests to your server? In general this is a bigger problem during deserialization. DOM-based deserializers are the biggest offenders because they need to instantiate so many objects in memory. SAX-based deserializers can do a much better job. High-performance XML protocol frameworks, such as the Apache SOAP Project, are developing innovative approaches to combine the speed of SAX with the ease of access that the DOM provides.

  • Transport efficiency: how do the sizes of the generated XML compare between encoding styles? Packing multimegabyte JPEGs as base-64 strings inside XML documents may not be the best way to use bits on the wire. Explore external linking mechanisms when bandwidth is of concern. Also, consider protocol bindings that allow for compression.
  • Flexibility: Can you encode abstract data models? Is there a limit on the types of data that can be represented? Can you link external content? Is the encoding format introspectable (i.e., can someone do something meaningful with the data without having previously looked at its schema?)? This is important for service-discovery-type applications.

This space is evolving quite rapidly and the pending release of the XML Schema specification will add fuel to the fires of innovation. Fasten your seatbelts - there's little standardization in this space right now and there will be some turmoil before we emerge with sensible ways to approach the common data-encoding scenarios described here.

Although there's lots more ground to cover on this subject, I think I should move on quickly to try to stay on top of innovation in the XML protocol space. In the next XML in Transit column I'll take a look at the Web Services Description Language (WSDL), another hallmark joint effort by Microsoft and IBM. Keep it coming, guys.

More Stories By Simeon Simeonov

Simeon Simeonov is CEO of FastIgnite, where he invests in and advises startups. He was chief architect or CTO at companies such as Allaire, Macromedia, Better Advertising and Thing Labs. He blogs at blog.simeonov.com, tweets as @simeons and lives in the Greater Boston area with his wife, son and an adopted dog named Tye.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.