XML DTD and schema

xml dtd program examples and what a dtd or an xml schema does and xml dtd data types
GregDeamons Profile Pic
GregDeamons,New Zealand,Professional
Published Date:03-08-2017
Your Website URL(Optional)
Chapter 15ti CHAPTER 15 In this chapter: • Languages and Metalanguages • Documents and DTDs XML 15 • Understanding XML DTDs • Element Grammar • Element Attributes • Conditional Sections • Building an XML DTD • Using XML HTML is a maverick. It only loosely follows the rules of formal electronic document- markup design and implementation. The language was born out of the need to assem- ble text, graphics, and other digital content and send them over the global Internet. In the early days of the Web’s boom, the demand for better browsers and document serv- ers—driven by hordes of new users with insatiable appetites for more and cooler web pages—left little time for worrying about things like standards and practices. Of course, without guiding standards, HTML would eventually have devolved into Babel. That almost happened, during the browser wars in the mid- to late 1990s. Chaos is not an acceptable foundation for an industry whose value is measured in the trillions of dollars. Although the standards people at the World Wide Web Con- sortium (W3C) managed to rein in the maverick HTML with standard version 4, it is still too wild for the royal herd of markup languages. The HTML 4.01 standard is defined using the Standard Generalized Markup Lan- guage (SGML). While more than adequate for formalizing HTML, SGML is far too complex to use as a general tool for extending and enhancing HTML. Instead, the W3C has devised a standard known as the Extensible Markup Language, or XML. Based on the simpler features of SGML, XML is kinder, gentler, and more flexible, well suited to guiding the birth and orderly development of new markup languages. With XML, HTML is being reborn as XHTML. In this chapter, we cover the basics of XML, including how to read it, how to create sim- ple XML Document Type Definitions (DTDs), and the ways you might use XML to enhance your use of the Internet. In the next chapter, we explore the depths of XHTML. You don’t have to understand everything there is to know about XML to write XHTML. We think it’s helpful, but if you want to cut to the chase, feel free to skip to the next chapter. Before you do, however, you may want to take a look at some of the uses of XML covered at the end of this chapter, starting with section 15.8. 472This chapter provides only an overview of XML. Our goal is to whet your appetite and make you conversant in XML. For full fluency, consult Learning XML by Erik T. Ray or XML in a Nutshell by W. Scott Means and Elliotte Rusty Harold, both from O’Reilly. 15.1 Languages and Metalanguages A language is composed of commonly accepted symbols that we assemble in a mean- ingful way in order to express ourselves and to pass along information that is intelligi- ble to others. For example, English is a language with rules (grammar) that define how to put its symbols (words) together to form sentences, paragraphs, and, ultimately, books like the one you are holding. If you know the words and understand the gram- mar, you can read the book, even if you don’t necessarily understand its contents. An important difference between human and computer-based languages is that human languages are self-describing. We use English sentences and paragraphs to define how to create correct English sentences and paragraphs. Our brains are mar- velous machines that have no problem understanding that you can use a language to describe itself. However, computer languages are not so rich and computers are not so bright that you could easily define a computer language with itself. Instead, we define one language—a metalanguage—that defines the rules and symbols for other computer languages. Software developers create the metalanguage rules and then define one or more lan- guages based on those rules. The metalanguage also guides developers who create the automated agents that display or otherwise process the contents of documents that use its language(s). XML is the metalanguage the W3C created and that developers use to define markup languages such as XHTML. Browser developers rely on XML’s metalanguage rules to create automated processes that read the language definition of XHTML and imple- ment the processes that ultimately display or otherwise process XHTML documents. Why bother with a markup metalanguage? Because, as the familiar proverb goes, the W3C wants to teach us how to fish so that we can feed ourselves for a lifetime. With XML, there is a standardized way to define markup languages for different needs, instead of having to rely upon HTML extensions. Mathematicians need a way to express mathematical notations, for instance; composers need a way to present musi- cal scores; businesses want their web sites to take sales orders from customers; physi- cians look to exchange medical records; plant managers want to run their factories The use of metalanguages has long been popular in the world of computer programming. The C program- ming language, for instance, has a set of rules and symbols defined by one of several metalanguages, includ- ing yacc. Developers use yacc to create compilers, which in turn process language source files into computer- intelligible programs (hence, its name: Yet Another Compiler Compiler). yacc’s only purpose is to help developers create new programming languages. 15.1 Languages and Metalanguages 473from web-based documents. All of these groups need an acceptable, resilient way to express these different kinds of information so that the software industry can develop the programs that process and display these diverse documents. XML provides the answer. Each content sector—the business group, the factory- automation consortium, a trade association—may define a markup language that suits their particular need for information exchange and processing over the Web. Computer programmers then create XML-compliant processes—parsers—that read the new lan- guage definitions and allow the server to process the documents of those languages. 15.1.1 Creation Versus Display While there is no limit to the kinds of markup languages that you can create with XML, displaying your documents may be more complicated. For instance, when you write HTML, a browser understands what to do with the h1 tag because it is defined in the HTML DTD. With XML, you create the DTD. For example, wouldn’t a recipe DTD be a great way to capture and standardize all those kumquat recipes you’ve been collecting in your kitchen drawers? With special ingredient and portion tags, the recipes are easy to define and understand. However, browsers won’t know what to do with these new tags unless you attach a stylesheet that defines their handling. Without a stylesheet, XML-compliant browsers render these tags in a very generic way—certainly not the flourishing presentation your kumquat recipes deserve. Even with stylesheets, there are limitations to presenting XML-based information. Let’s say you want to create something more challenging, such as a DTD for musical notation or silicon chip design. While describing these data types in a DTD is possi- ble, displaying this information graphically is certainly beyond the capabilities of any stylesheets we’ve seen yet; properly displaying this type of graphically rich informa- tion would require a specialized rendering tool. Nonetheless, your recipe DTD is a great tool for capturing and sharing recipes. As we’ll see later in this chapter, XML isn’t simply about creating markup languages for displaying content in browsers. It has great promise for sharing and managing infor- mation so that those precious kumquat dishes will be preserved for many genera- tions to come. Just bear in mind that, in addition to writing a DTD to describe your new XML-based markup language, in most cases you will want to supplement the † DTD with a stylesheet. An alternative to DTDs is XML Schemas. Schemas offer features related to data typing and are more pro- grammatically oriented than document-oriented. For more information, check out XML Schema by Eric van der Vlist (O’Reilly). † In fact, it is possible to write XML documents using only a stylesheet. DTDs are highly recommended but optional. See http://www.w3c.org/TR/xml-stylesheet for details. 474 Chapter 15: XML15.1.2 A Little History To complete your education into the whys and wherefores of markup languages, it helps to know how all these markup languages came to be. In the beginning, there was SGML. SGML was intended to be the only metalanguage from which all markup languages would derive. With SGML, you can define every- thing from hieroglyphics to HTML, negating the need for any other metalanguage. The problem with SGML is that it is so broad and all-encompassing that mere mor- tals cannot use it. Using SGML effectively requires very expensive and complex tools that are completely beyond the scope of regular people who just want to bang out an HTML document in their spare time. As a result, developers created other markup languages that are greatly reduced in scope and are much easier to use. The HTML standards themselves were initially defined using a subset of SGML that eliminated many of its more esoteric features. The DTD in Appendix D uses this subset of SGML to define the HTML 4.01 standard. Recognizing that SGML was too unwieldy to describe HTML in a useful way and that there was a growing need to define other HTML-like markup languages, the W3C defined XML. XML is a formal markup metalanguage that uses select features of SGML to define markup languages in a style similar to that of HTML. It elimi- nates many SGML elements that aren’t applicable to languages such as HTML, and simplifies other elements to make them easier to use and understand. XML is a middle ground between SGML and HTML, a useful tool for defining a wide variety of markup languages. XML is becoming increasingly important as the Web extends beyond browsers and moves into the realm of direct data interchange among people, computers, and disparate systems. A small number of people wind up creating new markup languages with XML, and many more people want to be able to understand XML DTDs in order to use all of these new markup languages. 15.2 Documents and DTDs To be perfectly correct, we must explain that “XML” has come to mean many subtly different things. An XML document is a document containing content that conforms to a markup language defined from the XML standard. An XML Document Type Definition (XML DTD) is a set of rules—more formally known as entity and element declarations—that define an XML markup language; i.e., how the tags are arranged in a correct (valid) XML document. To make things even more confusing, entity and element declarations may appear in an XML document itself, as well as within an XML DTD. An XML document contains character data, which consists of plain content and markup in the form of tags and XML declarations. Thus: blahharrumph/blah 15.2 Documents and DTDs 475is a line in a well-formed XML document. Well-formed XML documents follow cer- tain rules, such as the requirement for every tag to have a closing tag. These rules are presented in the context of XHTML in Chapter 16. To be considered valid—a valid XML document conforms to a DTD—every XML document must have a corresponding set of XML declarations that define how the tags and content should be arranged within it. These declarations may be included directly in the XML document, or they may be stored separately in an XML DTD. If an XML DTD exists that defines the blah tag, our well-formed XML document is valid, provided you preface it with a DOCTYPE tag that explains where to find the appropriate DTD: ?xml version="1.0"? DOCTYPE blah SYSTEM "blah.dtd" blahharrumph/blah The example document begins with the optional ?xml directive declaring the ver- sion of XML it uses. It then uses the DOCTYPE directive to identify the DTD that some automated system, such as a browser, uses to process and perhaps display the contents of the document. In this case, a DTD named blah.dtd should be accessible to the browser so that the browser can determine whether the blah tag is valid within the document. XML DTDs contain only XML entity and element declarations. XML documents, on the other hand, may contain both XML element declarations and conventional con- tent that uses those elements to create a document. This intermingling of content and declarations is perfectly acceptable to a computer processing an XML docu- ment, but it can get confusing for humans trying to learn about XML. For this rea- son, we focus our attention in this chapter on the XML entity and element declaration features that you can use to define new tags and document types. In other words, we are addressing only the DTD features of XML; the content features mirror the rules and requirements you already know and use in order to create HTML documents. 15.3 Understanding XML DTDs To use a markup language defined with XML, you should be able to read and under- stand the elements and entities found in its XML DTD. But don’t be put off: while XML DTDs are verbose, filled with obscure punctuation, and designed primarily for computer consumption, they are actually easy to understand once you get past all the syntactic sugar. Remember, your brain is better at languages than any computer. We use the word browser here because that’s what most people will use to process and view XML docu- ments. The XML specification uses the more generic phrase “processing application” because, in some cases, the XML document will be processed not by a traditional browser, but by some other tool that knows how to interpret XML documents. 476 Chapter 15: XMLAs we said previously, an XML DTD is a collection of XML entity and element decla- rations and comments. Entities are name/value pairs that make the DTD easier to read and understand, and elements are the actual markup tags defined by the DTD, such as HTML’s p and h1 tags. The DTD also describes the content and gram- mar for each tag in the language. Along with the element declarations, you’ll also find attribute declarations that define the attributes authors may use with the tags defined by the element declarations. There is no required order, although the careful DTD author arranges declarations in such a way that humans can easily find and understand them, computers notwith- standing. The beloved DTD author includes lots of comments, too, that explain the declarations and how they can be used to create a document. Throughout this chap- ter, we use examples taken from the XHTML 1.0 DTD, which you can find in its entirety at the W3C web site. Although it is lengthy, you’ll find this DTD to be well written, complete, and, with a little practice, easy to understand. XML also provides for conditional sections within a DTD, allowing groups of decla- rations to be optionally included or excluded by the DTD parser. This is useful when a DTD actually defines several versions of a markup language; the desired version can be derived by including or excluding appropriate sections. The XHTML 1.0 DTD, for example, defines both the “regular” version of HTML and a version that supports frames. By allowing the parser to include only the appropriate sections of the DTD, the rules for the html tag can change to support either a body tag or a frameset tag, as needed. 15.3.1 Comments The syntax for comments within an XML DTD is exactly like that for HTML com- ments: comments begin with and end with . The XML processor ignores everything between these two elements. Comments may not be nested. 15.3.2 Entities An entity is a fancy term for a constant. Entities are crucial to creating modular, eas- ily understood DTDs. Although they may differ in many ways, all entities associate a name with a string of characters. When you use the entity name elsewhere within a DTD, or in an XML document, language parsers replace the name with the corre- sponding characters. Drawing an example from HTML, the < entity is replaced by the character wherever it appears in an HTML document. Entities come in two flavors: parsed and unparsed. An XML processor will handle parsed entities and ignore unparsed ones. The vast majority of entities are parsed. An unparsed entity is reserved for use within attribute lists of certain tags; it is nothing more than a replacement string used as a value for a tag attribute. 15.3 Understanding XML DTDs 477You can further divide the group of parsed entities into general entities and parameter entities. General entities are used in the XML document, and parameter entities are used in the XML DTD. You may not realize that you’ve been using general entities within your HTML docu- ments all along. They’re the ones that have an ampersand (&) character preceding their name. For example, the entity for the copyright (©) symbol (©) is a gen- eral entity defined in the HTML DTD. Appendix F lists all of the other general enti- ties you know and love. To make life easier, XML predefines the five most common general entities, which you can use in any XML document. While it is still preferred that they be explicitly defined in any DTD that uses them, these five entities are always available to any XML author: & & ' ' > < " " You’ll find parameter entities littered throughout any well-written DTD, including the HTML DTD. Parameter entities have a percent sign (%) preceding their names. The percent sign tells the XML processor to look up the entity name in the DTD’s list of parameter entities, insert the value of the entity into the DTD in place of the entity reference, and process the value of the entity as part of the DTD. That last bit is important. By processing the contents of the parameter entity as part of the DTD, the XML processor allows you to place any valid XML content in a parameter entity. Many parameter entities contain lengthy XML definitions and may even contain other entity definitions. Parameter entities are the workhorses of the XML DTD; creating DTDs without them would be extremely difficult. 15.3.3 Entity Declarations Let’s define an entity with the ENTITY tag in an XML DTD. Inside the tag, first supply the entity name and value, and then indicate whether it is a general or a parameter entity: ENTITY name value ENTITY % name value The first version creates a general entity; the second, because of the percent sign, cre- ates a parameter entity. For both entity types, the name is simply a sequence of characters beginning with a letter, colon, or underscore and followed by any combination of letters, numbers, C and C++ programmers may recognize that the entity mechanism in XML is similar to the define macro mechanism in C and C++. The XML entities provide only simple character-string substitution and do not employ C’s more elaborate macro parameter mechanism. 478 Chapter 15: XMLperiods, hyphens, underscores, or colons. The only restriction is that names may not begin with a symbol other than the colon or underscore, or the sequence “xml” (either upper- or lowercase). The entity value is either a character string within quotes (unlike HTML markup, you must use quotes even if it is a string of contiguous letters) or a reference to another document containing the value of the entity. For these external entity val- ues, you’ll find either the keyword SYSTEM, followed by the URL of the document containing the entity value, or the keyword PUBLIC, followed by the formal name of the document and its URL. A few examples will make this clear. Here is a simple general entity declaration: ENTITY fruit "kumquat or other similar citrus fruit" In this declaration, the entity "&fruit;" within the document is replaced with the phrase “kumquat or other similar citrus fruit” wherever it appears. Similarly, here is a parameter entity declaration: ENTITY % ContentType "CDATA" Anywhere the reference %ContentType; appears in your DTD, it is replaced with the word CDATA. This is the typical way to use parameter entities: to create a more descriptive term for a generic parameter that will be used many times in a DTD. Here is an external general entity declaration: ENTITY boilerplate SYSTEM "http://server.com/boilerplate.txt" It tells the XML processor to retrieve the contents of the file boilerplate.txt from server.com and use it as the value of the boilerplate entity. Anywhere you use &boilerplate; in your document, the contents of the file are inserted as part of your document content. Here is an external parameter entity declaration, lifted from the HTML DTD, which references a public external document: ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "xhtml-lat1.ent" It defines an entity named HTMLlat1 whose contents are to be taken from the public document identified as -//W3C//ENTITIES Latin 1 for XHTML//EN. If the processor does not have a copy of this document available, it can use the URL xhtml-lat1.ent to find it. This particular public document is actually quite lengthy, containing all of the general entity declarations for the Latin 1 character encodings for HTML. Accord- ingly, simply writing this in the HTML DTD: %HTMLlat1; causes all of those general entities to be defined as part of the language. You can enjoy this document for yourself at http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent. 15.3 Understanding XML DTDs 479A DTD author can use the PUBLIC and SYSTEM external values with general and parameter entity declarations. You should structure your external definitions to make your DTDs and documents easy to read and understand. You’ll recall that we began the section on entities with a mention of unparsed enti- ties whose only purpose is to be used as values to certain attributes. You declare an unparsed entity by appending the keyword NDATA to an external general entity decla- ration, followed by the name of the unparsed entity. If we wanted to convert our gen- eral boilerplate entity to an unparsed general entity for use as an attribute value, we could say: ENTITY boilerplate SYSTEM "http://server.com/boilerplate.txt" NDATA text With this declaration, attributes defined as type ENTITY (as described in section 15.5.1) could use boilerplate as one of their values. 15.3.4 Elements Elements are definitions of the tags that you can use in documents based on your XML markup language. In some ways, element declarations are easier than entity declarations because all you need to do is specify the name of the tag and what sort of content that tag may contain: ELEMENT name contents The name follows the same rules as names for entity definitions. The contents section may be one of four types described here: • The keyword EMPTY defines a tag with no content, such as hr and br in HTML. Empty elements in XML get a bit of special handling, as described in section 15.4.5. • The keyword ANY indicates that the tag can have any content, without restriction or further processing by the XML processor. • The content may be a set of grammar rules that defines the order and nesting of tags within the defined element. You use this content type when the tag being defined contains only other tags, without conventional content allowed directly within the tag. In HTML, the ul tag is such a tag, as it can contain only li tags. • Mixed content, denoted by a comma-separated list of element names and the keyword PCDATA, is enclosed in parentheses. This content type allows tags to have user-defined content, along with other markup elements. The li tag, for example, may contain user-defined content as well as other tags. The last two content types form the meat of most DTD element declarations. This is where the fun begins. 480 Chapter 15: XML15.4 Element Grammar The grammar of human language is rich with a variety of sentence structures, verb tenses, and all sorts of irregular constructs and exceptions to the rules. Nonetheless, you mastered most of it by the age of three. Computer language grammars typically are simple and regular, and have few exceptions. In fact, computer grammars use only four rules to define how elements of a language may be arranged: sequence, choice, grouping, and repetition. 15.4.1 Sequence, Choice, Grouping, and Repetition Sequence rules define the exact order in which elements appear in a language. For instance, if a sequence grammar rule states that element A is followed by B and then by C, your document must provide elements A, B, and C in that exact order. A missing element (A and C, but no B, for example), an extra element (A, B, E, then C), or an ele- ment out of place (C, A, then B) violates the rule and does not match the grammar. In many grammars, XML included, sequences are defined by simply listing the appropriate elements, in order and separated by commas. Accordingly, our example sequence in the DTD would appear simply as A, B, C. Choice grammar rules provide flexibility by letting the DTD author choose one ele- ment from among a group of valid elements. For example, a choice rule might state that you may choose elements D, E, or F; any one of these three elements would sat- isfy the grammar. Like many other grammars, XML denotes choice rules by listing the appropriate choices separated by a pipe character (). Thus, we could write our simple choice in the DTD as D E F. If you read the vertical bar as the word or, choice rules become easy to understand. Grouping rules collect two or more rules into a single rule, building richer, more usable languages. For example, a grouping rule might allow a sequence of elements, followed by a choice, followed by a sequence. You can indicate groups within a rule by enclosing them in parentheses in the DTD. For example: Document ::= A, B, C, (D E F), G requires that a document begin with elements A, B, and C, followed by a choice of one element out of D, E, or F, followed by element G. Repetition rules let you repeat one or more elements some number of times. With XML, as with many other languages, you denote repetition by appending a special character suffix to an element or group within a rule. Without the special character, that element or group must appear exactly once in the rule. Special characters include the plus sign (+), meaning that the element may appear one or more times in the docu- ment; the asterisk (), meaning that the element may appear zero or more times; and the question mark (?), meaning that the element may appear either zero or one time. 15.4 Element Grammar 481For example, the rule: Document ::= A, B?, C, (D E F)+, G creates an unlimited number of correct documents with the elements A through F. According to the rule, each document must begin with A, optionally followed by B, followed by zero or more occurrences of C, followed by at least one, but perhaps more, of either D, E, or F, followed by zero or more Gs. All of the following exam- ples (and many others) match this rule: ABCDG ACCCFFGGG ACDFDFGG You might want to work through these examples to prove to yourself that they are, in fact, correct with respect to the repetition rule. 15.4.2 Multiple Grammar Rules By now, you can probably imagine that specifying an entire language grammar in a single rule is difficult, although possible. Unfortunately, the result would be an almost unreadable sequence of nearly unintelligible rules. To remedy this situation, the items in a rule may themselves be rules containing other elements and rules. In these cases, the items in a grammar that are themselves rules are known as nonterminals, and the items that are elements in the language are known as terminals. Eventually, all the nonterminals must reference rules that create sequences of terminals, or the grammar would never produce a valid document. For example, we can express our sample grammar in two rules: Document ::= A, B?, C, Choices+, G Choices ::= D E F In this example, Document and Choices are nonterminals, and A, B, C, D, E, F, and G are terminals. There is no requirement in XML (or most other grammars) that dictates or limits the number of nonterminals in your grammar. Most grammars use nonterminals wher- ever it makes sense for clarity and ease of use. 15.4.3 XML Element Grammar The rules for defining the contents of an element match the grammar rules we just discussed. You may use sequences, choices, groups, and repetition to define the allowable contents of an element. The nonterminals in rules must be names of other elements defined in your DTD. A few examples show how this works. Consider the declaration of the html tag, taken from the HTML DTD: ELEMENT html (head, body) 482 Chapter 15: XMLThis defines the element namedhtml whose content is ahead element followed by abody element. Notice you do not enclose the element names in angle brackets within the DTD; you use that notation only when the elements are actually used in a document. Within the HTML DTD, you can find the declaration of the head tag: ELEMENT head (%head.misc;, ((title, %head.misc;, (base, %head.misc;)?) (base, %head.misc;, (title, %head.misc;)))) Gulp. What on Earth does this mean? First, notice that a parameter entity named head.misc appears several times in this declaration. Let’s go get it: ENTITY % head.misc "(scriptstylemetalinkobject)" Now things are starting to make sense: head.misc defines a group of elements, from which you may choose one. However, the trailing asterisk indicates that you may include zero or more of these elements. The net result is that anywhere %head.misc; appears, you can include zero or more script, style, meta, link,or object elements, in any order. Sound familiar? Returning to the head declaration, we see that we are allowed to begin with any num- ber of the miscellaneous elements. We must then make a choice: either a group con- sisting of a title element, optional miscellaneous items, and an optional base element followed by miscellaneous items; or a group consisting of a base element, miscellaneous items, a title element, and some more miscellaneous items. Why such a convoluted rule for the head tag? Why not just write: ELEMENT head (scriptstylemetalinkobjectbasetitle) which allows any number of head elements to appear, or none at all? The HTML standard requires that every head tag contain exactly one title tag. It also allows for only one base tag, if any. Otherwise, the standard does allow any number of the other head elements, in any order. Put simply, the head element declaration, while initially confusing, forces the XML processor to ensure that exactly one title element appears in the head element and that, if specified, just one base element appears as well. It then allows for any of the other head elements, in any order. This one example demonstrates a lot of the power of XML: the ability to define com- monly used elements using parameter entities and the use of grammar rules to dic- tate document syntax. If you can work through the head element declaration and understand it, you are well on your way to reading any XML DTD. 15.4.4 Mixed Element Content Mixed element content extends the element grammar rules to include the special PCDATA keyword. PCDATA stands for “parsed character data” and signifies that the 15.4 Element Grammar 483content of the element will be parsed by the XML processor for general entity refer- ences. After the entities are replaced, the character data is passed to the XML appli- cation for further processing. What this boils down to is that parsed character data is the actual content of your XML document. Elements that accept parsed character data may contain plain old text, plus whatever other tags you allow, as defined in the DTD. For instance: ELEMENT title (PCDATA) means that the title element may contain only text with entities. No other tags are allowed, just as in the HTML standard. A more complex example is the p tag, whose element declaration is: ELEMENT p %Inline; Another parameter entity, %Inline;, is defined in the HTML DTD as: ENTITY % Inline "(PCDATA %inline; %misc;)" which expands to these entities when you replace the parameters: ENTITY % special "br span bdo object img map" ENTITY % fontstyle "tt i b big small" ENTITY % phrase "em strong dfn code q sub sup samp kbd var cite abbr acronym" ENTITY % inline.forms "input select textarea label button" ENTITY % misc "ins del script noscript" ENTITY % inline "a %special; %fontstyle; %phrase; %inline.forms;" What do we make of all this? The %Inline; entity defines the contents of the p ele- ment as parsed character data, plus any of the elements defined by %inline; and any defined by %misc;. Note that case does matter: %Inline; is different from %inline;. The %inline; entity includes lots of stuff: special elements, font-style elements, phrase elements, and inline form elements. %misc includes the ins, del, script, and noscript elements. You can read the HTML DTD for the other entity declarations to see which elements are also allowed as the contents of a p element. Why did the HTML DTD authors break up all these elements into separate groups? If they were simply defining elements to be included in the p element, they could have built a single long list. However, HTML has rules that govern where inline ele- ments may appear in a document. The authors grouped elements that are treated similarly into separate entities that could be referenced several times in the DTD. This makes the DTD easier to read and understand, as well as easier to maintain when a change is needed. 484 Chapter 15: XML15.4.5 Empty Elements Elements whose content is defined to be empty deserve a special mention. XML introduced notational rules for empty elements, different from the traditional HTML rules that govern them. HTML authors are used to specifying an empty element as a single tag, such as br or img. XML requires that every element have an opening and a closing tag, so an image tag would be written as img/img, with no embedded content. Other empty elements would be written in a similar manner. Because this format works well for nonempty tags but is a bit of overkill for empty ones, you can use a special shorthand notation for empty tags. To write an empty tag in XML, just place a slash (/) immediately before the closing angle bracket of the tag. Thus, you can write a line break as br/ and an image tag as img src="myimage.gif"/. Notice that the attributes of the empty element, if any, appear before the closing slash and bracket. 15.5 Element Attributes The final piece of the DTD puzzle involves attributes. You know attributes: they are the name/value pairs included with tags in your documents that control the behav- ior and appearance of those tags. To define attributes and their allowed values within an XML DTD, use the ATTLIST directive: ATTLIST element attributes The element is the name of the element to which the attributes apply. The attributes are a list of attribute declarations for the element. Each attribute declaration in this list consists of an attribute name, its type, and its default value, if any. 15.5.1 Attribute Values Attribute values can be of several types, each denoted in an attribute definition with one of the following keywords: CDATA Indicates that the attribute value is a character or string of characters. This is the attribute type you would use to specify URLs or other arbitrary user data. For example, the src attribute of the img tag in HTML has a value of CDATA. ID Indicates that the attribute value is a unique identifier within the scope of the document. This attribute type is used with an attribute, such as the HTML id attribute, whose value defines an ID within the document, as discussed in “Core Attributes” in Appendix B. 15.5 Element Attributes 485IDREF or IDREFS Indicate that the attribute accepts an ID defined elsewhere in the document via an attribute of type ID. You use the ID type when defining IDs; you use IDREF and IDREFS when referencing a single ID and a list of IDs, respectively. ENTITY or ENTITIES Indicate that the attribute accepts the name or list of names of unparsed general entities defined elsewhere in the DTD. The definition and use of unparsed gen- eral entities is covered in section 15.3.2. NMTOKEN or NMTOKENS Indicate that the attribute accepts a valid XML name or list of names. These names are given to the processing application as the value of the attribute. The application determines how they are used. In addition to these keyword-based types, you can create an enumerated type by list- ing the specific values allowed with this attribute. To create an enumerated type, list the allowed values, separated by pipe characters and enclosed in parentheses, as the type of the attribute. For example, here is how the method attribute for the form tag is defined in the HTML DTD: method (getpost) "get" The method attribute accepts one of two values, either get or post; get is the default value if nothing is specified in the document tag. 15.5.2 Required and Default Attributes After you define the name and type of an attribute, you must specify how the XML processor should handle default or required values for the attribute. You do this by supplying one of four values after the attribute type. If you use theREQUIRED keyword, the associated attribute must always be provided when the element is used in a document. Within the XHTML DTD, the src attribute of the img tag is required because an image tag makes no sense without an image to display. The IMPLIED keyword means that the attribute may be used but is not required and that no default value is associated with the attribute. If it is not supplied by the docu- ment author, the attribute has no value when the XML processor handles the element. For the img tag, the width and height attributes are implied because the browser derives sizing information from the image itself if these attributes are not specified. If you specify a value, it then becomes the default value for that attribute. If the user does not specify a value for the attribute, the XML processor inserts the default value (the value specified in the DTD). If you precede the default value with the keyword FIXED, the value is not only the default value for the attribute, it is the only value that can be used with that attribute if it is specified. 486 Chapter 15: XMLFor example, examine the attribute list for the form element, taken (and abridged) from the HTML DTD: ATTLIST form action CDATA REQUIRED method (getpost) "get" enctype CDATA "application/x-www-form-urlencoded" onsubmit CDATA IMPLIED onreset CDATA IMPLIED accept CDATA IMPLIED accept-charset CDATA IMPLIED This example associates seven attributes with the form element. The action attribute is required and accepts a character string value. The method attribute has one of two values, either get or post. get is the default, so if the document author doesn’t include the method attribute in the form tag, the XML parser assumes method=get automatically. The enctype attribute for the form element accepts a character string value and, if not specified, defaults to a value of application/x-www-form-urlencoded. The remaining attributes all accept character strings, are not required, and have no default values if they are not specified. If you look at the attribute list for the form element in the HTML DTD, you’ll see that it does not exactly match our example. That’s because we’ve modified our example to show the types of the attributes after any parameter entities have been expanded. In the actual HTML DTD, the attribute types are provided as parameter entities whose names give a hint of the kinds of values the attribute expects. For example, the type of the action attribute appears as %URI;, not CDATA, but elsewhere in the DTD is defined to be CDATA. By using this style, the DTD author lets you know that the string value for this attribute should be a URL, not just any old string. Simi- larly, the type of the onsubmit and onreset attributes is given as %Script. This is a hint that the character string value should name a script to be executed when the form is submitted or reset. 15.6 Conditional Sections As we mentioned earlier in this chapter, XML lets you include or ignore whole sec- tions of your DTD, so you can tailor the language for alternative uses. The HTML DTD, for instance, defines transitional, strict, and frame-based versions of the lan- guage. DTD authors can select the portions of the DTD they plan to include or ignore by using XML conditional directives: INCLUDE ...any XML content... 15.6 Conditional Sections 487or: IGNORE ...any XML content... The XML processor either includes or ignores the contents, respectively. Condi- tional sections may be nested, with the caveat that all sections contained within an ignored section are ignored, even if they are set to be included. You rarely see a DTD with the INCLUDE and IGNORE keywords spelled out. Instead, you see parameter entities that document why the section is being included or ignored. Suppose you are creating a DTD to exchange construction plans among builders. Because you have an international customer base, you build a DTD that can handle both English and metric units. You might define two parameter entities: ENTITY % English "INCLUDE" ENTITY % Metric "IGNORE" You would then place all the English-specific declarations in a conditional section and isolate the metric declarations similarly: %English ...English stuff here... %Metric ...Metric stuff here... To use the DTD for English construction jobs, define %English as INCLUDE and %Metric as IGNORE, which causes your DTD to use the English declarations. For met- ric construction, reverse the two settings, ignoring the English section and including the metric section. 15.7 Building an XML DTD Now that we’ve emerged from the gory details of XML DTDs, let’s see how they work by creating a simple example. You can create a DTD with any text editor and a clear idea of how you want to mark up your XML documents. You’ll need an XML parser and processing application to actually interpret and use your DTD, as well as a stylesheet to permit XML-capable browsers to display your document. 15.7.1 An XML Address DTD Let’s create a simple XML DTD that defines a markup language for specifying docu- ments containing names and addresses. We start with an address element, which contains other elements that tag the address contents. Our address element has a sin- gle attribute indicating whether it is a work or a home address: ELEMENT address (name, street+, city, state, zip?) ATTLIST address type (homebusiness) REQUIRED 488 Chapter 15: XMLVoilà The first declaration creates an element named address that contains a name element, one or more street elements, a city and state element, and an optional zip element. The address element has a single attribute, type, which must be specified and can have a value of either home or business. Let’s define the name elements first: ELEMENT name (first, middle?, last) ELEMENT first (PCDATA) ELEMENT middle (PCDATA) ELEMENT last (PCDATA) The name element also contains other elements—a first name, an optional middle name, and a last name—each defined in the subsequent DTD lines. These three ele- ments have no nested tags and contain only parsed character data; i.e., the actual name of the person. The remaining address elements are easy, too: ELEMENT street (PCDATA) ELEMENT city (PCDATA) ELEMENT state (PCDATA) ELEMENT zip (PCDATA) ATTLIST zip length CDATA "5" All these elements contain parsed character data. The zip element has an attribute named length that indicates the length of the zip code. If the length attribute is not specified, it is set to 5. 15.7.2 Using the Address DTD Once we have defined our address DTD, we can use it to mark up address docu- ments. For example: address type="home" name firstChuck/first lastMusciano/last /name street123 Kumquat Way/street cityCary/city stateNC/state zip length="10"27513-1234/zip /address With an appropriate XML parser and an application to use this data, we can parse and store addresses, create addresses to share with other people and applications, and create display tools that would publish addresses in a wide range of styles and media. Although our DTD is simple, it has defined a standard way to capture address data that is easy to use and understand. 15.7 Building an XML DTD 48915.8 Using XML Our address example is trivial. It hardly scratches the surface of the wide range of applications that XML is suited for. To whet your appetite, here are some common uses for XML that you will certainly be seeing now and in the future. 15.8.1 Creating Your Own Markup Language We touched on this earlier when we mentioned that the latest versions of HTML are being reformulated as compliant XML DTDs. We cover the impact XML has on HTML in the next chapter. But even more significantly, XML enables communities of users to create languages that best capture their unique data and ideas. Mathematicians, chemists, musicians, and professionals from hundreds of other disciplines can create special tags that rep- resent unique concepts in a standardized way. Even if no browser exists that can accurately render these tags in a displayable form, the ability to capture and stan- dardize information is tremendously important for future extraction and interpreta- tion of these ideas. For more mainstream XML applications with established audiences, it is easy to envision custom browsers being created to appropriately display the information. Smaller applications or markets may have more of a challenge creating markup lan- guages that enjoy such wide acceptance. Creating the custom display tool for a markup language is difficult; delivering that tool for multiple platforms is expensive. As we’ve noted, you can mitigate some of these display concerns through appropri- ate use of stylesheets. Luckily, XML’s capabilities extend beyond document display. 15.8.2 Document Exchange Because XML grew out of the tremendous success of HTML, many people think of XML as yet another document-display tool. In fact, the real power of XML lies not in the document-display arena, but in the world of data capture and exchange. Despite the billions of computers deployed worldwide, sharing data is as tedious and error-prone as ever. Competing applications do not operate from common docu- ment-storage formats, so sending a single document to a number of recipients is fraught with peril. Even when vendors attempt to create an interchange format, it still tends to be proprietary and often is viewed as a competitive advantage for partic- ipating vendors. There is little incentive for vendors to release application code for the purpose of creating easy document-exchange tools. XML avoids these problems. It is platform neutral, is generic, and can perform almost any data-capture task. It is equally available to all vendors and can easily be integrated into most applications. The stabilization of the XML standard and the 490 Chapter 15: XMLincreasing availability of XML authoring and parsing tools is making it easier to cre- ate XML markup languages for document capture and exchange. Most importantly, document exchange rarely requires document presentation, thus eliminating “display difficulties” from the equation. Often, an existing application uses XML to include data from another source and then uses its own internal dis- play capabilities to present the data to the end user. The cost of adding XML-based data exchange to existing applications is relatively small. 15.8.3 Connecting Systems A level below applications, there is also a need for systems to exchange data. As busi- ness-to-business communication increases, this need grows even faster. In the past, this meant that someone had to design a protocol to encode and exchange the data. With XML, exchanging data is as easy as defining a DTD and integrating the parser into your existing applications. The data sets exchanged can be quite small. Imagine shopping for a new PC on the Web. If you could capture your system requirements as a small document using an XML DTD, you could send that specification to a hundred different vendors to quote you a system. If you extend that model to include almost anything you can shop for—from cars to hot tubs—XML provides an elegant base layer of communi- cation among cooperating vendors on the Internet. Almost any data that is captured and stored can more easily be shared using XML. For many systems, the XML DTDs may define a data-transfer protocol and nothing more. The data may never actually be stored using the XML-defined markup; it may exist in an XML-compatible form only long enough to pass on the wire between two systems. One increasingly popular use of XML is web services, which make it possible for diverse applications to discover each other and exchange data seamlessly over the Internet, regardless of their programming language or architecture. For more infor- mation on web services, consult Web Services Essentials by Ethan Cerami (O’Reilly). In conjunction with XML-based data exchange, the Extensible Stylesheet Language, or XSL, is increasingly being used to describe the appearance and definition of the data represented by these XML DTDs. Much like Cascading Style Sheets (CSS) and its ability to transform HTML documents, XSL supports the creation of stylesheets for any XML DTD. You can use CSS with XML documents as well, but it is not as programmatically rich as XSL. While CSS stops with stylesheets, XSL is a style lan- guage. XSL certainly addresses the need for data display, and it provides rich tools that allow data represented with one DTD to be transformed into another DTD in a controlled and deterministic fashion. A complete discussion of XSL is beyond the scope of this book; consult XSLT by Doug Tidwell (O’Reilly) for complete details. 15.8 Using XML 491