Data dictionary vs Metadata

data dictionary for preservation metadata and data dictionary and metadata
NathanBenett Profile Pic
NathanBenett,Germany,Researcher
Published Date:11-07-2017
Your Website URL(Optional)
Comment
PREMIS Contents: version 2.0 Acknowledgments March 2008 Introduction Background The PREMIS Data Model General Topics Implementation Considerations The PREMIS Data Dictionary Version 2.0 Special Topics Methodology Glossary INTRODUCTION INTRODUCTION Background In June 2003, OCLC and RLG jointly sponsored the formation of the PREMIS (Preservation Metadata: Implementation Strategies) working group, comprised of international experts in the use of metadata to support digital preservation activities. The working group’s membership included more than 30 participants, representing five different countries and a variety of domains, including libraries, museums, archives, government agencies, and the private sector. Part of the working group’s charge was to develop a core set of implementable preservation metadata, broadly applicable across a wide range of digital preservation contexts and supported by guidelines and recommendations for creation, management, and use. This portion of the working group’s charge was fulfilled in May 2005 with the release of Data Dictionary for Preservation Metadata: Final Report of the PREMIS Working Group. That 237-page Report provides a wealth of resources on preservation metadata. First and foremost is the Data Dictionary itself, a comprehensive, practical resource for implementing preservation metadata in digital archiving systems. The Data Dictionary defines preservation metadata that: • Supports the viability, renderability, understandability, authenticity, and identity of digital objects in a preservation context; • Represents the information most preservation repositories need to know to preserve digital materials over the long-term; • Emphasizes “implementable metadata”: rigorously defined, supported by guidelines for creation, management, and use, and oriented toward automated workflows; and • Embodies technical neutrality: no assumptions made about preservation technologies, strategies, metadata storage and management, etc. In addition to the Data Dictionary, the working group also published a set of XML schema to support implementation of the Data Dictionary in digital archiving systems. The PREMIS Data Dictionary was awarded the 2005 Digital Preservation Award, given under the auspices of the British Conservation Awards, as well as the 2006 Society of American Archivists Preservation Publication Award. Following the release of the Data Dictionary in 2005, the PREMIS working group retired and the PREMIS Maintenance Activity, sponsored by the Library of Congress, was initiated to maintain the Data Dictionary and coordinate other work to advance understanding of preservation metadata and related topics. In addition to providing a permanent Web home for the Data Dictionary, XML schema, and related materials, the Maintenance Activity also operates the PREMIS Implementers Group (PIG) discussion list and wiki, conducts tutorials on the Data Dictionary and its use, and commissions focused studies on preservation metadata topics. The Maintenance Activity also established an Editorial Committee responsible for further development of the Data Dictionary and the XML schema and promoting their use. Data Dictionary for Preservation Metadata: PREMIS version 2.0 1 INTRODUCTION The membership of the Editorial Committee reflects a variety of countries and institutional backgrounds. At the time of the Data Dictionary’s release, the decision was made to “freeze” its content for at least 18 months, giving the digital preservation community time to read and digest it, experiment with its implementation, identify errors, and most importantly, provide feedback on ways that the Data Dictionary could be improved to increase its value and ease of application. Feedback was collected through a variety of mechanisms, and in 2007, the Editorial Committee determined that a sufficient level of commentary had accumulated to warrant undertaking the first revision of the Data Dictionary. The members of the Editorial Committee revised the Data Dictionary, making every effort to engage stakeholders in the process of revision. The Committee kept the preservation community informed of issues being discussed, solicited comment on proposed revisions, and consulted outside experts where appropriate. The result of this process is the PREMIS Data Dictionary for Preservation Metadata version 2.0. Development of the original PREMIS Data Dictionary The PREMIS working group was established to build on the earlier work of another initiative sponsored by OCLC and RLG: the Preservation Metadata Framework (PMF) working group. In 2001–2002 the PMF working group outlined the types of information that should be associated with an archived digital object. Their report, A Metadata Framework to Support the Preservation 1 of Digital Objects (the Framework), proposed a list of prototype metadata elements. However, additional work was needed to make these prototype elements implementable. The PREMIS working group was asked to take the PMF group’s work a step further and develop a data dictionary of core metadata for archived digital objects, as well as give guidance and suggest best practice for creating, managing, and using the metadata in preservation systems. Since the PREMIS working group had a practical rather than theoretical focus, members were sought from institutions known to be operating or developing preservation repository systems within the cultural heritage and information industry sectors. Diverse perspectives were also sought. The working group consisted of representatives from academic and national libraries, museums, archives, government, and commercial enterprises in five different countries. In addition, PREMIS called upon an international advisory committee of experts to review progress. To understand how preservation repositories were actually implementing preservation metadata, in November 2003 the working group undertook a survey of about 70 organizations thought to be active in or interested in digital preservation. The survey provided an opportunity to explore the state of the art in digital preservation generally, and questions were drafted to elicit information about policies, governance and funding, system architecture, and preservation strategies, as well as metadata practices. The subgroup contacted 16 of 48 respondents by telephone for more in-depth interviews. In December 2004 the PREMIS working group published its report based on the survey of digital repositories, Implementing Preservation Repositories for Digital Materials: Current Practice and Emerging Trends in the Cultural 2 Heritage Community (the Implementation Survey Report). The findings of this survey were extremely helpful in informing the working group’s discussions as it developed the Data Dictionary. 2 Data Dictionary for Preservation Metadata: PREMIS version 2.0 INTRODUCTION Both the earlier Framework and the PREMIS Data Dictionary build on the Open Archival 3 Information System (OAIS) reference model (ISO 14721). The OAIS information model provides a conceptual foundation in the form of a taxonomy of information objects and packages for archived objects, and the structure of their associated metadata. The Framework can be viewed as an elaboration of the OAIS information model, explicated through the mapping of preservation metadata to that conceptual structure. The PREMIS Data Dictionary can be viewed as a translation of the Framework into a set of implementable semantic units. However, it should be noted that the Data Dictionary and OAIS occasionally differ in terminology usage; these differences are noted in the Glossary that accompanies this report. Differences usually reflect the fact that PREMIS semantic units require more specificity than the OAIS definitions provide, which is to be expected when moving from a conceptual framework to an implementation. Implementable, core preservation metadata The PREMIS Data Dictionary defines “preservation metadata” as the information a repository uses to support the digital preservation process. Specifically, the group looked at metadata supporting the functions of maintaining viability, renderability, understandability, authenticity, and identity in a preservation context. Preservation metadata thus spans a number of the categories typically used to differentiate types of metadata: administrative (including rights and permissions), technical, and structural. Particular attention was paid to the documentation of digital provenance (the history of an object) and to the documentation of relationships, especially relationships among different objects within the preservation repository. The group considered a number of definitions of “core.” In one view, core describes any metadata absolutely required under any circumstances. In another, core means that metadata is applicable to any type of repository implementing any type of preservation strategy. PREMIS uses this practical definition: things that most working preservation repositories are likely to need to know in order to support digital preservation. The words “most” and “likely” were chosen deliberately. Core does not necessarily mean mandatory, and some semantic units were designated as optional when exceptional cases were apparent. The concept of “implementability” also required definition. Most preservation repositories deal with large quantities of data. Therefore, a key factor in the implementability of preservation metadata is whether the values can be automatically supplied and automatically processed by the repository. Whenever possible the group defined semantic units that do not require human intervention to supply or analyze. For example, coded values from an authority list are preferred over textual descriptions. The working group decided that the Data Dictionary should be wholly implementation independent. That is, the core metadata define information that a repository needs to know, regardless of how, or even whether, that information is stored. For instance, for a given identifier to be usable, it is necessary to know the identifier scheme and the namespace in which it is unique. If a particular repository uses only one type of identifier, the repository would not need to record the scheme in association with each object. The repository would, however, need to know this information and to be able to supply it when exchanging metadata with other repositories. Because of the emphasis on the need to know rather than the need to record or Data Dictionary for Preservation Metadata: PREMIS version 2.0 3 INTRODUCTION The PREMIS Data Model The working group developed a simple data model to organize the semantic units defined in the Data Dictionary. The data model defines five entities the working group felt were particularly important in regard to digital preservation activities: Intellectual Entities, Objects, Events, 4 Rights, and Agents. Each semantic unit defined in the Data Dictionary is a property of one of the entities in the data model. Figure 1 provides a graphical illustration of the PREMIS Data Model. Figure 1: The PREMIS Data Model In Figure 1, entities are represented by boxes; relationships between entities are represented by arrows. The direction of the arrow indicates the direction of the relationship linkage as it is recorded in the preservation metadata. For example, the arrow pointing from the Rights entity to the Agents entity means that the metadata associated with the Rights entity includes a semantic unit recording information about the relationship with an Agent. The arrow pointing from the Objects entity back to itself indicates that the semantic units defined in the Data Dictionary support the recording of relationships between Objects. No other entity in the data model supports relationships of this type; in other words, while Objects can be related to other Objects, Events cannot be related to other Events, Agents cannot be related to other Agents, and so on. Data Dictionary for Preservation Metadata: PREMIS version 2.0 5 INTRODUCTION The entities in the PREMIS data model are defined as follows: Intellectual Entity: a set of content that is considered a single intellectual unit for purposes of management and description: for example, a particular book, map, photograph, or database. An Intellectual Entity can include other Intellectual Entities; for example, a Web site can include a Web page; a Web page can include an image. An Intellectual Entity may have one or more digital representations. 5 Object (or Digital Object): a discrete unit of information in digital form. Event: an action that involves or impacts at least one Object or Agent associated with or known by the preservation repository. Agent: person, organization, or software program/system associated with Events in the life of an Object, or with Rights attached to an Object. Rights: assertions of one or more rights or permissions pertaining to an Object and/or Agent. The PREMIS Data Dictionary defines semantic units. Each semantic unit defined in the Data Dictionary is mapped to one of the entities in the data model. In this sense, a semantic unit may be viewed as a property of an entity. For example, the semantic unit size is a property of an Object entity. Semantic units have values: for a particular Object the value of size might be “843200004.” In most cases, a particular semantic unit is unambiguously a property of only one type of entity. The size of an Object is clearly a property of the Object entity. In some cases, however, a semantic unit applies equally to two or more types of entity. For example, Events have outcomes. If a migration event creates a file that has lost some important feature, the loss of that feature might be considered an outcome of the Event, and therefore a property of the Event entity. Alternatively, it might be considered an attribute of the new file, and therefore a property of the Object entity. When a semantic unit applies equally to multiple entity types, the semantic unit is associated with only one type of entity in the Data Dictionary. The data model relies upon links between the different entities to make these relationships clear. In the example above, the loss of the feature is treated as a detailed outcome of the Event, where the Event contains the identifier of the Object involved. What is important is that this association is arbitrary and is not meant to imply that a particular implementation is required. In some cases a semantic unit takes the form of a container that groups a set of related semantic units. For example, a semantic unit identifier groups the two semantic units identifierType and identifierValue. The grouped subunits are called semantic components of the container. Some containers are defined as extension containers, to allow the use of metadata encoded according to an external schema. This enables PREMIS to be extended with metadata elements that are more granular, non-core, or otherwise out of scope for the Data Dictionary. A relationship is a statement of association between instances of entities. “Relationship” can be interpreted broadly or narrowly, and expressed in many different ways. For example, the statement “Object A is of format B” could be considered a relationship between A and B. The PREMIS model, however, treats format B as a property of Object A. PREMIS reserves 6 Data Dictionary for Preservation Metadata: PREMIS version 2.0 INTRODUCTION “relationship” for associations between two or more Object entities or between entities of different types, such as an Object and an Agent. More on Objects The Object entity has three subtypes: file, bitstream, and representation. A file is a named and ordered sequence of bytes that is known by an operating system. A file can be zero or more bytes and has a file format, access permissions, and file system characteristics such as size and last modification date. A bitstream is contiguous or non-contiguous data within a file that has meaningful common properties for preservation purposes. A bitstream cannot be transformed into a standalone file without the addition of file structure (headers, etc.) and/or reformatting the bitstream to comply with some particular file format. A representation is the set of files, including structural metadata, needed for a complete and reasonable rendition of an Intellectual Entity. For example, a journal article may be complete in one PDF file; this single file constitutes the representation. Another journal article may consist of one SGML file and two image files; these three files constitute the representation. A third article may be represented by one TIFF image for each of 12 pages plus an XML file of structural metadata showing the order of the pages; these 13 files constitute the representation. Files, bitstreams, and filestreams A file in the PREMIS data model is similar to the idea of a computer file in ordinary usage: a set of zero or more bytes known to an operating system. Files can be read, written, and copied. Files have names and formats. A bitstream as defined in the PREMIS data model is a set of bits embedded within a file. This differs from common usage, where a bitstream could in theory span more than one file. A good example of a file with embedded bitstreams is a TIFF file containing two images. According to the TIFF file format specification a TIFF file must contain a header containing some information about the file. It may then contain one or more images. In the PREMIS data model each of these images is a bitstream and can have properties such as identifiers, location, inhibitors, and detailed technical metadata (e.g., color space). Some bitstreams have the same properties as files and some do not. The image embedded within the TIFF file clearly has properties different from the file itself. However, in another example, three TIFF files could be aggregated within a larger tar file. In this case the three TIFF files are also embedded bitstreams, but they have all the properties of TIFF files. The PREMIS data model refines the definition of bitstream to include only an embedded bitstream that cannot be transformed into a standalone file without the addition of file structure (e.g., headers) or other reformatting to comply with some particular file format specification. Examples of these bitstreams include an image within a TIFF 6.0 file, audio data within a WAVE file, or graphics within a Microsoft Word file. Data Dictionary for Preservation Metadata: PREMIS version 2.0 7 INTRODUCTION Some embedded bitstreams can be transformed into standalone files without adding any additional information, although a transformation process such as decompression, decryption, or decoding may have to be performed on the bitstream in the extraction process. Examples of these bitstreams include a TIFF within a tar file, or an encoded EPS within an XML file. In the PREMIS data model these bitstreams are defined as “filestreams,” that is, true files embedded within larger files. Filestreams have all of the properties of files, while bitstreams do not. In the Data Dictionary, the column for “File” applies to both files and filestreams. The column for “Bitstream” applies to the subset of bitstreams that are not filestreams and that adhere to the stricter PREMIS definition of bitstream. The location (contentLocation in the Data Dictionary) of a file would normally be a location in storage; while the location of a filestream or bitstream would normally be the starting offset within the embedding file. Representations The goal of many preservation repositories is to maintain usable versions of intellectual entities over time. For an intellectual entity to be displayed, played, or otherwise made useable to a human, all of the files making up at least one version of that intellectual entity must be identified, stored, and maintained so that they can be assembled and rendered to a user at any given point. A representation is the set of files required to do this. PREMIS chose the term “representation” to avoid the term “manifestation” as it is used in the 6 Functional Requirements for Bibliographic Records (FRBR). In FRBR a manifestation entity is “all the physical objects that bear the same characteristics in respect to both intellectual content and physical form.” In the PREMIS model a representation is a single digital instance of an intellectual entity held in a preservation repository. A preservation repository might hold more than one representation for the same intellectual entity. For example, the repository might acquire a single image (say, “Statue of a horse”) as a TIFF file. At some point the repository creates a derivative JPEG2000 file from the TIFF and keeps both files. Each of these files would constitute a representation of “Statue of a horse.” In a more complicated example, “Statue of a horse” might be a part of an article consisting of that TIFF image and a file of SGML-encoded text. If the repository created a JPEG2000 version of the TIFF, it would hold two representations of the article: the TIFF and the SGML files would make up one representation, while the JPEG2000 and the SGML files would make up another representation. How those representations are stored is implementation specific. A repository might chose to store a single copy of the SGML file, which would then be shared between representations. Alternately, the repository could choose to duplicate the SGML file and store two identical copies of it. The two representations would then consist of the TIFF and SGML copy 1, and the JPEG2000 and SGML copy 2. Not all preservation repositories will be concerned with representations. A repository might, for example, preserve file objects only and rely on external agents to assemble these objects into usable representations. If the repository does not manage representations, it does not need to record metadata about them. 8 Data Dictionary for Preservation Metadata: PREMIS version 2.0 INTRODUCTION Intellectual Entities and Objects The relationship between Intellectual Entities and Objects can be illustrated by a couple of examples: Example 1, Animal Antics: The book Animal Antics was published in 1902. A library digitized Animal Antics, creating one TIFF file for each of 189 pages. As structural metadata, it created an XML file showing how the images are assembled into a complete book. The library then performed OCR on the TIFF images, ultimately creating a single large text file that was marked up by hand in SGML. The library submitted 189 TIFF files, one XML file, and one SGML file to a preservation repository. To the repository Animal Antics is an Intellectual Entity: it is a reasonable unit that can be described as a whole, with properties such as an author, a title, and a publication date. The repository has two representations, one consisting of 189 TIFF files and an XML file, and the other consisting of one SGML file. Each representation could render a complete version of Animal Antics, albeit with different functionalities. The repository will record metadata about two representation objects and 191 file objects. An Ani im ma al l An Ant ti ic cs s (a (an n i in nte tel ll le ec ct tua ual l e en nt ti it ty y) ) R R Rep ep epr r res es esen en ent t ta a at t ti i io o on n n 1 1 1 R R Rep ep epr r res es esen en ent t ta a at t ti i io o on n n 2 2 2 XML XML XML SGM SGM SGML L L TIF TIF TIFF F F 1 1 1 TIF TIF TIFF 18 F 18 F 189 9 9 Figure 2: Animal Antics Intellectual Entity Example Example 2, Welcome to U: Welcome to U, submitted to a preservation repository as an AVI (Audio Video Interleaved) file, is a 10-minute movie introducing new students to a university campus. Welcome to U is an Intellectual Entity. The repository has one representation, which consists of a single AVI file. The repository’s preservation strategy requires that it manage the audio bits of the AVI file separately from the video bits. The repository will record metadata about one representation object, one file object, and two bitstream objects. Data Dictionary for Preservation Metadata: PREMIS version 2.0 9 INTRODUCTION More on Events The Event entity aggregates metadata about actions. A preservation repository will record events for many reasons. Documentation of actions that modify (that is, create a new version of) a digital object is critical to maintaining digital provenance, a key element of authenticity. Actions that create new relationships or alter existing relationships are important in explaining those relationships. Even actions that alter nothing, such as validity and integrity checks on objects, can be important to record for management purposes. For billing or reporting purposes some repositories may track actions such as requests for dissemination or reports. It is up to the repository which actions to record as Events. Some actions may be considered too trivial to record, or may be recorded in other systems (as, for example, routine file backups may be recorded in storage management systems). It is also an implementation decision whether to record events that occur before an object is ingested into the preservation repository, for example, derivation from an earlier object, or changes of custody. In theory, events following the deaccessioning of an Intellectual Entity could also be recorded. For example, a repository might first deaccession an Intellectual Entity, then delete all file Objects associated with that entity, and record each deletion as an Event. In the data model Objects are associated with Events in two ways. If an Object is related to a second Object through (because of) an Event, the Event identifier is recorded in the relationship container as the semantic component relatedEventIdentification. If the Object simply has an associated Event with no relationship to a second Object, the Event identifier is recorded in the container linkingEventIdentifier. (For more information on relationships, see page 13.) For example, assume a preservation repository ingests an XML file (object A) and creates a normalized version of it (object B) by running a program (event 1). In the metadata for object B, this could be recorded in relationship as follows: relationshipType = “derivation” relationshipSubType = “derived from” relatedObjectIdentification relatedObjectIdentifierType = “local” relatedObjectIdentifierValue = “A” relatedObjectSequence = “not applicable” relatedEventIdentification relatedEventIdentifierType = “local” relatedEventIdentifierValue = “1” relatedEventSequence = “not applicable” Continuing with this example, assume that after object B is created it is validated by running another program (event 2). In this case event 2 pertains only to object B, not to the relationship between B and A. The link to event 2 would be recorded as linkingEventIdentifier: linkingEventIdentifierType = “local” linkingEventIdentifierValue = “2” 10 Data Dictionary for Preservation Metadata: PREMIS version 2.0 INTRODUCTION A given Object can be associated in these two ways with any number of Events. All events have outcomes (success, failure, etc.). Some events also have outputs; for example, the execution of a program creates a new file object. The semantic units eventOutcome and eventOutcomeDetail are intended for documenting qualitative outcomes. For example, if the event is an act of format validation, the value of eventOutcome might be a code indicating the object is fully valid. Alternatively, it might be a code indicating the object is not fully valid, and eventOutcomeDetail could be used to describe all anomalies found. If the program performing the validation writes a log of warnings and error messages, a second instance of eventOutcomeDetail could be used to store or point to that log. If an event creates objects that are stored in the repository, those objects should be described as entities with a complete set of applicable metadata and associated with the event by links. More on Agents Agents are clearly important but are not the focus of the Data Dictionary, which defines only a means to identify the agent and a classification of agent type (person, organization, or software). While more metadata is likely to be necessary, this is left to other initiatives to define. The data model diagram shows an arrow from the Agent entity to the Event entity, but no arrow from Agent to the Object entity. Agents influence Objects only indirectly through Events. Each Event can have one or more related Objects and one or more related Agents. Because a single Agent can perform different roles in different Events, the role of the Agent is a property of the Event entity, not of the Agent entity. More on Rights Many efforts are concerned with metadata related to intellectual property rights and permissions, from rights expression languages to the indecs framework. However, only a small body of work addresses rights and permissions specifically related to digital preservation. After the publication of the first edition of the PREMIS Data Dictionary, the Library of Congress in its capacity as PREMIS Maintenance Agency commissioned a paper, “Rights in the PREMIS Data 7 Model,” by Karen Coyle . This paper discussed copyright, licenses, and statute as three bases for establishing intellectual property rights, and recommended an expansion of the rights information in the Data Dictionary to include information on these bases. Consequently, the permissionStatement in the original Data Dictionary was replaced with the rightsStatement in this version. In this revision the Editorial Committee relied heavily upon the Coyle paper, background materials such as Peter Hirtle's excellent “Digital Preservation and 8 9 Copyright, ” and the California Digital Library's draft copyrightMD schema . It should be noted that the proposed uses of copyrightMD and PREMIS rights are rather different. The copyrightMD schema is intended to document factual information to allow a human being to make an informed copyright assessment of a given work. The PREMIS rightsStatement is intended to allow a preservation repository to determine whether it has the right to perform a certain action in an automated fashion, with some documentation of the basis for the assertion. Data Dictionary for Preservation Metadata: PREMIS version 2.0 11 INTRODUCTION General Topics on the Structure and Use of the Data Dictionary The semantic units defined in the PREMIS Data Dictionary are bound together by a few structural conventions that help organize the Data Dictionary and support its implementation. These conventions include the use of identifiers; the manner in which relationships are handled in the Data Dictionary; and the “1:1 Principle” relating metadata to Objects. Identifiers Instances of Objects, Events, Agents, and Rights statements are uniquely identified by a set of semantic units collected under “Identifier” containers. These semantic units follow an identical syntax and structure, regardless of entity type: entity typeIdentifier entity typeIdentifierType: domain in which the identifier is unique entity typeIdentifierValue: identifier string The following examples illustrate the use of this syntax to identify an Object residing in Harvard’s Digital Repository Service (DRS), and an event that occurs under the auspices of the NRS (Name Resolution Service): Example 1: Identifying an Object ObjectIdentifier ObjectIdentifierType: NRS ObjectIdentifierValue: http://nrs.harvard.edu/urn-3:FHCL.Loeb:sa1 Example 2: Identifying an Event EventIdentifier EventIdentifierType: NRS EventIdentifierValue: 716593 In both examples, the identifier type is “NRS”, which indicates that the identifier is unique within the domain of the Name Resolution Service that assigns identifiers for the Digital Repository Service. Identifier type should be defined as specifically as possible, and provide sufficient information to indicate the relevant naming authority, as well as how to build the identifier value. For example, it would have been permissible to use “URL” for ObjectIdentifierType in the first example, since the identifier value is unique in that domain, but “NRS” conveys more information about the domain in which the identifier is created and used. If all identifiers are local to repository system, it is unlikely that identifier type would need to be explicitly recorded for each identifier in the system. This is an example of a semantic unit whose information is known implicitly by context or policy, and is therefore not implemented as a metadata element in the preservation system. However, if the repository exchanges digital objects and their associated metadata with other repositories, identifier type should be explicitly supplied. 12 Data Dictionary for Preservation Metadata: PREMIS version 2.0 INTRODUCTION Identifiers can be created internally or externally to the repository. The PREMIS Data Dictionary does not require or even recommend a specific identifier scheme; this is an implementation- specific issue and is therefore outside the scope of the Data Dictionary. The Data Dictionary simply provides a general syntax that can be used to express identifier type and value, regardless of the specific scheme chosen. It is recommended, however, that repositories choose persistent identification schemes wherever possible. Identifiers are repeatable for Objects and Agents; they are not repeatable for Rights and Events. Objects and Agents often have multiple identities in a global environment, and across systems, and therefore are likely to have multiple identifiers. Rights and Events are considered to have a context limited to a particular preservation repository, and therefore do not require multiple identifiers. Identifiers are used as references to establish relationships between entities in the PREMIS data model. Relationships are discussed in the next section. Relationships between Objects As noted earlier, an Object in a repository can be related to one or more other Objects in the repository. The PREMIS Data Dictionary supplies semantic units to support documentation of relationships between Objects. The working group began its exploration of this topic by collecting examples from existing preservation metadata projects. It found a wide range of metadata facts expressed as relationships—for example, “is migrated from,” “is keyed text of,” “is thumbnail of.” In some cases these relationship statements combine more than one fact (e.g., “is keyed text of” combines “is a keyed text” and “is derived from”). The group also reviewed the element refinements for the Dublin Core Relation element (IsPartOf, IsFormatOf, IsVersionOf, etc.) and concluded that most relationships among objects appear to be variants of these three basic types: structural, derivation, and dependency. Structural relationships show relationships between parts of objects. The structural relationships between the files that constitute a representation of an Intellectual Entity are clearly essential preservation metadata. If a preservation repository can’t put the pieces of a digital object back together, it hasn’t preserved the object. For a simple digital object (e.g., a photograph) structural information is minimal: the file constitutes the representation. Other digital objects such as e-books and Web sites can have quite complex structural relationships. Derivation relationships result from the replication or transformation of an Object. The intellectual content of the resulting Object is the same, but the Object’s instantiation, and possibly its format, are different. When file A of format X is migrated to create file B of format Y, a derivation relationship exists between A and B. Many digital objects are complex, and both structural and derivation information can change over time as a result of preservation activities. For example, a digitized book represented by 400 TIFF page images might after migration become four PDF files each containing 100 pages. A structural relationship among objects can be established by an act of derivation before the objects were ingested by the repository. For example, a word-processing document could have Data Dictionary for Preservation Metadata: PREMIS version 2.0 13 INTRODUCTION been used to create derivative files in PDF and XML formats. If only the PDF and XML files are submitted to the preservation repository, these objects are different representations of the same Intellectual Entity with parent-child relationships to the source word-processing file. They do not have derivation relationships with each other, but do have a structural relationship as siblings (children of a common parent). There is no one way to model all possible structural or derivation information. Rather than specify a particular approach, the group identified essential information that must be captured. The PREMIS Data Dictionary describes this in the semantic components of the semantic unit relationship. Structural and derivative relationships link Objects; the Objects must be identified. The type of relationship must be identified in some way (e.g., “is child of”) and the relationship may be associated with an Event that created that relationship. Implementers will likely choose 10 approaches that best suit the content to be preserved by using, for example, the METS 11 structMap or descriptive metadata schemes that define relationship types (e.g. Dublin Core ). A dependency relationship exists when one object requires another to support its function, delivery, or coherence of content. An object may require a font, style sheet, DTD, schema, or other file that is not formally part of the object itself but is necessary to render it. The Data Dictionary handles dependency relationships as part of the environment information, in the semantic units dependency and swDependency. In this way requirements for hardware and software are brought together with requirements for dependent files to form a complete picture of the information or assets required for the rendering and/or understanding of the object. Relationships between entities of different types The data model diagram uses arrows to show relationships between entities of different types. Objects are related to Intellectual Entities, Objects are related to Events, Agents are related to Events, etc. The Data Dictionary expresses relationships as linking information by including in the information for entity A a pointer to the related entity B. Every entity in the data model has a unique identifier for use as a pointer. So, for example, the Object entity has arrows pointing to Intellectual Entities and Events. These are implemented in the Data Dictionary by the semantic units linkingIntellectualEntityIdentifier and linkingEventIdentifier. The 1:1 principle In digital preservation it is common practice to create new copies or versions of stored objects. For example, in forward migration file A in format X may be input to a program which outputs file B in format Y. There are two ways to think about files A and B. One might think of them as a single Object, the history of which includes the transformation from X to Y, or one could think of them as two distinct Objects with a relationship created by the transformation Event. The 1:1 principle in metadata asserts that each description describes one and only one resource. As applied to PREMIS metadata, every Object held within the preservation repository (file, bitstream, representation) is described as a static set of bits. It is not possible to change a file (or bitstream or representation); one can only create a new file (or bitstream or representation) that is related to the source Object. In the example above, therefore, files A and B are distinct Objects with a derivative relationship between them. The Data Dictionary has a semantic unit for the 14 Data Dictionary for Preservation Metadata: PREMIS version 2.0 INTRODUCTION creation date of an Object (dateCreatedByApplication) but not for the modification date of an Object, because an Object, by definition, cannot be modified. When new objects are derived from existing objects the event that created the new object should be recorded as an Event, which will have a date/time stamp. The relationship(s) among the objects should be recorded using the relationship semantic unit associated with the Object entity. The semantic component relatedEventIdentification should be used to make the association with the Event. Implementation Considerations PREMIS conformance PREMIS conformance requires a preservation repository to follow the specifications outlined in the Data Dictionary. For example, if the repository claiming to be PREMIS-conformant implements a metadata element sharing the name of a semantic unit in the Data Dictionary, it is expected that the repository’s metadata element will also share the definition of the semantic unit. Metadata not defined in the Data Dictionary may certainly be used, but non-PREMIS elements should not conflict with or overlap with PREMIS semantic units if they use the same names. Data constraints and applicability guidelines in the Data Dictionary must also be adhered to. For repeatability and obligation, PREMIS conformance permits more stringent but not more liberal application. That is, a semantic unit defined in the Data Dictionary as repeatable can be treated as not repeatable within a repository, but not vice versa. The PREMIS Data Dictionary designates some semantic units as mandatory when describing representations, files, and/or bitstreams. The mandatory semantic units represent the minimum amount of information 1) necessary to support the long-term preservation of digital objects, and 2) that must accompany a digital object as it is transferred from the custody of one preservation repository to another. There is no prescribed strategy for collecting, storing, or managing the mandatory semantic units within the repository’s internal systems. Nor is there a minimum level of information that must be explicitly recorded and maintained locally by the repository. In general, the mandatory semantic units of the Data Dictionary represent the information that a preservation repository must be able to associate with any archived digital object in its possession. The specific means of association (e.g., local metadata storage, shared registries, etc.) are implementation issues and outside the scope of the Data Dictionary. When a digital object is exchanged between two preservation repositories, the repository sending the object must be able to extract from its systems or from other sources the information needed to populate the semantic units marked mandatory in the Data Dictionary. This information must conform to the specifications in the Data Dictionary and must be packaged with the digital object before its transfer to the second repository. The PREMIS working group believes that this information represents the minimum amount for the second repository to accept custody of the digital object and assume responsibility for its long-term preservation. Some PREMIS semantic units are equivalent to metadata elements in other metadata schemas. If metadata is taken from other schemas to populate PREMIS semantic units, care must be taken to ensure that this information conforms to the requirements and constraints associated with the Data Dictionary for Preservation Metadata: PREMIS version 2.0 15 INTRODUCTION corresponding semantic unit in the PREMIS Data Dictionary. Harmonizing the PREMIS Data Dictionary with other metadata schemas in cases where they overlap would help minimize conformance issues. For example, the Z39.87 metadata standard (Technical Metadata for Digital 12 Still Images) revised some of its elements to harmonize them with equivalent semantic units in the PREMIS Data Dictionary. Sometimes a preservation repository exchanges digital objects with parties that are not themselves preservation repositories. When a party submits an object to a preservation repository for archival retention, it is unlikely that the submitter will be in a position to supply the full range of information needed to populate the mandatory semantic units. Instead, it will supply a subset of this information whose extent, ideally, is determined by prior arrangement between the submitter and the repository. Whatever the extent of this subset, any information supplied by the submitter should conform to the Data Dictionary. The repository’s ingest process would then supply the rest of the information for the mandatory semantic units. When a repository disseminates an archived digital object to a user, it is unlikely that the user will be interested in the full range of mandatory semantic units associated with the archived object. Instead, the user would be provided with a subset of these semantic units. As in the case of submission, whatever the extent of this subset, any information supplied by the repository should conform to the Data Dictionary. Achieving interoperability across a network of preservation repositories and other stakeholders requires a shared view of the metadata needed to support long-term preservation, formalized as an implementable schema. PREMIS conformance and the mandatory semantic units are intended to fill this need. Implementation of the data model The PREMIS data model is meant to clarify the meaning and use of the semantic units in the Data Dictionary. It is not intended to prescribe an architecture for implementation. The working group believed that most preservation repositories will need to deal in some way with the conceptual entities, Objects, Agents, Events, and Rights, and found it useful to distinguish between the properties of subclasses of objects, such as files and filestreams, bitstreams, and representations. A particular repository implementation, however, may need to be more or less granular or define different categories of entity altogether. PREMIS recommends that any data model used be clearly defined and documented, and that metadata decisions be consistent with the data model. Sets of semantic units may be grouped and related indirectly to particular entities. For example, environment is a property of Objects. Logically, each file has one or more associated environments. However, in many cases the environment is determined by the file format; that is, all files of a particular format will have the same environment information. This could be handled in many different ways by different implementations. For example: • Repository 1 uses a relational database system. It has a “file” table with a row for each file object, and an “environment” table with a row for each unique set of environment 16 Data Dictionary for Preservation Metadata: PREMIS version 2.0 INTRODUCTION information. The “file” table can be joined with the “environment” table to get the appropriate environment information for each file. • Repository 2 uses an externally-maintained registry to obtain environment information. It maintains an internal inventory of file formats and their access keys for the external registry. Environment information is accessed via a Web services interface to the external registry and obtained dynamically when needed. • Repository 3 uses a system that models representations as containers and files as objects within those containers. Each object consists of a set of property/typed value pairs. Properties define roles for values. Property and type descriptions are themselves objects whose identifiers are drawn from the same namespace as other object identifiers. A file object may include a format property. Because format description is also an object, it could include an environment property, which in turn would point to an environment description object. Alternatively, a file object could include an environment property directly. Storing metadata The survey by the Implementation Strategies Subgroup showed that repositories have implemented several different architectures for storing metadata. Most commonly, metadata is stored in relational database tables. It is also common to store metadata as XML documents in an XML database, or as XML documents stored with the content data files. Other methods include proprietary flat file formats and object-oriented databases. Most respondents were using two or 2 more of these methods. (For more information, see the Implementation Survey Report .) Storing metadata elements in a database system has the advantages of fast access, easy update, and ease of use for query and reporting. Storing metadata records as digital objects in repository storage along with the digital objects the metadata describes also has advantages: it is harder to separate the metadata from the content, and the same preservation strategies that are applied to the content can be applied to the metadata. Recommended practice is to store critical metadata in both ways. Compound objects require structural metadata to describe the internal structure of the objects and the relationships between their parts. In the PREMIS Data Dictionary, semantic units that begin “related” and “linking” can be used to express certain simple structural information. In some cases this will be adequate for the use of the object, and in other cases it will not be. Often the presentation, navigation and/or processing of an object will require rich structural metadata 10 13 14 recorded according to some other standard, such as METS , MPEG-21 , or SMIL . In this case the file containing the structural metadata would be a file object to be preserved in its own right. Regardless of whether a file of independent structural metadata exists as part of the representation, when an archived representation is exported to another repository, the metadata linking files and representations should be provided. Supplying metadata values Most preservation repositories will deal with large quantities of materials, so it is desirable to automate the creation and use of metadata as much as possible. The values of many PREMIS Data Dictionary for Preservation Metadata: PREMIS version 2.0 17 INTRODUCTION semantic units can be obtained by parsing files programmatically, or can be supplied as constants by repository ingest programs. In cases where human intervention might be unavoidable, the group tended to pair a semantic unit requiring a coded value with a second semantic unit allowing a textual explanation. When information is supplied by the individual or organization submitting the objects to the repository, recommended practice is for the repository to attempt to verify this information by program whenever possible. For example, if a filename includes a file type extension, the repository should not assume the file extension necessarily indicates the format and should attempt to verify the format of the file before recording this as metadata. To facilitate automatic processing, the use of controlled vocabularies is recommended for a number of PREMIS semantic units. PREMIS assumes that repositories will adopt or define controlled vocabularies useful to them. The Data Dictionary indicates where best practice would require use of a controlled vocabulary. It does not require specific controlled vocabularies although it does in some cases indicate suggested values. The PREMIS Editorial Committee concluded that implementers should be able to choose the vocabulary used and specify which vocabulary is used. Whether and how to validate that the appropriate values have been used is an implementation consideration. With version 2.0 of the PREMIS Data Dictionary, the PREMIS Maintenance Activity at the Library of Congress is establishing a mechanism to register controlled vocabularies in use with PREMIS semantic units and expose them in a way that the PREMIS schemas can include them. Repositories may use these or define their own, but it should be clear what the source of each controlled vocabulary is when exporting metadata for exchange. Interoperability is enhanced if common vocabularies are used and declared. An implementer may choose to document controlled vocabularies used in its repository so that 10 exchange partners will know what to expect as values in the metadata. For instance, METS users may specify controlled vocabularies used in metadata in a METS profile, or PREMIS profiles may be established to document the same. A mechanism to record the source is provided in the PREMIS XML schemas. Other XML implementations may develop mechanisms to declare controlled vocabularies used or to validate values against specified vocabularies. In Resource Description Framework (RDF), use of resource URIs as property values is 15 encouraged, and many XML Schemas require attribute values to be URIs. For example, in the XML-Signature Syntax and Processing (XMLDsig), the value of the signature method algorithm must be a URI, such as “http://www.w3.org/2000/09/xmldsigndsa-sha1”. In general, resource URIs are allowable as values for semantic units in the PREMIS Data Dictionary, unless some noted constraint would disallow this. However, the working group was wary of recommending this practice for preservation. Resolution of URIs depends on a protocol that while currently ubiquitous is outside the control of the preservation repository. Also, the group felt strongly that any information needed for long-term preservation should be stored within the repository itself. If this information is stored as a preservation object, it is best referenced by the repository’s objectIdentifier. Information stored otherwise should still be under the direct control of the repository. Therefore, most examples in the Data Dictionary are names 18 Data Dictionary for Preservation Metadata: PREMIS version 2.0 INTRODUCTION of values rather than resource URIs. The equivalent of the example above might be simply “DSA-SHA1,” which should be assumed to be a constant whose meaning is known to the repository through some table or other documentation under the control of the repository organization. Extensibility For several semantic units the Data Dictionary notes the potential for extensibility, to allow implementations to include additional local metadata or to provide additional structure or granularity of metadata, if required. The inclusion of such additional metadata is relatively simple for implementations using relational databases; however, a mechanism for including such metadata when using the PREMIS schemas was not available in the first release of the Data Dictionary and schemas. Version 2.0 of the Data Dictionary introduces a formal mechanism for extensibility within the schemas for a small number of semantic units which were deemed prime candidates for extension. Later revisions of the Data Dictionary may add to this initial set of extensible semantic units if warranted. The initial set of semantic units for which extensibility will be supported in the schemas is: • significantProperties Object entity • objectCharacteristics Object entity • creatingApplication within objectCharacteristics, Object entity • environment within objectCharacteristics, Object entity • signatureInformation Object entity • eventOutcomeDetail within eventOutcomeInformation, Event entity • rights Rights entity These semantic units may be extended by use of an extension container within the Data Dictionary and schemas. Within the Data Dictionary, a corresponding semantic unit is indicated within the defined semantic components for each of the semantic units listed above as an extensible container with extension added to the name of the container that it extends. An extension may contain metadata encoded according to an external schema. A new container semantic unit, objectCharacteristicsExtension, has also been created within the Object entity to allow inclusion of format specific technical metadata within PREMIS. In devising the mechanism for extensibility, the PREMIS Editorial Committee adopted the principle that only semantic units which are containers may be extended. This would enable the use of a PREMIS defined semantic unit and/or a container for semantic units defined outside of PREMIS. This required some structural change (i.e. the addition of a container) to enable extension of eventOutcomeDetail. Data Dictionary for Preservation Metadata: PREMIS version 2.0 19 INTRODUCTION In utilizing the extensibility mechanism with the listed extensible semantic units, the following principles should be observed: • An extension container may be used to either supplement or replace PREMIS semantic units within the parent container (that is, the container which includes the extension container). The one exception is objectCharacteristicsExtension, which may only supplement objectCharacteristics. • An extension container may be used with existing PREMIS semantic units, supplementing the PREMIS semantic units with additional metadata. • An extension container may be used without existing PREMIS semantic units, effectively replacing the PREMIS semantic units with other applicable metadata (except for objectCharacteristicsExtension). • Where there is a one-to-one mapping between the contents of an extension container and an existing PREMIS semantic unit, recommended best practice would be to use the PREMIS semantic unit rather than its equivalent in the extension; however, implementers may choose to use the extension alone, if circumstances warrant. • If any semantic unit is not used it should be omitted, rather than an empty schema element included. • If the information in an extension container needs to be associated explicitly with a PREMIS unit the parent container is repeated with appropriate subunit. If extensions from different external schemas are needed, the parent container should also be repeated. In this case the repeated parent container may include the extension container with or without any other existing PREMIS semantic units for that parent container. • When an extension container is used, the external schema being used within that extension container must be declared. Date and time formats in PREMIS All semantic units that specify the use of a date or date and time suggest the use of a structured form to aid machine processing. In keeping with its being implementation independent, the Data Dictionary does not specify a particular standard to be used. In some cases, conventions are needed to express other aspects of a time period, such as an open-ended or questionable date. Version 2.0 of the PREMIS XML schema specifies date and time formats and establishes such conventions; it is recommended that these be used when needed. The following are semantic units that may include a date or date and time: • preservationLevelDateAssigned (under preservationLevel) • dateCreatedByApplication (under creatingApplication) • eventDateTime (under Event) • copyrightStatusDeterminationDate (under copyrightInformation) • statuteInformationDeterminationDate (under statuteInformation) 20 Data Dictionary for Preservation Metadata: PREMIS version 2.0