What is Data?

Data Definitions and Terminology

What is Data Definitions and Terminology

Scholarly literature, policy pronouncements, and the popular press are rife with discussions of data that make little attempt to define their terms. Even histories of science and epistemology. Other foundational works on the making of meaning in science discuss facts, representations, inscriptions, and publications, with little attention to data.

 

In the humanities, the term is rarely mentioned, despite scholars’ use of facts, numbers, letters, symbols, and other entities that would be categorized as data in the sciences and social sciences.

 

As the humanities come to rely more on digital collections, borrow more tools from other fields, and develop more of their own analytic methods for digital objects, their notions of data are becoming more explicit.

 

The concept of data is itself worthy of book-length explication. For the purposes of analyzing data in the context of scholarly communication, a narrower approach will suffice.

 

This overview is constrained to useful definitions, theories, and concepts for exploring commonalities and differences in how data are created, used, and understood in scholarly communities.

 

Definitions by Example

Data are most often defined by examples, such as facts, numbers, letters, and symbols. Lists of examples are not truly definitions because they do not establish clear boundaries between what is and is not included in a concept.

 

The definition offered by Peter Fox and Ray Harris is typical: “‘Data’ includes, at a minimum, digital observation, scientific monitoring, data from sensors, metadata, model output and scenarios, qualitative or observed behavioral data, visualizations, and statistical data collected for administrative or commercial purposes. Data are generally viewed as input to the research process.”

 

The term “data” as used in this document is meant to be broadly inclusive. In addition to digital manifestations of literature (including text, sound, still images, moving images, models, games, or simulations), it refers as well to forms of data and databases that generally require the assistance of computational machinery and software in order to be useful, such as various types of laboratory data including spectrographic, genomic sequencing, and electron microscopy data;

 

observational data, such as remote sensing, geospatial, and socioeconomic data; and other forms of data either generated or compiled, by humans or machines.

 

The Uhlir and Cohen definition recognizes that data can be created by people or by machines and acknowledges relationships between data, computers, models, and software. However, any such list is at best a starting point for what could be data to someone, for some purpose, at some point in time.

 

Operational Definitions

The most concrete definitions of data are found in operational contexts. Institutions responsible for managing large data collections should be explicit about what entities they handle and how, but few of these definitions draw clear boundaries between what are and are not data.

 

Among the best-known principles for data archiving are those in the Reference Model for an Open Archival Information System (OAIS). This consensus document on recommended practice originated in the space sciences community and is widely adopted in the sciences and social sciences as guidelines for data archiving.

 

The OAIS Reference Model uses data as a modifier: dataset, a data unit, data format, database, data object, data entity, and so on, while defining data in general terms with examples:

 

Data: A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing. Examples of data include a sequence of bits, a table of numbers, the characters on a page, the recording of sounds made by a person speaking, or a moon rock specimen.

 

The OAIS model distinguishes data from information as follows:

Information: Any type of knowledge that can be exchanged. In exchange, it is represented by data. An example is a string of bits (the data) accompanied by a description of how to interpret the string of bits as numbers representing temperature observations measured in degrees Celsius (the Representation Information). 

 

The Data Documentation Initiative (DDI) is a set of metadata standards for managing data throughout their life cycle (Data Documentation Initiative 2012). The DDI is widely used in the social sciences and elsewhere for data description but does not define data per se.

 

The DDI metadata specifications, which are expressed in XML, can be applied to whatever digital objects the DDI user considers to be data.

 

Among the partners in developing the DDI is the Inter-University Consortium for Political and Social Research (ICPSR). ICPSR is a leading international center that has been archiving social science research data since the early 1960s. ICPSR allows contributors to determine what they consider to be their data. Their instructions to prospective depositors offer the following guidance:

 

In addition to quantitative data, ICPSR accepts qualitative research data (including transcripts and audiovisual media) for preservation and dissemination.

 

ICPSR is committed to the digital preservation and encourages researchers to consider depositing their data in emerging formats, such as Web sites, geospatial data, biomedical data, and digital video. 

 

Thus, even institutions that collect and curate large volumes of data may not impose precise definitions of what they do and do not accept. Data remains an ambiguous concept, allowing archives to adapt to new forms of data as they appear.

 

Categorical Definitions

In operational and general research contexts, types of data may be distinguished by grouping them in useful ways. Data archives may group data by the degree of processing, for example. Science policy analysts may group data by their origin, value, or other factors.

 

Data scholarship

Data scholarship is a term coined here to frame the complex set of relationships between data and scholarship. Data have taken on a life of their own, or so it would seem from the popular press, independent of the scholarly context in which they are used as evidence of some phenomena.

 

Scholars, students, and business analysts alike now recognize that having enough data and the right techniques to exploit them enables new questions to be asked and new forms of evidence to be obtained.

 

Some of the things that can be done with data are extremely valuable. However, it can be very difficult to determine just how valuable any set of data may be, or the ways in which they may be valuable, until much later.

 

The notion of data scholarship was first framed as “data-intensive research” in policy initiatives that began early in the 2000s, including eScience, social Science, eHumanities, infrastructure, and cyberinfrastructure. The first three terms eventually coalesced into the collective eResearch.

 

The UK program on Digital Social Research consolidated earlier investments in eSocial Science, which encompassed data-intensive research in the social sciences and the study of eResearch.

 

Science often refers collectively to data scholarship in all fields, as exemplified by “eScience for the humanities”. Cyberinfrastructure remains a distinctly American concept that spans data scholarship and the technical framework that supports these activities.

 

Historically, scholarship has meant the attainments of a scholar in learning and erudition. The scholarship is an interior activity that encompasses how one learns, thinks about intellectual problems, and interprets evidence.

 

Research—“systematic investigation or inquiry aimed at contributing to knowledge of theory, topic, etc.” per the Oxford English Dictionary—is a form or act of scholarship.

 

In the sciences and social sciences, research and scholarship often are used interchangeably, whereas the term scholarship is often preferred in the humanities. The scholarship is the broader concept and more aligned with scholarly communication.

 

The latter term encompasses formal communication among scholars such as research, publishing, and associated activities such as peer review; and informal communication, including collaboration, personal exchanges, talks and presentations, and so on.

 

Individuals and individual scholarly communities may know how to exploit their data for their own purposes but know little about what data or methods from adjacent communities might be of value to them, or vice versa.

 

Scaling up to much larger volumes of data leads to qualitative differences in methods and questions. Old approaches are no longer viable, and yet old data often must be combined with the new. Expertise may not transfer easily across domains or methods.

 

The reactions to these challenges are many. Some would release data widely, in whatever forms they exist, to let others exploit at will. Some would hoard data indefinitely, rather than let others extract the value that they had failed to anticipate.

 

Many are paralyzed by risks such as misuse, misinterpretation, liability, lack of expertise, lack of tools and resources, lack of credit, loss of control, pollution of common pools of data, and the daunting challenges of sustainability.

 

Only now is the range of problems in dealing with data becoming apparent. Some of these problems are beginning to be understood sufficiently to make progress toward solutions. Others appear intractable.

 

While scholars may view dealing with data as one more individual responsibility in a time of declining resources for the research and education enterprise, data scholarship is embedded deeply in knowledge infrastructures.  tensions around data arise from concerns for ownership, control, and access to data;

 

the difficulty of transferring data across contexts and over time; differences among forms and genres of scholarly communication; larger shifts in technology, practice, and policy; and needs for long-term sustainability of data and other scholarly content.

 

Much is at stake for scholars, students, and the larger societies in which scholarship is conducted. Knowledge infrastructures provide a framework to assess social and technical interactions, the implications of open scholarship, and converging forms of scholarly communication.

 

Knowledge Infrastructures

The term knowledge infrastructure builds on earlier developments in information, infrastructure, and the Internet. Scholarship on infrastructure has blossomed, the Internet has become wholly embedded in academic life, and information remains a constant flow.

 

The phrase information infrastructure has narrowed in use from suggesting a comprehensive view of the information world to connoting technical communication architectures.

 

Similarly, national information infrastructure and global information infrastructure now refer to policy frameworks of the 1990s 

 

Infrastructures are not engineered or fully coherent processes. Rather, they are best understood as ecologies or complex adaptive systems. They consist of many parts that interact through social and technical processes, with varying degrees of success.

 

Paul Edwards defined knowledge infrastructures as “robust networks of people, artifacts, and institutions that generate, share, and maintain specific knowledge about the human and natural worlds.”

 

These networks include technology, intellectual activities, learning, collaboration, and distributed access to human expertise and documented information.

 

A later community exploration of these ideas addressed three themes: how are knowledge infrastructures changing? How do knowledge infrastructures reinforce or redistribute authority, influence, and power? How can we best study, know, and imagine today’s

 

Data is a form of information that seems always to be in motion, difficult to fix in a static form. Borders between tacit knowledge and common ground also are shifting, as multiple parties negotiate how data are understood across disciplines, domains, and over time.

 

The norms for knowledge, while never stable, are even more difficult to establish in an age of big data. What does it mean to “know” something if the computational result cannot be fully explained?

 

How much can or should be known about the origins of data to transfer them across contexts? The “trust fabric” implicit in the exchange of information between collaborators is difficult to replicate in exchanges with unknown others, especially across communities and over long periods of time.

 

Some transfers can be mediated by technology, but many will depend upon the expertise of human mediators, whether they are data scientists, librarians, archivists, or other emerging players in the research workforce. Commercial interests also are entering this space.

 

A commons is also a complex ecosystem, defined simply as a “resource shared by a group of people that is subject to social dilemmas”. Examples of how these complex ecologies reinforce or redistribute authority, influence, and power are threaded throughout this blog.

 

Individuals with skills in big data analytics are more valued. Scholars with the resources to exploit new forms of data also benefit. New forms of knowledge such as data mining and crowdsourcing help to remap and to reshape the intellectual territory.

 

Massive investments in infrastructure are good for everyone to the extent that a rising tide lifts all boats. The rising tide metaphor is countered by other social and economic trends, such as the knowledge gap theory, media literacies, and the Matthew effect. Generally speaking, the rich get richer.

 

Those with greater abilities to take advantage of new technologies and new information gain differential benefits. The Matthew effect applies to scholarship and was originally formulated in studies of Nobel Prize winners.

 

Those individuals and academic centers held in highest prestige tend to acquire recognition and resources disproportionately. New discoveries by established centers of excellence will receive more attention than those from less-known centers.

 

Conversely, scholars in lower-tier institutions and in less-developed countries typically have fewer skills and fewer local resources to use technological innovations effectively.

 

Invisibility is a concern in the design and maintenance of infrastructures in at least two respects. One is the defining characteristic of infrastructures that they may be visible only upon breakdown.

 

People often are unaware of how much they depend on infrastructure, whether the electrical grid or the interoperability between two instruments until that infrastructure ceases to function.

 

Second is the amount of invisible work necessary to sustain infrastructures, whether electrical grids, networks of scientific instruments, or distributed repositories of research data.

 

Those who benefit from using these infrastructures are often unaware of the background effort involved in keeping all the parts working smoothly together.

 

Invisible work is a salient characteristic of knowledge infrastructures because labor to document, organize, and manage scholarly information is essential for others to discover and exploit it.

 

Invisible work is both glue and friction in collaborations, in the development of tools, in the sharing and reuse of data, and many other infrastructure components.

 

Knowledge infrastructure may be a new term, but the idea is not. Since the earliest days of intellectual inquiry, scholars have learned how to swim—or how to design better boats—to avoid drowning in the deluge, flood, or tsunami of data headed their way.

 

Scholars have complained about being overwhelmed with information since at least the first century CE. Concerns for the abundance of books arose well before the early modern period.

 

By the mid-thirteenth century, solutions to the information problem were being formulated. These include title pages, concordances, and florilegia. Florilegia, literally “selected flowers,” were compilations of the best sections of books on a topic. Indexes were popular by the sixteenth century.

 

Blair explores how early scholars coped individually with their reading and interpretation through sophisticated note-taking and organization mechanisms. Relationships among authors, publishers, indexers, libraries, booksellers, and other stakeholders have been in flux ever since.

 

The Social and the Technical

Data scholarship is rife with tensions between the social and the technical. Rarely can these factors be separated, because they are reflexive and mutually influencing? The tool makes data creation possible, but the ability to imagine what data might be gathered makes the tool possible.

 

Rather than attempt to resolve those long-standing debates, the premise of the provocations is that the social and the technical aspects of scholarship are inseparable. Neither the data nor the tool can be understood without the other; 

 

 As a philosopher, Latour tends to use science in ways that encompass most forms of scholarly inquiry. In the North American parlance, the sciences frequently are distinguished from the social sciences and the humanities. Further distinctions can be made, such as separating disciplinary topics from professional pursuits in engineering, medicine, law, and education.

 

Although useful to identify institutional boundaries such as academic departments, such distinctions are arbitrary with respect to knowledge and scholarship.

 

Science is sometimes used herein in the general sense of scholarly knowledge and practice. When distinctions by discipline are useful, as in case studies, the sciences, social sciences, and humanities are invoked.

 

Larger questions of the history and philosophy of science are at play in the rising attention to data. Science is an expensive public investment.

 

Since World War II, and particularly since the end of the Cold War, the public has demanded more accountability, more voice in research directions, and more access to the results of research.

 

As relationships between the scientific enterprise and the public shifted, social scientists became more eager to study scholarly work. Scientists and other scholars also became more willing to be studied, both to have their voices heard and to benefit from external observations of their work.

 

Communities and Collaboration

Policies, practices, standards, and infrastructure for data usually refer to the communities associated with those data. Data management plans are a prime example: “What constitutes such data will be determined by the community of interest through the process of peer review and program management” (National Science Foundation 2010a).

 

Similarly, policies for digital archiving are framed in terms of the “designated community”. Data often are “boundary objects” that exist tenuously on the borders between the domain areas. By examining the roles that data play in collaborations, the boundaries, scope, agreements, and disagreements of communities come into view.

 

Collecting, creating, analyzing, interpreting, and managing data requires expertise in the research domain. Many kinds of expertise are required, some theoretical and practical, some social and technical. Parts of this expertise can be readily taught or learned from books, journals, or documentation, but much of it is deeply embedded knowledge that is difficult to articulate.

 

Best known as “tacit knowledge”—which is itself a complex construct—the expertise most critical to exploiting data is often the least transferrable between communities and contexts Community is a theoretical construct well known in the social sciences. In social studies of science and of scholarship, communities of practice and epistemic cultures are core ideas.

 

Communities of practice is a concept originated by Lave and Wenger to describe how knowledge is learned and shared in groups, a concept subsequently much studied and extended. Epistemic cultures, by contrast, are neither disciplines nor communities.

 

They are more a set of “arrangements and mechanisms” associated with the processes of constructing knowledge and include individuals, groups, artifacts, and technologies.

 

Common to communities of practice and epistemic cultures is the idea that knowledge is situated and local. Nancy Van House summarizes this perspective succinctly:

 

“There is no ‘view from nowhere’—knowledge is always situated in a place, time, conditions, practices, and understandings. There is no single knowledge, but multiple pieces of knowledge.”

 

Knowledge and Representation

Despite efforts to commodify them, data are “bright shiny objects” only in the sense of being a popular topic that attracts and distracts attention. Signals, recordings, notes, observations, specimens, and other entities came to be viewed as data through a long cultural process within research fields, disciplines, and specialties.

 

Documentations of scientific practices are known as “inscriptions”. Each field develops its own inscriptions to document, describe, and represent what it considers to be data.

 

Common methods of representing data—metadata, markup languages, formats, labeling, namespaces, thesauri, ontologies, and so on—facilitate the exchange of data within a field. A common form of representation can define the boundaries of a community.

 

Similarly, those boundaries can become barriers for those who wish to move data between fields that employ competing forms of representation. Various diseases, drugs, flora, fauna, and phenomena go by many different names. The ability to combine data from multiple sources depends on these inscriptions.

 

Data, standards of evidence, forms of representation, and research practices are deeply intertwined. Differences between communities often become apparent only when attempts are made to use or combine external data sources, to collaborate, or to impose practices from one community onto another.

 

Transferring knowledge across contexts and over time is difficult, as framed in the second provocation. Data are no easier to move than are other forms of knowledge.

 

Often they are among the most difficult because their meaning depends on the apparatus that surrounds them—the software, hardware, methods, documentation, publications, and so on.

 

Journal articles, conference papers, books, and other genres of publication are packages of information that are intended to be interpretable, at least by the sophisticated reader, as an independent unit.

 

They are representations of scholarly knowledge and often include representations of data in forms suitable for dissemination, discovery, and exchange. Forms in which scholarly publications are represented evolved over the course of centuries.

 

Title pages, statements of authorship, tables of contents, indexes, and other features that are now accepted components of scholarly books developed incrementally.

 

Some of these features, such as statements of responsibility, were transferred from books to articles with the first publication of scholarly journals in 1665, the Journal des Scavans in Paris and Transactions of the Royal Society in London.

 

In the time since expansive knowledge infrastructures have arisen around scholarly publishing. Publishers, peer reviewing, bibliographic citation, indexing and abstracting services, information-retrieval systems, and evaluation metrics such as journal impact factors are all part of this knowledge infrastructure.

 

Theory, Practice, and Policy

Data scholarship is a concept that transcends theory, practice, and policy. Data policy writ small is the set of choices made by researchers about matters such as what they consider data, what they save, what they curate; what they share, when, with whom; and what they deposit, when, and for how long.

 

Data policy writ large is the set of choices made by governments and funding agencies about matters such as what they consider to be data; what they require researchers to save;

 

what they require researchers to release, when, how, and to whom; what data they require to be curated, by whom, and for how long; and how these requirements are implemented in grant proposals, in awards, and in the provision of data repositories.

 

Data policy in the middle range is the choices made by research institutions, universities, publishers, libraries, repositories, and other stakeholders about what they consider to be data and their role in curating and disseminating those data.

 

In turn, these rest on larger policies about research funding, intellectual property, innovation, economics, governance, and privacy.

 

Policies of governments, funding agencies, journals, and institutions that are intended to improve the flow of scholarly communication often make simplifying assumptions about the ability to commodify and exchange information.

 

While usually intended to promote equity across communities and disciplines, policies that fail to respect the substantial differences, in theory, practice, and culture between fields are likely to be implemented poorly, be counterproductive, or be ignored by their constituents.

 

Individual communities may have their own moral economies that govern how data are collected, managed, and shared.  current policies for data management plans and data sharing tend to focus on releasing data rather than on the means to reuse them or to sustain access, which are complex and expensive aspects of knowledge infrastructures.

 

Open Scholarship

Open access, open source, open data, open standards, open repositories, open networks, open bibliography, open annotations … the list continues indefinitely. 

 

Today’s knowledge infrastructures were shaped by, and continue to shape, developments in open access to research; mechanisms to improve interoperability among systems, tools, and services; advances in distributed computing networks and technologies; and nearly ubiquitous access to the Internet.

 

Open scholarship is no more easily defined than is data scholarship. It is most closely equated to open science.

 

For the purposes of discussion here, open scholarship encompasses the policies and practices associated with open access publishing, open data, data release, and data sharing.

 

Among the expectations for the open scholarship are to speed the pace of research, to encourage new questions and forms of investigation, to minimize fraud and misconduct, to facilitate the growth of a technically and scientifically literate workforce, and to leverage public investments in research and education 

 

The use of a single term such as open scholarship, however, risks masking substantial differences in these many forms of open access. Publications and data each play distinct roles in scholarship, as framed by the third provocation and explicated below.

 

What open access to publications and open data have in common are the intent to improve the flow of information, minimize restrictions on the use of intellectual resources, and increase the transparency of research practice.

 

Where they differ are in their kinds of value for scholarship, the array of stakeholders involved, and their mobility across contexts and over time.

 

Open Access to Research Findings

Scholarship moved from the realm of private letters and meetings to open dissemination with the advent of the first journals in 1665. Readers had access to books, journals, and other publications through libraries, booksellers, and personal subscriptions. Private exchange of letters, drafts, manuscripts, and preprints continued in parallel.

 

Open access to research findings made a giant leap forward in 1991 with the launch of arXiv, originally known by its address xxx.lanl.gov, as it predated the World Wide Web.

 

In the twenty-plus years since, arXiv has expanded to other scientific fields, moved from Los Alamos National Laboratories to Cornell University, and gained a broad base of support from member institutions.

 

Its usage continues to rise exponentially. More than eight thousand papers per month are contributed to arXiv, and more than sixty million papers were downloaded in 2012 alone.

 

Several lessons from the launch of arXiv are important for today’s considerations of open access to data. One is that the system was an outgrowth of the active preprint exchange culture in high energy physics.

 

It built on an existing knowledge infrastructure that supported the flow of information within networks of close colleagues, known as invisible colleges.

 

The second lesson is that arXiv disrupted that knowledge infrastructure by changing relationships among authors, publishers, libraries, and readers as stakeholders in the scholarly communication of physics.

 

Researchers and students alike, in countries rich and poor, gained access to papers much sooner than the official date of publication.

 

Journal editors and publishers in physics had little choice but to accept the existence of arXiv, given its rapid adoption. Previously, many journals refused to consider papers posted online on the grounds that such posting constituted “prior publication.” Similar policies remain in place in many fields today.

 

A third lesson is that the success of arXiv did not transfer quickly or well to other fields. Whereas preprint servers in other disciplines are gaining in size and popularity, none became as deeply embedded in scholarly practice as has arXiv.

 

However, even arXiv is not embedded in all of physics, mathematics, astronomy, or the other fields it covers. In some research special-ties, usage is ubiquitous. In other specialties, it is only lightly used.

 

Open access to publications built on those early lessons. Open access is a simple concept that is commonly misunderstood, in no small part due to the competing interests of many stakeholders.

 

Peter Suber’s definition is the most succinct: “Open access (OA) literature is digital, online, free of charge, and free of most copyright and licensing restrictions.”

 

As Suber hastens to point out, open access to the literature of research and scholarship operates in a much different realm than does open access to other forms of content.

 

One principle on which OA literature rests is that authors are the copyright holders of their work, unless and until they transfer the ownership to another party such as a publisher. The second is that scholars rarely are paid for writing research articles.

 

They can distribute their work widely without losing revenue, which is not the case for most other authors, artists, or creators. Scholarly authors write research articles for impact or influence, rather than for revenue.

 

It is in a scholar’s best interests to reach as wide an audience as possible. The primary sources of funding for scholarly research are academic salaries and research grants. It is also in the best interests of the institutions that employ and fund scholars that their publications achieve the highest possible impact.

 

Open access to literature can be accomplished by any means, under multiple governance models, and goes by many names (e.g., green, gold, gratis, libre, etc.). What these models have in common is that they rest on the two principles described above.

 

Authors generally retain copyright or license their work for open access distribution. Authors also retain the rights to be attributed to the creators of the work. Different considerations apply for open access to scholarly books, textbooks, and other works for which authors generally do receive direct revenue

 

Since the mid-2000s, a growing number of research universities around the world have adopted open access policies for the journal publications of their faculty.

 

In the United States, these include Harvard, Massachusetts Institute of Technology, California Institute of Technology, and the University of California. Generally, these policies grant the universities a non-exclusive license to disseminate the work, usually through a public repository.

 

Seismic shifts toward open access to publications occurred in 2012 and 2013. In 2012, the Research Councils of the United Kingdom (RCUK) announced that all peer-reviewed journal articles and conference papers resulting in whole or in part from their funding are to be submitted to open access journals, effective April 2013.

 

The definition of “open access journal” in the policy has been modified and interpreted several times since, being particularly contentious. It involves embargo periods, an array of business models, and some interim subsidies.

 

In 2013, the executive branch of the US government announced a similar policy for open access to publications resulting from federal funding, generally following the embargo periods and policies established by the National Institutes of Health and PubMed Central. The European Union, Australia, and other countries are debating similar policies.

 

These various policies, business models, and new genres of publishing have resulted in broader public access to the scholarly journal literature. Accounting for embargo periods, about half of journal articles are freely available online within a year of publication, and that proportion is expected to grow.

 

While the devil is in the details, open access to scholarly journal articles is becoming the norm. Tensions between stakeholders have not disappeared, however.

 

Authors continue to post articles, papers, and other works online for which these open access policies do not apply, and some publishers are becoming more aggressive about policing access to works for which they are the exclusive copyright holder.

 

Open Access to Data

Many funding agency policies for open access to data are linked to policies for open access to publications. The UK policy is most explicit about the relationship: “The Government, in line with its overarching commitment to transparency and open data, is committed to ensuring that published research findings should be freely accessible.”

 

The RCUK policy on publishing in open access journals requires authors to specify how data associated with the publication can be accessed but acknowledges the complexities of doing so: “to ensure that researchers think about data access issues. However, the policy does not require that the data must be made open.

 

If there are considered to be compelling reasons to protect access to the data, for example, commercial confidentiality or legitimate sensitivities around data derived from potentially identifiable human participants, these should be included in the statement.”

 

The National Institutes of Health in the United States requires that publications resulting from their grant funds be deposited into PubMed. They also require data management plans with grant proposals. The National Science Foundation has requirements for data management plans but not for open access publishing.

 

However, the subsequent US federal policy on open access publishing will apply to NSF, NIH, and other federal agencies that spend more than $100 million annually in research and development. The policy directs each agency to develop an open-access plan for scientific publications and digital scientific data.

 

Open access to journal articles and open data differ with respect to both of Suber’s principles, however. Whereas authors are the copyright holders on journal articles, at least initially, the same is rarely true for data. Determining who qualifies for authorship is a much-debated topic within and between fields.

 

Once settled, certain rights and responsibilities accrue to the authors of a work. Determining who qualifies as an “author” of data is a barely explored topic within most collaborations.

 

Even when individuals and groups assign authority for data, the rights and responsibilities may remain unclear. Many forms of data are created and controlled by scholars, but ownership is a different matter. Some forms of data are deemed to be facts that can not be copyrighted. Researchers use data that are owned by other parties or acquired from common pools of resources.

 

Some types of data, such as confidential records on human subjects, are controlled by scholars but cannot be released. Rights policies may vary by institution, funding agency, contract, jurisdiction, and other factors.

 

Suber’s second principle is that scholarly authors write journal articles and many other forms of publication for impact, not for revenue. Scholars, their employers, and their funders are motivated to disseminate their publications as widely as possible. Neither situation is true for most data.

 

Journal articles are packaged for dissemination to an audience, whereas data are difficult to extricate from the processes of scholarly work. Releasing data often requires substantial investment over and above the conduct of the research and the writing of publications. Data can be valuable assets that are accumulated over the course of a career, to be released carefully, if at all.

 

Governance models for open access to data are nascent at best. “Freely accessible,” as in the RCUK policy quoted, appears to be in the sense of free speech rather than free beer.

 

The essential questions to address any commons concern equity, efficiency, and sustainability. Only a few fields have found ways to address equity and efficiency through the use of repositories to ingest, curate, and provide access to data.

 

Private exchange suffices in some fields. Others are turning to research libraries for assistance. In all cases, sustainability is a problem. Some repositories have long-term funding, others only short term. Some provide data to anyone without a fee, others serve data only to members of the consortia that fund them. 

 

Open data is thus substantially distinct from open access to scholarly literature. Little consensus exists on what it means for data to be “open.”  Both chemists, they are concerned with free access and with the ability to mine structured data.

 

When entities such as molecules are represented in ways that an algorithm can identify their structure, they become useful as data for mining, extraction, and manipulation.

 

When those same molecules are represented only as images in a text file, human eyes are necessary to identify their structure. Open data, in their formulation as a “document,” are data that are structured for machine readability and that are freely accessible.

 

Murray-Rust and others, under the aegis of the Open Knowledge Foundation, later developed a succinct legal definition of open data: “A piece of data or content is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike”.

 

In business contexts, definitions are more ambiguous: “Open data—machine-readable information, particularly government data, that’s made available to others”. The OECD Principles and Guidelines for Access to Research Data from Public Funding frames open access to data in terms of thirteen principles.

 

The UK Royal Society report called Science as an Open Enterprise defines open data as “data that meets the criteria of intelligent openness. Data must be accessible, usable, assessable and intelligible.”

 

Implications for biomedical data include cost-benefit trade-offs, triggering and timing of data release, means to ensure data quality, the scope of data to include, confidentiality, privacy, security, intellectual property, and jurisdiction 

 

Openness can facilitate the creation of data. Open access to texts, for example, enables the entities in those texts to be treated as data.

 

Using text-mining techniques, it becomes possible to locate all the articles or books describing a particular molecule, celestial object, person, place, event, or other entity. Publication databases, data archives, and collections of digitized books begin to look alike from a data mining perspective. 

 

Open data also can include the ability to treat representations of research objects as data, whether or not the objects being described are open.

 

An example is the creation of open tags or annotations for publications, datasets, and other content. Annotations and bibliographies add value to the items described, making them more discoverable.

 

Interest in sharing annotations arose early in digital library research, leading to competing methods and to standards efforts for the interoperability of annotation systems. Early tools to manage personal bibliographies, such as ProCite, Bib-lioLink, RefWorks, and EndNote were predicated on sole-author writing and on files held locally.

 

By the early 2000s, individuals began to tag and annotate websites, images, publications, and data and to share those tags on social networks such as Delicious and Flickr.

 

By the latter 2000s, personal bibliographies and open annotation were converging. Zotero, Mendeley, LibraryThing, and other tools enable bibliographies, tags, and notes to be shared.

 

The open bibliography movement got a large boost when national libraries began to release cataloging records for open use. As more bibliographic records become openly available, they can be treated as data to be mined. Annotation tools support a growing array of data types

 

Open Technologies

Open scholarship is part of a forty-plus-year transition from closed to open networks and technologies. Stories of the origin and trajectory of the Internet are tales of this transition;. 

 

It is generally agreed that computer networks were developed with government funding to serve research and military purposes.

 

From the first international interconnections of networks in the late 1960s until the policy changes of the early 1990s, the Internet was available only to the research, academic, and military communities via government contracts.

 

These became known as National Research and Education Networks (NRENs). Developed initially to share expensive computer cycles over the network, Internet capabilities expanded to include e-mail, file transfer, and similar features.

 

Parallel commercial packet-switched networks, such as Telenet and Tymnet, provided commodity communications to private enterprise in support of business activities and new information services such as bibliographic databases.

 

Policy changes in 1993–1994, under the rubrics of national information infrastructure and global information infrastructure, allowed the government and commercial interconnections.

 

This was the launch of the commodity Internet and the conversion of communication networks from government-owned or protected systems to commercial operations. The Internet was declared “open” to interconnection and to the provision of services by public and private entities alike.

 

Opening the network coincided with the first demonstration of the World Wide Web and with the first browser interfaces. In the two decades since, Internet technologies, capacity, and user communities have scaled beyond even the wildest imaginations of the original designers.

 

However, new business models, shifts in the balance of stakeholders, and unforeseen challenges of security and privacy are contributing to the redesign of the infrastructure.

 

Moving data over open networks is a different matter from being able to use those data once acquired. Data in digital form and representations in digital form are readable only with appropriate technologies.

 

To interpret a digital dataset much must be known about the hardware used to generate the data, whether sensor networks or laboratory machines; the software used to encode or analyze them, whether image processing or statistical tools; and the protocols and expertise necessary to bring them together.

 

Technologies evolve very quickly, especially in research environments. Many instruments produce data that can be read only with proprietary software. To use or reuse those data requires access to the right version of the software and perhaps to the instrument.

 

Many analytical tools are proprietary, thus analysis of data may yield datasets in proprietary formats, independent of how open the data were at the time of ingest. Scholars often build their own tools and write their own code to solve problems at hand.

 

Although an efficient practice in the short term, local code and devices can be very hard to maintain in the long term. Rarely are they constructed to industrial standards for software engineering? Local tools are flexible and adaptable, often at the expense of portability across sites and situations.

 

The degree of openness of data, standards, and technologies influences the ability to exchange data between tools, labs, partners, and over time. Standards can improve the flow of information within communities but also can create boundaries between them. They may be premature or inappropriate, creating barriers and preventing innovation.

 

Technical interoperability of systems and services has long been the holy grail of digital libraries and software engineering. Interoperability allows some data and stakeholders in and keeps others out.

 

Policy, practice, standards, business models, and vested interests often are greater determinants of interoperability than is technology.

 

Converging Communication

Formal and informal communications are converging in business, government, and scholarship alike. To stay in business, it is no longer sufficient for a company to be visible on Main Street and in the daily newspapers.

 

Now businesses also need a presence on the World Wide Web, social networks, blogs and microblogs, and video channels. Governments must be available to citizens in capital cities and in individual neighborhoods. With the growth of digital government, they also must be available online, providing public services 24/7.

 

Similarly, scholars are exerting their influence not only through the literature of their fields but also via web pages, preprint servers, data archives, institutional repositories, archives of slides and figures, blogs and microblogs, social networks, and other media as they are invented.

 

New technologies facilitate new means of communication, but they also destabilize existing models. As old models are mapped onto the new, metaphors may be stretched to their breaking points.

 

Data Metaphors

The roles that publications and data play in scholarly communication are conflated in metaphors such as “data publication” and “publishing data.” The simplifying assumptions in these metaphors pose risks to new models of scholarly communication, as stated in the third provocation.

 

Strictly speaking, publishing means “to make public,” thus many acts could be construed as publishing. In the context of scholarship, however, publishing serves three general purposes: (1) legitimization, (2) dissemination, and (3) access, preservation, and curation. The first function is usually accomplished by peer review.

 

The fixed point of publication, which typically is the document of record, manifests the legitimization process, conferring a stamp of quality and trust awarded by the community.

 

Citations accrue to published units as the legitimized records of research. The dissemination function is essential because research exists only if it is communicated to others.

 

Publishers dis-seminate research as journals, books, conference proceedings, and other genres. Authors disseminate their own work by distributing publications to colleagues, posting them, and by mentioning them in talks, blogs, social networks, and beyond.

 

The third function is to make the work available and discoverable, to ensure that copies are preserved, and usually also to ensure that a copy is curated for long-term use.

 

The latter function tends to be a joint responsibility of authors, publishers, and libraries. Scholars are motivated to publish their work since publishing is the primary form of recognition for hiring, promotion, and other rewards.

 

The data publishing metaphor is apt only in the narrow use of the term, analogous to journal and book publishing.

 

For example, the Organisation for Economic Co-operation and Development (OECD) publishes a wide array of national and international statistics on gross domestic product, employment, income, population, labor, education, trade, finance, prices, and such.

 

Various governmental bodies publish census data and similar statistics. Outside the scholarly arena, data publishing may refer to the distribution of documents consisting of lists, facts, or advertisements rather than narrative. A company by that name has been producing directories of local telephone numbers and similar information since 1986.

 

Beyond that narrow usage, the data publishing metaphor breaks down. Most often, it refers to the release of datasets associated with an individual journal article. Data may be attached to articles but they rarely receive independent peer review because they are difficult to assess.

 

In this sense of data publication, the data often are deposited in an archive and linked to the article, rather than published as a unit. The dataset may become discoverable and be curated, but it is not distributed as an entity that stands alone, nor is it self-describing in the way that journal articles are.

 

Data publication also can refer to posting data on the author’s website, in which case none of the three functions of publishing is well served.

 

In a few cases, the term refers to an archive that collects data and makes them accessible to others. Discovery and curation may be accomplished, but peer review and dissemination are not core activities of most data archives.

 

The argument in favor of the metaphor is familiarity—scholars understand how to publish and cite articles. Implicit in this argument is that familiarity will encourage data release.

 

Although often stated as fact, little evidence exists that data citation is an incentive to release data. The data publication metaphor also promotes the interests of publishers who would package data as units to disseminate, extending current business models.

 

The arguments against the data publication metaphor are many, but it persists. Calls to “publish your data” with each journal article are risky because they reify a binary link between articles and datasets.

 

In fields where a binary relationship exists and where papers are expected to be reproducible based on those datasets, the mapping may serve the interests of the community.

 

Those fields are few and far between, however. For such a one-to-one mapping to be effective, it must be supported by a larger knowledge infrastructure that includes peer review of datasets, the availability of repositories, journal policies and technologies to facilitate linking, and access to the necessary hardware, software, and other apparatus for reproducibility.

 

A one-to-one mapping between a journal article and a dataset is but one of many possible relationships between publications and data. Relationships frequently are many-to-many. The full scope of data and information resources associated with a given publication can be impossible to specify.

 

Direct links are useful to the extent that they support discovery and reproducibility, but an architecture that requires one-to-one links constrains the ability to discover or reuse datasets for other purposes.

 

The open data movement is predicated on the ability to assemble and compare data from many sources, which requires open technologies.

 

Data publication is but one of the five metaphors for data stewardship identified by Parsons and Fox (2013), all of which they find to be problematic and incomplete.

 

Their second metaphor is “big data” or “big iron,” referring to the industrial production and engineering culture associated with data in astronomy, climate science, high energy physics, and similar areas.

 

Big iron is concerned with quality assurance, data reduction, versioning, data and metadata standards, and high throughput. “Science support” is their third metaphor, referring to areas such as field ecology in which it can be difficult to separate the science from the data or the data collectors from the curators.

 

“Mapmaking,” their fourth metaphor, refers to the geospatial data essential for climate models, land use, surveys, and many other purposes. These data are integrated into layers, published as a map rather than as an article or paper.

 

“Linked data” is their last metaphor. While a means of linking datasets and publications, it is part of a larger movement to aggregate related units of data, publications, and documentation.

 

Notions of linked data are fundamental to the semantic web. To be effective, they rely on graph models of organization and require agreements on ontologies and standards. Open data is central to this worldview, but preservation, curation, and quality assurance are not high priorities.

 

Units of Data

The metaphors associated with data make simplifying assumptions about the appropriate unit to disseminate, cite, use, or curate. Data can be represented in units of any size, whether pixels, photons, characters, strokes, letters, words, cells in a spreadsheet, datasets, or data archives.

 

Even the term dataset is problematic, referring to at least four common themes—grouping, content, relatedness, and purpose—each of which has multiple categories. 

 

Datasets ranging in size from a few bits to many terabytes can be treated as independent objects. The appropriate unit depends on the intended use. Sometimes it is useful to aggregate many units of data for comparison and mining; at other times it is useful to extract portions of larger resources.

 

Books and journal articles, once existing only on paper as convenient units of communication, now can be broken down into much smaller units. Search engines retrieve articles as independent entities, not part of a journal issue carefully assembled by the editorial staff.

 

Within those articles, each table, figure, and dataset may have its own identifier so that it can be retrieved independently—separated from the context of the research method, theory, and conclusions. Books, journal articles, and other forms of text can be treated as units or aggregates of data, searchable by words, phrases, and character strings.

 

The boundary between formal communication such as publications and informal communication such as presentations and conversations continue to blur as scholarly content becomes more atomized and treated as data.

 

Journal articles, preprints, drafts, blog posts, slides, tables, figures, video presentations of talks, tweets, Facebook and LinkedIn posts, and countless other entities can be distributed independently.

 

Public, albeit commercial, repositories for slides and figures are popular for their ease of deposit and access and because they accept useful entities not readily published elsewhere. The flexibility of unitizing and linking digital objects promotes new forms of communication.

 

For example, when journals require substantial page charges to publish color figures—even in digital form—authors may publish only the grayscale version with the journal article, depositing or posting the full-color figures elsewhere.

 

Because the color image is necessary to interpret the findings, these authors have chosen an affordable means to reach their audience, at least in the short term. Some of this content will be sustained, but much will not. Links break early and often.

 

The usual response to the problem of disaggregation is to reaggregate content both to reconstruct original relationships among parts and to create new aggregations.

 

Linked data approaches can be used to reconstruct the value chain of scholarship, connecting articles, data, documentation, protocols, preprints, presentations, and other units.

 

This approach is suitable for units that are readily networked but is not a generic solution for linking resources across systems and services.

 

Similarly, data mining of the open literature may identify data in texts, tables, and figures, but not in supplemental materials or archives. Multiple approaches are needed to the problems of disaggregation, reaggregation, citing, and units of publication.

Recommend