What is Astronomy?
Astronomy has a long, illustrious, and well-documented history. In modern usage, per the Oxford English Dictionary, astronomy “deals with the universe beyond the Earth’s atmosphere, comprising the study of celestial objects and extraterrestrial phenomena, and of the nature and history of the universe.”
As with any research area, the boundaries are porous. Astronomers commonly collaborate with physicists, engineers, and computer scientists. Some have partners in statistics, medicine, biology, art and design, and elsewhere. This blog explores several Astronomy Terms and Data.
Astronomy has many characteristics that help to elucidate larger themes. It is a big data field that has several decades of experience with digital data and computation.
In that time period, astronomers have encountered radical changes in data practices, shifting from sole-investigator work to large team collaborations and from local control over instruments to shared international resources.
Astronomy has become a field as diverse as any other, consisting of pockets of big data, little data, and occasionally no data. Some astronomers are pioneers in data sharing and reuse; others are known to hoard their data resources.
They have constructed a sophisticated knowledge infrastructure to coordinate information resources on a global scale, yet rely on human expertise to identify relationships between individual objects. Substantial gaps in their ability to discover and exploit information resources remain.
The case study focuses on astronomy as an observational science, recognizing that these examples do not encompass the entire field. The scope is sufficiently broad to explore the sources, uses, and disposition of data, yet narrow enough to offer coherent explanations about the choices of data in a research area.
The material is drawn from published literature and from current research on data practices in the field. The example of conducting research in astronomy is developed through interviews, observations, and other analyses of the work of a team based at the Harvard-Smithsonian Center for Astrophysics (CfA)
Astronomy Size Matters
Astronomy varies widely in the size of collaborations and the volumes of data handled. Until the latter part of the twentieth century, astronomy remained a sole-investigator science. Galileo owned his own telescopes and had complete control over his own data.
Modern telescopes, including the 200-inch Hale telescope at Palomar Mountain that remained the world’s largest for many years, were privately owned with access by only a select few astronomers.
Between the time the Hale telescope was dedicated in 1948 and the first of the Gemini eight-meter telescopes was dedicated in 1999, the telescope became a very different research tool.
At the same time, more public funding was invested in astronomy, resulting in cultural changes in the field. Telescope time was available to many more astronomers, as was access to the data.
The astronomy enthusiast community is immense, swelling the ranks of those with astronomy expertise. Teams both large and small conduct astronomy research. Modern telescopes require one to two decades to design, build, and deploy; thus, very long-term planning is required.
Time constraints are partly social and political—the work to form and coordinate teams and to secure multiple cycles of funding;
partly technical—instruments and software must be designed to anticipate the state of technology at the time of deployment, and partly physical—the construction of instruments, including casting, cooling, and polishing of the glass in telescope mirrors, is a multiyear process.
The design of instruments and missions, in turn, influences decisions about the resulting data, how those data will be captured and curated, and the conditions under which astronomers have access to the instruments and the data.
As each new telescope is launched into space or the ones on the ground see “first light,” the design of the next generation of instruments is well underway.
As astronomy came to depend more on public funding and on community collaboration, they began to build consensus on research priorities. Since the 1960s, the US astronomy community has conducted a “decadal survey” to identify the highest priority projects. The scientific categories of projects change from decade to decade, reflecting shifts in the field.
Among the nine panels in 2010 were Cosmology and Fundamental Physics, Planetary Systems and Star Formation, Electromagnetic Observations from Space, and Optical and Infrared Astronomy from the Ground.
The decadal survey is a set of recommendations to the national and international funding agencies for areas the community requests support. It is not a guarantee of funding or of priorities of allocation, however. Actual funding is subject to negotiation with the agencies, with parent entities such as the US Congress, and with international partners.
Long Tail Astronomy
Big Data, Long Tail Astronomy also has become a big data field in terms of observational data available, putting astronomers at the head of the long tail curve. The absolute volume of astronomy data continues to grow by orders of magnitude with each new generation of the telescope.
Astronomy data are big in terms of volume and velocity. Their scaling problem is continual as new instruments capture more data at faster rates. Nearly all astronomy data are measurements of the intensity of electromagnetic radiation (e.g., X-rays, visible light) as a function of position on the sky, wavelength, and time.
Astronomy’s shift to digital technologies occurred over a period of several decades, resulting in qualitative changes in the forms of data captured. For generations of astronomers, data capture was analog, consisting of long and continuous exposures on glass plates, charge-coupled devices (the basis of digital photography), or spectrographs.
Astronomers spent entire nights at a telescope, carefully moving the instrument to obtain a continuous exposure of a celestial object or region over periods of minutes or hours.
Only a few records might result from each night on the mountain. In contrast, digital capture results in discrete, rather than continuous, images. Digital records are much more readily copied intact and transferred to other storage devices. They are more manipulable and more easily disseminated than analog data.
By the late twentieth century, telescopes were producing data at rates well beyond human consumption. Many parts of the process can be automated. To the extent that data collection can be specified precisely, robots can operate instruments during the night, sending datasets to be read the morning after.
Frequently, however, instruments are under the direct control of the astronomer. Taking one’s own data may still require staying up all night, even if the conditions are more comfortable than in times past.
Computer-based data analysis and visualization are the norms for the current generation of astronomers. However, many working astronomers learned their trade in analog days, bringing that analytic expertise to today’s data analysis. Some manual data collection continues, such as the observation of sunspots.
Analog data collected over the last several centuries remains valuable as permanent records of the skies in earlier eras. Some of those data have been converted to digital form and made available in public repositories. Others remain private, in the control of the astronomers and institutions that collected them.
Sky surveys are research projects to capture large amounts of data about a region of the sky over a long period of time. The Sloan Digital Sky Survey (SDSS), named for its primary funder, the Alfred P. Sloan Foundation, was the first sky survey designed for immediate public use.
Data collection began in 2000, mapping about one-quarter of the night sky with an optical telescope at Apache Point, New Mexico.
In a series of nine data releases from 2000 to 2008, the SDSS captured data at higher rates and better resolution because of new instruments added to the telescope, advances in charge-coupled devices (CCDs) for the cameras, and improvements in computer speed and capacity.
Pan-STARRS, the next generation sky survey, is mapping a larger area of the sky at greater levels of detail and has the additional capability of identifying moving objects. The telescope, based in Hawaii, is being deployed in stages.
Pan-STARRS’ gigapixel cameras are the largest and most sensitive cameras built to date. The Large Synoptic Survey Telescope (LSST), a ground-based telescope in Chile planned as the next major sky survey after Pan-STARRS claims to be the “fastest, widest, and deepest eye of the new digital age.”
It will obtain thirty terabytes of data nightly. The Square Kilometre Array (SKA), planned to be the world’s largest radio telescope, is expected to capture fourteen exabytes of data per day.
Data from sky surveys and from the main missions of public telescopes tend to be curated in archives, available for public use after periods of embargoes and processing. Data management is among the principal challenges facing missions such as Pan-STARRS, LSST, and SKA.
Not every astronomer is a big data scientist, however, nor are all astronomy data released for public use. Some astronomers still spend nights on the mountain, collecting small amounts of highly specialized observations.
Others obtain “parasitic” data from secondary instruments on telescopes in space. Yet others build their own instruments to capture precise kinds of data needed to address their own research questions.
Data from smaller projects such as these may be stored indefinitely on the local servers of the investigators. Such data may be large in volume, highly specialized, and difficult to interpret. Astronomers of the analog era tended to trust only data they had collected themselves.
Those who collect data know them best, whether analog or digital. Artifacts such as computer malfunctions or changes in weather that occlude an image are difficult to document in automated pipelines but can be essential knowledge for interpreting results.
Theorists may have big data or no data—depending upon what they consider to be data. Analytic theories in astronomy consist of equations that are solvable with pencil and paper or nowadays with supercomputers.
Analytic theories are used to model observed phenomena or to make predictions about phenomena. Computational or numerical theories are those in which astrophysical phenomena and objects are simulated to model, predict, and explain phenomena.
Computational simulations are a middle ground between theory and observation because they employ principles of pure analytic theory to information that looks much like observational data. Modelers may synthesize what a particular telescope would see if a source, event, or region with the properties they have simulated were to be observed.
Most models depend upon real observations gathered by other astronomers as inputs. Those who build computational models of astronomical phenomena sometimes consider the output of their models to be their data. Others will say they use no data.
Simulations typically produce a time series showing the evolution of a phenomenon. Models are run with multiple combinations of input parameters to simulate a range of potentially realistic conditions. Each parameter combination can be run for many time steps (each of which is called a “snapshot”).
As a result, each run (and even just one snapshot) may produce several terabytes of output—far more than can be kept indefinitely. Modelers may say their data consist of the few kilobytes of empirical observations they need to initiate a model run.
Some make fine distinctions between simulations, the code associated with the output of simulations, parameters, and data. Synthetic data can be created in the same format as observational data, enabling statistical comparisons with the same sets of analytical tools.
What Are Astronomy data?
Astronomy data are more difficult to characterize than might be imagined by the nonspecialist. Many people, with many different talents, are involved in the design, development, and deployment of astronomy missions.
Some astronomers are involved in these earlier stages, but observational research with public telescopes may rely on data for which the instruments were conceived decades earlier.
Astronomers may devote large parts of their careers to long-term collaboration, move between projects, draw data from multiple missions, or focus on specialized topics with their own instrumentation.
Some write observing proposals to collect their own data; some use extant data from archives; some build their own instruments, and some use combinations of all of these sources and resources. The point at which some entity becomes useful astronomical data depends upon choices such as these.
People involved at each stage may know only enough about the prior stages to do their own job.
By the time an instrument sees first light, no single person has a comprehensive view of all the decisions that led to the resulting stream of observations. To the physicist who led the instrument design, the voltage on the CCDs may be data.
To the theorist studying the origins of the universe, data may be the output of simulations that model how stars, galaxies, and other celestial objects form, evolve and die. To the empirical astronomer, data may be “image cubes” consisting of coordinates on the sky and spectra.
To the software engineer, data may be the output of the pipeline from the CCDs to the cleaned, calibrated, and structured files that are ingested by repositories.
“Constructing an astronomy paper is like building a house,” to quote one of the astronomers interviewed. Rarely is the beginning point clear? Families renovate and extend a house over periods of years and decades.
Even if one family starts with an empty plot, a structure may have stood before. Yet earlier, someone decided how the land was to be divided, which determines the possibilities for the size and orientation of the house, and so on.
Sources and Resources
The COMPLETE Survey (COordinated Molecular Probe Line Extinction Thermal Emission) Survey of Star Forming Regions discussed in the case study can be understood only in the larger context of how data are collected, selected, used, and managed in observational astronomy. The sources and resources for data in astronomy are many and varied.
They are difficult to extricate from the instruments, domain expertise, scientific principles, and forms of representation specific to the field.
While their data are “observations of the sky,” these observations rely on instruments with specialized sensing capabilities. Signals captured by these instruments are cleaned and calibrated to community standards.
Metadata are used to represent these signals in ways that they can be reconciled with data from other instruments. Structures for data standards enable astronomers to apply a common set of tools for analysis, visualization, and reporting.
Overlaid on these technologies, tools, and standards are institutions and professional practices to link data and publications from the international community of astronomers.
Telescopes are the most basic, and yet the most complex, technologies in astronomy. Optical technologies are much improved since Galileo’s day, resulting in telescopes that capture digital images.
Modern telescopes have multiple instruments, each with its own capabilities. These can be swapped out over time, extending the life of the telescope.
Telescopes are of several physical types and may be located on land, above part of the atmosphere on planes and balloons, or in space, high above the Earth’s atmosphere. Ground-based telescopes are typically located at higher altitudes or at remote locations away from city lights.
Optical telescopes are those with mirrors or lenses to focus light, such as those at Palomar in California or the La Silla Observatory in the Atacama Desert of Chile.
Radio telescopes use dish antennas, rather than mirrors or lenses, to focus signals. The Square Kilometre Array, which is being constructed as an array of many dishes in Australia and South Africa, is so named because it will have an effective collecting area of one square kilometer.
An international project, it is located in the Southern hemisphere to get the best view of the Milky Way Galaxy; the region also has less radio interference than other suitable sites.
Telescopes launched into orbit around the Earth can peer much deeper into space, with views well beyond the Earth’s atmosphere. These instruments can be decades in planning and can generate data for decades. Hubble is among the best-known space telescopes, currently orbiting 353 miles above the Earth.
In between were a series of development and design projects with international collaboration and funding. The Hubble Space Telescope (HST) is in its third decade of delivering scientific data from space.
With five sets of instruments, each capable of gathering light in different ways at different wavelengths, the HST is an orbiting laboratory for astronomy.
Additional instruments power the satellite, monitor its health and are used to make adjustments. Although most space instruments are launched into orbit or into deep space without further physical intervention, Hubble has been visited several times to add, repair, and swap instruments. More than ten thousand publications have resulted from Hubble data.
The actual collection of data from telescopic instruments is often an industrial process managed by the project mission, such as Hubble or Chandra.
The HST, for example, sends its signals to a tracking and relay satellite, which sends them to a ground station in New Mexico, which sends them to the Goddard Space Flight Center in Greenbelt, Maryland, which sends them to the Space Telescope Science Institute (STScI) in nearby Baltimore.
Validation and error checks are made at each stage in the process. Astronomers may use the observational data from STScI, once fully calibrated and released.
They also can submit observing proposals to use particular instruments, for particular periods of time, to study their own research questions. About two hundred proposals are approved each year, which represents about a 20 percent success rate for the applicants.
Electromagnetic Spectrum Each instrument, whether in space or on the ground, is designed to obtain signals over a particular range (or ranges) of wavelengths.
Radio telescopes take signals at the low-frequency end of the spectrum, which is also known as low energy or long wavelength. Gamma-ray telescopes take signals at the highest frequency end of the spectrum, also known as high energy or short wavelength.
Gamma rays and X-rays are largely blocked by the Earth’s atmosphere, so gamma- and X-ray telescopes operate outside the Earth’s atmosphere, and thus are located on rockets or satellites.
The electromagnetic spectrum is continuous. Divisions commonly made in astronomy include, in order of increasing energy, radio, microwave, infrared, optical, ultraviolet, X-ray, and gamma ray.
Finer divisions may include far-infrared, medium-infrared, near-infrared, soft X-ray, hard X-ray, and so on. Wavelengths sometimes are named after their size in metric distances, such as millimeter or submillimeter. Visible light (colors are seen by the human eye) available to Galileo is only a very narrow band of the spectrum.
Celestial Objects Celestial objects, also known as astronomical objects, are observable entities in the universe that occur naturally. These include stars, planets, galaxies, and comets, as well as other less familiar objects like nebulae, supernova remnants, and black holes.
Most astronomy research involves the study of celestial objects, whether individually or in combination with other phenomena. Objects may occur in certain regions of the sky and be visible at certain wavelengths.
To study a given celestial object, an astronomer needs observations of that region, taken with instruments capable of capturing the phenomena of interest. In some cases, an astronomer would apply for observing time on a specific telescopic instrument, on some number of days at the right time of year, to be pointed at the region.
In many other cases, the astronomer would search for data from instruments that already have gathered observations of the region at the desired wavelengths. Stars, planets, and asteroids move around in the sky, so astronomers need 3D models of trajectories to derive the (2D-projected) sky coordinates of these kinds of celestial objects at any given time.
Astronomy Data Products Observations from telescopic instruments go through many steps of cleaning, calibration, and data reduction before becoming available in astronomy data repositories.
The processing steps are known as pipelines. Pipeline processing may require some months; thus, observations often are made available as a “data release,” accompanied by the “data paper” that documents them.
The ninth data release of the SDSS, for example, includes all of the data from a new instrument added to the telescope after the eighth data release, plus corrected astrometry (sky positions) for the eighth release.
Thus, the same observations of the sky may be released more than once, as improvements are made to the pipeline. Astronomers are careful to identify which data release was used for any study. Interpretation of findings varies accordingly.
When astronomers gather their own data via observing proposals, they may do their own pipeline processing. A manual of recommended tools and steps may be provided for guidance. Those using well-calibrated instruments can compare their data to those of prior data products from the same instrument.
Those using a new instrument or their own instrument have advantages for new discoveries but also limitations in the lack of comparative data products for validating their results. They may turn to other data resources to calibrate some of their measurements, whether or not acknowledged in their publications.
Those who prefer less-processed observations may be seeking new phenomena or other patterns that are obscured by standard data reduction methods.
Many other kinds of data products result from astronomical observations. These include star catalogs and surveys of regions or celestial objects. Star catalogs date to ancient observations of the sky, when positions and brightness were charted over the course of the year. They also provided essential information for navigation at sea and on land.
Modern catalogs draw upon data repositories for precise descriptions of what is known about each star. Known stars can be referenced by catalog number. Catalogs and other data products can be searched to determine whether an object is known or unknown.
Real-time sky surveys rely on these types of data products for almost instant identification of new celestial objects. Transient event-detection methods can send “sky alerts” to telescopes, smartphones, and other devices.
Astronomy Knowledge Infrastructures
Astronomy has the most extensive knowledge infrastructure of all the fields covered in the case studies. Agreements on standards for data structures, metadata, and ontologies, combined with international coordination and large community investments in repositories, tools, and human resources, have resulted in a complex network of information resources.
Despite the highly automated data collection in astronomy, substantial portions of the infrastructure require human expertise to assign metadata and to identify links between related objects.
The manual labor required to interpret observations and to document relationships between digital objects is an example of the “invisible work” that often is required to make infrastructures function effectively. The work is visible to those who do it, of course, but those who rely on the infrastructure may not be aware of those investments unless the system breaks down.
Metadata Astronomy observations typically are acquired either as spectra (intensity as a function of wavelength), images (the distribution of intensity on the sky at a particular wavelength), or cubes (a 3D dataset giving intensity as a function of position and wavelength, from which images and spectra can be extracted).
In some cases, an instrument known as a barometer is used to measure intensity at a single position over a very narrow range of wavelengths.
Increasingly, astronomy observations are acquired as time series, meaning a series of samples, over time, of any of the types of data listed above. Telescopic instruments can generate metadata automatically for sky coordinates, wavelength, and time of observation.
Object names are entered by hand at the time of observation since human judgment is required. Other information useful to interpret observations, such as weather conditions and instrument errors, also may be recorded manually in observing logs.
To present images taken at wavelengths far beyond the visible part of the spectrum, astronomers assign colors to parts of the wavelength; for example, red = radio; green = optical; blue = X-ray.
However, few metadata standards exist for false color in images. Techniques for assigning colors vary widely, resulting in published images that are largely irreproducible.
Although colorful “pretty-picture” composites are popular with the public, many astronomers are reluctant to present these images in their research publications. Artistry must be carefully balanced with scientific validity
Essential information about the instrument, conditions of observation, wavelength, time, and sky coordinates are represented in a standard data format known as the Flexible Image Transport System (FITS). FITS was developed in the 1970s and widely adopted by the latter 1980s as part of the transition from analog to digital astronomy.
Analog observations could be calibrated to the positions and conditions of each telescope. Digital capture offered the opportunity to combine observations from multiple instruments, but to do so, agreements on data structures and coordinate systems were necessary
Most astronomy data repositories now provide data resources in FITS formats; thus, astronomers can use the metadata in FITS files to locate observational data by sky coordinates, spectra, time of observation, and other characteristics associated with the instruments.
Coordinate Systems Astronomy is based on the simple organizing principle that there exists only one sky. However, establishing a coordinate system that could reconcile the positions of objects in the sky required several centuries of scientific and engineering innovation.
Latitude (north-south) could be computed on land or sea via the stars. Longitude (east-west) required exact calculations of time as the Earth moves in its orbit. A precise clock that functioned aboard a ship at sea, accomplished in the later eighteenth century, transformed both navigation and astronomy (Sobel 2007).
Coordinate systems in astronomy depend upon precise temporal measurements because the motion of the Earth in its orbit causes the position and wavelength of an object’s emission to change subtly with time.
Astronomers agreed on a standard mapping system, known as the World Coordinate System (WCS), as part of the FITS standards for describing observations of the sky.
Each pixel in an image of the sky is assigned X and Y coordinates for its location. These coordinates, usually expressed as right ascension and declination, are the equivalent of longitude and latitude for the positions on the Earth.
The electromagnetic spectrum is the third dimension used in astronomical observations. This dimension may be expressed as frequency or wavelength, and it can be translated to a velocity in many cases due to the Doppler effect. For objects far outside the Milky Way, Hubble’s law is incorporated into the calculation of an object’s distance.
Observations taken by different instruments, at different times, can be reconciled via these coordinate systems. Sky images that were captured on glass plates more than a century ago can be matched to images taken with today’s telescopes.
Similarly, photographs of the sky taken today may be matched to their sky locations using the WCS, star catalogs, and other astronomical data products. The process is imperfect, as reconciliation sometimes requires knowledge of why an image was taken
Celestial Objects Celestial objects and other astronomical phenomena have their own sets of metadata. These are cataloged manually, after papers are published, through coordinated multinational efforts.
Celestial objects in our galaxy are cataloged in SIMBAD (the Set of Identifications, Measurements, and Bibliography for Astronomical Data), which is based at The Centre de Données Astronomiques de Strasbourg (CDS) in France.
Catalogers read new astronomy publications as they appear, creating metadata records for each mentioned celestial object that can be identified.
SIMBAD grows rapidly at the pace of publications and new discoveries, updating their statistics daily. As of this writing, SIMBAD contains about 18.2 million identifiers for 7.3 million unique objects that were mentioned in 285,000 papers, for a total of about 10 million references to unique objects.
Another way of representing these numbers is that each of these 7.3 million objects is known, on average, by about 2.5 different names—the 18.2 million identifiers.
Each paper mentions an average of 35 celestial objects—the 10 million citations of objects in 285,000 papers. These objects are not evenly distributed in the astronomy literature. Most papers describe a few objects, and a few papers list large numbers of objects.
Similarly, most objects have just one name (e.g., Jupiter) and some have many, such as their identification in surveys and catalogs created over a period of centuries. Each publication is thus richly tagged with metadata, adding value that can be used in discovering, combining, and distinguish-ing data about celestial objects.
Objects outside our galaxy are cataloged in the NASA Extragalactic Database (National Aeronautics and Space Administration, Infrared Processing and Analysis Center 2014a).
The solar system and planetary data are cataloged in yet another service (National Aeronautics and Space Administration, Jet Propulsion Laboratory 2014). CDS is the coordination point for many of the metadata repositories for astronomy and provides searching and mapping tools such as Aladin and VizieR.
Data Archiving Massive amounts of astronomical observations are available in data archives, also known as repositories, databases, or information systems. While extensive, they are not comprehensive.
Observations from government-funded astronomy missions, especially those collected by telescopic instruments launched into space, are most often made available as public resources. Most repositories are organized by mission, such as observations from the Spitzer Space Telescope, Chandra, and Hubble missions.
Data also are organized by wavelength, such as the set of archives hosted by the Infrared Processing and Analysis Center (IPAC). IPAC organizes data by missions and also by types of celestial objects, such as the NASA Exoplanet Archive (National Aeronautics and Space Administration, Infrared Processing and Analysis Center 2014b). Each major sky survey, such as the SDSS,
Pan-STARRS and LSST offer its own data repository. Well-curated older data, such as the Two Micron All Sky Survey (2MASS), for which data collection was completed in 2001, remain valuable indefinitely (National Aeronautics and Space Administration, Infrared Processing and Analysis Center 2014c).
Although astronomy data repositories are valuable resources, each archive is independent and each has its own user interface, search capabilities, and underlying data model. Because archived astronomy data tend to be partitioned by observational wavelength and by the observatory that collected them, integration approaches are needed.
Some repositories such as MAST curate data from multiple missions and spectra, and also accept contributions of data and models (National Aeronautics and Space Administration, Mikulski Archive for Space Telescopes 2013).
The Data Discovery Tool, Skyview, and WorldWide Telescope are among the growing number of tools available for searching across data archives and integrating data sources.
Astronomers have a growing number of options for sharing data that they collected themselves or were derived from public sources: archives that accept contributions, university repositories, project, and personal websites, and personal exchange. A small study of astronomers indicates that their sharing practices are much like other fields.
The most common form of data sharing is to e-mail data to colleagues on request. Only a small proportion (about 20 out of 175) had put data into an institutional archive. Data handling remains largely in the hands of the people who collected or analyzed them.
One respondent compared his team’s data practices to those of the SDSS, saying that instead of SDSS Data Release 1.0, 2.0, and so on Publications Bibliographic control is much more comprehensive in astronomy than in most fields.
The Harvard-Smithsonian Astrophysical Obser-vatory–NASA Astrophysics Data System (ADS) is largely a bibliographic system, despite its name. ADS, operational since 1993, contains records on core astronomy publications back to the nineteenth century, as well as extensive coverage of gray literature in the field.
ADS plays a central role in the knowledge infrastructure of astronomy by curating not only bibliographic records but also the links between publications, records of celestial objects, and data archives
Provenance Astronomers rely on this extensive array of knowledge infrastructure components to determine the provenance of data. Researchers must be able to trust the data, knowing that many people, many instruments, and many software tools have touched the observational bitstream.
Those bits are calibrated, cleaned, transformed, and reduced at many stages in the pipeline processing. Provenance concerns vary by research question and circumstance.
Data taken for one purpose, in one region of the sky, at particular times and wavelengths, may or may not be useful for a particular purpose later. For example, provenance may be much harder to determine on older data taken by individual astronomers than on data from sky surveys. Legacy data that are converted to digital form may not be documented adequately for some kinds of future uses.
The provenance of sky surveys and other large datasets are documented in the aforementioned data papers, such as those for individual data releases of the SDSS and COMPLETE. Instrument papers also may be published to give credit to instrument developers and to provide more detailed documentation of decisions made.
These papers document the instrumentation, calibration, and processing decisions. Data papers are among the most highly cited articles in astronomy because they aggregate references to the data sources.
Provenance in astronomy also is maintained by the use of common analytical tools and services. The International Virtual Observatory Alliance (IVOA) is a coordinating entity to develop and share data and tools (Hanisch and Quinn 2002; International Virtual Observatory Alliance 2013a).
Partners meet regularly to address interoperability issues and to coordinate national efforts on scientific infrastructure for astronomy research.
While far from complete, astronomy has made more progress in establishing relationships between publications and data that have most fields.
SIMBAD provides links between celestial objects and the papers reporting research on them, which are in ADS. Less well curated are metadata to link publications to observations and to link celestial objects to observations.
Efforts are underway to coordinate multiple systems and activities, with the goal of better semantic interlinking of these complementary resources. Coordination involves IVOA, ADS, CDS, astronomy libraries, data archives, and systems such as the WorldWide Telescope that can integrate disparate data sources.
External Influences Astronomy is no less influenced by external matters of economics and value, property rights, and ethics than are other domains of the sciences.
The creation and use of data in astronomy depend upon international agreements and an array of governance models. These underpin the knowledge infrastructures of astronomy in ways that are both subtle and profound.
Economics and Value One reason that astronomy data are attractive for research in computer science is that they have no apparent monetary value. A second reason is that their great volume and consistent structure make them useful for database research. Third, no human subjects are involved, so ethical constraints on reuse are minimized.
Although it is true that no market exists to buy and sell astronomical observations or the output of numerical models, telescopes, instruments, and large data archives such as the SDSS, Hubble, and Chandra, most are better understood as common-pool resources.
Those that constitute infra-structure investments of public and private agencies have governance models in place to ensure the quality of the resources and equity of access. Sustainability and free riders are continual threats to these common-pool resources.
However, many astronomical instruments and data resources are not part of these common pools. Data in the local control of individual astronomers and teams may be considered raw materials or private goods, depending upon the circumstance.
Some instruments and archives are club goods, available only to partners. Software necessary to analyze astronomy data and to interpret files may be open source or commercial products.
Property Rights Property rights in data vary by project, funding agency, and other factors. Whether research funding is public or private, investigators usually have exclusive rights to data for some period of time.
Proprietary periods, also known as embargoes, tend to range from three to eighteen months or so, counted from the time the observations are taken by the telescope. or from the time that “scientifically usable data” are available from the processing pipeline.
Investigators may have the discretion to make data available sooner, but special permission is required to lengthen the proprietary period.
Distinctions about whether the proprietary period begins with the time of observation or the time of scientifically usable data can result in months or years of difference in the time that investigators have to control their data while writing publications.
Data obtained from privately operated telescopes may never be released, although legacy data from important telescopes have since become available.
For those missions whose data are destined for archiving, which is the case with most major space telescopes, data are made available via repositories once pipeline processing is complete and proprietary periods have ended.
When astronomers wish to collect their own observations from these or other instruments, their rights to data are coupled to the governance of the instruments.
Telescopes are owned, under various arrangements, by the universities, consortia, governments, and other entities that fund them. Responsibility for policies governing the use of instruments such as the Hubble Space Telescope may be delegated to the scientific mission. Most public telescopes are available to qualified astronomers who must submit proposals for their use.
Other instruments are dedicated to synoptic surveys, assembling large data collections that will be open to all. A declining number of telescopes remain available only to the elite few associated with the private institutions that own them.
Data rights policies of these governing bodies make fine distinctions in the rights to use or control certain kinds of data.
For example, the National Optical Astronomy Observatory (NOAO) data rights policies distinguish between scientific data subject to proprietary periods and other kinds of data that are available to the community “immediately upon ingest of the exposure in the NOAO Archive,” such as metadata on individual exposures, including the time, duration, location, and instrument configuration.
Internal calibration data similarly is considered public. NOAO staff have access to all data from the instruments for purposes of monitoring the health, safety, calibration, and performance of the instrument.
NOAO is located at Kitt Peak in Arizona and operated by the Association of Universities for Research in Astronomy (AURA), Inc. under a cooperative agreement with the National Science Foundation (National Optical Astronomy Observatory 2013a).
Ethics issues arise in astronomy around access to data and to instruments, which are scarce and expensive resources. Who has access to which telescope, when, for how long, and with what resources for collecting and analyzing data is determined by the moral economy of astronomy.
Access became more equitable and merit-based in recent years with the growth in public funding for the field, but ethics and politics always will play a role. Telescopes are funded by a complex array of partners, including universities, to ensure that their members have access to necessary facilities.
Access to astronomy data can be delayed due to issues involving pipeline processing, proprietary periods, governance, and such. Data from the Planck mission to study cosmic microwave background, for example, was made available much later than initially promised.
The investigators released about thirty papers all at once, along with the data, on the grounds that the data were not valid for use by others until fully calibrated.
Astronomy observations can be sensitive, owing to their value for navigation and defense. For example, Pan-STARRS is partly funded by the US Air Force to monitor near-Earth objects; thus, data deemed sensitive are not available for astronomy research.
Pan-STARRS distinguishes between its primary science mission and its role in the defense. “Working in the open” usually implies data release at the time of article publication, rather than subjecting one’s day-to-day activities to public scrutiny.
Conducting Research in Astronomy
Most astronomers live in a data-rich world, with a wealth of tools and services to select and analyze those data. They also live in a world of constraints, with a very long time frame to plan missions and dependence on international coordination for funding, infrastructure, and access to instruments and data.
Access to telescopes and to observing time is more equitable than in generations past, but those at wealthier institutions who are members of major instrument consortia still have more resources than those at lesser universities and those in poorer countries.
Decisions about what entities become data and what data are shared, reused, or curated, and how are influenced by access to these resources and to constraints on time, technology, and infrastructure.
This case study follows one team, based at the Harvard-Smithsonian Center for Astrophysics (CfA), through a multiyear project known as the COMPLETE Survey as they developed research questions, collected and analyzed their data, and published their findings in more than forty papers.
It explores the knowledge infrastructures on which they depend, how they represent data, when and how they share them, the array of stakeholders involved, and their publication practices.
The COMPLETE Survey
The COMPLETE (COordinated Molecular Probe Line Extinction Thermal Emission Survey of Star Forming Regions) Survey, based at the CfA, is a large dataset created from public repositories of astronomical observations and from new observations in the same regions of the sky. The Survey mapped three very large star-forming regions in our galaxy in their entirety.
Observations covered the electromagnetic spectrum from X-ray to radio. The team then mined this data to address an array of research questions.
The Survey is valued for its comprehensiveness, its diversity of data sources, and its size—estimated to be about a thousand times larger than what was available as a coordinated resource a decade earlier.
Over the course of seven years or so, many people with many kinds of expertise were involved in conducting the survey. The team has ranged in size from about a dozen to twenty-five members, including faculty, senior researchers, postdocs, graduate students, and undergraduates.
Research using the Survey dataset continues, largely focused on observational and statistical work to understand the physics of star-forming regions.
Research Questions Research questions of the COMPLETE Survey team concern how interstellar gas arranges itself into new stars. The overarching question in star-formation research is what distributions of stars form as a function of time, for interstellar gas with given conditions.
They broke this question into smaller units, some parts of which had to be answered before the next parts of the puzzle could be addressed.
Among the major findings to date is the discovery of “cloudshine” in the near infrared; development and implementation of a structure-finding algorithm to describe real and simulated star-forming regions; a reinterpretation of the meaning of temperature in maps of interstellar gas; and an assessment of the role of self-gravity in star formation.
Collecting Data To identify data in existing archives, the team used coordinate-based and object-name–based searches in repositories to extract data for the three star-forming regions being studied (Perseus, Ophiuchus, and Serpens).
They also used metadata in SIMBAD and ADS to identify prior papers that studied celestial objects and phenomena in these regions.
However, metadata in archives are known to be incomplete, so the team relied on their professional knowledge of the field and on sources mentioned in papers to determine where to search for available data in archives.
More than half of the COMPLETE Survey is new data resulting from multiple proposals for observational time on telescopic instruments.
These new data were processed through the pipelines associated with each telescopic instrument. While a complex process to accomplish, it is noteworthy that the knowledge infrastructure for this area of astronomy supports the ability to reconcile old and new observations from multiple instruments.
The first steps in data analysis were to get all the datasets into a common FITS format, whether acquired from archives or via observing proposals. Reconciling these files requires an intimate knowledge of the FITS standard. Decisions had to be made about how to merge files.
The data sets do not have identical footprints on the sky, thus more data are available for some areas of the star-forming regions than others.
While the available metadata on sky positions and spectra are essential to merge data-sets, considerable expertise is required to reconcile differences in calibration, instrument characteristics, data models, and other factors (Goodman, Pineda, and Schnee 2009).
When the datasets were merged into a common file, the team could employ a suite of tools, both open source, and commercial, that take FITS files and other common formats as input.
To gain a competitive edge, astronomers sometimes write new software tools or new scripts in existing tools. Creating new tools and validating new methods can themselves be scientific contributions.
For example, in their Nature article about the role of gravity at multiple scales, Goodman explains how they implemented dendro-grams as a new technique to measure structures over a range of spatial scales.
Their 3D visualizations separate self-gravitating from non-self-gravitating regions in the dendrogram to show the superiority of the dendrogram algorithm over a previous algorithm called CLUMPFIND.
Their paper was the first to be published as a 3D PDF enabling three-dimensional views to be manipulated and rotated within the article using a particular version of the Adobe PDF reader.
COMPLETE yielded a cumulative body of research for the team, with each study producing findings that informed the next. Individual papers are based on subsets of the Survey assembled for a specific purpose, such as the exploration of the role of self-gravity in star formation mentioned above.
Because multiple papers are drawing on the same large dataset, they document the research protocols only to the extent necessary for the specifics of each paper.
The necessary provenance information for building the Survey is provided in the data paper, which has seventeen authors from the partner institutions.
The Nature article is only four pages in length because that is the maximum allowed by the journal; an additional twelve pages of supplemental the material is published online. Yet further documentation is on the project’s website and referenced in papers.
Papers based on the COMPLETE Survey were published in astronomy journals; thus, the papers are cataloged in SIMBAD and ADS, making them discoverable by the object, region, and bibliographic characteristics.
Curating, Sharing, and Reusing Data The COMPLETE Survey team is distributed across multiple institutions and countries, each of which has its own practices for sharing, curating, and reusing data. The core team, based at the Harvard-Smithsonian Center for Astrophysics, maintains the Survey datasets and website.
They make the survey data available for download in multiple parts and formats, each with extensive documentation (COordinated Molecular Probe Line Extinction Thermal Emission Survey of Star Forming Regions [COMPLETE] 2011). A suggested citation to the datasets is provided, but the team does not attempt to track usage or citations.
Derived datasets are being released via Dataverse, newly adapted for astronomical data. The data set remains in active use; thus, they can add new releases, new documentation, and corrections as needed. They have not contributed the survey data to the MAST or other repositories, which would relieve them of long-term curation responsibility.
The Harvard CfA team that conducted the COMPLETE Survey and exploited it for a series of research projects is far more concerned with data curation than are most scholars in most fields. They are actively involved in data sharing and infrastructure development in astronomy.
The team developed a Dataverse site for depositing and sharing astronomy data, are principals in the ADS All Sky Survey to integrate data and publications in the field, are principals in the WorldWide Telescope, and have studied data sharing and reuse in astronomy.
Despite their extensive documentation of the COMPLETE Survey, the team acknowledges they still have the “graduate student 1, 2, 3” problem that is endemic to long-term collaborative research.
When questions arise that involve fine-grained decisions about calibration, transformations, or similar analytical processes that occurred years earlier, they sometimes have to locate the departed student or postdoctoral fellow most closely involved.
These interpretations, in turn, can depend on decisions made by other parties earlier in the development of instruments, pipelines, and data products. This is the “building the housing problem:” the provenance of data can be traced only as far back as the beginning, and data may have many beginnings.
As with most digital data, astronomy data are inseparable from the software code used to clean, reduce, and analyze them. Data in the form of FITS files, which are already reduced through pipeline processing, can be analyzed with standard suites of tools, whether commercial or open source.
Many kinds of the software code associated with astronomy data may not be subject to release, however. Astronomers may write their own pipelines for data they have collected via their own observing proposals or instruments. They may write specialized tools or scripts to analyze public data.
In other cases, the code in computational simulations is closely protected, but code associated with the analysis and interpretation of output is released. In yet others, outputs of simulations may not be released if they are not considered to be data. These are but a few of the many forms of data scholarship found in astronomy.