Scientific data mining and Knowledge discovery

data mining for scientific and engineering applications pdf, scientific applications of data mining, scientific data mining and knowledge discovery pdf
Dr.JakeFinlay Profile Pic
Published Date:22-07-2017
Your Website URL(Optional)
Scientific Data MiningChapter 1 Introduction The purpose of computing is insight, not numbers. —Richard Hamming 247, p. v Advances in sensor technology have enabled us to collect large amounts of data in fields such as remote sensing, astronomy, medical imaging, and high-energy physics. Com- plementing these data sets from experiments and observations are data sets from numerical simulations of complex physical phenomena such as the flow around an airplane, global climate modeling, and evolution of galaxies in astrophysics. With computers having faster processing units, more memory, and many thousands of processors, it is now possible to simulate phenomena of a size and a complexity which were unthinkable a decade or two ago. The data generated by these simulations are very large, being measured in terabytes, and are also very complex, often with time-varying structures at different scales. Complementing these advances in sensor technology and high-performance comput- ing are advances in storage technology which have allowed more information to be stored on disks as well as in file systems which allow rapid access to the data. To realize the full potential of these advanced data collection and storage technologies, we need fast and ac- curate data analysis techniques to find the useful information that had originally motivated the collection of the data. 1.1 Defining “data mining” The term “data mining” has had a varied history 186, 555. As with many emerging fields that evolve over time, borrowing and enhancing ideas from other fields, data mining too has had its share of terminology confusion. Part of the confusion arises from the multidisciplinary nature of data mining. Many who work in areas that comprise data mining, such as machine learning, exploratory data analysis, or pattern recognition, view data mining as an extension of what they have been doing for many years in their chosen field. In some disciplines such as statistics, the terms “data mining” and “data dredging” have a negative connotation. This dates back to the time when these terms were used to describe extensive searches through data, where if you searched long and hard enough, you could find patterns 12 Chapter 1. Introduction where none existed (see 252, Section 1.7). Such negative connotations have had the unfortunate consequence that statisticians have tended to ignore the developments in data mining, to the detriment of both communities. Data miners too have contributed to the confusion surrounding the term “data mining.” The term originally referred to a single step in the multistep process of Knowledge Discovery in Databases (KDD) 185 . KDD is defined as the “nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” In this viewpoint, the data mining component of KDD is the means by which patterns are extracted and enumerated from the data. The other steps in KDD include data preparation, data selection, and interpretation of the results. As the acronym KDD suggests, this viewpoint assumes that the data reside in a database, which is rarely true for scientific data sets. Other authors 548 refer to data mining as the “process of extracting valid, previously unknown, comprehensible and actionable information from large databases and using it to make crucial business decisions.” The data mining process is applied to a data warehouse and is composed of the four basic steps of “data selection, data transformation, data mining, and result interpretation.” This somewhat circular definition of the term “data mining” considers the data mining process, not KDD, to be the overall process of extracting useful information from data. Further, by focusing on data warehouses, it restricts data mining to commercial and business applications, which typically use databases and data warehouses. In what may be a wise choice, some data miners have distanced themselves from the terminology debate, choosing instead to focus on how the ideas from different fields can be combined and enhanced to solve the problems of interest in data analysis. For example, the authors of a 1999 report of a NASA Workshop on Issues in the Application of Data Mining to Scientific Data 30 observed that “it was difficult to arrive at a consensus for the definition of data mining, apart from the clear importance of scalability as an underlying theme.” They, like many other practitioners and proponents of data mining, agreed that data mining is a multidisciplinary field, borrowing ideas from diverse technologies including machine learning, statistics, visualization, image processing, high-performance computing, databases, and so on. It is difficult to clearly distinguish data mining from any single component discipline. What is new is the confluence of the mature offshoots of these technologies at a time when we can exploit them for the analysis of massive complex data sets. In addition, as data mining techniques are applied to new forms of data, such as text and electronic commerce, newer disciplines such as natural language processing and issues such as handling of privacy are added to the fields that comprise data mining. In this book, I consider data mining to be the process concerned with uncovering patterns, associations, anomalies, and statistically significant structures in data. I do not differentiate among terms such as “knowledge discovery,” “KDD,” and “data mining,” nor do I attempt to address the subtle differences between data mining and any of its component disciplines. Instead, I focus on how these disciplines can contribute toward finding useful information in data. As the data analysis problems in science and engineering are often rather difficult, we need to borrow and build upon ideas from several different disciplines to address these challenging problems. 1.2 Mining science and engineering data The mining of data from science and engineering applications is driven by the end goals of the scientists, in particular their reasons for analyzing the data. Often, the scientists and1.3. Summary 3 engineers who collect or generate the data start the process with the goal of addressing a particular question or a set of questions. Sometimes, the question is well formulated, with a clearly identified solution approach. In other cases, the question may be well formulated, but it may not be clear at all how one would proceed to answer the question. Usually, the process of answering one question gives rise to others, and the data may often be analyzed to answer questions completely unrelated to the one which motivated the collection of the data in the first place. In other situations, the scientists may not have a specific question, but may be interested in knowing what is in the data. This is often the case when the science of what is being observed or simulated is not completely understood. A poorly formulated question or data, which may not be the most appropriate for answering the question at hand, are not the only challenges in mining science and engineering data sets. The characteristics of the data themselves can be a major impediment, whether they are data from experiments, simulations, or observations. A key issue is that the data are usually available in the “raw” form. This can be as images, or mesh data from simulations, where several variables are associated with each point in a computational grid. There is often both a spatial and a temporal component to the data. Sometimes, data collected from more than one sensor may need to be mined together to exploit the complementary information obtained by the sensors. As a result, a substantial amount of preprocessing is required to identify the objects in the data, and find appropriate representations for them, before we can identify patterns in the data. Another characteristic of the data is that they may be very large in size, being measured in terabytes or petabytes. Such data are usually stored on tapes, possibly as part of a high-performance storage system, making access to them nontrivial. The data may also be distributed, either in separate files or even in different geographic locations. Mining these data accurately and efficiently may need the development of new algorithms or the implementation of parallel or distributed versions of existing algorithms. As a result of their size and complexity, much of the data from science and engineering applications are never analyzed, or even looked at, once they have been collected. This is, of course, disconcerting to scientists who wonder about the science still left undiscovered in the data. More often than not, scientists are very interested in the analysis of their data and realize that the current techniques being used are inadequate for the information they want to extract from the data. But, they may not know what expertise is necessary to accomplish this analysis, or where they can find this expertise. This provides a unique opportunity for data miners (and I use the term in a very generic sense) to collaborate with scientists and engineers not only to address their analysis problems and aid in scientific discovery but also, in the process, to identify challenging problems whose solution will advance the state of the art in analysis techniques. 1.3 Summary Advances in sensor technology, high-performance computing, and high-density storage have all resulted in massive complex data sets in various domains in science and engineer- ing. The multidisciplinary field of data mining holds the promise of allowing scientists and engineers to analyze these complex data sets and to find useful information in them. Scientific data mining borrows ideas from a multitude of disciplines including image and video processing, machine learning, pattern recognition, information retrieval, and statis- tics. By combining techniques from these diverse disciplines, our goal is to help scientists4 Chapter 1. Introduction and engineers fully exploit the benefits of the increased computational and data gathering capabilities now available. In the next chapter, I describe the many ways in which data mining techniques are being used in various science and engineering applications. I then use these examples to motivate the end-to-end process of scientific data mining. Regarding terminology, for ease of description I use the term “scientific data” to mean data from both science and engineering applications.Chapter 2 Data Mining in Science and Engineering Science was originally empirical, like Leonardo making wonderful draw- ings of nature. Next came the theorists who tried to write down the equations that explained the observed behaviors, like Kepler or Einstein. Then, when we got to complex enough systems like the clustering of a million galaxies, there came the computer simulations, the computational branch of science. Now we are getting into the data exploration part of science, which is kind of a little bit of them all. —Alex Szalay 573 The analysis of data has formed a cornerstone of scientific discovery in many domains. While these domains, and the problems of analysis therein, may appear very different at first glance, there are several common themes among them. This chapter is an overview of the many ways in which data mining techniques are being used in science and engineering applications. The intent of this survey is severalfold. First, it illustrates the diversity of problems and the richness of the challenges encountered in scientific data mining. Second, it highlights the characteristics of the data in various scientific domains, characteristics which are discussed further in Chapter 3. Third, it serves as a prelude to identifying the common themes and similarities in scientific data analysis, and thus motivates the tasks described in the data mining process in Chapter 4. Finally, by describing how scientists use data analysis techniques in several very different domains, this survey shows that techniques developed in the context of one domain can be applied or extended to other domains easily. I discuss these techniques further in later chapters, with the goal of enabling crosscutting exchange of ideas across several disciplines. The survey in this chapter is by no means an exhaustive one. It is not intended to cover all possible applications of data mining in scientific domains or all scientific domains with large, complex data sets. It is, however, intended to show how the more recent data mining techniques complement existing techniques from statistics, exploratory data analysis, and domain-specific approaches to find useful information in data. The sections in this chapter are organized as follows. Sections 2.1 through 2.7 use illustrative examples to describe the various ways in which data mining techniques are being used in applications ranging from astronomy to remote sensing, biology, and physics. 56 Chapter 2. Data Mining in Science and Engineering These examples use techniques which are described later in the book. For those unfamiliar with these techniques, it may be helpful to read this chapter again to better understand the application of the solution approaches. Each section also contains a subsection summarizing the characteristics of the data in that application domain. Section 2.8 briefly mentions some of the newer application areas where data mining techniques are beginning to make inroads. Finally, Section 2.10 includes several pointers for additional information. 2.1 Astronomy and astrophysics Ever since humans first looked at the skies and observed obvious patterns such as the waxing and waning of the moon and the equinoxes, there has been an interest in understanding the universe through observations. The tools used for these observations have ranged from the simple, but surprisingly accurate, instruments of the ancient cultures to sophisticated telescopes, both earth-based, such as the Very Large Array (VLA) 606, and space-based, such as Hubble 281 and Chandra 96, to name a few. As the tools for collecting the data have become more complex, so have the techniques for analyzing the data. Astronomers have been using techniques from the component disciplines of data mining long before the term itself became popular. In the 1970s, they realized that computers could be used for the automated construction of catalogs, which are systematic collections of astronomical object descriptions satisfying specified selection criteria. The benefits of automated catalog construction are many, including the ability to process large numbers of objects, the consistent application of selection criteria, and the ability to make exhaustive searches for all objects satisfying these selection criteria. In particular, the completeness of such catalogs formed an important factor in the analysis of the data using statistical techniques. In addition to catalog generation, astronomers also realized that automated techniques could be used for the detection and analysis of objects in images on astronomical plates. As Jarvis and Tyson observed in their seminal paper 304, the use of an automated detection and classification system can avoid many potentially devastating and subtle systematic errors that occur in statistical astronomical studies which investigate properties of images over many magnitudes. These errors arise from the effects of different sample selections as well as variations in reduction and analysis techniques. It is interesting to note that in the 1970s and early 1980s, even though the raw data was in hardcopy in the form of astronomical image plates, the use of automated techniques for catalog construction and image analysis was gaining acceptance in the astronomy community. The paper by Jarvis and Tyson 304 on the Faint Object Classification and Analysis System (FOCAS), though written in 1981, provides an interesting insight into the early beginnings of data mining in astronomy. The goals of FOCAS were many, including reliable classification of all objects as stars, galaxies, or noise; an accurate position for each object; and a characterization of object shape. First, the raw data in the form of image plates were digitized into 6000× 6000 pixel images which were then processed for object detection, classification, and catalog manipu- lation. The detection of objects was done using image segmentation. A low-pass filter (see Section 7.2.1) was first applied to the image. Then, any pixel whose value exceeded the local sky density value at that pixel by a specified threshold, was kept as an object pixel. The size of the low-pass filter was chosen to meet the constraints on computational time and memory requirements of the systems being used (initially a Digital Equipment Corporation2.1. Astronomy and astrophysics 7 PDP 11/45, followed by a VAX 11/780). Once a pixel was identified as belonging to an object, groups of neighboring pixels were combined to form an object whose description was passed to an object evaluation module. This module extracted various parameters or features that characterized each of the objects. The average sky density at the object location was first calculated using domain- specific techniques and converted into an image intensity value. The densities of the object pixels were also converted into intensities which were then used in the calculation of the features. The main features used for each object were the Hu moments 279, which are discussed in more detail in Section 9.2. FOCAS used only three of the seven Hu moments as they were scale-, rotation-, and translation-invariant. Additional features included the peak intensity of the object which was defined as the intensity difference between the local sky intensity and the average intensity in a 3× 3 pixel window centered on the object centroid; an effective radius of the object; a measure of how closely the object matched an ideal star image; and the object magnitude. During the evaluation of these shape features, several consistency checks were made to ensure that an object was a valid image of a star or galaxy. While these checks did not remove all noise effects, they did reduce the size of the data that were sent to the classifier. Additional checks were also made to separate closely spaced objects that might have been treated as a single object in the object detection phase. It is interesting to note that FOCAS maintained separate files to store the output of the image segmentation step. It was a computationally expensive step and the astronomers did not want to repeat it if any changes were required in subsequent steps of the analysis. Also, in addition to the shape features, each object in the object list contained information on its position in the image plate, area in pixels, magnitude, and flags indicating exceptional conditions such as multiple objects in an area. These data formed the basic catalog generated from the image data. Once the object list was generated for each object on an image plate, it was input to a nonparametric statistical classifier 25. A two-stage classification was used. The noise objects were first identified using a single feature—the effective radius for bright objects and the peak intensity for dim objects. However, as this set of decision surfaces could not adequately separate the stars and galaxies, a second level of classification was used that empirically combined classification and clustering techniques (see Chapter 11). First, a training set of approximately 800 stars and galaxies was created. The features for the objects in this set were scaled and normalized to reduce the variation in the features. A clustering algorithm, in this case, the ISODATA algorithm 17, 18, was repeatedly applied until uniformly homogeneous clusters (to within a threshold) were obtained. Next, decision surfaces in the form of hyperellipsoids were generated to separate the star regions from the galaxy regions in the seven-dimensional feature space. The parameters of the hyperellip- soids were obtained using an iterative method based on minimizing the misclassification on the training set. Once the hyperellipsoids were generated, they were used to classify all the objects. Hyperellipsoids were chosen, as they performed better than simpler surfaces using the metric that the relative proportion of the two classes did not change significantly if another iteration of the classifier was made. The approach used by the astronomers in FOCAS illustrates several key ideas and issues in scientific data mining. First, there is the focus on converting the low-level image data to higher-level information in the form of features through the process of segmenting the images to identify the objects, followed by the extraction of features which are scale-, rotation-, and translation-invariant. Second, the astronomers took care to ensure that the8 Chapter 2. Data Mining in Science and Engineering input to the classification algorithm was of high quality by scaling and normalizing the features prior to classification, separating closely spaced objects, using flags to indicate exceptional conditions in the catalog, incorporating consistency checks to reduce the data input to the classifier, and carefully choosing the examples in the training set. Third, they paid attention to the results from the classifier and selected their options to obtain scientifically meaningful results. As we shall see in later chapters, these issues play an important role in the practical implementation of scientific data mining techniques. Following the early work of Jarvis and Tyson on FOCAS, several astronomers used data mining techniques for the analysis of their data. For example, Odewahn et al. 456, 455 used neural networks (see Section 11.2.4) to discriminate between stars and galaxies in the Palomar Sky Survey. They observed that the most time-consuming part of constructing a neural network classifier was the manual classification of the training set; this should be as free of misclassifications as possible and span the full range of possibilities. Other potential factors that were identified as contributing to lower accuracies of classification were inadequate image parameters and problems in neural network construction. Either the image parameters used in classification did not represent enough pertinent image features or the original image did not contain enough information to make an accurate classification. Further, the neural network may have had too few or too many nodes and may have been under- or overtrained. In addition, the astronomers observed that neural networks did not provide direct information on why an object was assigned a particular classification. This indicates another characteristic of scientific data mining, namely, the need of the scientist to understand how a decision was made. This is referred to as the interpretability of the models built by the classifier. Other examples on the use of neural nets in classification problems in astronomy include the morphological classification of galaxies 571, 2, 444, as well as spectral clas- sification 570, 556, where principal components analysis is used to reduce the number of features used as input to the neural networks. More recently, data miners have found astronomy to be a rich source of analysis problems. The Sky Imaging Cataloging and Analysis Tool (SKICAT) used classification techniques, specifically decision trees, to automate the reduction and analysis of the Digital Palomar Observatory Sky Survey (DPOSS-II) data. Using a modified version of the FOCAS, they identified objects in the images using image segmentation techniques followed by the extraction of 40 base-level attributes or features representing each object. However, these base-level features by themselves did not exhibit the required variance between different regions of an image plate or across image plates. As Fayyad, Djorgovski, and Weir 184 observed, in classification learning, the choice of attributes used to characterize the objects was by far the single most determining factor of the success or failure of the learning algorithm. So, in addition, they used four new normalized attributes that were derived from the base-level attributes and possessed the necessary invariance between and across image plates. They also calculated two additional attributes representing a point-spread-function (PSF) template. This allowed the astronomers to compensate for the blurred appearance of point sources (stars) due to the turbulence of the earth’s atmosphere. These derived attributes, along with the base-level attributes, were used in classification. The training set was generated manually, using higher-resolution images from a separate telescope with a higher signal-to-noise ratio for the fainter objects. A decision tree classifier, trained on this higher-resolution data, was able to correctly classify objects obtained from the lower- resolution DPOSS-II images.2.1. Astronomy and astrophysics 9 Figure 2.1. Subset from img3 from the “Volcanoes on Venus" data set in the UCI KDD archive 272. The data set is the one used in the JARtool project 71 which developed an automatic system for cataloging small volcanoes in the large set of images of the surface of Venus returned by the Magellan spacecraft. Several important conclusions can be drawn from the experiences of the SKICAT scientists. First, it was possible to create an automated approach that was accurate enough for use by the astronomers. Second, through the use of robust features and a training set from higher-resolution imagery, it was possible to obtain classifiers whose accuracy exceeded that of humans for faint objects. Third, the conversion of the data from high-dimensional pixel space of images to the lower-dimensional feature space transformed the problem into one solvable by learning algorithms. The use of derived features in addition to the base-level features resulted in high accuracy within and across plates. Finally, SKICAT showed the potential of data mining techniques in the semiautomated analysis of very large data sets containing millions of objects. Other recent efforts in mining astronomy data include the JPL Adaptive Recognition Tool (JARTool) 71 for cataloging volcanoes on Venus (see Figure 2.1) and its follow-on, the second generation tool Diamond Eye 73, which provided an application independent infrastructure for mining image collections. A similar application-independent approach was also used in the more recent Sapphire project which built a set of tools for various tasks in scientific data mining. These tools were then used in different applications, such as finding bent-double galaxies in astronomical surveys 315. The system architectures for some of these efforts are discussed in further detail in Chapter 13. Much of the focus in astronomy is on observation data from surveys, though it is also possible to mine data from simulations of astrophysical phenomena such as the evolution10 Chapter 2. Data Mining in Science and Engineering of a star 579. This topic will be covered in Section 2.5 which discusses the analysis of data from computer simulations. 2.1.1 Characteristics of astronomy data There are several characteristics which are common across data sets from observational astronomy, and the analyses conducted on these data sets, regardless of the instruments used to collect the data or the frequency at which the sky is being observed. These characteristics include: • Massivesizeofthedata: Astronomical surveys, whether conducted through ground- based telescopes such as the VLA 606, or space-based telescopes such as the Hubble Space Telescope (HST) 281, all result in very large amounts of data. For example, the Massive Compact Halo Objects (MACHO) survey 405, which concluded in 2000, resulted in 8 terabytes of data, and the Sloan Digital Sky Survey (SDSS) 528 will have nearly 15 terabytes of data when complete. Telescopes currently being planned, such as the Large Synoptic Survey Telescope (LSST) 396, are expected to collect more than 8 terabytes of data per night, producing several petabytes (a 15 petabyte is 10 bytes) of data per year, once it is operational around 2010. • Data collected at different frequencies: Astronomers survey the sky at different wavelengths including optical, infrared, X-ray, and radio frequencies. Since celes- tial objects radiate energy over an extremely wide range of wavelengths, important information about the nature of the objects and the physical processes inside them can be obtained by combining observations at several different wavelengths. For example, the Faint Images of the Radio Sky at Twenty centimeters (FIRST) 174 surveyed the sky at radio frequency, while LSST is an optical telescope, and the Chandra telescope 96 surveys the sky at X-ray frequency. • Real-timeanalysis: While much of the astronomical data are collected once and an- alyzed many times, usually off-line, there is an increasing need for real-time analysis in telescopes under development. For example, it is the goal of the LSST to make quality assurance data available to the telescope control system in real time to ensure correct operation of the telescope. In addition, as the image data are collected, they will be moved through data analysis pipelines that will compare the new image with previous images of the same region, generate prioritized lists of transient phenom- ena, and make the information available to the public in near real time for further observation. • Noisydata: Astronomical images are often noisy due to factors such as noise from the sensors and distortions due to atmospheric turbulence. For example, Figure 2.2 shows the “noise” patterns in the FIRST data that arise due to the Y-shaped arrangement of the telescopes in the VLA 606. In addition, astronomical images may have missing data due to a sensor malfunction or invalid pixel values caused when an extremely bright object makes the pixels around it artificially bright. • Lack of ground truth: In astronomy, a key challenge to validating the results of analysis can be the lack of ground truth. In other words, how do we verify that a2.1. Astronomy and astrophysics 11 J005602−012047.fits_0 DEC −1:18 −1:21 −1:24 0h56m20s 0h56m10s 0h56m00s 0h55m50s RA Figure2.2. Image from the FIRST Survey 174, extracted from the FIRST cutout server ( at RA = 00 56 2.084 and DEC =−01 20 47.41, using an image size of 10 arc minutes and maximum intensity of scaling of 10 mJy. Image is displayed using a log scale for clarity. Note the “Y” pattern of the noise in the image. The brighter wavy object in the center is a bent-double galaxy. region in an image of the surface of Venus is really a volcano, or that a galaxy is really a bent-double galaxy? This issue can be even more challenging for the negative examples; for example, is a galaxy non-bent-double because it is truly one, or is it just labeled as one because in the image, which is a two-dimensional projection of the three-dimensional universe, it appears to be one? • Temporal data analysis: In many astronomical surveys, the goal is to study tran- sient phenomena. For example, the goal of the MACHO survey 405 was to detect microlensing, which was indicated by changes in the intensity of an object over time. The LSST survey, on the other hand, is interested in detecting near-earth objects that move fast and must be detected in multiple exposures. Such temporal analyses re- quire that images of a region taken at different times must first be aligned before any comparison with previous images can be made.12 Chapter 2. Data Mining in Science and Engineering Several other characteristics of astronomy data are also worth mentioning. These data are often available publicly, usually as soon as the initial processing of the raw data from the telescopes is completed. Often, data once collected are analyzed many times, sometimes for reasons other than the one that motivated the initial collection of the data. Careful attention is paid to metadata, which is information describing the data. Items such as the time when the data were collected, the telescope settings, and the techniques used to process the data, are meticulously described and associated with any derived data products. Further, as more advanced algorithms for processing the data become available, the data are reanalyzed and new versions of derived data products created. Since many astronomical surveys cover the same region of the sky using different instruments and wavelengths, the astronomical community is developing “Virtual Obser- vatories” 173, 597. The goal of these virtual observatories is to enable astronomers to mine data which are distributed across different surveys, where the data are stored in geo- graphically distributed locations. Both the European and the American efforts are currently addressing the major challenge of having the data in all these different archives correspond to the same format. Of greater interest will be the mining of the data across surveys, which would include statistical analyses such as correlations across data taken at different resolutions, different wavelengths, and different noise characteristics. 2.2 Remote sensing In contrast to astronomy, where the data are “look-up” data, remote sensing data can be considered as “look-down” data. Some of the earliest images of the earth were taken by cameras strapped on pigeons. More sophisticated images taken from balloons and aircraft soon followed, and current technology now provides us access to high-resolution satellite imagery, taken in several spectral bands, at several resolutions 527, 381. Remote sensing systems are invaluable in monitoring the earth. Their uses include global climate change detection through the identification of deforestation and global warm- ing; yield prediction in agriculture; land use and mapping for urban growth; resource explo- ration for minerals and natural gas; as well as military surveillance and reconnaissance for the purpose of tactical assessment. The era of earth remote sensing began with the Landsat multispectral scanning system in 1972 which provided the first consistent high-resolution images of the earth. The data were collected in four spectral bands, at a spatial resolution of 80 meters, with coverage being repeated every 18 days. Systems deployed since then have improved the coverage, both in terms of the number of spectral bands as well as the spatial resolution. For example, data from the IKONOS satellite 212, which was launched in September 1999, are available in the red, green, blue, and near-infrared bands at 4 meters resolution and in grey-scale or panchromatic at 1 meter resolution. Data from the Quickbird satellite 146 are available at an even higher resolution of 61 centimeters in panchromatic and in 2.4 meters in four bands (red, green, blue, and near-infrared). Note that, at such resolution, it is possible to clearly identify everyday objects such as cars, highways, roads, and houses in the satellite imagery. This opens up potential new applications such as the de- tection of human settlements. It also presents an important challenge in storing, managing, retrieving, and analyzing these large data sets. A key task in the mining of remote sensing imagery is the identification of man-made structures such as buildings, roads, bridges, airports, etc. Some of the early work in this area using aerial imagery focused on exploiting user-specified information about the properties2.2. Remote sensing 13 of these structures and their relationships with other more easily extracted objects. For example, in their 1982 paper 450, Nevatia and Price consider the task of finding airports in an aerial image of the San Francisco Bay area in California. They exploit the knowledge that some of the airports have runways near or projecting into the water. They first use a combination of two simple image segmentation techniques (see Chapter 8)—a line detector to find roads and a region segmentor to identify land-water boundaries. Each region or line is considered as an identifiable object, and features, such as average color and intensity values of a region, the height to width ratio of a region, and the location of the center, are extracted. User-specified information in the form of a rough sketch of the region identifying the major structures and their location relative to each other is used to map the regions in the aerial imagery with the corresponding objects in the sketch. The sketch is represented as a graph with the nodes identifying the objects (for example, San Francisco Bay North and Angel Island) and the edges indicating the relationship between the objects (for example, is-a-neighbor-of). The sketch is general and can be applied to any view of the area. Finally, Nevatia and Price used graph-matching techniques to create an annotated aerial image of the San Francisco Bay area. With the availability of high-resolution satellite imagery from different sensors, the analysis of such imagery for man-made structures has also become more sophisticated. For example, Mandal, Murthy, and Pal 413 analyze data from the Indian Remote Sensing (IRS) satellite, with a resolution of 36.25 meters. They focus on the green and infrared bands as they are more sensitive in discriminating various land-cover types. First, they classify the image pixels into 6 classes (such as concrete structures, open space, etc.), followed by further processing using heuristics to identify man-made structures. For example, a heuristic may be that an airport should have a runway, which is a discrete narrow linear concrete structure region at least 30 pixels in length. In 270, Henderson and Xia provide a status report on the use of Synthetic Aperture Radar (SAR) imagery in human settlement detection and land-use applications. They observe that radar imaging can provide information which is comple- mentary to the information provided by systems operating in the visible and near-infrared regions. In Lee et al. 367, the authors consider the problem of extracting roads from the 1-meter IKONOS satellite imagery 212. They first segment the imagery using a variant of the watershed algorithm (see Section 8.2), identify road candidates using information about the regions such as elongatedness, and then expand the list of identified candidates by connecting close-by roads. IKONOS imagery has also been used in a multilevel process for the identification of human settlements using decision trees and statistical techniques 323. A recent work describing the practical issues in applying machine learning techniques to remote sensing imagery focuses on the detection of rooftops in aerial imagery 412. The authors discuss issues such as the importance of choosing features relevant to the task, the problems with labeling the rooftop candidates including consistency in labeling and disagreements among experts, and the effect of having a severely skewed training set with many more negative examples than positive ones accompanied by a higher cost of errors made in misclassifying the minority class. In addition to land-use applications, remote sensing imagery has also been used for meteorological data mining. For example, Kitamoto 342 describes the use of techniques from the informatics communities, such as pattern recognition, computer vision, and in- formation retrieval, to analyze and predict typhoons using satellite images that capture the cloud patterns of the typhoon. In contrast to mathematical models used in numerical weather prediction, where the model is deduced from partial differential equations describing the14 Chapter 2. Data Mining in Science and Engineering dynamics of fluids in the gravity field, Kitamoto considers pattern recognition and the past experience of experts as an indispensable part of the approach. The data used are from the Japanese GMS-5 geostationary satellite in one visible and three infrared bands, the latter enabling observation of clouds even during night-time. Cloud patterns are identified by dividing a 512× 512 pixel image into 8× 8 pixel blocks and assigning a “cloud amount” value to the data in each block. This gives a 64× 64 or 4096-dimensional cloud-amount vector which is then used in subsequent analysis using techniques such as principal com- ponent analysis, clustering with the k-means algorithm, and self-organizing maps. While early results with the use of data mining techniques appear promising, Kitamoto cautions that one must not forget the fundamental difficulty in predicting atmospheric events, namely the chaotic nature of the atmosphere. A task related to classifying and predicting typhoons is that of tracking them over time. For example, Zhou et al. 648 use motion analysis to track hurricanes in multispectral data from the Geostationary Operational Environmental Satellite (GOES). Since the data are collected in a super-rapid scan sequence at 1 minute intervals, it is possible to observe hurricane dynamics in the imagery. Using a nonrigid motion model and the Levenberg– Marquardt nonlinear least-squares method, Zhou et al. fit the model to each small cloud region in the image and then estimate the structure and motion correspondences between the frames of the data. Another related work using data from GOES is the detection of cumulus cloud fields by Nair et al. 445, who compare the accuracy of two techniques. They found that a structural thresholding method based on the morphology of cloud fields outperformed an approach based on classification algorithms applied to edge and spectral features extracted from the imagery. Classifiers such as neural networks have also been used in cloud classification using texture features extracted from Landsat imagery 368. Adifferent example of mining remote sensing data is the Quakefinder system described by Stolorz and Dean 569. This analyzed panchromatic data collected by the French SPOT satellite at 10 meter resolution to detect and measure tectonic activity in the earth’s crust. Using the 1992 Landers earthquake in Southern California, Stolorz and Dean found that the ground displacements varied in magnitude up to 7 meters, which was smaller than the pixel resolution, making naïve pixel-to-pixel change detection methods inapplicable. They had to use sophisticated methods to infer the motion of a single pixel, a process that had to be repeated for each pixel in the image. As even a modest-sized image contained 2050× 2050 pixels, this required the use of a massively parallel processor. In addition, the authors found that they had to incorporate corrections for slight differences in spacecraft trajectories caused by differences in height above the target region as well as the yaw, pitch, and roll of the spacecraft. Spurious differences between scenes due to differences in the sun angle and the view angle were found to be negligible as the satellite was in a sun-synchronous orbit. Other interesting applications of mining remote sensing imagery include the detection of fire smoke using neural networks 380, the automatic detection of rice fields 589, and the monitoring of rice crop growth 355. Kubat, Holte, and Matwin 352 illustrate several aspects of practical data mining in their work on the detection of oil spills. In particular, they discuss issues related to the scarcity of the data, as only a few images contained oil spills, which in turn, resulted in an unbalanced training set. They also observed that their examples showed a natural grouping in batches, where the examples in a batch were from a single image, leading to greater similarity among them. In addition, they found that they2.2. Remote sensing 15 had to carefully consider which images to present to the human who can control the trade-off between false alarms and missed positives. 2.2.1 Characteristics of remote sensing data Remotely sensed images, whether obtained from satellites or aerial photography, are a very rich source of data for analysis. The characteristics of these data include: • Massivesize: The increasing sophistication of remotely sensed systems has resulted in an explosive growth in the amount of data being collected. NASA’s Earth Observing System (EOS) Project 171 represents an extreme of this trend. Starting in 1997, several satellites have been documenting the earth’s processes and global change, acquiring 1.6 terabytes of images and other data per day. This will result in over 11,000 terabytes of data over the 15-year lifetime of the project. • Multiresolution data: Over the years, as the earth has been observed by various systems ranging from Landsat to IKONOS and Quickbird, we have acquired imagery at resolutions ranging from a few hundred meters to under a meter. As a result, we have several images of an area taken over time at different resolutions. When such data, collected over time, must be analyzed to understand changes that have occurred over long periods, we will need to use analysis techniques which can exploit the different resolutions of the data. • Multispectral data: While some of the very early remotely sensed data were col- lected in only one band (i.e., panchromatic), data collected more recently are often in several spectral bands. These range from the 4-band multispectral data from IKONOS and Quickbird to the 224-band hyperspectral data from the Advanced Visible/Infrared Imaging Spectrometer (AVIRIS) 11. This acquisition of imagery at various spectral bands simultaneously provides multiple “snapshots” of spectral properties which can contain far more information than a single band alone. This allows a scientist to as- sign a spectral signature to an object when the resolution is too low for identification using shape or spatial detail. Note that the increase in the number of spectral bands also increases the size of the data collected. • Multisensor data: In addition to the resolution or the number of spectral bands, remote sensing systems are also characterized by the sensors used, which in turn in- fluences the way in which the data are processed. Several factors must be considered, including the orbit of the platform which affects the angle at which the image is taken and the number of days between imaging of nearby swathes; the platform attitude, that is, the roll, pitch, and yaw of the platform which may distort the image; the scan geometry which determines the order in which the scene is scanned and the number of detector elements used in scanning; topographic distortions; and so on 527. • Spatiotemporal data: Since remote sensing platforms, especially satellites, cover the same region of the earth several times, there is an implicit temporal aspect to the data collected. As a result, change detection is an important application in remote sensing, not just using data collected over time by a single sensor, but also data collected by different sensors. Thus image registration, that is, aligning the satellite images, becomes an important task in mining remotely sensed data.16 Chapter 2. Data Mining in Science and Engineering Figure2.3. An image from the Landsat Gallery illustrating how missing data due to sensor malfunction may be handled in remote sensing imagery. The left image is path 39 row 37, acquired over Salton Sea in southern California on 9/17/2003, and it shows the scan gaps caused by the failed Scan Line Corrector. The right image is the same data, but with the gaps filled by using data acquired on 9/14/2002. Image: Grey-scale version of Landsat_Gallery_394_1_450.jpg from • Missing data: As with any other observational science, remote sensing data can suffer from missing data problems due to sensor malfunction (see Figure 2.3). Filling in these gaps must be done with care not only to avoid the introduction of artifacts, but also to ensure the integrity of the image. • Noisy data: Data in remotely sensed imagery can be noisy due to sensor noise or extraneous objects such as clouds in optical imagery that may obscure the scene on the earth (see Figure 2.4). However, what is considered as noise may depend on the problem being solved; clouds in a scene could be considered as the signal in an application involving the classification of clouds. Several other characteristics of remotely sensed data are worth mentioning. Since the raw data sets are very large, they may be available after various levels of processing. For example, EOS data products are available at levels ranging from Level 0 to Level 4 167. Level 0 data products are raw EOS instrument data at full instrument resolution. At higher levels, the raw instrument data are converted into more usable parameters and formats of interest to the scientific community. For example, Level 3 data products are variables mapped on uniform space-time grids, usually with some completeness and consistency. At Level 4, the parameters are further refined through the use of models. Many of the data sets from remote sensing are publicly available. Some, such as the data from EOS, are free, while others are available for a fee.2.3. Biological sciences 17 Figure2.4. Atmospheric vortices over Guadalupe Island showing the cloud cover obscuring the land mass underneath. If the task is to obtain more information about the land mass, then this image would be inappropriate. If, however, the task is to identify the type of clouds, this image would be an ideal one. Image courtesy of NASA’s Earth Observatory, In many remote sensing applications, ground truth can be difficult or expensive to obtain. For example, soil samples may need to be collected to evaluate analysis techniques that determine mineral content from remotely sensed imagery. The areas with these samples may be inaccessible, or there may be too many of them widely distributed geographically to make the collection of samples feasible. While much of the discussion in this section has focused on remote sensing of the earth, it is worth mentioning that the same techniques can be applied to remotely sensed data from other planets such as Mars, where such analysis can be an invaluable precursor in planning for “in situ,” that is, in-place sensors. Remote sensing systems are also a key part of surveillance and reconnaissance 365, where the purpose is to acquire military, economic, and political intelligence information. Reconnaissance refers to preliminary inspection, while surveillance maintains close ob- servation of a group or location. The latter involves frequent or continuous coverage and involves the collection of video data, further aggravating the challenges of handling and analyzing massive data sets. This topic is further discussed in Section 2.4. No discussion of remote sensing would be complete without a mention of Geographic Information System (GIS) 350. In addition to being a cartographic tool to produce maps, a GIS stores geographic data, retrieves and combines this data to create new representations of geographic space, and provides tools for spatial analysis. Examples of data in a GIS include information on streets, rivers, countries, states, power lines, school districts, office buildings, and factories. The results from the analysis of remotely sensed imagery, when used together with the complementary information in a GIS, can form an invaluable tool. 2.3 Biological sciences A very rich source of challenging data mining problems are the fields of biological sciences, including medical image analysis, clinical trials, and bioinformatics. While it is possible to treat each topic as a separate section (and each is broad enough to cover several books),18 Chapter 2. Data Mining in Science and Engineering I have chosen to include them in one section, as the lines between them are beginning to blur—advances in genomics and proteomics are leading to an improved understanding of systems-level cellular behavior with potential benefits to clinical research. 2.3.1 Bioinformatics Bioinformatics is the science of managing, mining, and interpreting information from bi- ological sequences and structures. It came into prominence through the Human Genome Project 603, 358. Fueled by advances in DNA sequencing and genome mapping tech- niques, the Human Genome Project resulted in large databases of genetic sequences, pro- viding a rich source of challenges for data miners. More recently, the field of proteomics promises an even more radical transformation of biological and medical research. The pro- teome defines the entire protein complement in a given cell, tissue, or organism. In a broad sense, proteomics can be considered to be everything “postgenomic” and includes protein activities, the interactions between proteins, and the structural description of proteins and their higher-order complexes. By studying global patterns of protein content and activity, we can identify how proteins change during development or in response to a disease. This not only can boost our understanding of systems-level cellular behavior, but also benefit clinical research through the identification of new drug targets and the development of new diagnos- tic markers. An excellent introduction to various aspects of this emerging field of proteomics is given in a series of articles in the special issue of Nature 448, as well as in the more recent survey of the computational techniques used in protein structure prediction 468. Data mining techniques have found an important role in genomics where they are ex- tensively used in the analysis of data in the genetic sequence and protein structure databases 609. For example, the text by Stekel 566 is a good introduction to microarray bioin- formatics, providing an overview of microarray technology; the use of image processing techniques to extract features from microarrays; the role of exploratory data analysis tech- niques in cleaning the data; and the use of statistics and machine learning techniques, such as principal component analysis, clustering, and classification, to study the relationship be- tween genes or to identify genes or samples that behave in a similar or coordinated manner. Baldi and Brunak 16 provide a more machine learning focus to bioinformatics and discuss how techniques such as neural networks, hidden Markov models, and stochastic grammars can be used in the analysis of DNA and protein sequence data. Their text also includes a chapter on Internet resources and public databases. Other introductory texts in this area include the ones by Baxevanis and Ouellette 27 and Lesk 371. The challenges in proteomics far surpass those in genomics and offer many opportu- nities for the use of data mining techniques. For example, Zaki 643 discusses how data mining techniques, such as association rules and hidden Markov models, can be used in pro- tein structure prediction. Using a database of protein sequences and their three-dimensional structure in the form of contact maps, he shows how one can build a model to predict whether parts of amino acids are likely to be in contact and to discover common nonlocal contact patterns. Proteomics is also having an effect on drug discovery—as most drug targets are proteins, it is inevitable that proteomics will enable drug discovery 251. For example, Deshpande, Kuramochi, and Karypis 138 use the discovery of frequent subgraphs to classify chemical compound data sets. They observe that graph-based mining is more suitable for handling the relational nature of chemical structures. By combining methods2.3. Biological sciences 19 for finding frequent subgraphs in chemical compounds with traditional machine learning techniques, such as rules and support vector machines, they found that they could improve the classification accuracy on problems such as the discovery of carcinogenic compounds and compounds that inhibit the HIV virus. The area of computational biology is using computational systems to simulate complex structures and processes inherent in living systems. Instead of focusing on just one level, say genomics or proteomics, there is an increasing interest in mining data across different levels, ranging from various genome projects, proteomics, and protein structure determination to digitized versions of patient medical records. Just as in physics, computers are now being used to simulate biological processes from the atomic and molecular level to cells and parts of organs. For example, the Blue Brain project 420, 46 is using massively parallel computers to create a biologically accurate functional model of the brain. The goal is to increase our understanding of brain function and dysfunction and help explore solutions to various neurological diseases. The use of computer simulations in understanding biological phenomena comes with all the issues associated with the use of simulations in any science, namely, the analysis of massive amounts of complex data, the understanding of the sensitivity of the results to various input parameters, the verification of the computer models, and the validation of the simulations by comparisons with experiments or observations. Another interesting application of data mining in bioinformatics is in automating the collection, organization, summarization, and analysis of data present in the literature 351. This requires the use of text mining and optical character recognition to understand the text in documents, image mining to extract information from images and plots in documents, and the use of machine learning and statistical techniques to find the connections between various documents. 2.3.2 Medicine Statistical techniques have had a long history in the field of clinical trials where they were used for drawing conclusions about the efficacy of a drug used in the treatment of a disease or for finding patterns in the occurrences of diseases 232. With the availability of other forms of data, such as images from X-rays and magnetic resonance imaging (MRI), the field of medical imaging is also providing data that can be used in the diagnosis of diseases 558, 37, 20. There are several examples where data mining techniques are being used in medical imagery. A compelling example is the use of neural network technology in the detection of breast cancer in mammograms 219. Breast cancer is a serious problem with early detection having a dramatic effect on raising the survival rate. While mass radiological screening is the only reliable means of early detection, the examination of the mammograms is very tedious and inefficient. When a single radiologist has to examine a large volume of mammograms in a short time, it can lead to missed cancers. A computer-assisted approach can be invaluable in such circumstances, leading not only to reduced radiologist time per mammogram, but also to improved early detection as the computer system could draw attention to lesions that have been overlooked in the mammogram. A practical implementation of such technology is the ImageChecker 288 which uses neural networks to provide a potential improvement of over 20% detection of breast cancer. A more detailed look at the issues encountered in computer-aided mammographic screening is provided by Kegelmeyer et al. 333. This work built upon earlier work in the

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.