Lecture notes on Data Warehousing

lecture notes for data warehousing and mining, lecture notes in data mining and data warehousing,what is data warehousing architecture and applications pdf free download
Dr.JakeFinlay Profile Pic
Published Date:22-07-2017
Your Website URL(Optional)
? Data WarehousingDATA WAREHOUSING AND DATA MINING CHAPTER 1 LESSON 1 DATA WAREHOUSING INTRODUCTION TO DATA WAREHOUSING Structure communication requirements it is possible to incorporate additional or expert information it is. • Objective The logical link between what the managers see in their decision � Introduction Support EIS application and the company’s operational � Meaning of Data warehousing activities Johan McIntyre of SAS institute Inc. � History of Data warehousing In other words the data warehouse provides warehouse � Traditional Approaches To Historical Data provides data that is already transformed and summarized, � Data from legacy systems therefore making it an appropriate environment for the more efficient DSS and EIS applications. � Extracted information on the Desktop A data warehouse is a collection of corporate information, � Factors, which Lead To Data Warehousing derived directly from operational system and some Objective external data sources. The main objective of this lesson is to introduce you with the Its specific purpose is to support business decisions, not basic concept and terminology relating to Data Warehousing. business ask “What if?” questions. The answer to these By the end of this lesson you will be able to understand: questions will ensure your business is proactive, instead of � Meaning of a Data warehouse reactive, a necessity in today’s information ago. � Evolution of Data warehouse The industry trend today is moving towards more powerful Introduction hardware and software configuration, we now have the ability to Traditionally, business organizations create billions of bytes of process vast volumes of information analytically, which would data about all aspects of business everyday, which contain have been unheard of tenor even five years ago. A business millions of individual facts about their customers, products, today must we able to use this emerging technology or run the operations, and people. However, this data is locked up and is risk if being information under loaded. As you read that extremely difficult to get at. Only a small fraction of the data correctly - under loaded - the opposite of over loaded. Over- that is captured, processed, and stored in the enterprise is loaded means you are so determine what is important. If you actually available to executives and decision makers. are under loaded, you are information deficient. You cannot cope with decision – making expectation because you do not Recently, new concepts and tools have evolved into a new know where you stand. You are missing critical pieces of technology that make it possible to provide all the key people information required to make informed decisions. within the enterprise with access to whatever level of informa- tion needed for the enterprise to survive and prosper in an To illustrate the danger of being information under loaded, increasingly competitive world. The term that is used for this consider the children’s story of the country mouse is unable to new technology is “data warehousing”. In this unit I will be cope with and environment its does not understand. discussing about the basic concept and terminology relating to What is a cat? Is it friend or foe? Data Warehousing. Why is the chess in the middle of the floor on the top of a The Lotus was your first test of “What if “processing on the platform with a spring mechanism? Desktop. This is what a data warehouse is all about using Sensory deprivation and information overload set in. The information your business has gathered to help it react better, picture set country mouse cowering in the corner. If is stays smarter, quicker and more efficiently. there, it will shrivel up and die. The same fate awaits the Meaning of Data Warehousing business that does not respond to or understand the environ- Data warehouse potential can be magnify if the appropriate data ment around it. The competition will moves in like cultures and has been collected and stored in a data warehouse. A data exploit all like weaknesses. warehouse is a relational database management system In today’s world, you do not want to be the country mouse. In (RDBMS) designed specifically to meet the needs of transaction today’s world, full of vast amounts of unfiltered information, a processing system. It can be loosely defined as any centralized business that does not effectively use technology to shift data repository, which can be queried for business benefit, but through that information will not survive the information age. this will be more clearly defined letter. Data warehouse is a new Access to, and the understating of, information is power. This powerful technique making. It possible to extract archived power equate to a competitive advantage and survival. This unit operational data and over come inconsistencies between will discuss building own data warehouse-a repository for different legacy data formats, as well as integrating data through- storing information your business needs to use if it hopes to out an enterprise, regardless of location, format, or survive and thrive in the information age. We will help you 1OUSING AND DA NING understand what a data warehouse is and what it is not. You the disadvantages faced it led to the development of the new will learn what human resources are required, as well as the roles application called Data Warehousing and responsibilities of each player. You will be given an Factors, which Lead To Data Warehousing overview of good project management techniques to help Many factors have influenced the quick evolution of the data ensure the data warehouse initiative dose not fail due the poor warehousing discipline. The most important factor has been the project management. You will learn how to physically imple- advancement in the hardware and software technologies. ments a data warehouse with some new tools currently available Hardware and Software prices: Software and hardware prices to help you mine those vast amounts of information stored have fallen to a great extent. Higher capacity memory chips are with in the warehouse. Without fine running this ability to available at very low prices. mine the warehouse, even the most complete warehouse, � Powerful Preprocessors: Today’s preprocessor are many would be useless. times powerful than yesterday’s mainframes: e.g. Pentium III History of Data Warehousing and Alpha processors Let us first review the historical management schemes of the � Inexpensive disks: The hard disks of today can store analysis data and the factors that have led to the evolution of hundreds of gigabytes with their prices falling. The amount the data warehousing application class. of information that can be stored on just a single one-inch Traditional Approaches to Historical Data high disk drive would have required a roomful of disk drives Throughout the history of systems development, the primary in 1970’s and early eighties. emphasis had been given to the operational systems and the � Desktop powerful for analysis tools: Easy to use GUI data they process. It was not practical to keep data in the interfaces, client/server architecture or multi-tier computing operational systems indefinitely; and only as an afterthought can be done on the desktops as opposed to the mainframe was a structure designed for archiving the data that the opera- computers of yesterday. tional system has processed. The fundamental requirements of � Server software: Server software is inexpensive, powerful, the operational and analysis systems are different: the opera- and easy to maintain as compared to that of the past. tional systems need performance, whereas the analysis systems Example of this is Windows NT that have made setup of need flexibility and broad scope. powerful systems very easy as well as reduced the cost. Data from Legacy Systems The skyrocketing power of hardware and software, along with Different platforms have been developed with the development the availability of affordable and easy-to-use reporting and of the computer systems over past three decades. In the 1970’s, analysis tools have played the most important role in evolution business system development was done on the IBM mainframe of data warehouses. computers using tools such as Cobol, CICS, IMS, DB2, etc. With the advent of 1980’s computer platforms such as AS/400 Emergence of Standard Business and VAX/VMS were developed. In late eighties and early Applications nineties UNIX had become a popular server platform introduc- New vendors provide to end-users with popular business ing the client/server architecture which remains popular till date. application suites. German software vendor SAP AG, Baan, PeopleSoft, and Oracle have come out with suites of software Despite all the changes in the platforms, architectures, tools, and that provide different strengths but have comparable function- technologies, a large number of business applications continue ality. These application suites provide standard applications that to run in the mainframe environment of the 1970’s. The most can replace the existing custom developed legacy applications. important reason is that over the years these systems have This has led to the increase in popularity of such applications. captured the business knowledge and rules that are incredibly Also, the data acquisition from these applications is much difficult to carry to a new platform or application. These systems simpler than the mainframes. are, generically called legacy systems. The data stored in such systems ultimately becomes remote and becomes difficult to get End-user more Technology Oriented at. One of the most important results of the massive investment in technology and movement towards the powerful personal Extracted Information on the Desktop computer has been the evolution of a technology-oriented During the past decade the personal computer has become very business analyst. Even though the technology-oriented end popular for business analysis. Business Analysts now have users are not always beneficial to all projects, this trend certainly many of the tools required to use spreadsheets for analysis and has produced a crop of technology-leading business analysts graphic representation. Advanced users will frequently use that are becoming essential to today’s business. These technol- desktop database programs to store and work with the ogy-oriented end users have frequently played an important role information extracted from the legacy sources. in the development and deployment of data warehouses. They The disadvantage of the above is that it leaves the data frag- have become the core users that are first to demonstrate the mented and oriented towards very specific needs. Each initial benefits of data warehouses. These end users are also individual user has obtained only the information that she/he critical to the development of the data warehouse model: as requires. The extracts are unable to address the requirements of they become experts with the data warehousing system, they multiple users and uses. The time and cost involved in train other users. addressing the requirements of only one user are large. Due to 2DATA WAREHOUSING AND DATA MINING Discussions � Write short notes on: � Legacy systems � Data warehouse � Standard Business Applications � What is a Data warehouse? How does it differ from a database? � Discuss various factors, which lead to Data Warehousing. � Briefly discuss the history behind Data warehouse. References 1. Adriaans, Pieter, Data mining, Delhi: Pearson Education Asia, 1996. 2. Anahory, Sam, Data warehousing in the real world: a practical guide for building decision support systems, Delhi: Pearson Education Asia, 1997. 3. Berry, Michael J.A. ; Linoff, Gordon, Mastering data mining : the art and science of customer relationship management, New York : John Wiley & Sons, 2000 4. Corey, Michael, Oracle8 data warehousing, New Delhi: Tata McGraw- Hill Publishing, 1998. 5. Elmasri, Ramez, Fundamentals of database systems, 3rd ed. Delhi: Pearson Education Asia, 2000. 3CHAPTER 1: DATA WAREHOUSING LESSON 2 MEANING AND CHARACTERISTICS OF DATA WAREHOUSING Structure Data warehousing evolved with the integration of a number of different technologies and experiences over the last two decades, � Objective which have led to the identification of key problems. � Introduction Data Warehousing � Data warehousing Because data warehouses have been developed in numerous � Operational vs. Informational Systems organizations to meet partic-ular needs, there is no single, � Characteristics of Data warehousing 1 canonical definition of the term data warehouse. Profes-sional � Subject oriented magazine articles and books in the popular press have elabo- rated on the meaning in a variety of ways. Vendors have � Integrated capitalized on the popularity of the term to help mar-ket a � Time variant variety of related products, and consultants have provided a � Non-volatiles large variety of services, all under the data-warehousing banner. However, data warehouses are quite distinct from traditional Objective databases in their structure, functioning, performance, and The objective of this lesson is to explain you the significance purpose. and difference between Operational systems and Informational systems. This lesson also includes various characteristics of a Operational vs. Informational Systems Data warehouse. Perhaps the most important concepts that has come out of the Data Warehouse movement is the recognition that there are two Introduction fundamentally different types of information systems in all In the previous section, we have discussed about the need of organizations: operational systems and informational systems. data warehousing and the factors that lead to it. In this section I “Operational systems” are just what their name implies, they are will explore the technical concepts relating to data warehousing the systems that help us run the enterprise operate day-to-day. to you. These are the backbone systems of any enterprise, our “order A company can have data items that are unrelated to each other. entry’, “inventory”, “manufacturing”, “payroll” and “account- Data warehousing is the process of collecting together such data ing” systems. Because of their importance to the organization, items within a company and placing it in an integrated data operational systems were almost always the first parts of the store. This integration is over time, geographies, and application enterprise to be computerized. Over the years, these operational platforms. By adding access methods (on-line querying, systems have been extended and rewritten, enhanced and reporting), this converts a ‘dead’ data store into a ‘dynamic’ maintained to the point that they are completely integrated into source of information. In other words, turning a liability into the organization. Indeed, most large organizations around the an asset. Some of the definitions of data warehousing are: world today couldn’t operate without their operational systems “A data warehouse is a single, complete, and consistent store of and that data that these systems maintain. data obtained from a variety of sources and made available to On the other hand, there are other functions that go on within end users in a way they can understand and use in a business the enterprise that have to do with planning, forecasting and context.” (Devlin 1997) managing the organization. These functions are also critical to “Data warehousing is a process, not a product, for assembling the survival of the organization, especially in our current fast and managing data from various sources for the purpose of paced world. Functions like “marketing planning”, “engineering gaining a single, detailed view of part or all of the business.” planning” and “financial analysis” also require information (Gardner 1998) systems to support them. But these functions are different A Data Warehouse is a capability that provides comprehensive from operational ones, and the types of systems and informa- and high integrity data in forms suitable for decision support to tion required are also different. The knowledge-based functions end users and decision makers throughout the organization. A are informational systems. data warehouse is managed data situated after and outside the “Informational systems” have to do with analyzing data and operational systems. A complete definition requires discussion making decisions, often major decisions about how the of many key attributes of a data warehouse system Data enterprise will operate, now and in the future. And not only do Warehousing has been the result of the repeated attempts of informational systems have a different focus from operational various researchers and organizations to provide their organiza- ones, they often have a different scope. Where operational data tions flexible, effective and efficient means of getting at the needs are normally focused upon a single area, informational valuable sets of data. data needs often span a number of different areas and need large amounts of related operational data. 5OUSING AND DA NING In the last few years, Data Warehousing has grown rapidly from subject contain only the information necessary for decision a set of related ideas into architecture for data delivery for support processing. enterprise end user computing. Integrated They support high-performance demands on an organization’s When data resides in money separate applications in the data and information. Several types of applications-OLAP, DSS, operational environment, encoding of data is often inconsis- and data mining applications-are supported. OLAP (on-line tent. For instance in one application, gender might be coded as analytical processing) is a term used to describe the analysis of “m” and “f” in another by o and l. When data are moved from complex data from the data warehouse. In the hands of skilled the operational environment in to the data warehouse, when knowledge workers. OLAP tools use distributed computing data are moved from the operational environment in to the capabilities for analyses that require more storage and processing data warehouse, they assume a consistent coding convention power than can be economically and efficiently located on an e.g. gender data is transformed to “m” and “f”. individual desktop. DSS (Decision-Support Systems) also Time variant known as EIS (Executive Information Systems, not to be The data warehouse contains a place for storing data that are five confused with enterprise integration systems) support an to ten years old, or older, to be used for comparisons, trends, organization’s leading deci-sion makers with higher-level data and forecasting. These data are not up dated. for complex and important decisions. Data mining is used for knowledge discovery, the pro-cess of searching data for Non-volatile Data are not update or changed in any way once they enter the unanticipated new knowledge. data warehouse, but are only loaded and accessed. Traditional databases support On-Line Transaction Processing (OLTP), which includes insertions, updates, and deletions, Data warehouses have the following distinctive characteristics. while also supporting information query requirements. � Multidimensional conceptual view. Traditional relational databases are optimized to process queries � Generic dimensionality. that may touch a small part of the database and transactions � Unlimited dimensions and aggregation levels. that deal with insertions or updates of a few tuples per relation � Unrestricted cross-dimensional operations. to process. Thus, they cannot be optimized for OLAP, DSS, or data mining. By contrast, data warehouses are designed precisely � Dynamic sparse matrix handling. to support efficient extraction, process-ing, and presentation for � Client-server architecture. analytic and decision-making purposes. In comparison to tradi- � Multi-user support. tional databases, data warehouses generally contain very large � Accessibility. amounts of data from multiple sources that may include databases from different data models and sometimes lies � Transparency. acquired from independent systems and platforms. � Intuitive data manipulation. A database is a collection of related data and a database system � Consistent reporting performance. is a database and database software together. A data warehouse � Flexible reporting is also a collection of information as well as supporting system. Because they encompass large volumes of data, data ware- However, a clear distinction exists, Traditional databases are houses are generally an order of magnitude (sometimes two transactional: relational, object-oriented, network, or hierarchical. orders of magnitude) larger than the source databases. The Data warehouses have the distinguishing characteristic that they sheer volume of data (likely to be in terabytes) is an issue that are mainly intended for decision-support applications. They are has been dealt with through enterprise-wide data warehouses, optimized for data retrieval, not routine transaction processing. virtual data warehouses, and data marts: Characteristics of Data Warehousing � Enterprise-wide data warehouses are huge projects requiring As per W. H. Inmon, author of building the data warehouse massive investment of time and resources. and the guru who is ready widely considered to be the origina- � Virtual data warehouses provide views of operational tor of the data warehousing concept, there are generally four databases that are materialized for efficient access. character that describe a data warehouse: � Data marts generally are targeted to a subset of the W. H. Inmon characterized a data warehouse as “a subject- organization, such as a dependent, and are more tightly oriented, integrated, nonvola-tile, time-variant collection of data focused. in support of management’s decisions.” Data ware-houses provide access to data for complex analysis, knowledge discov- To summarize the above, here are some important points to ery, and decision-making. remember about various characteristics of a Data warehouse: � Subject-oriented Subject Oriented Data are organized according to subject instead of application � Organized around major subjects, such as customer, e.g. an insurance company using a data warehouse would product, sales. organize their data by costumer, premium, and claim, instead of by different products (auto. Life etc.). The data organized by 6DATA WAREHOUSING AND DATA MINING � Focusing on the modeling and analysis of data for decision making, not on daily operations or transaction processing. � Provide a simple and concise view around particular subject by excluding data that are not useful in the decision support process. � Integrated � Constructed by integrating multiple, heterogeneous data sources as relational databases, flat files, on-line transaction records. � Providing data cleaning and data integration techniques. � Time variant � The time horizon for the data warehouse is significantly longer than that of operational systems. � Every key structure in the data warehouse contains an element of time (explicitly or implicitly). � Non-volatile � A physically separate store of data transformed from the operational environment. � Does not require transaction processing, recovery, and concurrency control mechanisms. � Requires only two operations in data accessing: initial loading of data and access of data (no data updates). Discussions � Write short notes on: � Metadata � Operational systems � OLAP � DSS � Informational Systems � What is the need of a Data warehouse in any organization? � Discuss various characteristics of a Data warehouse. � Explain the difference between non-volatile and Subject-oriented data warehouse. References 1. Adriaans, Pieter, Data mining, Delhi: Pearson Education Asia, 1996. 2. Anahory, Sam, Data warehousing in the real world: a practical guide for building decision support systems, Delhi: Pearson Education Asia, 1997. 3. Berry, Michael J.A. ; Linoff, Gordon, Mastering data mining : the art and science of customer relationship management, New York : John Wiley & Sons, 2000 4. Corey, Michael, Oracle8 data warehousing, New Delhi: Tata McGraw- Hill Publishing, 1998. 5. Elmasri, Ramez, Fundamentals of database systems, 3rd ed. Delhi: Pearson Education Asia, 2000. 7OUSING AND DA NING 8LESSON 3 ONLINE TRANSACTION PROCESSING Similarities and Differences in OLTP and Data Structure Warehousing � Objective � Introduction � Data warehousing and OLTP systems OLTP Data Warehouse � Similarities and Differences in OLTP and Data Warehousing Run day-to-day Information retrieval Processes in Data Warehousing OLTP Purpose operation and analysis � What is OLAP? Structure RDBMS RDBMS � Who uses OLAP and WHY? � Multi-Dimensional Views Data Model Normalized Multi-dimensional � Benefits of OLAP SQL plus data analysis Access SQL extensions Objective Type of Data that run the Data that analyses the The main objective of this lesson is to introduce you with Data business business Online Transaction Processing. You will learn about the Condition Changing Historical descriptive importance and advantages of an OLTP system. of Data incomplete Introduction Relational databases are used in the areas of operations and The data warehouse server a different purpose from that of control with emphasis on transaction processing. Recently OLTP systems by allowing business analysis queries to be relational databases are used for building data warehouses, answered as opposed to “simple aggregation” such as ‘what is which stores tactical information (1year into the future) that the current account balance for this customer?’ Typical data answers who and what questions. In contrast OLAP uses MD warehouse queries include such things as ‘which product line views of aggregate data to provide access strategic information. sells best in middle America and how dose this correlate to OLAP enables users to gain insight to a wide variety of possible demographic data? views of information and transforms raw data to reflect the enterprise as understood by the user e.g., Analysts, managers Processes in Data Warehousing OLTP and executives. The first step in data warehousing is to “insulate” your current operational information, i.e. to preserve the security and Data Warehousing and OLTP Systems integrity of mission- critical OLTP applications, while giving A data base which in built for on line transaction processing, you access to the broadest possible base of data. The resulting OLTP, is generally regarded as inappropriate for warehousing as database or data warehouse may consume hundred of they have been designed with a different set of need in mind gigabytes-or even terabytes of disk space, what is required than i.e., maximizing transaction capacity and typically having are capable efficient techniques for storing and retrieving massive hundreds of table in order not to look out user etc. Data amounts of information. Increasingly, large organizations have warehouse are interested in query processing as opposed to found that only parallel processing systems offer sufficient transaction processing. bandwidth. OLTP systems cannot be receptacle stored of repositories of The data warehouse thus retrieves data from a varsity of facts and historical data for business analysis. They cannot be heterogeneous operational database. The data is than trans- quickly answer adhoc queries is rapid retrieval is almost impos- formed and delivered to the data warehouse/ store based in a sible. The data is inconsistent and changing, duplicate entries selected modal (or mapping definition). The data transforma- exist, entries can be missing and there is an absence of historical tion and movement processes are completed whenever an data, which is necessary to analyses trends. Basically OLTP offers update to the warehouse data is required so there should some large amounts of raw data, which is not easily understood. The from of automation to manage and carry out these functions. data warehouse offers the potential to retrieve and analysis The information that describes the modal metadata is the information quickly and easily. Data warehouse do have means by which the end user finds and understands the data in similarities with OLTP as shown in the table below. the warehouse and is an important part of the warehouse. The metadata should at the very least contain: � Structure of the data; � Algorithm used for summarization; 9OUSING AND DA NING � Mapping from the operational environment to the data � In contrast OLAP uses Multi-Dimensional (MD) views of warehouse. aggregate data to provide access strategic information. Data cleansing is an important viewpoint of creating an efficient � OLAP enables users to gain insight to a wide variety of data warehouse of creating an efficient data warehouse in that is possible views of information and transforms raw data to the removal of creation aspects Operational data such as low reflect the enterprise as understood by the user e.g. Analysts, level transaction information which sloe down the query times. managers and executives. The cleansing stage has to be as dynamic as possible to accom- � In addition to answering who and what questions OLAPs modate all types of queries even those, which may require can answer “what if “ and “why”. low-level information. Data should be extracted from produc- � Thus OLAP enables strategic decision-making. tion sources at regular interval and pooled centrally but the � OLAP calculations are more complex than simply summing cleansing process has to remove duplication and reconcile data. differences between various styles of data collection. � However, OLAP and Data Warehouses are complementary Once the data has been cleaned it is than transfer to the data � The data warehouse stores and manages data while the warehouse, which typically is a large database on a high perfor- OLAP transforms this data into strategic information. mance box, either SMP Symmetric Multi- Processing or MPP, Massively parallel Processing Number crunching power is Who uses OLAP and WHY? another importance aspect of data warehousing because of the � OLAP applications are used by a variety of the functions of complexity involved in processing adhoc queries and because of an organisation. the vast quantities of data that the organization want to use in � Finance and accounting: the warehouse. A data warehouse can be used in different ways, for example it can be a central store against which the queries are Budgeting run of it can be used like a data mart, data mart which are small Activity-based costing warehouses can be established to provide subsets of the main Financial performance analysis store and summarized information depending on the require- And financial modelling ments of a specific group/ department. The central stores approach generally uses every simple data structures with very � Sales and Marketing little assumptions about the relationships between data where Sales analysis and forecasting as marts often uses multidimensional data base which can Market research analysis speed up query processing as they can have data structures which Promotion analysis reflect the most likely questions. Customer analysis Many vendors have products that provide on the more of the above data warehouse functions. However, it can take a Market and customer segmentation significant amount of work and specialized programming to � Production provide the interoperability needed between products form. Production planning Multiple vendors to enable them to perform the required data Defect analysis warehouse processes a typical implementation usually involves a mixture of procedure forma verity of suppliers. Thus, OLAP must provide managers with the information they need for effective decision-making. The KPI (key performance Another approach to data warehousing is the Parsaye Sandwich indicator) of an OLAP application is to provide just-in-time paradigm put forward by Dr. Kamran Parsaye , CED of (JIT) information for effective decision-making. JIT informa- information discovery, Hermosa beach. This paradigm or tion reflects complex data relationships and is calculated on the philosophy encourages acceptance of the probability that the fly. Such an approach is only practical if the response times are first iteration of data warehousing effort will require consider- always short The data model must be flexible and respond to able revision. The Sandwich paradigm advocates the following changing business requirements as needed for effective decision approach. making. � Pre-mine the data to determine what formats and data are In order to achieve this in widely divergent functional areas needed to support a data- mining application; OLAP applications all require: � Build a prototype mini- data warehouse i.e. the, the meat of MD views of data sandwich most of features envisaged for the and product; Complex calculation capabilities � Revise the strategies as necessary; Time intelligence � Build the final warehouse. Multi-Dimensional Views What is OLAP? � MD views inherently represent actual business models, � Relational databases are used in the areas of operations and which normally have more than three dimensions e.g., Sales control with emphasis on transaction processing. data is looked at by product, geography, channel and time. � Recently relational databases are used for building data warehouses, which stores tactical information ( 1 year into the future) that answers who and what questions. 10DATA WAREHOUSING AND DATA MINING � MD views provide the foundation for analytical processing � Identify various benefits of OLTP. through flexible access to information. � “OLAP enables organisations as a whole to respond more � MD views must be able to analyse data across any dimension quickly to market demands, which often results in increased at any level of aggregation with equal functionality and ease revenue and profitability”. Comment. and insulate users from the complex query syntax � Who are the primary users of Online Transaction Processing � What ever the query is they must have consistent response System? times. � “The KPI (key performance indicator) of an OLAP � Users queries should not be inhibited by the complex to application is to provide just-in-time (JIT) information for form a query or receive an answer to a query. effective decision-making”. Explain. � The benchmark for OLAP performance investigates a server’s References ability to provide views based on queries of varying 1. Anahory, Sam, Data warehousing in the real world: a practical complexity and scope. guide for building decision support systems, Delhi: Pearson Basic aggregation on some dimensions Education Asia, 1997. More complex calculations are performed on other 2. Adriaans, Pieter, Data mining, Delhi: Pearson Education dimensions Asia, 1996. � Ratios and averages 3. Corey, Michael, Oracle8 data warehousing, New Delhi: Tata McGraw- Hill Publishing, 1998. � Variances on sceneries 4. Elmasri, Ramez, Fundamentals of database systems, 3rd ed. � A complex model to compute forecasts Delhi: Pearson Education Asia, 2000. � Consistently quick response times to these queries are imperative to establish a server’s ability to provide MD views of information. Benefits of OLAP � Increase the productivity of manager’s developers and whole organisations. � Users of OLAP systems become more self-sufficient eg managers no longer depend on IT to make schema changes. � It allows managers to model problems that would be impossible with less flexible systems � Users have more control and timely access to relevant strategic information which results in better decision making.(timeliness, accuracy and relevance) � IT developers also benefit form using OLAP specific software as they can deliver applications to users faster. � Thus, reducing the application backlog and ensure a better service. � OLAP further reduces the backlog by making its users self- sufficient to build their own models but yet not relinquishing control over the integrity of the data � OLAP software reduces the query load and network traffic on OLTP systems and data warehouses. � Thus, OLAP enables organisations as a whole to respond more quickly to market demands, which often results in increased revenue and profitability. The goal of every organisation. Discussions � Write short notes on: � Multi-Dimensional Views � Operational Systems � What is the significance of an OLTP System? � Discuss OLTP related processes used in a Data warehouse. � Explain MD views with an example. 11OUSING AND DA NING 12LESSON 4 DATA WAREHOUSING MODELS Structure � Introduction � Objective � The Date warehouse Model � Data Modeling for Data Warehouses � Multidimensional models � Roll-up display • A drill-down display � Multidimensional Schemas � Star Schema � Snowflake Schema Objective The main objective of this lesson is to make you understand a data warehouse model. It also explains various types of multidimensional models and Schemas. Figure 2: The structure of data inside the data warehouse Introduction The current detail data is central in importance as it: Data warehousing is the process of extracting and transforming � Reflects the most recent happenings, which are usually the operational data into informational data and loading it into a most interesting; central data store or warehouse. Once the data is loaded it is accessible via desktop query and analysis tools by the decision � It is voluminous as it is stored at the lowest level of makers. granularity; � it is always (almost) stored on disk storage which is fast to The Data Warehouse Model access but expensive and complex to manage. The data warehouse model is illustrated in the following diagram. Older detail data is stored on some form of mass storage, it is infrequently accessed and stored at a level detail consistent with current detailed data. Lightly summarized data is data distilled from the low level of detail found at the current detailed level and generally is stored on disk storage. When building the data warehouse have to consider what unit of time is summarization done over and also the contents or what attributes the summarized data will contain. Highly summarized data is compact and easily accessible and can even be found outside the warehouse. Metadata is the final component of the data warehouse and is Figure 1: A data warehouse model really of a different dimension in that it is not the same as data The data within the actual warehouse itself has a distinct drawn from the operational environment but is used as: structure with the emphasis on different levels of summariza- � a directory to help the DSS analyst locate the contents of the tion as shown in the figure below. data warehouse, � a guide to the mapping of data as the data is transformed from the operational environment to the data warehouse environment, � a guide to the algorithms used for summarization between the current detailed data and the lightly summarized data and the lightly summarized data and the highly summarized data, etc. 13OUSING AND DA NING The basic structure has been described but Bill Inmon fills in In the figure, there is a three-dimensional data cube that the details to make the example come alive as shown in the organizes product sales data by fiscal quarters and sales regions. following diagram. Each cell could contain data for a specific prod-uct, specific fiscal quarter, and specific region. By including additional dimensions, a data hypercube could be produced, although more than three dimensions cannot be easily visualized at all or presented graphically. The data can be queried directly in any combination of dimensions, by passing complex database queries. Tools exist for viewing data Data Modeling for Data Warehouses Figure 3: An example of levels of summarization of data inside the data warehouse The diagram assumes the year is 1993 hence the current detail data is 1992-93. Generally sales data doesn’t reach the current level of detail for 24 hours as it waits until it is no longer available to the operational system i.e. it takes 24 hours for it to get to the data warehouse. Sales details are summarized weekly by subproduct and region to produce the lightly summarized detail. Weekly sales are then summarized again to produce the highly summarized data. According to the user’s choice of dimensions. Changing from Data Modeling for Data Warehouses one dimensional hierarchy -(orientation) to another is easily Multidimensional models take advantage of inherent accomplished in a data cube by a technique called pivoting (also relationships in data to populate data in multidimensional called rotation). In this technique, the data cube can be thought matrices called data cubes. (These may be called hypercube if of as rotating to show a different orientation of the axes. For they have more than three dimensions.) For data that lend example, you might pivot the data cube to show regional sales themselves to dimensional Formatting, query performance in revenues as rows, the fiscal quarter revenue totals as columns, multidimensional matrices can be much better than in the and company’s products in the third dimension. Hence, this relational data model. Three examples of dimensions in a technique is equivalent to having a regional sales table for each corporate data warehouse would be the corporation’s fiscal product separately, where each table shows quarterly sales for periods, products, and regions. that product region by region. Multidimensional models lend themselves readily to hierarchical A standard spreadsheet is a two-dimensional matrix. One views in what is known as roll-up display and drill-down example would be a spreadsheet of regional sales by product display. for a particular time period. Products could be shown as rows, with sales revenues for each region comprising the columns. � Roll-up display moves up the hierar-chy, grouping into Adding a time dimension, such as an organiza-tion’s fiscal larger units along a dimension (e.g., summing weekly data by quarters, would produce a three-dimensional matrix, which quar-ter, or by year). One of the above figures shows a roll- could be repre-sented using a data cube. up display that moves from individual products to a coarser grain of product categories. � A drill-down display pro-vides the opposite capability, furnishing a finer-grained view, perhaps disaggregating country sales by region and then regional sales by sub region and also breaking up prod-ucts by styles. The multidimensional storage model involves two types of tables: dimension tables and fact tables. A dimension table consists of tuples of attributes of the dimension. A fact table can be thought of as having tuples, one per a recorded fact. This fact contains some measured or observed variable(s) and identifies it (them) with pointers to dimension tables. The fact 14DATA WAREHOUSING AND DATA MINING table contains the data and the dimensions identify each tuple in that data. An example of a fact table that can be viewed from the perspec-tive of multiple dimensional tables. Two common multidimensional schemas are the star schema and the snowflake schema. The star schema consists of a fact table with a single table for each dimension. The snowflake schema is a variation on the star schema in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them. Some installations are normalizing data warehouses up to the third normal form so that they can access the data warehouse to the finest level of detail. A fact con-stellation is a set of fact tables that share some dimension tables. Following figure shows a fact constellation with two fact tables, business results arid business forecast. These share the dimension table called product. Fact constella- tions limit the possible queries for the ware-house. Data warehouse storage also utilizes indexing techniques to support high perfor-mance access. A technique called bitmap indexing constructs a bit vector for each value in a domain (column) being indexed. 15OUSING AND DA NING a. Give three dimension data elements and two fact data It works very well for domains of low-cardinality. There is a 1 elements that could be in the database for this data bit placed in the jth position in the vector if the jth row warehouse. Draw a data cube, for this database. contains the value being indexed. For example, imagine an inventory of 100,000 cars with a bitmap index on car size. If b. State two ways in which each of the two fact data elements there are four-car sizeseconomy, compact, midsize, and full could be of low quality in some respect. size-there will be four bit vectors, each containing 100,000 bits 2. You have decided to prepare a budget for the next 12 (12.5 K) for a total index size of 50K. Bitmap indexing can months based on your actual expenses for the past 12. You provide consider-able input/output and storage space advan- need to get your expense information into what is in effect a tages in low-cardinality domains. With bit vec-tors a bitmap data warehouse, which you plan to put into a spreadsheet for index can provide dramatic improvements in comparison, easy sorting and analysis. aggregation, and join performance. In a star schema, dimen- a. What are your information sources for this data warehouse? sional data can be indexed to tuples in the fact table by join indexing. Join indexes are traditional indexes to maintain b. Describe how you would carry out each of the five steps of relationships between primary key and foreign key values. They data preparation for a data warehouse database, from relate the values of a dimension of a star schema to rows in the extraction through summarization. If a particular step does fact table. For example, consider a sales fact table that has city not apply, say so and justify your statement. and fiscal quarter as dimensions. If there is a join index on city, References for each city the join index maintains the tuple IDs of tuples 1. Adriaans, Pieter, Data mining, Delhi: Pearson Education containing that city. Join indexes may involve multiple dimen- Asia, 1996. sions. 2. Anahory, Sam, Data warehousing in the real world: a practical Data warehouse storage can facilitate access to summary data by guide for building decision support systems, Delhi: Pearson taking further advantage of the nonvolatility of data ware- Education Asia, 1997. houses and a degree of predictability of the analyses that will be performed using them. Two approaches have been used.(1) 3. Berry, Michael J.A. ; Linoff, Gordon, Mastering data mining smaller tables including summary data such as quarterly sales or : the art and science of customer relationship management, New revenue by product line, and (2) encoding of level (e.g., weekly, York : John Wiley & Sons, 2000 quarterly, annual) into existing tables. By comparison, the 4. Corey, Michael, Oracle8 data warehousing, New Delhi: Tata overhead of creating and maintaining such aggregations would McGraw- Hill Publishing, 1998. likely be excessive in a volatile, transaction-oriented database. 5. Elmasri, Ramez, Fundamentals of database systems, 3rd ed. Discussions Delhi: Pearson Education Asia, 2000. � What are the various kinds of models used in Data warehousing? � Discuss the following: � Roll-up display � Drill down operation � Star schema � Snowflake schema � Why is the star schema called by that name? � State an advantage of the multidimensional database structure over the relational database structure for data warehousing applications. � What is one reason you might choose a relational structure over a multidimensional structure for a data warehouse database? . � Clearly contrast the difference between a fact table and a dimension table. Exercises 1. Your college or university is designing a data warehouse to enable deans, department chairs, and the registrar’s office to optimize course offerings, in terms of which courses are offered, in how many sections, and at what times. The data warehouse planners hope they will be able to do this better after examining historical demand for courses and extrapolating any trends that emerge. 16DATA WAREHOUSING AND DATA MINING 17OUSING AND DA NING Notes 18LESSON 5 ARCHITECTURE AND PRINCIPLES OF DATA WAREHOUSING them a better understanding of their data and a better under- Structure standing of their business in relation to their competitors, and � Objective it lets them provide better customer service. � Introduction So, what exactly is a data warehouse? Should your company � Structure of a Data warehouse have one, and if so, what should it look like? � Data Warehouse Physical Architectures Structure of a Data Warehouse � Generic Two-Level Essentially, a data warehouse provides historical data for � Expanded Three-Level decision-support applications. Such applications include reporting, online analytical processing (OLAP), executive � Enterprise data warehouse (EDW) information systems (EIS), and data mining. � Data marts According to W. H. Inmon, the man who originally came up � Principles of a Data warehousing with the term, a data warehouse is a centralized, integrated Objective repository of information. Here, integrated means cleaned up, The objective of this lesson is to let you know the basic merged, and redesigned. This may be more or less complicated structure of a Data warehouse. You will also learn about Data depending on how many systems feed into a warehouse and warehouse physical architecture and various principles of a Data how widely they differ in handling similar information. warehousing. But most companies already have repositories of information Introduction in their production systems and many of them are centralized. Let me start the lesson with an example, which illustrates the Aren’t these data warehouses? Not really. importance and need of a data warehouse. Until several years Data warehouses differ from production databases, or online ago Coca Cola had no idea how many bottles of Coke it transaction-processing (OLTP) systems, in their purpose and produced each day because production data were stored on 24 design. An OLTP system is designed and optimized for data different computer systems. Then, it began a technique called entry and updates, whereas a data warehouse is optimized for Data warehousing. One airline spent and wasted over 100 data retrieval and reporting, and it is usually a read-only system. million each year on inefficient mass media advertising cam- An OLTP system contains data needed for running the day-to- paigns to reach frequent flyers…then it began data warehousing. day operations of a business but a data warehouse contains Several years ago, the rail industry needed 8 working days to data used for analyzing the business. The data in an OLPT deliver a freight quote to a customer. The trucking industry, by system is current and highly volatile, which data elements that contrast, could deliver a freight quote to a customer on the may be incomplete or unknown at the time of entry. A phone instantly, because unlike the rail industry, truckers were warehouse contains historical, nonvolatile data that has been using…data warehousing. adjusted for transactions errors. Finally, since their purposes are A data warehouse is a data base that collects current informa- so different, OLPT systems and data warehouses use different tion, transforms it to ways it can be used by the warehouse data-modeling strategies. Redundancy is almost nonexistent in owner, transforms that information for clients, and offers OLTP systems, since redundant data complicates updates. So portals of access to members of your firm to help them make OLPT systems are highly normalized and are usually based on a decisions and future plans. relational model. But redundancy is desirable in a data ware- house, since it simplifies user access and enhances performance Data warehousing is the technology trend most often associated by minimizing the number of tables that have to be joined. with enterprise computing today. The term conjures up images Some data warehouses don’t use a relational model at all, of vast data banks fed from systems all over the globe, with preferring a multidimensional design instead. legions of corporate analysts mining them for golden nuggets of information that will make their companies more profitable. To discuss data warehouses and distinguish them from transactional databases calls for an appropriate data model. The All of the developments in database technology over the past multidimensional data model is a good fit for OLAP and 20 years have culminated in the data warehouse. Entity- decision-support technologies. In contrast to multi-databases, relationship modeling, heuristic searches, mass data storage, which provide access to disjoint and usually heterogeneous neural networks, multiprocessing, and natural-language databases, a data warehouse is frequently a store of integrated interfaces have all found their niches in the data warehouse. But data from multiple sources, processed for storage in a multidi- aside from being a database engineer’s dream, what practical mensional model. Unlike most transactional databases, data benefits does a data warehouse offer the enterprise? warehouses typically support time-series and trend analysis, When asked, corporate executives often say that having a data both of which requires more historical data than are generally warehouse gives them a competitive advantage, because it gives 19OUSING AND DA NING maintained in transactional databases. Compared with transac- Data Warehousing, OnLine Analytical Processing (OLAP) and tional databases, data warehouses are nonvolatile. That means Decision Support Systems - apart from being buzz words of that information in the data warehouse changes far less often today IT arena - are the expected result of IT systems and and may be regarded as non-real-time with periodic updating. current needs. For decades, Information Management Systems In transactional systems, transactions are the unit and are the have focused exclusively in the gathering and recording into agent of change a database; by contrast, data warehouse Database Management Systems data that corresponded to information is much more coarse grained and is refreshed everyday simple transactions, from which the name OnLine according to a careful choice of refresh policy, usually incremen- Transaction Processing (OLTP) comes from. tal. Warehouse updates are handled by the warehouse’s Managers and analysts now need to go steps further from the acquisition component that provides all required preprocessing. simple data storing phase and exploit IT systems by posing We can also describe data warehousing more generally as “a complex queries and requesting analysis results and decisions collection of decision support technologies, aimed at enabling that are based on the stored data. Here is where OLAP and Data the knowledge worker (executive, manager, ana-lyst) to make, Warehousing is introduced, bringing into business the necessary better and faster decisions.” The following Figure gives an system architecture, principles, methodological approach and - overview of the conceptual structure of a data warehouse. It finally - tools to assist in the presentation of functional shows the entire data warehousing process. This process Decision Support Systems. includes possible cleaning and reformatting of data before it’s I.M.F. has been working closely with the academic community - warehousing. At the end of the process, OLAP, data mining, which only recently followed up the progress of the commercial and DSS may generate new relevant information such as rules; arena that was boosting and pioneering in the area for the past this information is shown in the figure going back into the decade - and adopted the architecture and methodology warehouse figure also shows that data sources may include files. presented in the following picture. This is the result of the ESPRIT funded Basic Research project, “Foundations of Data Warehouse Quality - DWQ”. 20

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.