Automated Building of Domain Ontologies from Lecture Notes in Courseware

Automated Construction Chinese Domain Ontology from Wikipedia An Automated Domain-Specific Answer Ontology Construction
EdenKelly Profile Pic
EdenKelly,United States,Professional
Published Date:12-07-2017
Your Website URL(Optional)
Comment
Automated Construction Of Domain Ontologies From Lecture Notes M.Tech Project Dissertation Submitted in partial fulfillment of the requirements for the degree of Master of Technology by Neelamadhav Gantayat Roll No : 09305045 under the guidance of Prof. Sridhar Iyer Department of Computer Science and Engineering Indian Institute of Technology, Bombay June, 2011Declaration I declare that this written submission represents my ideas in my own words and where others ideas or words have been included, I have adequately cited and referenced the original sources. I also declare that I have adhered to all principles of academic honesty and integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I un- derstand that any violation of the above will be cause for disciplinary action by the Institute and can also evoke penal action from the sources which have thus not been properly cited or from whom proper permission has not been taken when needed. Neelamadhav Gantayat (09305045) th Date: 28 June, 2011 iiiAcknowledgement I would like to express my deep gratitude for my guide Prof. Sridhar Iyer, who has always been making things simple to understand. Without his deep insight into this domain and his valuable time for this project, it would not have been possible for me to move ahead properly. He has been remarkable in his attempt to keep me motivated in this project and has always tried to improve me with proper feedback. I would like to thank my friend Sagar Kale for his help in checking some sections for grammatical errors. I would like to thankRamkumarRajendran andSoumanMandal, for their constant feed- back and motivation. I would like to thank each and every one who helped me throughout my work. Neelamadhav Gantayat (09305045) vAbstract Nowadays e-learning has become popular, especially with the availability of large course-ware repositories such as MIT’s OCW, NPTEL and CDEEP. A variety of searching techniques, e- learning tools and systems are also available. Courseware repositories contain large amounts of lecture videos and text. When searching for lecture material on a given topic, it would be useful if the repository also indicates the topics that are pre-requisites. However, suppose a user wants to learn about a particular topic of a subject, the search tools typically return a large number of links to the user in response to his/her query (topic). Many of these are not directly related to the topic. Some of them are more advanced topics and some other links contains some irrelevant data which is nothing to do with the desired topic, so the user does not know which links to follow in order to enhance his knowledge. In this paper we present a technique that automatically constructs the ontology (dependency graph) from the given lecture notes. We show how this ontology can be used to identify the pre-requisites and follow-up modules for a given query (lecture topic). We also provide the user with a dependency graph which gives a conceptual view of the domain. Our system extracts the concepts using “term frequency inverse document frequency (tf-idf) weighting scheme” and then determines the associations among concepts using “apriori algorithm”. We have evalu- ated our system by comparing its results with the dependencies determined by an expert in the subject area. viiContents 1 Introduction 1 1.1 Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation for MTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Goal of MTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Solution Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Organization of the report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background 5 2.1 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Domain Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Applications of Ontology . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Dependency graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Repositories surveyed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Searching Tools surveyed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Literature Survey 11 3.1 Mining based Automatic Ontology ConstructionIvan07 . . . . . . . . . . . . 11 3.1.1 TERMINAETerm99: . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.2 Ontology Development using SALTSALT02 . . . . . . . . . . . . . 13 3.1.3 Learning OWL ontology from free text LIU04 . . . . . . . . . . . . . 14 3.1.4 Ontology Construction for Information Selection Khan02 . . . . . . . 15 3.1.5 Comparison of Ontology construction methods . . . . . . . . . . . . . 15 3.2 Various Methods of Developing Ontology . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Skeletal methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 Practical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.3 Knowledge Engineering Approach . . . . . . . . . . . . . . . . . . . . 19 3.2.4 Seven-Step Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Ontology languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.1 History of Ontology Languages Fern03 . . . . . . . . . . . . . . . . 24 3.3.2 XML (Extended Markup Language) . . . . . . . . . . . . . . . . . . . 25 3.3.3 RDF (Resource Description Framework) . . . . . . . . . . . . . . . . 29 ix3.3.4 OIL (Ontology Interchange Language) . . . . . . . . . . . . . . . . . 30 3.3.5 OWL (Web Ontology Language) . . . . . . . . . . . . . . . . . . . . . 31 3.4 Ontology Editors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.1 Ontolingua . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.2 Proteg ´ e ´ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.4.3 WebODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4.4 OntoStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4 System Overview 39 4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Solution Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Implementation Details 43 5.1 system-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.2 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.3 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1.4 Ontology Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.1.5 Generating the Dependency Graph & ontology . . . . . . . . . . . . . 46 5.2 System - 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.1 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.2 Name Entity Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2.3 Ontology Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6 Evaluation 53 6.1 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.3 Results of System -1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.4 Results of System - 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.5 System-1 Vs. System-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.6 Observations and Interpretations . . . . . . . . . . . . . . . . . . . . . . . . . 64 7 Conclusion & Future Work 67 Appendices 67 A Stop Words 69 B Other Results 71List of Figures 2.1 Dependency Graph for “Operating System” . . . . . . . . . . . . . . . . . . . 7 2.2 MIT’s OCW search for “Operating system Threads” . . . . . . . . . . . . . . . 8 3.1 Ontology generation process . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Skeletal Ontology Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Practical Ontology Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4 Knowledge Engineering Approach, Taken from YUN09 . . . . . . . . . . . . . . . . . 19 3.5 Seven-Step Ontology Approach . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.6 Defining classes of “Operating System” . . . . . . . . . . . . . . . . . . . . . 22 3.7 Types of “Threads” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.8 Stack of Ontology Markup Languages taken from Fern03 . . . . . . . . . . . . . . . . 24 3.9 graphical representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.10 Predicate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 Desired solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Dependency Graph for operating system . . . . . . . . . . . . . . . . . . . . . 40 4.3 Ontology Development from Text, taken fromGrub95 . . . . . . . . . . . . . . . . . . 40 4.4 System overview of system-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5 System overview of system-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.1 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.1 Confusion matrix for the example . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2 Classification Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.3 Computer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.4 Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.5 DAG for Computer Networks by System-1 . . . . . . . . . . . . . . . . . . . . 59 6.6 Computer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.7 Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.8 DAG for Computer Networks by System-2 . . . . . . . . . . . . . . . . . . . . 63 B.1 Computer Networks ontology developed by our system using proteg ´ e ´ . . . . . 71 xiB.2 DAG for Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 B.3 DAG for Software Engg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 B.4 DAG for Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 B.5 DAG for Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 B.6 DAG for Embedded System . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 B.7 DAG for System Analysis and Design . . . . . . . . . . . . . . . . . . . . . . 77List of Tables 2.1 Comparison of different course-ware repositories . . . . . . . . . . . . . . . . 9 3.1 Comparison of Ontology construction methods, taken fromIvan07 . . . . . . 15 3.2 Book Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Message . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 Student Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 OWL Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.6 Comparison of Ontology development tools, taken from ? . . . . . . . . . . . . . . 37 6.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 6.2 Results for Computer Networks . . . . . . . . . . . . . . . . . . . . . . . . . 56 6.3 Results for Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.4 Results for Software Engineering . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.5 Results for Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.6 Results for Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.7 Results for Numerical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.8 Results for System Analysis And Design . . . . . . . . . . . . . . . . . . . . . 58 6.9 Results for Computer Networks . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.10 Results for Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.11 Results for Software Engineering . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.12 Results for Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.13 Results for Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.14 Results for Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.15 Results for System Analysis And Design . . . . . . . . . . . . . . . . . . . . . 62 6.16 Confusion Matrix for System - 1 . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.17 Confusion Matrix for System - 2 . . . . . . . . . . . . . . . . . . . . . . . . . 64 xiiiChapter 1 Introduction 1 2 Courseware repositories, such as OCW and NPTEL , contain large amounts of data in the form of videos and text. A fine-grain (topic-level) search facility and automatic identification of pre- requisites and follow-ups for a given topic is desirable and would be useful to students. Such a feature (identification of pre-requisites of a given topic) is not available in these repositories. This feature could be built by manual tagging of the contents, but it is cumbersome to do so. In this paper we present a technique that automatically constructs the ontology (dependency graph) from given lecture notes. We show how this ontology can be used to identify the pre- requisites and follow-up modules for a given query (lecture topic). In domain ontology, relation- ships between different concepts of a domain are identified. In our case, a concept corresponds to a lecture module and a relationship corresponds to whether it is a prerequisite or a follow-up of the topic. We also provide the user with a dependency graph which corresponds to a con- cept map and gives a conceptual view of the domain. People can often grasp ideas much more quickly by looking into the graphical representation than by reading them in a bookCmap08. To the best of our knowledge, there is no such system to automatically determine dependencies of topics from a repository of lecture notes. Our system extracts the concepts using “term frequency inverse document frequency (tf- idf) weighting scheme” and then determines the associations among concepts using “apriori algorithm”. We have evaluated our system by comparing its results with the dependencies determined by an expert in the subject area. 1.1 Abbreviations and acronyms  Ontology: Large number of ideas and concepts to gather in a hierarchical order.  Query, Learning-module or concept: Any topic related to a particular subject.  Most Relevant: The PDF file which contains the topic that we are searching. 1 http://ocw.mit.edu 2 http://nptel.iitm.ac.in Prerequisite: Prior information which is needed before proceeding with the topic.  Follow-up: Information that can be read after finishing the topic.  Stemming: Finding the root (base) form of a word.  Name entity identification: Finding the proper nouns, naming specific things.  OWL: Web Ontology Language.  ngram: Groups of n written letters, n syllables, or n words. 1.2 Motivation for MTP The course-ware repositories like NPTEL, CDEEP and MIT’s OCW provides lecture notes in the form of PDF’s for a wide range of courses. Some repositories provide searching only for courses, but not for topics. If we search for any topic though it is present in a course, searching provided by these repositories cannot give the result. Most of the current search engines and search techniques available in courseware repositories use only keyword based searching, which will produce some PDF’s which contains the keyword but not related to the topic. More over current search techniques do not provide us with the prerequisites and follow-ups for a given topic. Suppose user want to learn about a particular topic, the search tool returns a large number of links to the user in response to his/her query. Instead the search tools should provide the PDF file which contains learning module for his query, and the prerequisites and follow up modules that can be learned. To achieve this objective we use Domain Ontology to create a dependency graph which will have the relation between concepts in a particular domain. As a user we tried to search topics like Threads, TCP/IP, Ethernet etc., in some reposito- ries. Although these topics were covered in those repositories, it showed that the topic was not available there. And in some other repositories, most of the times the search results consisted more advanced topics before the topic we searched. A detailed survey of repositories is given in the next section. Hence, an effective search facility has to be provided so that the user can get desired topics along with some related topics. Related topics can either be pre-requisites or follow-ups. Detailed repository survey is described in the next chapter. 1.3 Goal of MTP Given a set of lecture files (PDF or Text) for a particular subject from a course-ware repository, or a Text book (soft copy), Our aim is to come up with a system which will provide the user with correct reference (link for the PDF file in case of lecture files or chapter in case of the book.) for the desired topic of the given subject. We also show a dependency graph so that the user can refer to the previous and advanced topics as required. Dependency graph will provide conceptual view of the subject to the user. We do not assume any ordering of the files or the concepts. 2The system will also provide the pre-requisites and follow-ups for the topic. User can re- view the pre-requisite before starting the module or can refer the pre-requisites in case of any difficulties in understanding the topic. At the same time user can also refer to the follow-up to enhance his/her knowledge about the topic. We have divided our system into three modules:  Providing user with the link to the PDF file which consists the learning module for the query (Keyword).  Creating a dependency graph for the entire course so that the user can have a conceptual view of the course.  Suggestion of previous and advanced topics as required. 1.4 Solution Approach Our technique is to extract the topics (keywords) from the given PDF files using “term frequency inverse document frequency (tf-idf) weighting scheme”. Then we determine the associations among different concepts (topics) using “apriori algorithm”. Then we arrange the relations in a hierarchical order. For any user query, our system provides the link for the topic, and two topics above it as pre-requisites and two topics below it as follow-ups, from the hierarchy of the ontology. Given the contents of a course from a repository (NPTEL in our case), we do the following:  We indexed the given text using Lucene, the index is used for searching, and also for finding the dependencies between different concepts.  We used Tf-idf weighting to find out the important concepts and apriori algorithm to find out the relation between the different concepts.  We implemented and tested our system on several courses taken from NPTEL. For effec- tive evaluation we tested the results against the dependencies determined by an expert in the subject area. 1.5 Organization of the report In Chapter 2, we explained the background required by a reader in order to proceed with the report and also different course-ware repositories like CDEEP, NPTEL, and MIT’s OCW, and their searching strategies. Chapter 3 contains different methodologies, techniques, tools and languages for developing ontologies. Chapter 4 describes different approaches to our system. The implementation of our systems is described in Chapter 5. Our experiments to evaluate the performance of the system are shown in Chapter 6 and conclusions in Chapter 7. 34Chapter 2 Background This chapter describes courseware repositories, gives a brief overview of domain ontology and defines dependency graph. 2.1 Ontology 1 Ontology Borrowed from philosophy - the study of “The nature of being” . Ontology in in- formation system is a large number of ideas and concepts to gather in a hierarchical order. It provides a mechanism to capture information about the objects, Classes and the relationships that hold between them in some domain. The aim of ontology is to develop knowledge repre- sentations that can be shared and reused. GuberGrub95 defined an ontology as “A formal explicit specification of a shared conceptualization.” In ontology classes describe concepts in the domain. A class can have subclasses that rep- resent concepts that are more specific than the superclass. Slots describe properties of classes and instances. 2.1.1 Domain Ontology Domain Ontology is an ontology Model which provides definitions and relationships of the concepts, major theories, principles and activities in the domain. Domain ontologies provide shared and common understanding of a specific domain. Domain ontology provides particular meaning of term as they apply to that domain. For example the word thread has many different meaning. An ontology about the domain of operating system would model “process threads”, while an ontology about the domain of ”textiles” would model thread with different meaning. 1 Taken from: http://en.wikipedia.org/wiki/Ontology 52.1.2 Applications of Ontology Main application areas of ontology are knowledge management, Web commerce, electronicOIL business and e-learning. Knowledge Management is concerned with acquiring, maintaining, and accessing an or- ganization’s data. Nowadays organizations are distributed around the world. Ontology will help in these organizations in searching, extracting and maintaining the large number of on-line documents. Ontology will give efficient searching techniques other than keyword matching. Web commerce is extending the exiting business models with reduced costs. Some exam- ples where web commerce can be used is online market places and auction houses. Ontology will help the customers in finding the shops that sells the desired product with quality, quantity and reduced cost. Ontology can describe the various products and help navigate and search automatically for the required information. Electronic Business is nothing but automation of business transactions. Ontology included eBusiness will help in automation of data exchange. E-learning To find out the dependencies between the keywords of a topic in the repositories, and to facilitate the user with more recent and relevant data. The key difficulties in developing ontology are: (i) extensive knowledge about a subject is required and (ii) it is time-consuming. We have automated this process, in the context of lecture notes. We use domain ontology to represent relations between topics for a given course. Here we consider only one relation, which is “follows”. Topic-2 follows Topic-1 means that Topic-1 is a pre-requisite for Topic-2 and Topic-2 is a follow-up of Topic-1. In our system we first develop the domain ontology from the given set of notes. Then we refer the node which represents the user’s desired topic and also provide two of its ancestor nodes as pre-requisites and two descendants as follow-ups. The domain ontology developed by our system is also presented to the user by a graphical representation called dependency graph. 2.2 Dependency graph A dependency graph is a directed graph which represents dependencies of several objects to- wards each other. “Given a set of objects S and a transitive relation R = SS with (a;b)2 R modeling a dependency ‘a needs b evaluated first’, the dependency graph is a graph G = (S;T) with TR and R being the transitive closure of T.”Wikidep Dependency graphs are represented in hierarchical order, i.e., most general concepts are at the top of the graph and the more specific and less general concepts in lower orders. Using dependency graphs we can represent the dependencies between different concepts as shown in Figure 2.1; concepts are shown by ellipses and dependencies by arrows. 6