Data exploration and analysis

big data exploration through visual analytics and data exploration and visualisation
Dr.GordenMorse Profile Pic
Dr.GordenMorse,France,Professional
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Data Exploration Heli Helskyaho 21.4.2016 Seminar on Big Data ManagementReferences • 1 Marcello Buoncristiano, Giansalvatore Mecca, Elisa Quintarelli, Manuel Roveri, Donatello Santoro, Letizia Tanca: Database Challenges for Exploratory Computing. SIGMOD Record 44(2): 17-22 (2015). • 2 Stratos Idreos, Olga Papaemmanouil, Surajit Chaudhuri: Overview of Data Exploration Techniques. SIGMOD Conference 2015: 277-281. Introduction to Data Exploration • Data exploration is to efficiently extract knowledge from data, even though you do not know what you are looking for. • Exploratory computing: a conversation between a user and a computer. – The system provides active support; “step-by-step “conversation” of a user and a system that “help each other” to refine the data exploration process, ultimately gathering new knowledge that concretely fullfils the user needs” – exploration process is investigation, exploration-seeking, comparison-making, and learning – Wide range of different techniques is needed (the use of statistics and data analysis, query suggestion, advanced visualization tools, etc.)Not a new thing… • J. W. Tukey. Exploratory data analysis. Addison-Wesley, Reading,MA, 1977: “with exploratory data analysis the researcher explores the data in many possible ways, including the use of graphical tools like boxplots or histograms, gaining knowledge from the way data are displayed”Problems and new problems • We do not know what we are looking for • Must support also non-technical users (journalists, investors, politicians,…) • More and more data • Different data models and formats • Loading in progress while data exploration going on • All must be done efficiently and fastThe Conversation • To start the conversation the system suggests some initial relevant features. • A feature is the set of values taken by one or more attributes in the tuple-sets of the view lattice, while relevance is defined starting from the statistical properties of these values. The AcmeBand example • AcmeBand, a fitness tracker – a wrist-worn smartband that continuously records user steps, and can be used to track sleep at nightExample conversation • (S1) “It might be interesting to explore the types of activities. In fact: running is the most frequent activity (over 50%), cycling the least frequent one (less than 20%)” • (S2) “It might be interesting to explore the sex of users with running activities. In fact: more than 65% of the runners are male” • (S3) “It might be interesting to explore differences in the distribution of the length of the running activities between male and female. In fact: male users generally have longer running activities”.User likes S 1 • (S1) “It might be interesting to explore the types of activities. In fact: running is the most frequent activity (over 50%), cycling the least frequent one (less than 20%)” • □ (Activity) Type=running • The user likes it and picks RunningThe Computer suggests S1.1 • (S1.1) “It might be interesting to explore the region of users with running activities. In fact: while users are evenly distributed across regions, 65% of users with running activities are in the west and only 15% in the south”. • □ (Location □ AcmeUser □ Activity) Type=running • The user likes it and picks South • □ (Location □ AcmeUser □ Type=running, region=south Activity)The Computer suggests S1.1.1 • (S1.1.1) “It might be interesting to explore the sex of users with running activities in the south region. In fact: while sex values are usually evenly distributed among users, only 10% of users with running activities in the south are women”. • The user likes it • □ (Location □ Type=running, region=south, sex=female AcmeUser □ Activity)How to proceed? • Now the user is exploring the set of tuples corresponding to female runners in the south • He/she may – ask the system for further advices – browse the actual database records – use an externally computed distribution of values to look for relevant featuresAdding an external source • Maybe the user is interested in studying the quality of sleep and downloads from a Web site an Excel sheet stating that over 60% of the users complain about the quality of their sleep • That data will be imported to the system • the system suggests that, unexpectedly, 85% of sleep periods of southern women that run are of good quality • The user has learned something new from the dataChallenges in short • Usability • PerformanceThe Process • the system populates the lattice with a set of initial views based on tables and foreign keys • Each view node is a concept of the conversation – The table Activity is a concept ”activity” – the join of AcmeUser and Location is a concept of “user location” • The system builds histograms for the attributes of these views, and looks for those that might have some interest for the user – It may be relevant if it is different from, or maybe close to, the user’s expectations – the users’ expectations may derive from their previous background, common knowledge, previous exploration of other portions of the database, etc. • Different tests are used to define the interest (test(d,d’)) • Depending the position in the lattice the test is different • The notion of relevance is based on the frequency distribution of attributes in the view lattice.These tests could be for instance: • the two-sided t-test (two-tailed t-test), assessing variations in the mean value of two Gaussian-distributed subsets • the two-sided Wilcoxon rank sum test, assessing variations in the median value of two distribution-free subsets • the one/two-sample Chi-square test, for categorial data comparison, assessing the distribution of a subset with discrete values w.r.t. a reference distribution or assessing variations in proportions between two subsets with discrete values • the one/two-sample Kolmogorov-Smirnov test, to assess whether a subset comes from a reference continuous probability density function or whether two subsets have been generated by two continuous different probability density functionsFaceted search • One of the techniques we just used is called a faceted search (faceted navigation, faceted browsing) • Wikipedia: ”a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters” • In our example Activity is a facet and Running is a constraint or a facet value • The user can browse the set making use of the facets and of their values and will inspect the set of items that satisfy the selection criteria. We need Fast Statistical Operators • We also need the availability of fast operators to compute statistical properties of the data and ability to compare them • For example subgroup discovery to discover the maximal subgroups of a set of objects that are statistically “most interesting” with respect to a property of interest • critical step is a statistical algorithm to measure the difference between two tuple-sets with a common target feature, in order to compute the relative relevanceThe difference between two tuple-sets • Four steps that are conducted iteratively: – Sampling – Comparison – Iteration – Query ranking