What is Data Mining

What is Data Mining

What is Data Mining with its Basics

Data mining is widely used by banking firms in soliciting credit card customers, by insurance and telecommunication companies in detecting fraud, by telephone companies and credit card issuers in identifying those potential customers most likely to churn, by manufacturing firms in quality control, and many other applications.

 

Data mining is being applied to improve food and drug product safety and detection of terrorists or criminals. Data mining involves statistical and/or artificial intelligence (AI) analysis, usually applied to large-scale data sets. Masses of data generated from cash registers, from scanning, and from topic-specific databases throughout the company are explored, analyzed, reduced, and reused.

 

Searches are performed across different models proposed for predicting sales, marketing response, and profit. Though automated AI methods are also used, classical statistical approaches are fundamental to data mining.

 

Data mining tools need to be versatile, scalable, capable of accurately predicting responses between actions and results, and capable of automatic implementation. Versatile refers to the ability of the tool to apply a wide variety of models. Scalable tools imply that if the tools work on a small data set, it should also work on larger data sets.

 

Automation is useful, but its application is relative. Some analytic functions are often automated, but human setup prior to implementing procedures is required. In fact, analyst judgment is critical to the successful implementation of data mining. Proper selection of data to include in searches is critical: Too many variables produce too much output, while too few can overlook key relationships in the data. Data transformation also is often required.

 

Data Mining

Data Mining

The traditional statistical analysis involves an approach that is usually directed, in that a specific set of expected outcomes exists. This approach is referred to as supervised (hypothesis development and testing). But, data mining also involves a spirit of knowledge discovery (learning new and useful things). Knowledge discovery is referred to as unsupervised (knowledge discovery). Knowledge discovery by humans can be enhanced by graphical tools and identification of unexpected patterns through a combination of human and computer interaction.

 

Much of this can be also accomplished through automatic means. A variety of analytic computer models have been used in data mining. The standard models employed in data mining include regression (e.g., normal regression for prediction and logistic regression for classification) and neural networks.

 

This blog discusses techniques like association rules for initial data exploration, fuzzy data mining approaches, rough set models, and genetic algorithms. Data mining requires identification of a problem, along with a collection of data that can lead to better understanding, and computer models to provide statistical or other means of analysis. This may be supported by visualization tools, that display data, or through fundamental statistical analysis, such as correlation analysis.

 

Data mining aims to extract knowledge and insight through the analysis of large amounts of data using sophisticated modeling techniques; it converts data into knowledge and actionable information. Data mining models consist of a set of rules, equations, or complex functions that can be used to identify useful data patterns, understand, and predict behaviors.

 

Data mining is a process that uses a variety of data analysis methods to discover the unknown, unexpected, interesting, and relevant patterns and relationships in data that may be used to make valid and accurate predictions. In general, there are two methods of data analysis: supervised and unsupervised. In both cases, a sample of observed data is required. This data may be termed the training sample. The training sample is used by the data mining activities to learn the patterns in the data.

 

Data mining models are of two kinds:

Data mining models

1. Directed or supervised models: In these models, there are input fields or attributes and an output or target field. Input fields are also called predictors because they are used by the model to identify a prediction function for the output or target field. The model generates an input-output mapping function, which associates predictors with the output so that, given the values of input fields, it predicts the output values.

 

Predictive models themselves are of two types, namely, classification or propensity models and estimation models. Classification models are predictive models with predefined target field or classes or groups so that the objective is to predict a specific occurrence or event. The model also assigns a propensity score with each of these events that indicates the likelihood of the occurrence of that event. In contrast, estimation models are used to predict a continuum of target values based on the corresponding input values.

 

For instance, the supervised model is used to estimate an unknown dependency from known input-output data.

 

a. Input variables might include the following:

Quantities of different articles bought by a particular customer Date of purchase Location Price

 

b. Output variables might include an indication of whether the customer responds to a sales campaign or not. Output variables are also known as targets in data mining.

 

Sample input variables are passed through a learning system, and the subsequent output from the learning system is compared with the output from the sample.  In other words, we try to predict who will respond to a sales campaign. The difference between the learning system output and the sample output can be thought of as an error signal. Error signals are used to adjust the learning system. This process is done many times with the data from the sample, and the learning system is adjusted until the output meets a minimal error threshold.

 

2. Undirected or unsupervised models: In these models, there are input fields or attributes, but no output or target field. The goal of such models is to uncover data patterns in the set of input fields. Undirected models are also of two types, namely, cluster models, and, association and sequence models.

 

Cluster models do not have predefined target field or classes or groups, but the algorithms analyze the input data patterns and identify the natural groupings of cases. In contrast, association or sequence models do not involve or deal with the prediction of a single field. Association models detect associations between discrete events, products, or attributes; sequence models detect associations over time.

 

Unsupervised data analysis does not involve any fine-tuning. Data mining algorithms search through the data to discover patterns, and there is no target or aim variable. Only input values are presented to the learning system without the need for validation against any output. The goal of unsupervised data analysis is to discover “natural” structures in the input data. In biological systems, perception is a task learned via an unsupervised technique.

 

Benefits

Benefits

Data mining can provide customer insight, which is vital for establishing an effective Customer Relationship Management strategy. It can lead to personalized interactions with customers and hence increased satisfaction and profitable customer relationships through data analysis. It can support an individualized and optimized customer management throughout all the phases of the customer life cycle, from the acquisition and establishment of a strong relationship to the prevention of attrition and the winning back of lost customers.

 

1.Segmentation: It is the process of dividing the customer base into distinct and internally homogeneous groups in order to develop differentiated marketing strategies according to their characteristics. There are many different segmentation types based on the specific criteria or attributes used for segmentation.

 

In behavioral segmentation, customers are grouped by behavioral and usage characteristics. Data mining can uncover groups with distinct profiles and characteristics and lead to rich segmentation schemes with business meaning and value. Clustering algorithms can analyze behavioral data, identify the natural groupings of customers, and suggest a solution founded on observed data patterns.

 

Data mining can also be used for the development of segmentation schemes based on the current or expected/estimated value of the customers. These segments are necessary in order to prioritize customer handling and marketing interventions according to the importance of each customer.

 

2. Direct marketing campaigns: Marketers use direct marketing campaigns to communicate a message to their customers through mail, the Internet, e-mail, telemarketing (phone), and other direct channels in order to prevent churn (attrition) and to drive customer acquisition and purchase of add-on products.

 

More specifically, acquisition campaigns aim at drawing new and potentially valuable customers away from the competition. Cross-/deep-/up-selling campaigns are implemented to sell additional products, more of the same product, or alternative but more profitable products to existing customers. Finally, retention campaigns aim at preventing valuable customers from terminating their relationship with the organization.

 

Although potentially effective, this can also lead to a huge waste of resources and to bombarding and annoying customers with unsolicited communications. Data mining and classification (propensity) models, in particular, can support the development of targeted marketing campaigns. They analyze customer characteristics and recognize the profiles or extended-profiles of the target customers.

 

3. Market basket analysis: Data mining and association models, in particular, can be used to identify related products typically purchased together. These models can be used for market basket analysis and for revealing bundles of products or services that can be sold together.

 

However, to succeed with CRM, organizations need to gain insight into customers and their needs and wants through data analysis. This is where analytical CRM comes in. Analytical CRM is about analyzing customer information to better address the CRM objectives and deliver the right message to the right customer. It involves the use of data mining models in order to assess the value of the customers, understand, and predict their behavior. It is about analyzing data patterns to extract knowledge for optimizing the customer relationships. For example,

 

a. Data mining can help in customer retention as it enables the timely identification of valuable customers with increased likelihood to leave, allowing time for targeted retention campaigns.

 

b. Data mining can support customer development by matching products with customers and better targeting of product promotion campaigns.

 

c. Data mining can also help to reveal distinct customer segments, facilitating the development of customized new products and product offerings, which better address the specific preferences and priorities of the customers.

 

The results of the analytical CRM procedures should be loaded and integrated into the operational CRM front-line systems so that all customer interactions can be more effectively handled on a more informed and personalized base.

 

 Data Mining Applications

 Data Mining Applications

Data mining can be used by businesses in many ways; two of the most profitable application areas have been the use of customer segmentation by marketing organizations to identify those with marginally greater probabilities of responding to different forms of marketing media, and banks using data mining to more accurately predict the likelihood of people to respond to offers of different services offered.

 

Many companies are using this data mining to identify their “valuable” customers so that they can provide them the service needed to retain them.

1. Customer Profiling—identifying those subsets of customers most profitable to the business

2. Targeting—determining the characteristics of profitable customers who have been captured by competitors

3.  Market Basket Analysis—determining product purchases by the consumer,  which can be used for product positioning and for cross-selling

 

The key is to find actionable information or information that can be utilized in a concrete way to improve profitability. Some of the earliest applications were in retailing, especially in the form of market basket analysis.

 

Data mining methodologies can be applied to a variety of domains, from marketing and manufacturing process control to the study of risk factors in medical diagnosis, from the evaluation of the effectiveness of new drugs to fraud detection.

 

a. Relational marketing: It is useful for numerous tasks like identification of customer segments that are most likely to respond to targeted marketing campaigns, such as cross-selling and up-selling; identification of target customer segments for retention campaigns; prediction of the rate of positive responses to marketing campaigns; and, interpretation and understanding of the buying behavior of the customers.

 

b. Text mining: Data mining can be applied to different kinds of texts, which represent unstructured data, in order to classify articles, books, documents, e-mails, and web pages. Examples are web search engines or the automatic classification of press releases for storing purposes. Other text mining applications include the generation of filters for e-mail messages and newsgroups.

 

c. Web mining: It is useful for the analysis of so-called clickstreams—the sequences of pages visited and the choices made by a web surfer. They may prove useful for the analysis of e-commerce sites, in offering flexible and customized pages to surfers, in caching the most popular pages, or in evaluating the effectiveness of an e-learning training course.

 

4. Image recognition: The treatment and classification of digital images, both static and dynamic, are useful to recognize written characters, compare and identify human faces, apply correction filters to photographic equipment, and detect suspicious behaviors through surveillance video cameras.

 

5. Medical diagnosis: Learning models are an invaluable tool within the medical field for the early detection of diseases using clinical test results.

 

6. Image analysis: For the diagnostic purpose, it is another field of investigation that is currently burgeoning.

 

7.  Fraud detection: Fraud detection is relevant for different industries such as telephony, insurance (false claims), and banking (illegal use of credit cards and bank checks; illegal monetary transactions).

 

8.Risk evaluation: The purpose of risk analysis is to estimate the risk connected with future decisions. For example, using the past observations available, a bank may develop a predictive model to establish if it is appropriate to grant a monetary loan or a home loan, based on the characteristics of the applicant.

 

Data Mining Analysis

Data Mining Analysis

Exploratory Analysis

This data mining task is primarily conducted by means of exploratory data analysis and therefore, it is based on queries and counts that do not require the development of specific learning models. The information so acquired is usually presented to users in the form of histograms and other types of charts.

 

Before starting to develop a classification model, it is often useful to carry out an exploratory analysis whose purposes are as follows:

To achieve a characterization by comparing the distribution of the values of the attributes for the records belonging to the same class

 

To detect a difference, through a comparison between the distribution of the values of the attributes for the records of a given class and the records of a different class (or between the records of a given class and all remaining records)

 

The primary purpose of exploratory data analysis is to highlight the relevant features of each attribute contained in a dataset, using graphical methods and calculating summary statistics, and to identify the intensity of the underlying relationships among the attributes. Exploratory data analysis includes three main phases:

 

1.  Univariate analysis, in which the properties of every single attribute of a dataset are investigated

 

2. Bivariate analysis, in which pairs of attributes are considered, to measure the intensity of the relationship existing between them (for supervised learning models, it is of particular interest to analyze the relationships between the explanatory attributes and the target variable)

 

3. Multivariate analysis, in which the relationships holding within a subset of attributes are investigated

 

Classification

Classification

In a classification problem, a set of observations is available, usually represented by the records of a dataset, whose target class is known. Observations may correspond, for instance, to mobile phone customers and the binary class may indicate whether a given customer is still active or has churned.

 

Each observation is described by a given number of attributes whose value is known; in the previous example, the attributes may correspond to age, customer seniority, and outgoing telephone traffic distinguished by destination. A classification algorithm can, therefore, use the available observations relative to the past in order to identify a model that can predict the target class of future observations whose attributes values are known.

 

Classification analysis has many applications in the selection of the target customers for a marketing campaign, fraud detection, image recognition, early diagnosis of diseases, text cataloging, and spam e-mail recognition are just a few examples of real problems that can be framed within the classification paradigm.

 

Regression

Regression

If one wishes to predict the sales of a product based on the promotional campaigns mounted and the sale price, the target variable may take on a very high number of discrete values and can be treated as a continuous variable; this would become a case of regression analysis. Based on the available explanatory attributes, the goal is to predict the value of the target variable for each observation.

 

A classification problem may be effectively be turned into a regression problem, and vice versa; for instance, a mobile phone company interested in the classification of customers based on their loyalty may come up with a regression problem by predicting the probability of each customer remaining loyal.

 

The purpose of regression models is to identify a functional relationship between the target variable and a subset of the remaining attributes contained in the dataset. Regression models

 

Serve to interpret the dependency of the target variable on the other variables.

 

Are used to predict the future value of the target attribute, based upon the functional relationship identified and the future value of the explanatory attributes.

 

The development of a regression model allows knowledge workers to acquire a deeper understanding of the phenomenon analyzed and to evaluate the effects determined on the target variable by different combinations of values assigned to the remaining attributes.

 

This is of great interest particularly for analyzing those attributes that are control levers available to decision makers.

 

Thus, a regression model may be aimed at interpreting the sales of a product based on investments made in advertising in different media, such as daily newspapers, magazines, TV, and radio. Decision makers may use the model to assess the relative importance of  the various communication channels, therefore directing future investments toward those media that appear to be more effective.

 

Moreover, they can also use the model to predict the effects on the sales determined by different marketing policies, so as to design a combination of promotional initiatives that appear to be the most advantageous.

 

Time Series

Time Series

Sometimes the target attribute evolves over time and is therefore associated with adjacent periods on the time axis. In this case, the sequence of values of the target variable is said to represent a time series. For instance, the weekly sales of a given product observed over  2 years represent a time series containing 104 observations. Models for time-series analysis investigate data characterized by a temporal dynamics and are aimed at predicting the value of the target variable for one or more future periods.

 

The aim of models for time-series analysis is to identify any regular pattern of observations relative to the past, with the purpose of making predictions for future periods. Time-series analysis has many applications in business, financial, socioeconomic, environmental, and industrial domains—predictions may refer to future sales of products and services, trends in economic and financial indicators, or sequences of measurements relative to ecosystems, for example.

 

Un-Supervised Analysis

Un-Supervised Analysis

Association Rules

Association rules, also known as affinity groupings, are used to identify interesting and recurring associations between groups of records of a dataset. For example, it is possible to determine which products are purchased together in a single transaction, and how frequently. Companies in the retail industry resort to association rules to design the arrangement of products on shelves or in catalogs. Groupings by related elements are also used to promote cross-selling or to devise and promote combinations of products and services.

 

Clustering

Clustering

The term “cluster” refers to a homogeneous subgroup existing within a population. Clustering techniques are therefore aimed at segmenting a heterogeneous population into a given number of subgroups composed of observations that share similar characteristics; observations included in different clusters have distinctive features. Unlike classification, in clustering, there are no predefined classes or reference examples indicating the target class, so that the objects are grouped together based on their mutual homogeneity.

 

Sometimes, the identification of clusters represents a preliminary stage in the data mining process, within exploratory data analysis. It may allow homogeneous data to be processed with the most appropriate rules and techniques and the size of the original dataset to be reduced since the subsequent data mining activities can be developed autonomously on each cluster identified.

 

Description and Visualization

Visualization

The purpose of a data mining process is sometimes to provide a simple and concise representation of the information stored in a large dataset. Although, in contrast to clustering and association rules, descriptive analysis does not pursue any particular grouping or partition of the records in the dataset, an effective and concise description of information is very helpful, since it may suggest possible explanations of hidden patterns in the data and lead to a better understanding the phenomena to which the data refer.

 

Notice that it is not always easy to obtain a meaningful visualization of the data. However, the effort of representation is justified by the remarkable conciseness of the information achieved through a well-designed chart.

 

CRISP-DM Methodology

The cross-industry standard process-data mining (CRISP-DM) methodology was initiated in 1996 and represents a generalized pattern applicable to any data mining project. CRISP-DM methodology maps from a general CRISP-DM process into a process with a specific application.

 

In essence, the process model describes the life cycle of the data mining process comprising six basic steps; the model shows phases, the tasks within individual phases, and relations between them. Data mining projects are iterative; once a goal is reached, or new knowledge and insight are discovered that can be useful in one of the previous phases, it is desirable to revisit the earlier phases.

The CRISP-DM process model is constituted of the following six phases:

 

  • Business understanding
  • Data understanding
  • Data preparation
  • Modeling
  • Model evaluation
  • Model deployment

 

The CRISP-DM process can be viewed through four hierarchical levels describing the model at four levels of details, from general to specific. Each specific project passes through the phases at the first level; the first level is, at the same time, the most abstract. At the subsequent level, each stage is broken down into generalized, generic tasks. They are generalized, as they cover all the possible scenarios in the data mining process, depending on the phase the project is in.

 

The first level defines the basic phases of the process, that is, the data mining project. The third level defines particular, specialized tasks. They describe how an individual generalized task from the second level is executed in the specific case.

 

For instance, if the second level defines a generic task of data filtering, then the third level describes how this task is executed depending on whether it is a categorical or continuous variable. Finally, the fourth level contains the specific instance of the data mining process, with a range of actions, decisions, and outcomes of the actual knowledge discovery process.

 

Business Understanding

Business

This phase of the data mining project deals with defining goals and demands from the business point of view. This phase comprises tasks such as the following:

  • Determining business objectives
  • Situation assessment
  • Defining the goals of data mining
  •  Producing the project plan

 

It determines the problem domain (marketing, user support, or something similar), as also identifies the organization’s business units involved with the project. It also identifies the resources required for this project including the hardware and tools for implementation, as well as the human resources, especially the domain-specific experts required for the project.

 

At the end of the first phase, a project plan is developed with a list of phases, with constituting tasks and activities, as well as the time and effort estimations; resources for tasks, their interdependencies, inputs, and outputs are also defined. The project plan highlights strategies for issues like risk assessment and quality management.

 

Data Understanding

Data

This phase of the data mining project deals with getting familiarized with the organization’s data through exploratory data analysis, which includes simple statistical characteristics and more complex analyses, that is, setting certain hypotheses on the business problem.

 

This phase comprises tasks such as the following:

  • Collecting initial data
  • Describing data
  • Exploring data
  • Verifying data quality

 

The data are obtained from identified sources, and the selection criteria are chosen in light of the specific business problem under consideration. Tables are defined, and if the data source is a relational database or a data warehouse, variations of the tables to be used are also specified.

 

This is followed by analyzing the basic characteristics of the data, such as quantity and types (e.g., categorical or continuous), the analysis of the correlations between variables, distribution, and intervals of values, as well as other simple statistical functions coupled with specialized statistical analysis tools if necessary. It is important to establish the meaning for every variable, especially from the business aspect, and relevance to the specific data mining problem.

 

The more complex analysis of the dataset entails using one of the OLAP or similar visualization tools. This analysis enables on shaping the relevant hypotheses and transforming them into the corresponding mining problem space. In addition, project goals get fine-tuned more precisely.

 

At the end of this phase, the quality of the data set is ascertained in terms of the completeness and accuracy of data, the frequency of discrepancies or the occurrences of the null-values.

 

Data Preparation

Data Preparation

This phase of the data mining project deals with data preparation for the mining process. It includes choosing the initial data set on which modeling is to begin, that is, the model set.

This phase comprises tasks such as the following:

 

  • Data selection
  • Data cleaning
  • Data construction
  • Data integration
  • Data formatting

 

When defining the set for the subsequent modeling step, one takes into account, among other things, elimination of individual variables based on the results of statistical tests of correlation and significance, that is, values of individual variables. Taking these into account, the number of variables for the subsequence modeling iteration is reduced, with the aim of obtaining an optimum model. Besides this, this is the phase when the sampling (i.e., reducing the size of the initial data set) technique is decided on.

 

At the end of this phase, the issue of data quality is addressed, as well as the manner in which nonexistent values will be managed, as also the strategy for handling particular values. New variables are derived, the values of the existing ones are transformed, and values from different tables are combined in order to obtain new variables values. Finally, individual variables are syntactically adjusted in sync with the modeling tools, without changing their meaning.

 

Modeling

Modeling

This phase of the data mining project deals with choosing the data mining technique itself. The choice of the tool, that is, a technique to be applied depends on the nature of the problem. Actually, various techniques can always be applied to the same type of problem, but there is always a technique or tool yielding the best results for a specific problem.

 

It is sometimes necessary to model several techniques and algorithms, and then opt for the one yielding the best results. In other words, several models are built in a single iteration of the phase, and the best one is selected.

 

This phase comprises tasks such as the following:

  • Generating test design
  • Building the model
  • Assessing the model

 

Before modeling starts, the data (model) set from the previous phase must be divided into subsets for training, testing, and evaluation. The evaluation subset is used for assessing the model’s efficiency on unfamiliar data, whereas the test subset is used for achieving model generality, that is, avoiding the overfitting on the training subset.

 

Division of the data subset is followed by model building. The effectiveness of the obtained model is assessed on the evaluation subset. In the case of predictive models, applying the obtained model on the evaluation subset produces data for the cumulative gains chart showing how well the model predicts on an unfamiliar set.

 

Parameters for the subsequent modeling step are determined based on the obtained graph, the model quality ratio (surface below the graph) and other ratios—such as significance and value factors for individual variables, and correlations between them. If necessary, the developer returns to the previous phase to eliminate noise variables from the model set.

 

If several models were built in this phase (even those done using different techniques), then models are compared, and the best ones are selected for the next iteration of the modeling phase. Each obtained model is interpreted from the business point of view, as much it is possible in the current phase iteration itself.

 

At the end of this phase, the developers assess the possibility of model deployment, result reliability, and whether the set goals are met from the business and analytic point-of-view. The modeling phase is repeated until the best, that is, the satisfactory model is obtained.

 

Model Evaluation

This phase of the data mining project deals with the assessment of the final model, that is, the extent to which it meets the goals set in the first phase of the data mining project. The evaluation of the model in the previous phase is more related to the model’s technical characteristics (efficiency and generality).

 

This phase comprises tasks such as the following:

  • Evaluating results
  • Reviewing the process
  • Determining the next steps

 

If the information gained at this point affects the quality of the entire project, this would indicate returning to the first phase and reinitiate the whole process with the newer information. However, if the model meets all the business goals and is considered satisfactory for deployment, a detailed review of the entire data mining process is conducted in order to ascertain the quality of the entire process.

 

At the end of this phase, the project manager decides on moving to the phase of model deployment or repeating the prior process for improvement.

 

Model Deployment

Model Deployment

This phase of the data mining project deals with model deployment in business, taking into account the way of measuring the model’s benefits and its fine-tuning on an ongoing basis.

This phase comprises tasks such as the following:

  • Preparing the deployment plan
  • Monitoring plan
  • Maintenance
  • Producing the final report
  • Project review

 

Because of the changing market conditions and competition, it is necessary to repeat the modeling process periodically to fine-tune or alter the model for sustaining the effectiveness of the insights drawn from data mining.

 

The application of the model in the strategic decision making of a business organization can be used for direct measurement of the benefits of the obtained model, and gather new knowledge for the subsequent iterations for model improvement.

 

At the end of this phase, the project is concluded by the overall review, that is, analysis of its strengths and weaknesses. Final reports and presentations are made. Documentation with experiences usable in possible future projects is also compiled.

 

Machine Learning

Machine Learning

An intelligent system learns from experience or relevant information on the past happenings. The same is true for machines as well; machines learn from two different ways:

 

1. They are exposed to past happenings to adaptively learn from whatever they “experience.”

2. They are exposed to massive collective data relevant to the past happenings; the machine ingests this data and attempts to learn from it.

 

Since the anticipation of future events is unfeasible, it is not possible to prepare all machines to experience the happenings as they occur, and only the second approach is more feasible. In that case, the larger the data, the better is the scope for learning comprehensively by the machine.

 

The machine when presented with this data carrying hidden facts, rules, and inferences is supposed to discern them so that the next time if the same data occur, it may identify and compute the correct answer (or solution). The machine summarizes the entire input data into a smaller data set that can be consulted to find outputs to future inputs in a manageable amount of time.

 

The general procedure of working with machine-learning systems consists of the following:

  • Establish the historical database
  • Perform input data acquisition
  •  Perform pattern matching
  • Producing output

 

Learning is the process of building a scientific model after discovering knowledge from a sample data set or data sets. Generally, machine learning is considered to be the process of applying a computing-based resource to implement learning algorithms. Formally, machine learning is defined as the complex computation process of automatic pattern recognition and intelligent decision making based on training sample data.

 

Machine-learning methods can be categorized into four groups of learning activities:

 

Symbol-based machine learning has a hypothesis that all knowledge can be represented in symbols and that machine learning can create new symbols and new knowledge, based on the known symbols. In symbol-based machine learning, decisions are deducted using logical inference procedures.

 

Connectionist-based machine learning is constructed by imitating neuron net connection systems in the brain. In connectionist machine learning, decisions are made after the systems are trained and patterns are recognized. Behavior-based learning has the assumption that there are solutions to behavior identification and is designed to find the best solution to solve the problem.

 

Immune-system-based approach learns from its encounters with foreign objects and develops the ability to identify patterns in data.

 

Thus, it is not necessary to select machine-learning methods based on these fundamental distinctions; within the machine-learning process, mathematical models are built to describe the data randomly sampled from an unseen probability distribution.

 

None of these machine-learning methods has any noticeable advantages over the others. Machine learning has to be evaluated empirically because its performance heavily depends on the type of prior training experience the learning machine has undergone,  the performance evaluation metrics, and the strength of the problem definition.

 

Machine- learning methods are evaluated by comparing the learning results of methods applied to the same data set or quantifying the learning results of the same methods applied on sample data sets. Generally, the feasibility of a machine-learning method is acceptable when its computation time is polynomial.

 

Machine-learning methods use training patterns to learn or estimate the form of a classifier model. The models can be parametric or nonparametric. The goal of using machine- learning algorithms is to reduce the classification error on the given training sample data. The training data being finite, the learning theory requires probability bounds on the performance of learning algorithms.

 

Depending on the availability of training data and the desired outcome of the learning algorithms, machine-learning algorithms are categorized into the following:

 

1. In supervised learning, pairs of input and target output are given to train a function, and a learning model is trained such that the output of the function can be predicted at a minimum cost. The supervised learning methods are categorized based on the structures and objective functions of learning algorithms. Popular categorizations include artificial neural network (ANN), support vector machine (SVM), and decision trees.

 

2. In unsupervised learning, no target or label is given in sample data. Unsupervised learning methods are designed to summarize the key features of the data and to form the natural clusters of input patterns given a particular cost function. The most famous unsupervised learning methods include k-means clustering, hierarchical clustering, and self-organization map. Unsupervised learning is difficult to evaluate, because it does not have an explicit teacher and, thus, does not have labeled data for testing.

 

Cybersecurity Systems

Cybersecurity

Cybersecurity systems address various cybersecurity threats, including viruses, Trojans, worms, spam, and botnets. These cybersecurity systems combat cybersecurity threats at two levels:

 

Host-based defense systems control upcoming data in a workstation by firewall, antivirus, and intrusion detection techniques installed in hosts.

 

Network-based defense systems control network flow by network firewall, spam filter, antivirus, and network intrusion detection techniques.

 

Conventional approaches to cyber defense create a protective shield for cyberinfrastructure; they are mechanisms designed in firewalls, authentication tools, and network servers that monitor, track, and block viruses and other malicious cyber attacks. For example, the Microsoft Windows operating system has a built-in Kerberos cryptography system that protects user information. Antivirus software is designed and installed in personal computers and cyber infrastructures to ensure customer information is not used maliciously.

 

Cybersecurity systems aim to maintain the confidentiality, integrity, and availability of information and information management systems through various cyber defense systems that protect computers and networks from hackers who may want to intrude on a system or steal financial, medical, or other identity-based information.

 

Cyber systems and infrastructure are always vulnerable because of the inherently transient nature of the design and implementation of software and networks. Due to unavoidable design and programming errors, vulnerabilities in common security components, such as firewalls, are inevitable; it is not possible to build a system that has no security vulnerabilities.

 

Patches are developed continuously to protect the cyber systems, but attackers also continuously exploit newly discovered flaws. Because of the constantly evolving nature of cyber threats, merely building defensive systems for identified attacks is not adequate to protect users; higher-level methodologies are also required to discover overt and covert intrusions and intrusion techniques so that a more reliable security cyberinfrastructure can be ensured.

 

A high-level defense system consists of the following steps:

 

1. Information sources:

A host-based event originates with log-files; a host-based event includes a sequence of commands executed by a user and a sequence of system calls launched by an application, for example, send mail. A network-based event originates with network traffic; a network-based event includes network traffic data, for example, a sequence of Internet protocol (IP) or transmission control protocol (TCP) network packets.

 

2.Data capturing tools: Data capturing tools such as Libpcap for Linux or Winpcap for Windows capture events from the audit trails of resource information sources like a host or network.

 

3.  Data preprocessing: The data preprocessing module filters out the attacks for which good signatures have been learned.

 

4. Feature extraction: A feature extractor derives basic features that are useful in event analysis engines, including a sequence of system calls, start time, duration of a network flow, source IP and source port, destination IP and destination port, protocol, number of bytes, and number of packets.

 

5. Analysis engines: In an analysis engine, various intrusion detection methods are implemented to investigate the behavior of the cyberinfrastructure, which may or may not has appeared before in the record, for example, to detect anomalous traffic.

 

6. The decision of responses: The decision of responses is generated once a cyber attack has been identified.

The resulting solutions can either be reactive and proactive. Reactive security solutions termed as intrusion detection systems (IDSs) detect intrusions based on the information from log files and network flow so that the extent of damage can be determined, hackers can be tracked down, and similar attacks can be prevented in the future.

 

However, proactive approaches anticipate and eliminate identified vulnerabilities in the cyber system, while remaining prepared to defend effectively and rapidly against actual attacks. To function correctly, proactive security solutions require user authentication (e.g., user password and biometrics), a system capable of avoiding programming errors, and information protection.

 

Data Mining for Cybersecurity

Data Mining for Cybersecurity

Data mining techniques use statistics, artificial intelligence, and pattern recognition of data in order to group or extract behaviors or entities. Data mining uses analysis tools from statistical models, mathematical algorithms, and machine-learning methods to discover previously unknown, valid patterns and relationships in large data sets, which are useful for finding hackers and preserving privacy in cybersecurity.

 

Learning these behaviors is important, as they can identify and describe structural patterns in the data automatically and, as a consequence, theoretically explain data and predict patterns. Automatic and theoretic learning require complex computation that demands stringent machine-learning algorithms.

 

There are two categories of data mining methods: supervised and unsupervised. Supervised data mining techniques predict a hidden function using training data. The training data have pairs of input variables and output labels or classes. The output of the method can predict a class label of the input variables.

 

Examples of supervised mining are classification and prediction. Unsupervised data mining is an attempt to identify hidden patterns from given data without resorting to training data (i.e., pairs of input and class labels). Typical examples of unsupervised mining are clustering and associative rule mining.

 

Soft Computing

Soft Computing

Usually, the primary considerations of traditional hard computing are precision, certainty, and rigor. In contrast, the principal notion in soft computing is that precision and certainty carry a cost and that computation, reasoning, and decision making should exploit (wherever possible) the tolerance for imprecision, uncertainty, approximate reasoning, and partial truth for obtaining low-cost solutions.

 

The corresponding facility in humans leads to the remarkable human ability to understand distorted speech, deciphering sloppy handwriting, comprehending the nuances of natural language, summarizing text, recognizing and classifying images, driving a vehicle in dense traffic, and, more generally, making rational decisions in an environment of uncertainty and imprecision. The challenge, then, is to exploit the tolerance for imprecision by devising methods of computation that lead to an acceptable solution at low cost.

 

Soft computing is a consortium of methodologies that works synergistically and provides, in one form or another, flexible information processing capability for handling real-life ambiguous situations. Its aim is to exploit the tolerance for imprecision, uncertainty, approximate reasoning, and partial truth in order to achieve tractability, robustness, and low-cost solutions. The guiding principle is to devise methods of computation that lead to an acceptable solution at a low cost, by seeking for an approximate solution to an imprecisely or precisely formulated problem.

 

Unlike soft computing, the traditional hard computing deals with precise computation. The rules of hard computing are strict and binding; as inputs, outputs, and procedures are all clearly defined, it generates the same precise answers without any degree of uncertainty—every time that the procedure is applied. Unless the rules or procedures are changed, the output result would never change.

 

Table compares characteristics of soft computing with the traditional hard computing. The main constituents of soft computing include the following:

  • Artificial neural networks (ANNs)
  • Fuzzy logic and fuzzy inference systems
  • Evolutionary and genetic algorithms
  • Rough sets
  • Signal processing tools such as wavelets

 

Though each of them contributes a distinct methodology for addressing problems in its domain, they are complementary to each other and can be blended effectively. The result is a more intelligent and robust system providing a human-interpretable, low-cost, approximate solution, as compared to traditional techniques.

 

There is no universally best soft computing method; choosing particular soft computing tool(s) or some combination with traditional methods is entirely dependent on the particular application, and it requires human interaction to decide on the suitability of a blended approach.

 

Fuzzy sets provide a natural framework for the process in dealing with uncertainty or imprecise data. Generally, they are suitable for handling the issues related to the understandability of patterns, incomplete and noisy data, and mixed media information and human interaction and can provide approximate solutions faster.

 

ANNs are nonparametric and robust and exhibit good learning and generalization capabilities in data-rich environments. Genetic algorithms (GAs) provide efficient search algorithms to optimally select a model, from mixed media data, based on some preference criterion or objective function.

 

Rough sets are suitable for handling different types of uncertainty in data. Neural networks and rough sets are widely used for classification and rule generation. Application of wavelet-based signal processing techniques is new in the area of soft computing. Wavelet transformation of a signal results in decomposition of the original signal in different multiresolution subbands.

 

This is useful in dealing with compression and retrieval of data, particularly images. Other approaches like case-based reasoning and decision trees are also widely used to solve data mining problems.

 

Artificial Neural Networks

Artificial Neural Networks

The human brain is composed of an ensemble of millions of small cells or processing units called neurons that work in parallel. Neurons are connected to each other via neuron connections called the synapses. A particular neuron takes its input from a set of neurons, it then processes these inputs and passes on the output to another set of neurons. The brain as a whole is a complex network of such neurons in which connections are established and broken continuously. ANNs have resulted from efforts to imitate the functioning of the human brain.

 

Like the human brain, ANNs are able to learn from history or sample data; it is able to learn by repeating the learning process for a number of iterations—the performance demonstrates improvement with every completed iteration. Once learned, ANNs can reproduce the same output whenever the same input is applied.

 

The precision and correctness of the answer depend on the learning and nature of the data given: Sometimes, ANN may be able to learn even complex data very quickly, while at other times, it may refuse to learn from another set of data. The precision depends upon on how well the ANN was able to learn from the presented data.

 

Once ANNs have learned from historical or sample data, they have another extraordinary capability that enables them to predict the outputs of unknown inputs with quite high precision. This capability is known as generalization results from the fact that ANNs can emulate any type of simple or complex function. This gives ANNs the power to model almost any problem that is encountered in the real world.

 

Fuzzy Systems

Fuzzy Systems

In fuzzy sets, every member of the set does not have a full membership of the set but rather has a degree of belongingness to the set. This degree of membership is termed as the membership degree and the function that determines this membership degree or belongingness is called the membership function. This function associates with each member of the set with a degree of probability of membership: The higher is this degree, the more strongly the member is a part of the set.

 

Fuzzy systems are implemented by a set of rules called fuzzy rules, which may be defined on the set. A nonfuzzy has a very discreet way of dealing with rules: Either a rule fires fully or does not fire at all, depending on the truth of the expression in the condition specified.

 

However, in the case of fuzzy rules, since the rule is true or false only to a degree, the rule fires to this degree of trueness or falseness. The output of all the rules is aggregated to get the system’s final output.

 

The general procedure of working with fuzzy inference systems consists of the following:

 

  • 1.          Crisp input modeling
  • 2.         Membership functions that are applied to obtain the fuzzified inputs
  • 3.         Rules that are applied over these inputs to generate the fuzzy outputs
  • 4.         Various outputs that are aggregated
  • 5.         The aggregated output that is defuzzified to produce the crisp output

 

 

Evolutionary Algorithms

Evolutionary Algorithms

An evolutionary algorithm (EA) is inspired by the success of Darwin’s “theory of natural selection” surmised by “survival of the fittest.” Natural evolution results from the fusion of male and female chromosomes to generate one or more offspring that have a blend of characteristics from both the male and female counterpart. The offspring may be weaker or fitter than the participating parents and would survive the parents to the degree of its fitness to the surrounding conditions and environment. This process of evolution leads to improvement from one generation to the next one.

 

EAs work in a similar way. The process is initiated with the generation of a random set of solutions to a given problem resulting in a population of such solutions. These “individual” solutions constituting the population are made to participate in an evolutionary process: From these solutions, a few individuals with high fitness are chosen based on a predetermined method termed as “selection.”

 

These pair of individuals are then made to generate offspring guided again by a process termed as “crossover.” The system then randomly shortlists some of the newly generated solutions and adds new characteristics to them through another predefined process termed as “mutation;” more similar operations are performed on the shortlisted solutions. The fitness of any one of these mutated solutions is adjudged based on another function termed as the “fitness function.”

 

As the process cycles through newer and newer generations, the quality of the selected solutions continues to improve. This improvement is very rapid over the first few generations but slows down with later generations.

 

Rough Sets

Rough Sets

The purpose of rough sets is to discover knowledge in the form of business rules from imprecise and uncertain data sources. The rough set theory is based on the notion of indiscernibility and the inability to distinguish between objects and provides an approximation of sets or concepts by means of binary relations, typically constructed from empirical data.

 

As an approach to handling imperfect data, rough set analysis complements other more traditional theories such as probability theory, evidence theory, and fuzzy set theory. The intuition behind the rough set approach is the fact that in real life when dealing with sets, we often have no means of precisely distinguishing individual set elements from each other due to limited resolution (lack of complete and detailed knowledge) and uncertainty associated with their measurable characteristics.

 

The rough set philosophy is founded on the assumption that we associate some information (data and knowledge) with every object of the universe of discourse. Objects, which are characterized by the same information, are indiscernible in view of the available information about them.

 

The indiscernibility relation generated in this way is the mathematical basis for the rough set theory. Any set of all indiscernible objects is called an elementary set and forms a basic granule of knowledge about the universe. Any set of objects being a union of some elementary sets is referred to as crisp or precise set otherwise the set is rough (imprecise or vague). Consequently, each rough set has boundary-line cases (i.e., objects), which cannot be classified with complete certainty as members of the set.

 

The general procedure for conducting rough set analysis consists of the following:

  • Data preprocessing
  •  Data partitioning
  • Discretization
  • Reduce generation
  • Rule generation and rule filtering
  • Applying the discretization cuts to test dataset
  • Score the test dataset on the generated rule set (and measuring the prediction accuracy)
  • Deploying the rules in a production system

 

INTRODUCTION TO DATA

data content

In the 21st century, human activity is more and more associated with data. Being so widespread globally, the Internet provides access to vast amounts of information. Concurrently, advances in electronics and computer systems allow recording, storing, and sharing the traces of many forms of activity. The decrease of cost per unit of data for recording, processing, and storing enables easy access to powerful well-connected equipment.

 

As a result, nowadays, there is an enormous and rapidly growing amount of data available in a wide variety of forms and formats. Apart from data concerning new activity, more and more previously inaccessible resources in electronic format, such as publications, music, and graphic arts, are digitized.

 

The availability of vast amounts of data aggregated from diverse sources caused an evident increase of interest in methods for making sense of large data quantities and extracting useful conclusions. The common bottleneck among all big data analysis methods is structuring. Apart from the actual data content, analysis typically requires extra information about the data of various levels and complexities, also known as metadata. For example, consider a newspaper article.

 

The data content consists of the title, the text, and any images or tables associated with it. Simple metadata would contain the name of the author, the date that the article was published, the newspaper page in which the article was printed, and the name of the column that hosted the article.

 

Other metadata could concern statistics, such as the number of pages that this article covers and the number of paragraphs and words in it; indexing information, such as the unique identification number of this article in the newspaper’s storage database; or semantics, such as the names of countries, people, and organizations that are discussed in this article or are relevant.

 

In addition, semantic metadata might contain more complex types of information, such as the subtopics that are discussed in the article, the paragraphs in which each subtopic is addressed, and the sentences that report facts vs. sentences that express the opinion of the author.

 

In the above example, it is evident that the metadata corresponding to this kind of data only, i.e., newspaper articles, can be of many diverse types, some of which are not known a priori but are specific to the type of further processing and applications that use this data. Considering the diversity of available data types, one can imagine the diversity of metadata types that could be associated with it.

 

Most of the data available are not ready for direct processing, because they are associated with limited or no metadata. Usually, extra preprocessing for adding structure to unstructured data is required before it is usable for any further purpose.

 

The remainder of this blog discusses a number of data sources, focusing on textual ones. Then, an overview of the methods for structuring unstructured data by means of extracting information from the content is presented. In addition, moving toward a more practical view of data structuring, we discuss a multitude of examples of applications where data structuring is useful.

 

SOURCES OF UNSTRUCTURED TEXTUAL DATA

TEXTUAL DATA

One of the major means of human communication is text. Consequently, the text is one of the major formats of available data, among others, such as recorded speech, sound, images, and video. Textual data sources can be classified in a variety of different aspects, such as domain, language, and style.

 

The domain of a text represents the degree that specialized and technical vocabulary is used in it. This is a very important attribute of text, because the senses of some words, especially technical terms, depend on it. For example, the meaning of the word lemma is different in the domain of mathematics and the domain of linguistics.

 

In mathematics, lemma is a proven statement used as a prerequisite toward the proof of another statement, while in linguistics lemma is the canonical form of a word. A piece of text can belong to a specialized technical or scientific domain, or to the general domain, in the absence of a specialized one.

 

Arguably, the general domain is not entirely unified. Even when discussing everyday concepts a certain level of technical vocabulary is being used. For example, consider an everyday discussion about means of transport or cooking. The former might contain terms such as trainplatformticketbus, and sedan, while the latter might contain terms such as pot, pan, stovewhipmixbake, and boil. The union of general words plus widely known and used technical terms comprise a general domain.

 

Language is an important feature of text for a variety of reasons. Mainly, it is one of the very few hard classification general features of text, and it affects radically the methods of text analysis and information mining. Hard classification features are properties of text that can be used to partition a collection of items into nonoverlapping item sets.

 

Language is a hard classification feature in a collection of monolingual documents; i.e., each document can be assigned to a single language only. Moreover, the language of a text or textual collection dictates the methods that can be applied to analyze it, since for some languages there exist adequate resources for many domains, while others are much less or not exploited at all.

 

It should also be noted that a document collection can possibly be multilingual; i.e., some of its parts may be in a different language than others. Multilingual collections usually consist of parallel documents; i.e., each document is accompanied by one or more translations of its contents in other languages. Another kind of multilingual collections is comparable documents, where each document corresponds to one or more documents in other languages that are not necessarily precise translations of it, but just similar.

 

Another equally important property of text is style, ranging from formal and scientific to colloquial and abbreviate. Text in different styles often uses different vocabularies and follows syntax and grammar rules more or less strictly. For example, the style of text in a scientific article is usually formal and syntactically and grammatically complete.

 

In contrast, the transcribed speech might be elliptical and probably missing some subjects or objects. In recent years, due to the proliferation of social networking websites, a new style of elliptical, very condensed text style has emerged.

 

Text domain, language, and style are properties orthogonal to each other. In other words, there exists text characterized by any combination of values for these properties. This large space of possible combinations is indicative of the variety of unstructured text available. Below, a number of textual sources are introduced and briefly discussed.

 

Patents

Patents are agreements between a government and the creator of an invention granting him or her exclusive rights to reproduce, use, and sell the invention for a set time period. The documents associated with these agreements, also called patents, describe how the invention works, what is it useful for, what it is made of, and how it is made. From a text analysis point of view, patents are challenging documents.

 

First, although there are several patent collections and search applications available, most of the documents are available as raw, unstructured text. Identifying parts of the documents that refer to different aspects of a patent, such as the purpose, impact, potential uses, and construction details would consist of a basic structure step, which would, in turn, be essential for further processing toward building applications and extracting meaningful conclusions.

 

Second, since a patent document addresses various aspects of an invention, the entire text is not of a single domain. Third, patent documents usually contain tables and figures, which should be recognized and separated from the textual body before any automatic processing.

 

Publications

Publications

Books, journal articles, and conference proceeding contributions comprise the source of text that has been exploited via automatic text analysis methods the most. There are several reasons for this. Due to the diversity of scientific and technical publications available, this type of textual data is easy to match with any domain of application. Moreover, publications offer a natural level of universal structuring: title, abstract, and sections, which most of the time contain a method, results, and a conclusion section.

 

Some of the available publications offer some extra structuring within the abstract section into further subsections. In addition, many publications come with author-specified keywords, which can be used for indexing and search. In scientific publications, new research outcomes are introduced and discussed. As a result, publications are an excellent source for mining neologisms, i.e., new terms and entities.

 

Corporate Web pages

Corporate Web pages are a much less exploited source of textual data, due to the difficulties in accessing and analyzing text. Web pages currently online may combine different development technologies and also may follow any structuring format.

 

This variation restricts the ability to develop a universal mechanism for extracting clean text from web pages and dictates building a customized reader for each corporate website or each group of similar websites. Similarly, tables and figures can be represented in a variety of formats; thus, separate readers are necessary to extract their exact content.

 

Despite these difficulties, companies are much interested to track the activity of other companies active in the same or similar business areas. The results of analyzing web pages of competitors can be very useful in planning the future strategy of a company in all aspects, such as products, research and development, and management of resources.

 

Blogs

Blogs

Blogs are online, publicly accessible notebooks, analogous to traditional notice boards or diaries. Users, also known as bloggers, can publish their opinions, thoughts, and emotions, expressed in any form: text, images, and video. Other users can comment on the documents published and discuss. Due to the freedom of expression associated with the concept of blogs, there is much and growing interest in analyzing blog articles to extract condensed public opinion about a topic.

 

Text in blogs is much easier to access than corporate pages, discussed previously, because the vast majority of blogs are hosted in a small number of blog sites, such as Wordpress, Blogger, and Tumblr. Blog articles vary largely in domains, languages, and style.

 

Social Media

Social Media

Social media, such as Twitter, Facebook, and Google+, allow users to briefly express and publish their thoughts, news, and emotions to all other users or groups of users that they participate in. Media have different approaches to groups of friends or interest groups related to each user.

 

Each user apart from publishing is able to read the updates of users in the groups he or she participates, comment, or just express his or her emotion for them, e.g., Facebook “like.” Moreover, users can republish an update of some other user so that it is visible to more people (Facebook “share,” Twitter “retweet”). The content of user updates has different characteristics in the various media. For example, in Facebook, users can post text, links, images, and videos, while in Twitter each post is restricted to 140 characters, including posted links.

 

Mining social media text is a relatively new trend in the field of text processing. Strong interest has emerged from social media, due to their increasing popularity. Companies consider the habits, preferences, and views of users as very important toward improving the products and services they offer and also designing new products. Text in social media is written in many languages; however, less common are languages spoken in countries where the Internet is not used broadly.

 

There are several challenges relevant to social media text analysis. First, due to the multitude of domains that can be observed in social media, it is challenging to determine which pieces of text are relevant to a domain. In contrast to publications and blogs, text in social media is often relevant to a number of different domains.

 

Second, the style of language in social media is significantly different than the style of any other type of text. Due to the length restrictions of posts, and also to increase typing speed, a very elliptical text style has evolved embracing all sorts of shortenings: emoticons, combinations of symbols with a special meaning (e.g., “xxx” and “<3”), words shortened to homophone letters (e.g., “cu” standing for “see you”), a new set of abbreviations (e.g., “lol”, “rofl,” and “omg”), and others. Moreover, spelling mistakes and other types are more frequent in social media text than in other, more formal types of text.

 

Third and most importantly, the entire text published in social media is very impractical to process due to its immense size. Alternatively, text relevant to some topic is usually obtained by filtering either stored posts or the online stream of posts while they are being published. Filtering can take advantage of the text itself, by using keywords relevant to the topic of interest, or any accompanying metadata, such as the name of the author.

 

News

News

Newswire and newspaper articles comprise a source of text easy to process. Usually, text can be downloaded from news websites or feeds, i.e., streams of newswire posts structured in a standard manner. In addition, the text is carefully written in a formal or colloquial style with very limited or no typos, grammar or syntax mistakes, and elliptical speech.

 

The domain of articles is often clearly specified. If not, the title and keywords, if available, can be helpful in resolving it. Stored newswire and newspaper articles can be very useful to historians, since they are the trails of events, as they take place. In addition, advances and trends in economy and politics can be invaluable to business administration, management, and planning.

 

Online Forums and Discussion Groups

Forums

Online forums and discussion groups provide a good source for strictly domain-specific text. Most of the forums and discussion groups are combined with programmatic access facilities, so that text can be downloaded easily. However, forums and groups are available for limited languages and topics and usually are addressed to a specialized scientific or technical audience. The style of text can range largely and depends on the topic discussed.

 

Technical Specification Documents

Technical specifications are lengthy, domain-specific documents that describe the properties and function of special equipment or machinery or discuss plans for industrial facilities. Technical specification documents are written in a formal style and, due to their nature, are excellent sources of technical terminology. However, only a very limited amount of older technical specification documents are digitized.

 

Newsgroups, Mailing Lists, Emails

Emails

Newsgroups are electronic thematic notice boards, very popular in the 1990s and 2000s. Users interested in the topic of a newsgroup are able to read notices posted by any member of the group, via specialized applications called news-group readers. Newsgroups are dominantly supported by a specialized communication protocol called Network News Transfer Protocol (NNTP).

 

This protocol is used by newsgroup readers and also provides an interface that allows reading newsgroup contents programmatically. Newsgroups mainly contain domain-specific textual articles of formal or colloquial style.

 

Similarly to newsgroups, mailing lists are collections of email addresses of people interested in a topic. Users can publish emails related to the topic of the mailing list, informing other users about news, asking questions, or replying to the questions of other users. Access to the content of mailing lists can be implemented easily since many mailing lists are archived and available online.

 

Apart from emails sent to specific mailing lists, emails, in general, can also be a valuable source of text. However, general emails are much more challenging to analyze than emails sent to mailing lists. Since general emails are not restricted to a specific topic, the domain of text is not known before-hand and should be captured while processing. Similarly to social media text, emails relevant to a specific topic can be retrieved from an email collection by filtering.

 

Moreover, the text style in emails can range from formal and professional to really personal, condensed, and elliptical. Last but not least, access to general email can only be granted by approval of the administrator of the server where the corresponding email account is hosted or by the actual email account user.

 

Lately, several collections of anonymized email text have been made available. Anonymization, apart from removing names, usually refers to the removal of other sensitive information, such as company names, account numbers, identity numbers, insurance numbers, etc.

 

Legal Documentation

Legal Documentation

Lately, most countries and states make laws and other legal documentation available. Minutes of many governments, committees, and unions are also online indexed, usually after a standard period of time, 5 or 10 years. For example, the Europarl Corpus is a parallel corpus extracted from the proceedings of the European Parliament that include versions in 21 European languages.

 

Legal documentation is important for text processing as a large, domain-specific textual source for tasks such as topic recognition, extracting legal terminology, and events and others. In addition, legal documentation that comes with parallel translation is important for machine translation, multilingual term extraction, and other tasks that draw statistics on aligned text in more than one language.

 

Wikipedia

Wikipedia, as a large, general purpose, the online encyclopedia, is a valuable source of text, offering several handy properties for text processing. Wikipedia covers many, if not all, technical or scientific domains and also contains lemmas of the general domain. However, a subset of Wikipedia articles relevant to a specific domain can be easily retrieved by choosing entries whose titles contain terms of that particular domain.

 

Moreover, Wikipedia comes with an interface that allows accessing its clean textual content programmatically. It covers a multitude of languages; however, it is not equally complete for all of them. The style of language is formal or colloquial. A unique disadvantage of the quality of text in Wikipedia stems from the fact that any user, with no previous certification, can submit his or her own articles or amendments to existing articles. An inherent feature of Wikipedia, called “Featured articles,” can be used to select high-quality articles, in the cost of massive loss of coverage.

 

STRUCTURING TEXTUAL DATA

STRUCTURING TEXTUAL DATA

In the previous section, a number of diverse textual resources have been described with an emphasis on their fundamental linguistic properties: domain, language, and style. Some hints about the possible uses of the resources have been provided. To a further extent, some common examples of usage scenarios will be discussed in the “Applications” section. In this section, a few typical textual processing stages are introduced. The output of these processing stages can be applied to many domains and for various purposes.

 

Term Recognition

Terms are words or sequences of words that verbally represent concepts of some specific domain of knowledge, usually scientific or technical. In other words, terms are lexical items closely related to a subject area, and their frequency in this area is significantly higher than in other subject areas.

 

For example, some terms of the finance and economy domain are inflation, interest rate, bonds, and derivatives, while some terms of the domain of biology are a moleculeprotein, and genetic code. Recognizing terms in the text can be useful for many processing tasks. For example, neologisms in a domain, i.e., newly emerging terms, can designate advances in it. In addition, indexing using terms instead of just words can improve search performance.

 

Term recognition is the task of locating terms in domain-specific text collections. Approaches to term recognition can be classified as linguistic, dictionary-based, statistical, and hybrid, depending on the different types of information that they consider. Linguistic approaches use morphological, grammatical, and syntactical knowledge to identify term candidates. Dictionary-based approaches employ various readily available repositories of known term representations, such as ontologies.

 

Statistical approaches refer to the applications of various statistical tools, which receive as input frequency counts of words and sequences of words, co-occurrences of words, and features that capture the context of words or sequences of words, i.e., words that occur frequently before or after the target ones.

 

To provide a practical example of a simple term recognizer, we can consider a combination of parts of speech and frequency filtering. To capture term candidates that consist of adjectives and nouns, we apply a regular expression pattern on the output of parts of speech tagger. Then, we compute the frequencies of term candidates and accept those that survive a given frequency threshold. This process is also described in pseudocode in Algorithm.

 

Algorithm: A Simple Term Recognizer

Input: A textual document t.

Output: A list of terms, ordered according to the frequency of occurrence.

1. Pass text t to a part of speech tagger, and store its output, PoS(t).

2. Apply the regular expression (adj | noun)* noun+ to PoS(t), to identify TC, a set of n term candidates.

3. Filter out terms in TC whose frequency is lower than 2 (or any other prespecified threshold value).

4. Store the terms in TC in a list, LTC, in decreasing order of frequency.

5.Return L

TC.

 

Let’s apply this simple term recognizer to a snippet taken from an answer of spokesman Josh Earnest to the press:

 

There is a range of estimates out there about the economic impact of the pipeline, about how this pipeline would have an impact on our energy security. There are also estimates about how this pipeline may or may not contribute to some environmental factors. So there are a range of analyses and studies that have been generated by both sides of this debate.

 

Applying the regular expression of step 2 in Algorithm to the above quote retrieves the term candidates below. The corresponding frequencies are shown within parentheses: range (2), estimate (2), economic impact (1), pipeline (3), impact (1), energy security (1), environmental factor (1), analysis (1), study (1), side (1), debate (1). Since the frequency threshold value specified in step 3 of Algorithm is 2, the algorithm will output the following list of terms: [pipeline (3), estimate (2), range (2)].

 

Named Entity Recognition

Named Entity Recognition

Named entities are terms associated with a specific class of concepts, i.e., a category, a type or kind of objects or things. For example, entity classes in the news domain are usually people, locations, and organizations. Some examples of biomedical entity classes are genes, proteins, organisms, and malignancies. The notion of named entities is very similar to the notion of terms. However, named entities cannot be defined independently of the corresponding named entity classes, while terms do not need to be classified.

 

Mostly, named entity recognizers use ontologies as their background knowledge. Ontologies are domain-specific classifications of concepts in classes. Each concept represents the notion of an object or thing and can usually be expressed in a variety of different verbal sequences. For example, in the domain of computers, the concept “hard disk drive” can be expressed as Winchester drivehard drive, or just disk. Some ontologies have a tree-like structure, such that some classes are nested into other broader classes.

 

Named entity recognition methods attempt to recognize named entities of prespecified types in text and decide the type that they correspond to. Baseline approaches do a direct matching of named entity realizations in the ontology to the text. Sophisticated approaches attempt to address several issues that hinder error-free recognition, such as

 

The variability of named entities that is not covered by the ontology Ambiguity: Some named entities could be assigned to more than one class. For example, Jordan can be the name of a country or a famous basketball player, and Kennedy can refer to the former U.S. president or the airport named after him. To disambiguate a named entity, typically methods take into account the context in which it occurs.

 

Co-reference is the phenomenon of multiple expressions in a sentence or document referring to the same concept. For example, suppose we have the following text: John walks slowly because he has a bad knee. His son is standing by him. All words in italics refer to the same named entity, John. Ideally, a named entity recognizer should be able to recognize the pronouns and map them to the named entity, John.

 

To provide an example of a very simple named entity recognition approach, we would use a dictionary of named entities and then perform direct matching on the input text

 

Algorithm A Simple Named Entity Recognizer

Input: A textual document t, a dictionary of named entities (NEs).

Output: Text t with named entity annotations.

1.For ne ∈ NEs do {

2. If ne occurs in t then

3.Add an annotation in t for named entity ne

}

4. Return text t with named entity annotations

 

Relation Extraction

Relation Extraction

In the previous section, named entities and the recognition method were introduced. The current section is about relations between previously recognized named entities and methods for recognition. Semantic relations between named entities can be of various types, depending on the application that they will be used in succession. Some examples are

 

Is-a relations: The general domain relations between a named entity of a semantic class to a named entity of a more general semantic class. For instance, from the phrases “car is a vehicle” and “collagen is a protein,” the is-a relation pairs (carvehicle) and (collagenprotein) can be extracted, respectively.

 

Interacts-with relation: In the biology domain, these relations can be used to spot gene-disease and protein-protein interactions, useful for structuring biomedical documents semantically.

 

This method is able to extract a limited number of accurate tuples; in other words, it achieves high precision but low recall. The reason is that patterns accept instances strictly, allowing no variation on words other than the named entity slots. Using the parts of speech and lemmas of those parts can allow some minimal variation.

 

Other approaches, more flexible to generalize, take into account the parsing tree of sentences and check if specific words or categories of words lie in certain positions. Moreover, machine learners can be applied for this task, based on various features that capture the context of a candidate relation entity.

 

To illustrate how the pseudocode of a very simple relation extraction component would look, we provide Algorithm. The algorithm inputs a document accompanied with named entity annotations and applies a set of patterns to identify is-a relations.

 

Algorithm: A Simple Relation Extractor

Input: A textual document t with named entity annotations, a set of patterns:

P = {“NE1 is a NE2,” “NE1 is a type of NE2,” “NE1, a NE2”} Output: Text t with named entity and relation annotations.

1.

For p ∈ P do {

2. If ne applies to t then

3.  Add an annotation in t for a is-a relation between NE1 and NE 2

}

4. Return text t with named entity and relation annotations

 

Event Extraction

Event Extraction

The notion of the event in text mining is very similar to the common sense of events. Events are complex interactions of named entities, and they have a distinct, atomic meaning, separate from other events. Of course, events might be related, but each of them is complete and independent. Events are of different types and nature for different textual domains, for example:

 

•In the domain of news, events are incidents or happenings that took or will take place. An event consists of complex relations between named entities that correspond to various aspects of it, such as time, place, people involved, etc.

 

For instance, the sentence “Mr. Westney visited New York to present his program at Queens College” expresses a transportation event in the past, triggered by the verb visit. New York is a geopolitical entity (GPE) that plays the role of destination in this event, while the person Mr. Westney holds the subject position.

 

•In the domain of biologyevents are structured descriptions of biological processes that involve complex relationships, such as angiogenesis, metabolism, and reaction, between biomedical entities. Events are usually initiated verbally by trigger words, which can be verbs, such as inhibit, or verb nominalizations, such as inhibition. The arguments of events are biomedical entities of specific types, such as genes and proteins, or other events, such as regulation. Events depend highly on the textual context in which they are expressed.

 

Similarly to relation extraction, event extraction can follow a number of simple or more sophisticated methods. Simpler methods recognize a set of trigger words and match specific patterns or apply standard rules. More sophisticated methods attempt to raise the constraints and shortcoming of simple approaches.

 

Bootstrapping approaches introduce iterations. They start with a standard set of trigger words and event extraction rules and iteratively expand these sets to recognize more instances. The procedure of enrichment is critical for the overall performance of this type of method.

 

Machine learning approaches encode information about trigger words, named entity components, context, and probably other ad hoc observations as features, and then attempt to learn the ways that these features interact and correlate with the actual events.

 

A trained learner can then be applied to raw text to extract events similar to the ones it was trained on. Machine learning methods usually perform better than the pattern and rule-based methods. However, extra performance comes at the cost of the expensive and laborious task of annotating training data manually.

 

As a simple example of an event extractor, we provide the pseudocode in Algorithm. The code is able to recognize transportation events similar to the one presented in the former bullet above.

 

Algorithm A Simple Event Extractor

Input: A textual document t with named entity annotations,

a set of trigger words: TW = {visit} and

a set of event patterns: P = {“NEperson trigger word NEGPE”}

Output: Text t with event annotations.

1.For p ∈ P do {

2.For trigger_word ∈ T W do {

3.If p(trigger_word) applies to t then

4.Add an annotation in t for a transportation event

expressed by NEperson, NEGPE, and trigger word.

}

}

5.Return text t with named entity and event annotations

 

Sentiment Analysis

Sentiment Analysis

Sentiment analysis is the task of assigning scores to considerable textual pieces, such as sentences, paragraphs, or documents that represent the attitude of the author with respect to some topic or the overall polarity of a document. Considering a single, coarse-grained sentimental dimension per textual piece, each is assigned a single score that represents positive, negative, or neutral sentiment.

 

A more complex model would consider more than one sentimental dimension, such as agreement, satisfaction, and happiness, assigning more than one score per text. Moreover, the context of each text can be considered in more detail, so that a sentimental score is computed for each selected named entity. For example, a user review about a computer might be overall positive, but might be negative for some components, such as the speakers and the keyboard.

 

Similarly to named entity recognition, relation extraction, and event extraction, sentiment analysis is addressed in a domain-specific manner. In general, positive and negative sentiments are extracted by looking into domain-specific linguistic cues depicting agreement, disagreement, praise, negative slang, etc.

 

Following a simple approach, a dictionary of preselected words associated with scores is used to aggregate the score of longer textual units. More sophisticated approaches can take into account lexical patterns, part of speech patterns, and shallow parsing results. Machine learning is also applicable to this task.

 

Algorithm presents the pseudocode for a simple sentiment analyzer that considers a small lexicon of four words and aggregates an overall score for an input text. As an application example, feeding in the text “I like apples” would output the score +1, while feeding in “I hate bananas” would output the score –2. Evidently, this simplistic approach would perform badly in many cases. For example, “I don’t like bananas” would out-put +1, while “I like love and I don’t like hate” would output +2.

 

Algorithm: A Simple Sentiment Analyzer

Input: A textual document t and a lexicon of words associated with polarity scores:

D = {love (+2), like (+1), dislike (–1), hate (–2)}

Output: Text t scored for sentiment analysis.

1. Score = 0

2.For word ∈ D do {

3.If word occurs t then

4.Score = score + wordscore

}

5.Return t, score

 

APPLICATIONS

APPLICATIONS

This section describes several applications of structuring text in various domains, for diverse potential users and further usage.

 

Web Analytics via Text Analysis in Blogs and Social Media

Web analytics focuses on collecting, measuring, analyzing, and interpreting web data in order to improve web usage and effectiveness for the interests of a person or an organization. Web analytics comprises an important tool for market research, business development, and measuring the effect of advertisement and promotional campaigns.

 

Companies show increasing interest in monitoring the opinion of consumers about the products and services they offer. They are interested to know the effect of their actions both in general and to particular consumer groups.

 

Methods of web analytics can be classified as on-site and off-site. On-site methods consider statistics that can be collected from the target company website itself. Such statistics refer to traffic per time unit, a number of visitors, and page views. Off-site web analytics methods concern measuring the impact of the activities of a company or an organization from web resources other than the website of that company or organization. Usually, web resources suitable for this purpose are online forums, blogs, electronic commerce websites, and social media.

 

Sentiment analysis is the most important tool in the process of assessing the attitude of posts toward products and services. As discussed above, sentiment analysis can be detailed enough to fit the analysis needs required by the user. However, the more detailed a sentiment analysis system is, the more training data is needed to achieve adequate performance. Apart from sentiment analysis, considering the terms occurring in posts related to the target organization, product or service can give evidence about concepts that users consider as related.

 

Recognizing the types of these terms, i.e., identifying them as named entities, would indicate how users think of the target products or services in comparison to other, probably competitive ones. Other metadata associated with social media posts, such as the author profiles, are useful to compile an analysis of the characteristics of users interested in the target organization, product, or service.

 

Often, companies are mainly interested in the ages and lifestyle or their customers. For the latter, the overall activity of users that posted about the target organization, product, or service should be analyzed.

 

Sentiment analysis and term recognition in conjunction with an analysis of user profiles can produce valuable decision-making results. Managers will be able to observe the levels of user satisfaction per profile attributes, such as age, gender, and location. Moreover, decisions can be based on the terms that occur in discussions of users per location or age group and the corresponding term importance scores.

 

Linking Diverse Resources

For many domains, it is meaningful to merge together information coming from different resources. Examples from the domains of news and medicine are discussed below as indicative.

 

For a variety of interest groups and organizations the ability to observe the effect of advances in politics and economy as well as other decisions, events, and incidents is of invaluable importance. Having access to public opinion dynamics can be considered a form of immediate feedback, helpful in politics to form actions, measures, and policies. This aggregated knowledge can be compiled by linking together information for the same topic, coming from different resources: news articles, blogs, and social media. The process can be organized in four steps:

 

  • 1.Clustering together news articles about the same topic
  • 2.Retrieving blog and social media posts related to the topic of each cluster
  • 3.Analyzing the opinion and sentiment in these posts
  • 4.Aggregating sentiment and opinion mining outcomes per topic to be presented to the user

 

The entire process requires preprocessing all text to recognize terms, entities, relations, and events. These metadata should be considered as features for clustering (first step), and also for constructing queries to retrieve relevant blog and social media posts in the second step. This analysis is valuable for decision-makers as a feedback mechanism. For each decision made, they can observe the impact on the public or specific interest group in terms of sentiment, satisfaction, and opinions.

 

In the medical domain, while running a clinical trial, it is particularly laborious to locate patients that can participate by taking into account the eligibility criteria specified. This task can be automated up to some level to help doctors select matching candidates. Structured eligibility criteria of a clinical trial can be cross-checked with structured medical information from electronic health records of patients.

 

The structuring necessary for this task refers to named entity recognition in both resources and identification of corresponding numerical indications and levels. Structuring will enable doctors to select patients that fulfill certain eligibility criteria automatically.

 

Search via Semantic Metadata

Searching for documents with specific characteristics in large collections can be very tedious and costly. Document metadata can be used to improve searching and locating them much more efficiently. Semantic metadata, such as named entities, relations, and events, can contribute toward this purpose in addition to standard metadata accompanying the document.

 

For example, in the newswire and newspaper domain, an article might be associated with inherent metadata about the name of the author, the name of the column it was published in, the date of publication and others, but also the named entities, relations among them, as well as events can be computed in the textual content of the article, as discussed in the “Structuring Textual Data” section.

 

As an example, suppose we have the New York Times article titled “New York May Use Money to Aid Children.”* It was written by Raymond Hernandez, and published on Sunday, June 22, 1997, on the fourth column of page 14. The article belongs to the National Desk and is a part of the Tobacco Industry collection. All this information consists of the inherent metadata of the article.

 

  • Unannotated
  • Term
  • Named-entity
  • Event
  • Semantically
  • Documents
  • Extraction
  • Recognition
  • Identification
  • Annotated
  • Documents

 

Term extraction applied on the content of this article can identify terms such as childrentobacco industryindustrythe federal government, attorney general, Richard Blumenthal, Connecticut, Dennis C. Vacco, Christine Todd Whitman, Friday night, George E. Pataki, health-care, and healthcare expert.

 

Metadata values for a specific kind of metadata, e.g., author name, can be presented collectively in the search environment to be used as search facets. This allows users to search the document collection in a different way than standard textual queries. They will be able to select a single value or a set of values for one or more metadata types and obtain as a result the documents that are associated with these specific data values only. Semantic metadata can be used in the very same way.

 

For example, users can retrieve all news articles that were published on pages 10–20 in some newspaper and contain the geopolitical named entity Washington, a demonstration event and a movement event whose subject is U.S. President Obama. Using annotations as search tools improve search because it allows us to easily locate documents of specific characteristics in large document collections.

 

Using metadata values as search facets can also be very valuable in many domains other than news. In the biological domain, it is very crucial for researchers to be able to retrieve all publications that refer to specific relations and events. In the medical domain, within the procedure of running a clinical trial, doctors are obliged to retrieve and take into account all related to previous clinical trials.

 

Search using metadata as facets can be very helpful in locating these trials quickly and precisely from collections containing a plethora of documents. As it is visible close to the top of the screenshot, “selected categories,” i.e., constraints applied to the search space, are phasephase 1intervention typedrugstudy typeobservational; and termine termblood pressure. The former constraint considers clinical trials of phase 1 only. The second constraint restricts the search to clinical trial protocols that study the effects of drugs, while the third restricts to observational studies.

 

The last constraint requires for the term blood pressure to occur in the textual content of the protocols retrieved. These four constraints in combination are fulfilled by five clinical trial protocols only. They are shown on the right-hand side of the screenshot. Other constraints appearing on the left-hand side can be also selected.

 

This example shows how metadata can provide an intuitive and user-friendly manner to search in vast document spaces. ASCOT also employs other methods for locating clinical trial protocols of interest, such as keyword search and clustering.

 

Dictionary and Ontology Enrichment

In many domains, such as biology, chemistry, and technology, new terms are introduced constantly. Ontologies and other term dictionaries should be updated to keep up with advances, and this task is very expensive if performed manually. Automatic term extraction applied to textual resources that contain neologisms, such as publications and forum discussions, can produce new term candidates. Checking these candidates only instead of reading the entire new publications is a much easier and less costly manual task.

 

Some ontologies hold translations of concepts in more than one language. It is common that multilingual ontologies are not equally developed in all the languages they offer. Bilingual term extraction can aid in enriching these ontologies, since it is able to extract pairs of terms for different languages. In other words, bilingual term extraction can map the verbal sequences of existing ontology concepts to the sequences of the same concepts in other languages. This method of enriching multilingual ontologies requires a minimal cost for manual verification.

 

Automatic Translation

Automatic word-by-word translation is a very difficult task because of the differences in syntax and sentence structure among various languages. For this reason, many machine translation methods use an alignment between semantic-bearing units of text, such as terms, named entities, and relations among them. The intuition behind this idea is that relation and events, that join entities together, retain their structure in any language.

 

In addition, translations of terms tend to exhibit increased similarities across languages, especially in some domains, such as chemistry, biology, and medicine. Thus, identifying named entities, relations, and events can be helpful during translation. It is indicative that typical statistical machine translation systems operate by learning from the training data the translation of words or word sequences associated with probabilities, and then use these translation tables to translate unknown sentences.

 

Forensics and Profiling

Forensics is the investigation of criminal activity by applying various sciences and technologies. Since blogging and social media comprise nowadays a significant part of communication for most people, investigating the online activity of criminals is worth investigating because it might reveal important aspects about their interests, plans, personality, way of living, and thinking. In other words, analyzing the activity of criminals in blogs and social media might provide more evidence about their profile in general.

 

Analyzing blog posts of somebody under investigation or comments that they might have submitted to other posts can reveal evidence about their ideas, background principles, and way of thinking. Toward processing this textual evidence, extracting terms and named entities as well as analyzing sentiment might be useful.

 

Apart from blog posts and comments, social media profiles and posts are useful for suspect profiling. The textual content of posts can be analyzed similarly to blog posts and comments. However, social media can provide extra information about the habits and acquaintances of a user.

 

Specifically, the profile of each user in some social medium contains a set of friends. Analyzing the activity of each friend of a target user together with the relation between them might also be informative. Finally, some social media record information about the places that a user has logged in from. The time and place of these check-ins might be important for investigating criminal activity.

 

Automatic Text Summarization

Automatic Text Summarization

Text summarization is the process of producing a short version of a document that contains the most important point made in it. A summary produced by a human usually does not consist of the same sentences as the main document. Humans would synthesize the important parts of a document into new, more condensed sentences.

 

However, synthesizing a summary from scratch is a complex task to address automatically, because it requires a language generation step. Instead, typical automatic summarization methods concatenate the most important sentences of clauses extracted from the original document.

 

Specifically, automatic summarization methods typically assign an importance score to the sentences or clauses in the document to be summarized. Then, the sentences are sorted in order of decreasing importance, and only the top N sentences of this list are presented as the document summary.

 

Toward the process of scoring document sentences or clauses, semantic metadata computed for the target document prior to summarization are of significant importance. Named entities and relations among them hold increased semantic meaning. Named entities, relations, and events in a document can be linked together to draw a skeleton of its meaning. Then, a summary should definitely contain the sentences that describe these interrelated events.

 

Automatic summarization of textual documents can be handy to present long text in a compact manner. Providing users with a condensed summary can be useful in a variety of applications, such as news, research publications in a specific domain, and advertisements of a product family.