What is Big Data Analytics
The big data phenomenon has rapidly become pervasive across the spectrum of industries and sectors. It typically describes the incredibly large volume of data that is collected, stored, and managed. This large volume of data is being analyzed to gain insight to make informed decisions.
This blog explores what is Big Data Analytics with the best examples. And also explains Big Data architectures, platforms, algorithms, and methodologies used in 2019.
Big data analytics is emerging as a subdiscipline of the field of business analytics involving the application of unique architectures and platforms, technologies, unique programming languages, and open-source tools.
The key underlying principle is the utilization of distributed processing to address the large volume and simultaneous complexity and real-time nature of the analytics.
Very large datasets have existed for decades—the key difference is the emergence of the collection and storage of unstructured data primarily from social media, etc.
The data gathered from unconventional sources such as blogs, online chats, email, sensors, tweets, etc., and information gleaned from nontraditional sources such as blogs, social media, email, sensors, pictures, audio and video multimedia utilizing web forms, mobile devices, scanners, etc., hold the potential of offering different types of analytics, such as descriptive, predictive, and prescriptive.
From a comparative perspective, big data did exist in the 1960s, 1970s, 1980s, and 1990s, but it was mostly structured data (e.g., numerical/quantitative) in flat files and relational databases.
With the emergence of the Internet and the rapid proliferation of web applications and technologies, there has been an exponential increase in the accumulation of unstructured data as well. This has led to an escalating and pressing opportunity to analyze this data for decision-making purposes.
For example, it is universal knowledge that Amazon, the online retailer, utilizes big data analytics to apply predictive and prescriptive analytics to forecast what products a customer ought to purchase.
All of the visits, searches, personal data, orders, etc., are analyzed using complex analytics algorithms.
Likewise, from a social media perspective, Facebook executes analytics on the data collected via the users’ accounts. Google is another historical example of a company that analyzes a whole breadth and depth of data collected via the searches results tracking.
Examples can be found not only in Internet-based companies, but also in industries such as banking, insurance, healthcare, and others, and in science and engineering. Recognizing that big data analytics is here to stay, we next discuss the primary characteristics.
BIG DATA ANALYTICS
Like big data, the analytics associated with big data is also described by three primary characteristics: volume, velocity, and variety (Big Data Analytics). There is no doubt data will continue to be created and collected, continually leading to an incredible volume of data. Second, this data is being accumulated at a rapid pace, and in real time.
This is indicative of velocity. Third, gone are the days of data being collected in standard quantitative formats and stored in spreadsheets or relational databases. Increasingly, the data is in multimedia format and unstructured. This is the variety characteristic.
Considering volume, velocity, and variety, the analytics techniques have also evolved to accommodate these characteristics to scale up to the complex and sophisticated analytics needed.
Some practitioners and researchers have introduced a fourth characteristic: veracity. The implication of this is data assurance. That is, both the data and the analytics and outcomes are error-free and credible.
Simultaneously, the architectures and platforms, algorithms, methodologies, and tools have also scaled up in granularity and performance to match the demands of big data.
For example, big data analytics is executed in distributed processing across several servers (nodes) to utilize the paradigm of parallel computing and a divide and process approach.
It is evident that the analytics tools for structured and unstructured big data are very different from traditional business intelligence (BI) tools.
The architectures and tools for big data analytics have to necessarily be of industrial strength. Likewise, the models and techniques such as data mining and statistical approaches, algorithms, visualization techniques, etc., have to be mindful of the characteristics of big data analytics.
For example, the National Oceanic and Atmospheric Administration (NOAA) uses big data analytics to assist with climate, ecosystem, and environment, weather forecasting and pattern analysis, and commercial translational applications.
NASA engages big data analytics for aeronautical and other types of research. Pharmaceutical companies are using big data analytics for drug discovery, analysis of clinical trial data, side effects and reactions, etc.
Banking companies are utilizing big data analytics for investments, loans, customer demographics, etc. Insurance and healthcare provider and media companies are other big data analytics industries.
The 4Vs are a starting point for the discussion about big data analytics. Other issues include the number of architectures and platform, the dominance of the open-source paradigm in the availability of tools, the challenge of developing methodologies, and the need for user-friendly interfaces.
While the overall cost of the hardware and software is declining, these issues have to be addressed to harness and maximize the potential of big data analytics. We next delve into the architectures, platforms, and tools.
ARCHITECTURES, FRAMEWORKS, AND TOOLS
The conceptual framework for a big data analytics project is similar to that for a traditional business intelligence or analytics project. The key difference lies in how the processing is executed.
In a regular analytics project, the analysis can be performed with a business intelligence tool installed on a stand-alone system such as a desktop or laptop. Since the big data is large by definition, the processing is broken down and executed across multiple nodes.
While the concepts of distributed processing are not new and have existed for decades, their use in analyzing very large datasets is relatively new as companies start to tap into their data repositories to gain insight to make informed decisions.
Additionally, the availability of open-source platforms such as Hadoop/MapReduce on the cloud has further encouraged the application of big data analytics in various domains.
Third, while the algorithms and models are similar, the user interfaces are entirely different at this time. Classical business analytics tools have become very user-friendly and transparent.
On the other hand, big data analytics tools are extremely complex, programming intensive, and need the application of a variety of skills. The data can be from internal and external sources, often in multiple formats, residing at multiple locations in numerous legacy and other applications.
All this data has to be pooled together for analytics purposes. The data is still in a raw state and needs to be transformed. Here, several options are available.
A service-oriented architectural approach combined with web services (middleware) is one possibility. The data continues to be in the same state, and services are used to call, retrieve, and process the data.
On the other hand, data warehousing is another approach wherein all the data from the different sources are aggregated and made ready for processing. However, the data is unavailable in real time.
Via the steps of extract, transform, and load (ETL), the data from diverse sources is cleansed and made ready. Depending on whether the data is structured or unstructured, several data formats can be input to the Hadoop/MapReduce platform.
In this next stage in the conceptual framework, several decisions are made regarding the data input approach, distributed design, tool selection, and analytic models.
Finally, to the far right, the four typical applications of big data analytics are shown. These include queries, reports, online analytic processing (OLAP), and data mining.
Visualization is an overarching theme across the four applications. A wide variety of techniques and technologies have been developed and adapted to aggregate, manipulate, analyze, and visualize big data.
These techniques and technologies draw from several fields, including statistics, computer science, applied mathematics, and economics.
The most significant platform for big data analytics is the open-source distributed data processing platform Hadoop (Apache platform), initially developed for routine functions such as aggregating web search indexes.
It belongs to the class NoSQL technologies (others include CouchDB and MongoDB) that have evolved to aggregate data in unique ways.
Hadoop has the potential to process extremely large amounts of data by mainly allocating partitioned data sets to numerous servers (nodes), which individually solve different parts of the larger problem and then integrate them back for the final result.
It can serve in the twin roles of either as a data organizer or as an analytics tool. Hadoop offers a great deal of potential in enabling enterprises to harness the data that was, until now, difficult to manage and analyze.
Specifically, Hadoop makes it possible to process extremely large volumes of data with varying structures (or no structure at all).
However, Hadoop can be complex to install, configure, and administer, and there is not yet readily available individuals with Hadoop skills. Furthermore, organizations are not ready as well to embrace Hadoop completely.
It is generally accepted that there are two important modules in Hadoop
1. The Hadoop Distributed File System (HDFS). This facilitates the underlying storage for the Hadoop cluster. When data for the analytics arrives in the cluster, HDFS breaks it into smaller parts and redistributes the parts among the different servers (nodes) engaged in the cluster.
Only a small chunk of the entire data set resides on each server/node, and it is conceivable each chunk is duplicated on other servers/nodes.
Since the Hadoop platform stores the complete data set in small pieces across a connected set of servers/nodes in a distributed fashion, the analytics tasks can be distributed across the servers/nodes too.
Results from the individual pieces of processing are aggregated or pooled together for an integrated solution. MapReduce provides the interface for the distribution of the subtasks and then the gathering of the outputs. MapReduce is discussed further below.
A major advantage of parallel/distributed processing is graceful degradation or capability to cope with possible failures. Therefore, HDFS and MapReduce are configured to continue to execute in the event of a failure.
HDFS, for example, monitors the servers/nodes and storage devices continually. If a problem is detected, it automatically reroutes and restores data onto an alternative server/node. In other words, it is configured and designed to continue processing in light of failure.
In addition, replication adds a level of redundancy and backup. Similarly, when tasks are executed, MapReduce tracks the processing of each server/node. If it detects any anomalies such as reduced speed, going into a hiatus, or reaching a dead end, the task is transferred to another server/node that holds the duplicate data.
Overall, the synergy between HDFS and MapReduce in the cloud environment facilitates industrial strength, scalable, reliable, and fault-tolerant support for both the storage and analytics.
In an example, it is reported that Yahoo! is an early user of Hadoop. Its key objective was to gain insight from the large amounts of data stored across the numerous and disparate servers. The integration of the data and the application of big data analytics was mission critical. Hadoop appeared to be the perfect platform for such an endeavor.
Presently, Yahoo! is apparently one of the largest users of Hadoop and has deployed it on thousands on servers/nodes. The Yahoo! Hadoop cluster apparently holds huge “log files” of user-clicked data, advertisements, and lists of all Yahoo! published content.
From a big data analytics perspective, Hadoop is used for a number of tasks, including correlation and cluster analysis to find patterns in the unstructured data sets.
Some of the more notable Hadoop-related application development-oriented initiatives include Apache Avro (for data serialization), Cassandra and HBase (databases), Chukka, Hive (provides ad hoc Structured Query Language (SQL)-like queries for data aggregation and summarization):
Mahout (a machine learning library), Pig (a high-level Hadoop programming language that provides a data flow language and execution framework for parallel computation), Zookeeper (provides coordination services for distributed applications), and others. The key ones are described below.
MapReduce, as discussed above, is a programming framework developed by Google that supports the underlying Hadoop platform to process the big data sets residing on distributed servers (nodes) in order to produce the aggregated results.
The primary component of an algorithm would map the broken up tasks (e.g., calculations) to the various locations in the distributed file system and consolidate the individual results (the reduce step) that are computed at the individual nodes of the file system.
In summary, the data mining algorithm would perform computations at the server/node level and simultaneously in the overall distributed system to summate the individual outputs.
It is important to note that the primary Hadoop MapReduce application programming interfaces (APIs) are mainly called from Java. This requires skilled programmers. In addition, advanced skills are indeed needed for development and maintenance.
In order to abstract some of the complexity of the Hadoop programming framework, several application development languages have emerged that run on top of Hadoop. Three popular ones are Pig, Hive, and Jaql. These are briefly described below.
Pig and PigLatin
The pig was originally developed at Yahoo! The Pig programming language is configured to assimilate all types of data (structured/unstructured, etc.). Two key modules are comprised in it: the language itself, called PigLatin, and the runtime version in which the PigLatin code is executed.
According to Zikopoulos et al., the initial step in a Pig program is to load the data to be subject to analytics in HDFS.
This is followed by a series of manipulations wherein the data is converted into a series of mapper and reducer tasks in the background. Last, the program dumps the data to the screen or stores the outputs at another location.
The key advantage of Pig is that it enables the programmers utilizing Hadoop to focus more on the big data analytics and less on developing the mapper and reducer code.
While Pig is robust and relatively easy to use, it still has a learning curve. This means the programmer needs to become proficient.
To address this issue, Facebook has developed a runtime Hadoop support architecture that leverages SQL with the Hadoop platform.
This architecture is called Hive; it permits SQL programmers to develop Hive Query Language (HQL) statements akin to typical SQL statements. However, HQL is limited in the commands it recognizes.
Ultimately, HQL statements are decomposed by the Hive Service into MapReduce tasks and executed across a Hadoop cluster of servers/nodes. Also, since Hive is dependent on Hadoop and MapReduce executions, queries may have a lag time in processing up to several minutes.
This implies Hive may not be suitable for big data analytics applications that need rapid response times, typical of relational databases. Lastly, Hive is a read-based programming artifact; it is therefore not appropriate for transactions that engage in a large volume of write instructions.
[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]
Pointedly, Jaql enables the functions of select, join, group, and filter of the data that resides in HDFS. In this regard, it is analogous to a hybrid of Pig and Hive.
Jaql is a functional, declarative query language that is designed to process large data sets. To facilitate parallel processing, Jaql converts high-level queries into low-level queries consisting of MapReduce tasks.
Zookeeper is yet another open-source Apache project that allows a centralized infrastructure with various services; this provides for synchronization across a cluster of servers. Zookeeper maintains common objects required in large cluster situations (like a library).
Examples of these typical objects include configuration information, hierarchical naming space, and others. Big data analytics applications can utilize these services to coordinate parallel processing across big clusters.
This necessitates centralized management of the entire cluster in the context of such things as name services, group services, synchronization services, configuration management, and others.
Furthermore, several other open-source projects that utilize Hadoop clusters require these types of cross-cluster services.
The availability of these in a Zookeeper infrastructure implies that projects can be embedded by Zookeeper without duplicating or requiring constructing all over again. A final note: Interface with Zookeeper hap-pens via Java or C interfaces presently.
HBase is a column-oriented database management system that sits on the top of HDFS. In contrast to traditional relational database systems, HBase does not support a structured query language such as SQL. The applications in HBase are developed in Java much similar to other MapReduce applications.
In addition, HBase does support application development in Avro, REST, or Thrift. HBase is built on concepts similar to how HDFS has a NameNode (master) and slave nodes, and MapReduce comprises JobTracker and TaskTracker slave nodes.
A master node manages the cluster in HBase, and regional servers store parts of the table and execute the tasks on the big data.
Cassandra, an Apache project, is also a distributed database system. It is designated as a top-level project modeled to handle big data distributed across many utility servers.
Also, it provides reliable service with no particular point of failure. It is also a NoSQL system. Facebook originally developed it to support its inbox search. The Cassandra database system can store 2 million columns in a single row.
Similar to Yahoo!’s needs, Facebook wanted to use the Google BigTable architecture that could provide a column-and-row database structure; this could be distributed across a number of nodes.
But BigTable faced a major limitation—its use of a master node approach made the entire application depend on one node for all read-write coordination—the antithesis of parallel processing.
Cassandra was built on a distributed architecture named Dynamo, designed by Amazon engineers.
Amazon used it to track what its millions of online customers were entering into their shopping carts. Dynamo gave Cassandra an advantage over BigTable; this is due to the fact that Dynamo is not dependent on anyone master node.
Any node can accept data for the whole system, as well as answer queries. Data is replicated on multiple hosts, creating stability and eliminating the single point of failure.
Many tasks may be tethered together to meet the requirements of a complex analytics application in MapReduce. The open-source project Oozie to an extent streamlines the workflow and coordination among the tasks.
Its functionality permits programmers to define their own jobs and the relationships between those jobs. It will then automatically schedule the execution of the various jobs once the relationship criteria have been complied with.
Lucene is yet another widely used open-source Apache project predominantly used for text analytics/searches; it is incorporated into several open-source projects.
Lucene precedes Hadoop and has been a top-level Apache project since 2005. Its scope includes full-text indexing and library search for use within a Java application
Avro, also an Apache project, facilitates data serialization services. The data definition schema is also included in the data file. This makes it possible for an analytics application to access the data in the future since the schema is also stored along with.
Versioning and version control are also added features of use in Avro. Schemas for prior data are available, making schema modifications possible.
Mahout is yet another Apache project whose goal is to generate free applications of distributed and scalable machine learning algorithms that support big data analytics on the Hadoop platform.
Mahout is still an ongoing project, evolving to include additional algorithms. The core widely used algorithms for classification, clustering, and collaborative filtering are implemented using the map/reduce paradigm.
Streams deliver a robust analytics platform for analyzing data in real time. Compared to BigInsights, Streams applies the analytics techniques on data in motion.
But like BigInsights, Streams is appropriate not only for structured data but also for nearly all other types of data—the nontraditional semistructured or unstructured data coming from sensors, voice, text, video, financial, and many other high-volume sources.
Overall, in summary, there are numerous vendors, including AWS, Cloudera, Hortonworks, and MapR Technologies, among others, who distribute open-source Hadoop platforms. Numerous proprietary options are also available, such as IBM’s BigInsights.
Further, many of these are cloud versions that make it more widely available. Cassandra, HBase, and MongoDB, as described above, are widely used for the database component. In the next section, we offer an applied big data analytics methodology to develop and implement a big data project in a company.
BIG DATA ANALYTICS METHODOLOGY
While several different methodologies are being developed in this rapidly emerging discipline, here a practical hands-on methodology is outlined. The table shows the main stages of such a methodology. In stage 1, the Outline of Big Data Analytics Methodology
Stage 1 Concept design
Establish the need for a big data analytics project
Define problem statement
Why is project important and significant?
Stage 2 Proposal
Abstract—summarize the proposal
What is a problem being addressed?
Why is it important and interesting?
Why big data analytics approach?
Problem domain discussion
Prior projects and research
Stage 3 Methodology
Data sources and collection
Variable selection (independent and dependent variables)
ETL and data transformation
Expected results and conclusions
Scope and limitations
Develop conceptual architecture
−− Show and describe the component
−− Show and describe big data analytics platform/tools
Execute steps in the methodology
Perform various big data analytics using various techniques and algorithms (e.g., word count, association, classification, clustering, etc.)
Gain insight from outputs
Derive policy implications
Make informed decisions
Stage 4 Presentation and walkthrough Evaluation
interdisciplinary big data analytics team develops a concept design. This is the first cut at briefly establishing the need for such a project since there are trade-offs in terms of cheaper options, risk, problem-solution alignment, etc.
Additionally, a problem statement is followed by a description of project importance and significance. Once the concept design is approved in principle, one proceeds to stage 2, which is the proposal development stage.
Here, more details are filled. Taking the concept design as an input, an abstract highlighting the overall methodology and implementation process is outlined.
This is followed by an introduction to the big data analytics domain: What is the problem being addressed? Why is it important and interesting to the organization?
It is also necessary to make the case for a big data analytics approach. Since the complexity and cost are much higher than those of traditional analytics approaches, it is important to justify its use.
Also, the project team should provide background information on the problem domain and prior projects and research done in this domain.
Both the concept design and the proposal are evaluated in terms of the 4Cs:
Completeness: Is the concept design complete?
Correctness: Is the design technically sound? Is correct terminology used?
Consistency: Is the proposal cohesive, or does it appear choppy? Is there flow and continuity?
Communicability: Is proposal formatted professionally? Does report communicate design in easily understood language?
Next, in stage 3, the steps in the methodology are fleshed out and implemented. The problem statement is broken down into a series of hypotheses. Please note these are not rigorous, as in the case of statistical approaches. Rather, they are developed to help guide the big data analytics process.
Simultaneously, the independent and dependent variables are identified. In terms of analytics itself, it does not make a major difference to classify the variables. However, it helps identify causal relationships or correlations. The data is collected (longitudinal data, if necessary), described, and transformed to make it ready for analytics.
A very important step at this point is platform/ tool evaluation and selection. For example, several options, as indicated previously, such as AWS Hadoop, Cloudera, IBM BigInsights, etc., are available.
A major criterion is whether the platform is available on a desktop or on the cloud. The next step is to apply the various big data analytics techniques to the data. These are not different from routine analytics. They’re only scaled up to large datasets.
Through a series of iterations and what if analysis, insight is gained from the big data analytics. From the insight, informed decisions can be made and policy shaped. In the final steps, conclusions are offered, scope and limitations are identified, and the policy implications discussed.
In stage 4, the project and its findings are presented to the stakeholders for action. Additionally, the big data analytics project is validated using the following criteria:
Robustness of analyses, queries, reports, and visualization
Variety of insight
Substantiveness of the research question
Demonstration of big data analytics application
Some degree of integration among components
Sophistication and complexity of analysis
An implementation is a staged approach with feedback loops built in at each stage to minimize the risk of failure. The users should be involved in the implementation.
It is also an iterative process, especially in the analytics step, wherein the analyst performs what-if analysis. The next section briefly discusses some of the key challenges in big data analytics.
For one, a big data analytics platform must support, at a minimum, the key functions necessary for processing the data.
The criteria for platform evaluation may include availability, continuity, ease of use, scalability, ability to manipulate at different levels of granularity, privacy and security enablement, and quality assurance.
Additionally, while most currently available platforms are open source, the typical advantages and limitations of open-source platforms apply.
They have to be shrink-wrapped, made user-friendly, and transparent for big data analytics to take off. Real-time big data analytics is a key requirement in many industries, such as retail, banking, healthcare, and others.
The lag between when data is collected and processed has to be addressed. The dynamic availability of the numerous analytics algorithms, models, and methods in a pull-down type of menu is also necessary for large-scale adoption.
The in-memory processing, such as in SAP’s Hana, can be extended to the Hadoop/MapReduce framework.
The various options of local processing (e.g., a network, desktop/laptop), cloud computing, software as a service (SaaS), and service-oriented architecture (SOA) web services delivery mechanisms have to be explored further.
The key managerial issues of ownership, governance, and standards have to be addressed as well.
Interleaved into these are the issues of continuous data acquisition and data cleansing. In the future, ontology and other design issues have to be discussed. Furthermore, an appliance- driven approach (e.g., access via mobile computing and wireless devices) has to be investigated.
We next discuss big data analytics in a particular industry, namely, healthcare and the practice of medicine.
BIG DATA ANALYTICS IN HEALTHCARE
The healthcare industry has great potential in the application of big data analytics. From evidence-based to personalized medicine, from outcomes to a reduction in medical errors, the pervasive impact of big data analytics in healthcare can be felt across the spectrum of healthcare delivery.
Two broad categories of applications are envisaged: big data analytics in the business and delivery side (e.g., improved quality at lower costs) and in the practice of medicine (aid in diagnosis and treatment).
The healthcare industry has all the necessary ingredients and qualities for the application of big data analytics—data intensive, critical decision support, outcomes-based, improved delivery of quality health care at reduced costs (in this regard, the transformational role of health information technology such as big data analytics applications is recognized), and so on.
However, one must keep in mind the historical challenges of the lack of user acceptance, lack of interoperability, and the need for compliance regarding privacy and security. Nevertheless, the promise and potential of big data analytics in healthcare cannot be overstated.
In terms of examples of big data applications, it is reported that the Department of Veterans Affairs (VA) in the United States has successfully demonstrated several healthcare information technologies (HIT) and remote patent monitoring programs.
The VA health system generally outperforms the private sector in following recommended processors for patient care, adhering to clinical guidelines, and achieving greater rates of evidence-based drug therapy.
These achievements are largely possible because of the VA’s performance-based accountability framework and disease management practices enabled by electronic medical records (EMRs) and HIT.
Another example is the National Institute for Health and Clinical Excellence, part of the UK’s National Health Service (NHS), pioneering use of large clinical data sets to investigate the clinical and cost effectiveness of new drugs and expensive existing treatments.
The agency issues appropriate guidelines on such costs for the NHS and often negotiates prices and market access conditions with pharmaceutical and medical products (PMP) industries.
Further, the Italian Medicines Agency collects and analyzes clinical data on the experience of expensive new drugs as part of a national cost-effectiveness program.
The agency can impose conditional reimbursement status on new drugs and can then reevaluate prices and market access conditions in light of the results of its clinical data studies.
BIG DATA ANALYTICS OF CANCER BLOGS
In this section, we describe our ongoing prototype research project in the use of the Hadoop/MapReduce framework on the AWS for the analysis of unstructured cancer blog data.
Health organizations and individuals such as patients are using blog content for several purposes. Health and medical blogs are rich in unstructured data for insight and informed decision making.
While current applications such as web crawlers and blog analysis are good at generating statistics about the number of blogs, top 10 sites, etc., they are not advanced/useful or scalable computationally to help with analysis and extraction of insight.
First, the blog data is growing exponentially (volume); second, they’re posted in real time and the analysis could become outdated very quickly (velocity); and third, there is a variety of content in the blogs.
Fourth, the blogs themselves are distributed and scattered all over the Internet. Therefore, blogs in particular and social media, in general, are great candidates for the application of big data analytics.
To reiterate, there has been an exponential increase in the number of blogs in the healthcare area, as patients find them useful in disease management and developing support groups.
Alternatively, healthcare providers such as physicians have started to use blogs to communicate and discuss medical information.
Examples of useful information include alternative medicine and treatment, health condition management, diagnosis–treatment information, and support group resources.
This rapid proliferation in health- and medical-related blogs has resulted in huge amounts of unstructured yet potentially valuable information is available for analysis and use.
Statistics indicate health-related bloggers are very consistent at posting to blogs.
The analysis and interpretation of health-related blogs are not trivial tasks. Unlike many of the blogs in various corporate domains, health blogs are far more complex and unstructured.
The postings reflect two important facets of the bloggers and visitors: the individual patient care and disease management (fine granularity) to generalized medicine (e.g., public health).
Hadoop/MapReduce defines a framework for implementing systems for the analysis of unstructured data. In contrast to structured information, whose meaning is expressed by the structure or the format of the data, the meaning of unstructured information cannot be so inferred.
Examples of data that carry unstructured information include natural language text and data from audio or video sources. More specifically, an audio stream has a well-defined syntax and semantics for rendering the stream on an audio device, but its music score is not directly represented.
Hadoop/ MapReduce is sufficiently advanced and sophisticated computationally to aid in the analysis and understanding of the content of health-related blogs.
At the individual level (document-level analysis) one can perform analysis and gain insight into the patient in longitudinal studies. At the group level (collection-level analysis) one can gain insight into the patterns of the groups (network behavior, e.g., assessing the influence within the social group).
for example, in a particular disease group, the community of participants in an HMO or hospital setting, or even in the global community of patients (ethnic stratification).
The results of these analyses can be generalized. While the blogs enable the formation of social networks of patients and providers, the uniqueness of the health/medical terminology comingled with the subjective vocabulary of the patient compounds the challenge of interpretation.
Discussing at a more general level, while blogs have emerged as contemporary modes of communication within a social network context, hardly any research or insight exists in the content analysis of blogs. The blog world is characterized by a lack of particular rules on the format, how to post, and the structure of the content itself.
Questions arise: How do we make sense of the aggregate content? How does one interpret and generalize?
In health blogs in particular, what patterns of diagnosis, treatment, management, and support might emerge from a meta-analysis of a large pool of blog postings? How can the content be classified? What can natural clusters be formed about the topics?
What associations and correlations exist between key topics? The overall goal, then, is to enhance the quality of health by reducing errors and assisting in clinical decision making.
Additionally, one can reduce the cost of healthcare delivery through the use of these types of advanced health information technology. Therefore, the objectives of our project include the following:
1. To use Hadoop/MapReduce to perform analytics on a set of cancer blog postings from Yahoo!
2. To develop a parsing algorithm and association, classification, and clustering technique for the analysis of cancer blogs
3. To develop a vocabulary and taxonomy of keywords (based on existing medical nomenclature)
4. To build a prototype interface
5. To contribute to social media analysis in the semantic web by generalizing the models from cancer blogs
The following levels of development are envisaged: first level, patterns of symptoms, management (diagnosis/treatment); the second level, glean insight into disease management at individual/group levels;
And third level, clinical decision support (e.g., a generalization of patterns, syntactic to semantic)—informed decision making. Typically, the unstructured information in blogs comprises the following:
Blog topic (posting): What issue or question does the blogger (and comments) discuss?
Disease and treatment (not limited to): What cancer type and treatment (and other issues) are identified and discussed?
Other information: What other related topics are discussed? What links are provided?
What Can We Learn from Blog Postings?
Unstructured information related to blog postings (bloggers), including responses/comments, can provide insight into diseases (cancer), treatment (e.g., alternative medicine, therapy), support links, etc.
1. What are the most common issues patients have (bloggers/responses)?
2. What are the cancer types (conditions) most discussed? Why?
3. What therapies and treatments are being discussed? What medical and nonmedical information is provided?
4. Which blogs and bloggers are doing a good job of providing relevant and correct information?
5. What are the major motivations for the postings (comments)? Is it classified by roles, such as provider (physician) or patient?
6. What are the emerging trends in disease (symptoms), treatment, and therapy (e.g., alternative medicine), support systems, and information sources (links, clinical trials)?
What Are the Phases and Milestones?
This project envisions the use of Hadoop/MapReduce on the AWS to facilitate distributed processing and partitioning of the problem-solving process across the nodes for manageability.
Additionally, supporting plug-ins are used to develop an application tool to analyze health-related blogs. The project is scoped to content analysis of the domain of cancer blogs at Yahoo!.
Phase 1 involved the collection of blog postings from Yahoo! into a Derby application.
Phase 2 consisted of the development and configuration of the architecture—keywords, associations, correlations, clusters, and taxonomy.
Phase 3 entailed the analysis and integration of extracted information in the cancer blogs—preliminary results of initial analysis (e.g., patterns that are identified).
Phase 4 involved the development of taxonomy.
Phase 5 proposes to test the mining model and develop the user interface for deployment. We propose to develop a comprehensive text mining system that integrates several mining techniques, including association and clustering, to effectively organize the blog information and provide decision support in terms of search by keywords.
Big data analytics is transforming the way companies are using sophisticated information technologies to gain insight from their data repositories to make informed decisions.
This data-driven approach is unprecedented, as the data collected via the web and social media is escalating by the second. In the future, we’ll see the rapid, widespread implementation and use of big data analytics across the organization and the industry. In the process, the several challenges highlighted above need to be addressed.
As it becomes more mainstream, issues such as guaranteeing privacy, safeguarding security, establishing standards and governance, and continually improving the tools and technologies would garner attention.
Big data analytics and applications are at a nascent stage of development, but the rapid advances in platforms and tools can accelerate their maturing process.