New Big Data Trends
The future of Big Data might be the same quantitatively, but bigger and better or disruptive forces may occur which will change the whole computing outlook. Big data can assist not only the expansion of products and services but also enables the creation of new ones. This blog explains new trends in big data in 2019.
The term Big Data as we know today may be inappropriate in the future, as it changes with technologies available and computing capabilities, the trend is that the scale of Big Data we know today will be small in ten years.
In the era of data, these become the most valuable good and important to the organizations, and may become the biggest exchange commodity in the future Data is being called “the new oil”, which means that they are being refined and becoming high valued products, through analytical capabilities.
Organizations need to invest in constructing this infrastructure now so they are prepared when supply and value chains are transformed.
An online paper was published in February 2015 on the Forbes website and signed by Douglas Lane, which presented the three big trends for business intelligence for the next year, boosted by the use of massive data volume.
The first of them says that up to 2020, information will be used to reinvent, digitalize or eliminate 80% of business processes and products from the previous decade.
With the growth of the IoT, connected devices, sensors, and intelligent machines, the ability of things to generate new types of information in real-time also grows and will actively participate in the industry’s value chain. Laney states that things will become self-agents, for people and for business.
The second trend is that by 2017, more than 30% of Big Data companies access will occur through data services brokers as intermediates, offering a base for businesses to make better decisions.
Laney projects the arrival of a new category of business centered in the cloud which will deliver data to be used in the business connect, with or without human intervention.
Finally, the last trend is that “…by 2017, more than 20% of customer-facing analytic deployments will provide product tracking information leveraging the IoT”.
The rapid dissemination of the IoT will create a new style of customer analysis and product tracking, utilizing the ever-cheapening electronic sensors, which will be incorporated into all sorts of products.
3D Data Capturing
The relevance of individual data streams will be considered additional data that may be useful during further analyses will be proposed.
The typical training (rehabilitation) session using the Homebalance system and the Scope device used for capturing biofeedback data as described in the previous section consists of three consecutive tasks that the probands are to perform: a simple chessboard scene and two scenes in 3D virtual environments.
Each of the considered training sessions has several levels of complexity given by the used parameters of the user task, e.g. maximal time allowed for accomplishing.
In different medical Fields, the term proband is often used to denote a particular subject, e.g., person or animal, being studied or reported on. the task expected the scope of movement, type of cognitive task to be ensured together with the movement.
The complexity of the task assigned to the patient is changed during the exercise in accordance with the biofeedback data informing how demanding is the actual setting for him/her.
If the patient manages the task well, the current setting level can be slightly increased by a session supervisor. In the opposite case, it can be decreased. Scaling of the task complexity is described separately for each of the included training sessions below.
3D Data Processing and Visualization
Given the raw data that can be obtained from various sensors described previously, we will now focus on the description of data structures and data flows integral to our approach. This will be demonstrated on use-cases mentioned above the chessboard scene and virtual reality scenes.
One minor terminology remark: since many of the scenes considered here for training and rehabilitation have game-like character and using 2D/3D game engines for scene visualization is quite common, we will refer to the testing applications with the virtual environment as to games.
Indeed, many of the scenarios used in training and rehabilitation applications resemble computer games: there are given goals that the proband should accomplish, the proband’s progress in the training session is measured by some kind of score.
The scenes can have an adjustable difficulty level (by setting time limits, making scenes more complicated to navigate, etc.). Therefore, using serious game approach to the training and rehabilitation applications seems very natural.
Conversely, many training and rehabilitation approaches that employ software applications can be gamified, by turning them into repeatable scenarios all with difficulty levels, awards, and scores.
Since the raw data acquired from sensors are quite low-level and use arbitrary device-related units, it is necessary to perform some preprocessing, followed by peak detection/feature extraction steps.
This knowledge is sufficient for reconstructing and evaluating the exact progress of the gameplay later, perhaps for purposes of finding the moments where the player experienced major struggles in the game and adjusting the game difficulty for later plays accordingly in order to optimize the training/learning process that the game is designed to facilitate.
From the low-level data mentioned above, various high-level indicators can be derived. GSR, for instance, is a good resource for estimating stress levels. It is also a prospective physiological indicator of cognitive load, defined as the amount of information that the short-term (or working) memory can hold at one time.
Cognitive Load Theory explained by Sweller and his colleagues in suggests that learners can absorb and retain information effectively only if it is provided in such a way that it does not overload their mental capacity.
Monitoring of physiological functions is a useful tool for early detection of increasing stress level of the patient during the therapeutic process. Monitoring of the patient leads to detection of a change of the patient’s state even if it is not visible for the therapist. The normal level of heart rate and other parameters is individual for each patient.
It is useful to collect data before the start of the therapeutic intervention, during the relaxation phase. The patient should be in optimal stress level during the whole therapeutic process. The current stress level should be compared to the normal level of the individual patient and to the optimal stress level of similar patients from the database.
Sensor data are managed by computers (stored, processed, analyzed, etc.) as time-series data. The PMML (Predictive Markup Language) format is a suitable form for representation of these signal data and other statistical and data mining models computed and stored in our data space.
Cloud-Dew Computing paradigm
Distributed software system architecture, based on the Cloud-Dew Computing paradigm. The rationales for our design decisions are briefly explained below.
Centralized approach. All treatment, data collection, and processing are physically placed at a clinic (rehabilitation center).
Distributed approach. The treatment is provided both at a clinic and at home. There are several implementation patterns possible meeting appropriate requirements.
In our case, we need support for effective and productive rehabilitation and a system that guarantees high availability, reliability, safety, and security, besides its advanced functionality.
We have focused on Cloud-Dew computing to meet the above objectives. Our long-term goal is to have a network of cooperating nodes, where a node denotes a cloud server associated with a clinic (rehabilitation center) steering a set of home infrastructures equipped with balance force platforms and computing, storage, and sensor devices.
Each home system works autonomously, even if the Internet connection is not available now, and in specific intervals exchanges collaborative information (analysis results, data mining models, etc.) with the center involved in a cloud;
here, appropriate security rules are preserved. The role of the processing on the premise is also annotation of the collected data with metadata before sending it to the cloud associated with the center.
The Cloud Server coordinates activities of several home infrastructures, simply Homes. Client Program participates in the traditional software pattern Client-Server. Moreover, it includes software functionality associated with on-premise data processing.
Sensor Network Management steers the data flow from sensors and other data stream sources (e.g. video cameras) to the Home Dataspace managed by the Data Space Management System (DSMS). The Dew Server acts as a specific proxy of the Cloud Server; among others, it is responsible for data synchronization on the Home side.
The availability of this voluminous data provides an excellent opportunity for facility management to improve their service operation by cherishing the positive compliments and identifying and addressing the inherent concerns.
This data, however, lacks structure, is voluminous and is not easily amenable to manual analysis necessitating the use of Big Data approaches.
We designed, developed, and implemented software systems that can download, organize, and analyze the text from these online reviews, analyze them using Natural Language Processing algorithms to perform sentiment analysis and topic modeling and provide facility managers actionable insights to improve visitor experience.
It is rather astonishing to see a large number of people sharing vast amounts of data (which no longer necessarily implies numbers, but could be in various formats such as text, images, weblinks, hashtags) on these social networks.
Park management could and should utilize this data for strengthening their programs, formulating marketing and outreach strategies and improving services.
Big Data presents serious challenges regarding data collection, storage, and integration. Most of the data is noisy and unstructured. In order to fully harness the power of Big Social Data, parks management have to build proper mechanisms and infrastructure to deal with data.
Utilizing Public Online Reviews
The NYS park system could develop a deeper understanding of diverse public opinions about its parks by harnessing public online reviews. However, the data on these sites lack structure, is voluminous and is not easily amenable to manual analysis.
In order to tap into this rich source of visitor information, facility managers need new ways to get feedback online and new software tools for analysis.
A research group from Cornell’s Samuel Curtis Johnson Graduate School of Management and the Water Resources Institute designed, developed, and implemented software systems to tap online review content for the benefit of state agencies and the general public.
Among the many online social platforms that host visitor reviews, Yelp, TripAdvisor, and Google are the most popular and thus were used to develop a pilot decision support system for NYS park managers.
Preprocessing and Text Representation
Arguably the first critical step in analyzing unstructured and voluminous text is the transformation of the free form (qualitative) text into a structured (quantitative) form that is easier to analyze. The easiest and most common transformation is with the “bag of words” representation of text.
The moniker “bag of words” is used to signify that the distribution of words within each document is sufficient, i.e., linguistic features like word order, grammar, and other attributes within the written text can be ignored (without losing much information) for statistical analysis.
This approach converts the corpus into a document-term matrix. This matrix contains a column for each word and a row for every document and a matrix entry is a count of how often the word appears in each document.
The resulting structured and quantitative document-term matrix can then, in principle, be analyzed using any of the available mathematical techniques. The size of the matrix, however, can create computational and algorithmic challenges.
Natural language processing overcomes this hurdle by emphasizing meaningful words by removing uninformative ones and by keeping the number of unique terms that appear in the corpus from becoming extremely large.
There are preprocessing steps that are standard, including
(i) transforming all text into lowercase
(ii) removing words composed of less than 3 characters and very common words called stop words (e.g., the, and, of)
(iii) stemming words, which refers to the process of removing suffixes, so that words like values, valued and valuing are all replaced with value, and finally
(iv) removing words that occur either too frequently or very rarely.
Word counts and sentiment represent the most basic statistics for summarizing a corpus, and research has shown that they are associated with customer decision making and product sales.
To accommodate potential non-linearities in the impact of sentiment on customer behavior, it is recommended to separately estimate measures of positive sentiment and negative sentiment of each item of customer feedback.
The positive sentiment score is calculated by counting the number of unique words in the review that matched a list of “positive” words in validated databases (called dictionaries) in the existing literature. The negative sentiment score is calculated analogously.
The choice of the dictionary is an important methodological consideration when measuring sentiment since certain words can have the sentiment that changes with the underlying context.
For instance, show that dictionaries created using financial 10-K disclosures are more appropriate for financial sentiment analysis rather than dictionaries created from other domains.
We chose the dictionaries in since they were created, respectively, to summarize the opinions within online customer reviews and to perform tone analysis of social media blogs. In total, the combined dictionaries consist of approximately 10,000 labeled words.
These analytical tools are continuously improving, algorithmically getting faster, more sophisticated feature and thus are applicable not only to government-operated facilities but also to resources managed by non-profit groups such as land trusts.
Depending on the needs of state or local agencies and other managers of public spaces, these tools could be enhanced in the following ways:
Automatic Downloading of Reviews:
Online review platforms continue to increase in popularity and new reviews are submitted on a regular basis.
Manual downloading is time-consuming, especially for managers who want “real-time” reports. It is necessary to develop and implement a process that automatically downloads and organizes online reviews into a database.
Topic Modeling on Negative/Positive Review Segments: The current approaches extracts themes from whole reviews that have been labeled as positive or negative. But reviews are rarely, if ever, completely positive or negative.
Each review typically contains segments that are positive alongside segments that are negative. In order to get a more accurate collection of themes, the analysis should perform topic modeling on collections of review segments as opposed to whole reviews.
Topic Modeling Incorporating Expert Feedback:
A topic is simply a collection of words. When the topics are chosen by computer software, some of the words in the topic may not fit according to the needs of park managers.
In such cases, the managers can identify words that should be dropped from a topic and the model can be re-run. Such a recursive approach will lead to a more accurate extraction of themes and improved managerial insights.
Reviews from third-party online platforms, are unsolicited and often cannot be verified for veracity. With the advancement and proliferation of technologies like mobile phones and microchip wristbands, the use of devices that track key personal information is increasingly common.
These devices carry important information which visitors could voluntarily share with the facility management to create verified or more detailed reviews.
Identifying and Accommodating Temporal Changes: Instances exist in which the underlying data of reviewer characteristics, the length, and content of reviews, the topics discussed and even the language used can undergo a seismic shift.
When and if that happens, the existing analysis, applied without any changes, can lead to wrong insights and conclusions. It is necessary to have an approach for identifying such temporal shifts in data and determine ways in which the analysis should be appropriately adjusted.
The necessity and usefulness of natural scenic areas such as local, state, and national parks, has never been greater for our society. They provide much-needed serenity and tranquility and resulting mental health benefits and stress relief in our lives that are increasingly getting lonely, technology-centric, and hectic.
Given that these benefits should be realized in short durations of available free time, it is important that appropriate information and feedback is provided to the visitors.
The traditional methods with which the park managers connect with their visitors have had many strengths, but also numerous weaknesses, and the availability of social networks and the information available within them can further strengthen the communication between the park visitors and park management.
Given the new developments in data storage, computational capacity, and efficiency of natural language processing algorithms, a new opportunity is presented to the park managers to better define visitor experience, measure their capacity to meet these expectations, and make the necessary changes to improve their service offering.
Using the New York State park system as the backdrop, we demonstrate the system we built to apply natural language processing algorithms to customer feedback collected from the various social networks.
This is only a small first step and we describe ways in which such systems can be further improved and implemented by the teams managing these local, state, and national parks.
The Practical Use of the Big Data
Big Data has practical applications in multiple areas: commerce, education, business, financial institutions, and engineering, among many others.
The Data Scientist, using programming skills, technology, and data analysis techniques presented in the last section, will support the decision-making process, provide insights and generate value for businesses and organizations.
The use of data supports decision makers when responding to challenges. Moreover, understanding and using Big Data improves the traditional way of the decision-making process.
The use of Big Data is not limited to the private sector; it shows great potential in public administration. In this section, some examples of use in both spheres for the greater good are presented.
Some businesses, due to the volume of generated data, might find Big Data to be of more use to improve processes, monitor tasks and gain competitive advantages.
Call centers, for instance, have the opportunity of analyzing the audio of calls which will help to both control business processes—by monitoring the agent behavior and liability—and to improve the business, having the knowledge to make the customer experience better and identifying issues referring to products and services.
Although predictive analytics can be used in nearly all disciplines, retailers and online companies are big beneficiaries of this technique.
Due to a large number of transaction operations happening every day, a number of different opportunities for insights are possible, including understanding the behavior of the customers and consumption patterns for better sales planning and replenishment and analyzing promotions are just a few examples of what Big Data can add.
Not only has the private sector experiences the benefits of Big Data. Opportunities in public administration are akin to private organizations. Governments use Big Data to stimulate the public good in the public sphere, by digitizing administrative data, collecting and storing more data from multiple devices.
Big Data can be used in the different functions of the public administration: to detect irregularities, for general observation of regulated areas, to understand the social impact through social feedback on actions taken and also to improve public services.
Other examples of good use of Big Data in public administration are to identify and address basic needs in a faster way, to reduce the unemployment rate, to avoid delays in pension payments, to control traffic using live streaming data and also to monitor the potential need of emergency facilities.
Companies and public organizations have been using their data not only to understand the past but also to understand what is happening now and what will happen in the future.
Some decisions are now automated by algorithms and findings that would have taken months or even years before are being discovered at a glance now. This great power enables a much faster reaction to situations and is leading us into a new evolution of data pacing, velocity, and understanding.