Big Data Analytics and Ethnography
Big Data, perceived as one of the breakthrough technological developments of our times, has the potential to revolutionize essentially any area of knowledge and impact on any aspect of our life. Using advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing, analysts, researchers, and business users can analyze previously inaccessible or unusable data to gain new insights resulting in better and faster decisions, and producing both economic and social value; it can have an impact on employment growth, productivity, the development of new products and services, traffic management, the spread of viral outbreaks, and so on.
But great opportunities also bring great challenges, such as the loss of individual privacy. In this blog, we aim to provide an introduction into what Big Data is and an overview of the social value that can be extracted from it; to this aim, we explore some of the key literature on the subject. We also call attention to the potential ‘dark’ side of Big Data but argue that more studies are needed to fully understand the downside of it. We conclude this blog with some final reflections.
The Hype Around Big Data and Big Data Analytics
• Is it possible to predict whether a person will get some disease 24 h before any symptoms are visible?
• Is it possible to predict future virus hotspots?
• Is it possible to predict when or where a fraud or crime will take place before it actually happens?
• It is possible to predict traffic congestion up to three hours in advance?
• Is it possible to predict terrorists’ future moves?
• Can we support in a better way the wellbeing of people?
These are some of the questions that the Big Data age promises to have an answer for. We should note, at the outset, that some of these questions have already been answered, fostering new waves of creativity, innovation, and social change. But the true potential of Big Data in any of the areas mentioned is yet to be fully unlocked.
Big Data is a big phenomenon that for the past years has been fundamentally changing not only what we know, but also what we do, how we communicate and work, and how we cooperate and compete. It has an impact at the individual, organizational, and societal level, being perceived as a breakthrough technological development.
Today, we are witnessing an exponential increase in ‘raw data’, both human and machine-generated; human, borne from the continuous social interactions and doings among individuals, which led McAfee and Brynjolffon to refer to people as “walking data generators”; machine-generated, borne from the continuous interaction among objects (generally coined the ‘Internet-of-Things’), data which is generally collected via sensors and IP addresses. Big Data comes from five major sources:
Big Data for the Greater Good: An Introduction
• Large-scale enterprise systems, such as enterprise resource planning, customer relationship management, supply chain management, and so on.
• Online social graphs, resulting from the interactions on social networks, such as Facebook, Twitter, Instagram, WeChat, and so on.
• Mobile devices, comprising handsets, mobile networks, and internet connection.
• Internet-of-Things, involving the connection between physical objects via sensors.
• Open data/ public data, such as weather data, traffic data, environment, and housing data, financial data, geodata, and so on.
Advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing, are just as important as the Big Data itself. This may sound somehow obvious and trivial, but it is important to be clear about the weight that each holds in the discussion.
On the one hand, were it not for the possibility to collect large amounts of data being generated, the development of advanced analytics techniques would be irrelevant. On the other hand, the availability of the huge data would mean nothing without advanced analytics techniques to analyze it. Advanced analytics techniques and Big Data are, thus, intertwined. And thanks to Big Data Analytics, we now have the possibility to transform all of the data into meaningful information that can be explored at various levels of the organization or society.
It is useful to distinguish among three different types of analytics: descriptive, predictive, and prescriptive. It should be noted that today, nonetheless, most of the efforts are directed towards predictive analytics.
• Descriptive analytics help answer the question: What happened? It uses data aggregation and data mining (dashboards/scorecards and data visualization, among others). It helps to summarize and describe the past and is useful as it can shed light onto behaviors that can be further analyzed to understand how they might influence future results. For example, descriptive statistics can be used to show the average pounds spent per household or a total number of vehicles in inventory.
• Predictive analytics help answer the question; What will or could happen? It uses statistical models and forecasts techniques (regression analysis, machine learning, neural networks, golden path analysis, and so on). It helps to understand the future by means of providing estimates about the likelihood of a future outcome. For example, predictive analytics can be used to forecast customer purchasing patterns or customer behavior.
• Prescriptive analytics help answer the question: How can we make it happen? Or what should we do? It uses optimization and simulation algorithms (algorithms, machine learning, and computational modeling procedure, among others). It helps to advise on possible outcomes before decisions are made by means of quantifying the effect of future decisions. For example, prescriptive analytics can be used to optimize production or customer experience.
Despite the increased interest in exploring the benefits of Big Data emerging from performing descriptive, predictive, and/or prescriptive analytics, however, researchers and practitioners alike have not yet agreed on a unique definition of the concept of Big Data. We have performed a Google search of the most common terms associated with Big Data.
The fact that there is no unique definition of Big Data is not necessarily bad, since this allows the possibility to explore facets of Big Data that may otherwise be constrained by a definition; but, at the same time, it is surprising if we consider that what was once considered to be a problem, that is, the collection, storage, and processing of a large amount of data, today is not an issue anymore; as a matter of fact, advanced analytics technologies are constantly being developed, updated, and used.
In other words, we have the IT technology to support the finding of innovative insights from Big Data, while we lack a unified definition of Big Data. The truth is that indeed, this is not a problem. We may not all agree on what Big Data is, but we do all agree that Big Data exists. And grows exponentially. And this is sufficient because what this means is that we now have more information than ever before and we have the technology to perform analyses that could not be done before when we had smaller amounts of data.
Let us consider, for example, the case of WalMart, who has been using predictive technology since 2004. Generally, big retailers (WalMart included) would collect information about their transactions for the purpose of knowing how much they are selling. But WalMart took advantage of the benefits posed by Big Data and took that extra step when it started analyzing the trillions of bytes’ worth of sales data, looking for patterns and correlations.
One of the things they were able to determine was what customers purchase the most ahead of a storm. And the answer was Pop-Tarts, a sugary pastry that requires no heating, lasts for an incredibly long period of time, and can be eaten at any meal. This insight allowed WalMart to optimize its supply of Pop-Tarts in the months or weeks leading up to a possible storm. The benefit for both the company and the customers is, thus, obvious.
Considering the above example, one of the conclusions we can immediately draw is that the main challenge in the Big Data age remains how to use the newly-generated data to produce the greatest value for organizations, and ultimately, for the society.
Einav and Levin, captured this view when they elegantly stated that Big Data’s potential comes from the “identification of novel patterns in behavior or activity, and the development of predictive models, that would have been hard or impossible with smaller samples, fewer variables, or more aggregation”.
We thus also agree with the position taken by Baesens et al., who stated that “analytics goes beyond business intelligence, in that it is not simply more advanced reporting or visualization of existing data to gain better insights. Instead, analytics encompasses the notion of going beyond the surface of the data to link a set of explanatory variables to a business response or outcome”.
It is becoming increasingly clear that Big Data is creating the potential for significant innovation in many sectors of the economy, such as science, education, healthcare, public safety and security, retailing and manufacturing, e-commerce, and government services, just to mention a few—we will discuss some of this potential later in the blog.
For example, according to a 2013 Report published by McKinsey Global Institute, Big Data analytics is expected to generate up to $190 billion annually in healthcare cost savings alone by 2020. Concomitantly, it is also true that despite the growth of the field, Big Data Analytics is still in its incipient stage and comprehensive predictive models that tie together knowledge, human judgment and interpretation, commitment, common sense, and ethical values, are yet to be developed. And this is one of the main challenges and opportunities of our times. The true potential of Big Data is yet to be discovered.
We conclude this section with the words of Watson, who stated that “The keys to success with big data analytics include a clear business need, strongly committed sponsorship, alignment between the business and IT strategies, a fact-based decision-making culture, a strong data infrastructure, the right analytical tools, and people skilled in the use of analytics”.
What Is Big Data?
The term ‘Big Data’ was initially used in 1997 by Michael Cox and David Ellsworth, to explain both the data visualization and the challenges it posed for computer systems. To say that Big Data is a new thing is to some extent erroneous. Data have always been with us; it is true, however, that during the 1990s and the beginning of the 2000s, we experienced an increase in IT-related infrastructure, which allowed to store the data that was being produced. But most of the time, these data were simply that: stored—and most probably forgotten. Little value was actually being extracted from the data.
Today, besides the required IT technology, our ability to generate data has increased dramatically—as mentioned previously, we have more information than ever before, but what has really changed is that we can now analyze and interpret the data in ways that could not be done before when we had smaller amounts of data. And this means that Big Data has the potential to revolutionize essentially any area of knowledge and any aspect of our life.
Big Data has received many definitions and interpretations over time and a unique definition has not been yet reached, as indicated previously. Today, it is customary to define Big Data in terms of data characteristics or dimensions, often with names starting with the letter ‘V’. The following four dimensions are among the most often encountered:
Volume: It refers to the large amount of data created every day globally, which includes both simple and complex analytics and which poses the challenge of not just storing it, but also analyzing it. It has been reported that 90% of the existent data has been generated in the past two years alone. It is also advanced that by 2020, the volume of data will be 40 ZB, 300 times bigger than the volume of data in 2005.
Velocity: It refers to the speed at which new data is generated as compared to the time window needed to translate it into intelligent decisions. It is without a doubt that in some cases, the speed of data creation is more important than the volume of the data; IBM considered this aspect when they stated that “for time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value”. Real-time processing is also essential for businesses looking to obtain a competitive advantage over their competitors, for example, the possibility to estimate the retailers’ sales on a critical day of the year, such as Christmas.
Variety: It encapsulates the increasingly different types of data, structured, semi-structured, and unstructured, from diverse data sources (e.g., web, video and audio data, sensor data, financial data, and transactional applications, log files and clickstreams, GPS signals from cell phones, social media feeds, and so on), and in different sizes from terabytes to zettabytes.
One of the biggest challenges is posed by unstructured data. Unstructured data is a fundamental concept in Big Data and it refers to data that has no rules attached to it, such as a picture or a voice recording. The challenge is how to use advanced analytics to make sense of it.
Veracity: It refers to the trustworthiness of the data. In its 2012 Report, IBM showed that 1 in 3 business leaders don’t trust the information they use to make decisions. One of the reasons for such phenomenon is that there are inherent discrepancies in the data, most of which emerge from the existence of unstructured data.
This is even more interesting if we consider that, today, most of the data is unstructured. Another reason is the presence of inaccuracies. Inaccuracies can be due to the data being intrinsically inaccurate or from the data becoming inaccurate through processing errors.
Building upon the 4 Vs:
The 4 V definition is but a starting point that outlines the perimeters. The definition does not help us to determine what to do inside the perimeters, how to innovatively investigate and analyze big data to enhance decision making quality, how to anticipate and leverage the transformational impacts of big data, or how best to consider scope as well as scale impacts of big data.
As such, they argued for the necessity to include a 5th V, namely Value, to complement the 4 V framework. It should be noted that, by some accounts, there are as many as 10 Vs.
Charles and Gherman argued that the term Big Data is a misnomer, stating that while the term in itself refers to the large volume of data, Big Data is essentially about the phenomenon that we are trying to record and the hidden patterns and complexities of the data that we attempt to unpack. With this view, the authors advanced an expanded model of Big Data, wherein they included three additional dimensions, namely the 3 Cs: Context, Connectedness, and Complexity.
The authors stated that understanding the Context is essential when dealing with Big Data, because “raw data could mean anything without a thorough understanding of the context that explains it”; Connectedness was defined as the ability to understand Big Data in its wider Context and within its ethical implications; and Complexity was defined from the perspective of having the skills to survive and thrive in the face of complex data, by means of being able to identify the key data and differentiate the information that truly has an impact on the organization.
Presenting them all goes beyond the scope of the present blog, but we hope to have provided a flavor of the various dimensions of Big Data. Having, thus, highlighted these, along with the existing debate surrounding the very definition of Big Data, we will now move towards presenting an overview of the social value that Big Data can offer.
The Social Value of Big Data
Value creation in a Big Data perspective includes both the traditional economic dimension of value and the social dimension of value. Today, we are yet to fully understand how organizations actually translate the potential of Big Data into the said value. Generally, however, stories of Big Data’s successes have tended to come from the private sector, and less is known about its impact on social organizations. Big Data can, nonetheless, drive big social change, in fields such as education, healthcare, and public safety and security, just to mention a few.
Furthermore, social value can be materialized as employment growth, increased productivity, increased consumer surplus, new products, and services, new markets and better marketing, and so on. Governments, for instance, can use big data to, “enhance transparency, increase citizen engagement in public affairs, prevent fraud and crime, improve national security, and support the well-being of people through better education and healthcare”.
In-depth systematic review of Information Systems literature on the topic and identified two socio-technical features of Big Data that influence value realization: portability and interconnectivity. The authors further argue that, in practice, “organizations need to continuously realign work practices, organizational models, and stakeholder interests in order to reap the benefits from big data”.
As previously mentioned, the value that Big Data Analytics can unleash is great, but we are yet to fully understand the extent of the benefits. Empirical studies that consider the creation of value from Big Data Analytics are nowadays growing in number, but are still rather scarce. We, thus, join the call for further studies in the area.
In the following lines, we will proceed to explore how Big Data can inform social change and to this aim, we present some of the advancements made in three different sectors. It should be noted that the information provided is not exhaustive, as our main intention is to provide a flavor of the opportunities brought about by the Big Data age.
Is it possible to predict whether a person will get some disease 24 hours before any symptoms appear? It is generally considered that the healthcare system is one of the sectors that will benefit the most from the existence of Big Data Analytics. Let us explore this in the following lines.
There is a certain consensus that some of the challenges faced by the healthcare sector include the inadequate integration of the health care systems and the poor health-care information management. The healthcare sector, in general, amasses a large amount of information, which nonetheless, results in today in unnecessary increases in medical costs and time for both healthcare service providers and patients.
Researchers and hospital managers alike are thus interested in how this information could be used instead to deliver a high-quality patient experience, while also improving organizational and financial performance and meeting future market needs.
Big Data Analytics can support evidence-based decision-making and action taking in healthcare. In this sense, a study and found that only 42% of the healthcare organizations surveyed supported their decision-making process with Big Data Analytics and only 16% actually had the necessary skills and experience to use Big Data Analytics. The value that can be generated in general goes, thus, far beyond the one that is created today.
But beyond improving profits and cutting down on operating costs, Big Data can help in other ways, such as curing disease or detecting and predicting epidemics. Big Data Analytics can help to collect and analyze the health data that is constantly being generated for faster responses to individual health problems; ultimately, for the betterment of the patient.
It is now a well-known fact that with the help of Big Data Analytics, real-time datasets have been collected, modeled, and analyzed and this has helped speed up the development of new flu vaccines, identifying and containing the spread of viral outbreaks such as the Dengue fever or even Ebola.
Furthermore, we can only imagine for now what we would be able to do if, for example, we would collect all the data that is being created during every single doctor appointment. There are a variety of activities that happen during the routine medical examinations, which are not necessarily recorded, especially if the results turn out to be within some set parameters (in other words, if the patient turns out to be healthy):
The doctor will take the body temperature and blood pressure, look into the eyes to see the retina lining, use an otoscope to look into the ears, listen to heartbeat for regularity, listen to breathing in the lungs, and so on—all these data could help understand as much about a patient as possible, and as early in his or her life as possible.
This, in turn, could help identify warning signs of illness with time in advance, preventing further advancement of the disease, increasing the odds of success of the treatment, and ultimately, reducing the associated expenses. Now, to some extent, collecting this kind of granular data about an individual is possible due to smartphones, dedicated wearable devices, and specialized apps, which can collect data, for example, on how many steps a day a person walks and on the number of daily calories consumed, among others. But the higher value that these datasets hold is yet to be explored.
Psychiatry is a particular branch of medicine that could further benefit from the use of Big Data Analytics and research studies to address such matter have just recently started to emerge. It is well known that in psychiatric treatments, there are treatments that are proven to be successful, but what cannot be predicted generally is who they are going to work for; we cannot yet predict a patient’s response to a specific treatment.
What this means, in practical terms, is that most of the times a patient would have to go through various trials with various medicines, before identifying that works the best for the patient in question. Importance of Big Data and robust statistical methodologies in treatment prediction research, and in so doing, they advocated for the use of machine-learning approaches beyond exploratory studies and toward model validation. The practical implications of such endeavors are rather obvious.
The healthcare industry, in general, has not yet fully understood the potential benefits that could be gained from Big Data Analytics. The authors further stated that most of the potential value creation is still in its infancy, as predictive modeling and simulation techniques for analyzing healthcare data as a whole have not been yet developed. Today, one of the biggest challenges for healthcare organizations is represented by the missing support infrastructure needed for translating analytics-derived knowledge into action plans, a fact that is particularly true in the case of developing countries.
It is to some extent gratuitous to say that, in the end, nothing is more important than our food supply. Considering that we still live in a world in which there are people dying of starvation, it comes as quite a surprise to note that about a third of the food produced for human consumption is lost or wasted every year.
The agriculture sector is thus, in desperate need of solutions to tackle problems such as inefficiencies in planting, harvesting, and water use and trucking, among others. The Big Data age promises to help. For example, Big Data Analytics can help farmers simulate the impact of water, fertilizer, and pesticide, and engineer plants that will grow in harsh climatic conditions; it can help to reduce waste, increase and optimize production, speed up plant-growth, and minimize the use of scarce resources, such as water.
Generally speaking, Big Data Analytics has not yet been widely applied in agriculture. Nonetheless, there is increasing evidence of the use of digital technologies; and bio-technologies to support agricultural practices. This is termed as smart farming, a concept that is closely related to sustainable agriculture.
Farmers have now started using high-technology devices to generate, record, and analyze data about soil and water conditions and weather forecast in order to extract insights that would assist them in refining their decision-making process. Some examples of tools being used in this regard include: agricultural drones (for fertilizing crops), satellites (for detecting changes in the field); and sensors on the field (for collecting information about weather conditions, soil moisture and humidity, and so on).
As of now, Big Data Analytics in agriculture has resulted in a number of research studies in several areas—we herewith mention some of the most recent ones: crops, land, remote sensing, weather and climate change, animals’ research, and food availability and security.
The applicability of Big Data in agriculture faces a series of challenges, among which: data ownership and security and privacy issues, data quality, intelligent processing and analytics, sustainable integration of Big Data sources, and openness of platforms to speed up innovation and solution development. These challenges would need, thus, to be addressed in order to expand the scope of Big Data applications in agriculture and smart farming.
Traffic congestion and parking unavailability are few examples of major sources of traffic inefficiency. Worldwide. But how about if we could change all that? How about if we could predict traffic jams hours before actually taking place and use such information to reach our destinations within lesser time?
How about if we could be able to immediately find an available parking space and avoid frustration considerably? Transportation is another sector that can greatly benefit from Big Data. There is a huge amount of data that is being created, for example, from the sat nav installed in vehicles, as well as the embedded sensors in infrastructure.
But what has been achieved so far with Big Data Analytics in transportation? One example is the development of the ParkNet system (“ParkNet at Rutgers”, n/a), a wireless sensing network developed in 2010 which detects and provides information regarding open parking spaces. The way it works is that a small sensor is attached to the car and an onboard computer collects the data which is uploaded to a central server and then processed to obtain the parking availability.
Another example is VTrack, a system for travel time estimation using sensor data collected by mobile phones that address two key challenges: reducing energy consumption and obtaining accurate travel time estimates. In the words of the authors themselves:
Real-time traffic information, either in the form of travel times or vehicle flow densities, can be used to alleviate congestion in a variety of ways: for example, by informing drivers of roads or intersections with large travel times (“hotspots”); by using travel time estimates in traffic-aware routing algorithms to find better paths with smaller expected time or smaller variance; by combining historical and real-time information to predict travel times in specific areas at particular times of day; by observing times on segments to improve operations (e.g., traffic light cycle control), plan infrastructure improvements, less congestion pricing and tolling schemes, and so on.
A third example is VibN a mobile sensing application capable of exploiting multiple sensors feeds to explore live points of interest of the drivers. Not only that, but it can also automatically determine a driver’s personal points of interest.
Lastly, another example is the use of sensors embedded in the car that could be able to predict when the car would break down. A change in the sound being emitted by the engine or a change in the heat generated by certain parts of the car—all these data and much more could be used to predict the increased possibility of a car to break down and allow the driver to take the car to a mechanic prior to the car actually breaking down. And this is something that is possible with Big Data and Big Data Analytics and associated technologies.
To sum up, the Big Data age presents opportunities to use traffic data to not only solve a variety of existent problems, such as traffic congestion and equipment fault but also predict traffic congestion and equipment fault before it actually happens. Big Data Analytics can, thus, be used for better route planning, traffic monitoring and management, and logistics, among others.
Good… but What About the Bad?
Any given technology is argued to have a dual nature, bringing both positive and negative effects that we should be aware of. Below we briefly present two of the latter effects.
In the context of Big Data and advanced analytics, a negative aspect, which also represents one of the most sensitive and worrisome issues, is the privacy of personal information. When security is breached, privacy may be compromised and loss of privacy can, in turn, result in other harms, such as identity theft and cyberbullying or cyberstalking. “[…]
There is a great public fear about the inappropriate use of personal data, particularly through the linking of data from multiple sources. Managing privacy is effectively both a technical and a sociological problem, and it must be addressed jointly from both perspectives to realize the promise of big data”.
In the age of Big Data, there is a necessity to create new principles and regulations to cover the area of privacy of information, although who exactly should create these new principles and regulations is a rather sensitive question.
On the one hand, behavioral economists would argue that humans are biased decision-makers, which would support the idea of automation. But on the other hand, what happens when individuals gain the skills necessary to use automation but know very little about the underlying assumptions and knowledge domain that make automation possible?
It is without much doubt that The Bad or dark side of the Big Data age cannot be ignored and should not be treated with less importance than it merits, but more in-depth research is needed to explore and gain a full understanding of its negative implications and how these could be prevented, diminished, or corrected.
In this introductory blog, we have aimed to provide an overview of the various dimensions and aspects of Big Data, while also exploring some of the key research studies that have been written and related applications that have been developed, with a special interest on the societal value generated.
It has been advocated that advanced analytics techniques should support, but not replace, human decision-making, common sense, and judgment. We align with such assessment that indeed, without the qualities mentioned above, Big Data is, most likely, meaningless. But we also need to be pragmatic and accept evidence that points to the contrary. Markus, for example, pointed out that today there is evidence of automated decision-making, with minimal human intervention:
By the early 2000s, nearly 100% of all home mortgage underwriting decisions in the United States were automatically made. Today, over 60% of all securities trading is automated, and experts predict that within ten years even the writing of trading algorithms will be automated because humans will not be fast enough to anticipate and react to arbitrage opportunities.
IBM is aggressively developing Watson, its Jeopardy game-show-winning software, to diagnose and suggest treatment options for various diseases. In these and other developments, algorithms and Big Data […] are tightly intertwined.
It is, thus, not too bold to say that the potential applications and social implications brought by the Big Data age are far from being entirely understood and continuously change, even as we speak. Without becoming too philosophical about it, we would simply like to conclude the above by saying that a Big Data Age seems to require a Big Data Mind; and this is one of the greatest skills that a Big Data Scientist could profess.
Today, there are still many debates surrounding Big Data and one of the most prolific ones involves the questioning of the very existence of Big Data, with arguments in favor of or against Big Data. But this is more than counter-productive. Big Data is here to say. In a way, the Big Data age can be compared to the transition from the Stone Age to the Iron Age: it is simply the next step in the evolution of the human civilization and it is, quite frankly, irreversible.
And just because we argue over its presence does not mean it will disappear. The best we can do is accept its existence as the natural course of affairs and instead concentrate all our efforts to chisel its path forward in such a way so as to serve a Greater Good.
We conclude by re-stating that although much has been achieved until the present time, the challenge remains the insightful interpretation of the data and the usage of the knowledge obtained for the purposes of generating the most economic and social value. We join the observation made by other researchers according to which many more research studies are needed to fully understand and fully unlock the societal value of Big Data.
Until then, we will most likely continue to live in a world wherein individuals and organizations alike collect massive amounts of data with a ‘just in case we need it’ approach, trusting that one day, not too far away, we will come to crack the Big Data Code.
Big Data Analytics and Ethnography:
Ethnography is generally positioned as an approach that provides deep insights into human behavior, producing ‘thick data’ from small datasets, whereas big data analytics is considered to be an approach that offers ‘broad accounts’ based on large datasets. Although perceived as antagonistic, ethnography and big data analytics have in many ways, a shared purpose;
In this sense, this blog explores the intersection of the two approaches to analyzing data, with the aim of highlighting both their similarities and complementary nature. Ultimately, this blog advances that ethnography and big data analytics can work together to provide a more comprehensive picture of big data, and can thus, generate more societal value together than each approach on its own.
For thousands of years and across many civilizations, people have been craving for knowing the future. From asking the Oracle to consulting the crystal ball to reading the tarot cards, these activities stand as examples that show how people have always sought any help that could tell them what the future held, information that would aid them to make better decisions in the present.
Today, the craving for such knowledge is still alive and the means to meet it is big data and big data analytics. From traffic congestion to natural disasters, from disease outbursts to terrorist attacks, from game results to human behavior, the general view is that there is nothing that big data analytics cannot predict. Indeed, the analysis of huge datasets has proven to have invaluable applications.
Big data analytics is one of today’s most famous technological breakthroughs that can enable organizations to analyze fast-growing immense volumes of varied datasets across a wide range of settings, in order to support evidence-based decision-making. Over the past few years, the number of studies that have been dedicated to assessing the potential value of big data and big data analytics has been steadily increasing, which also reflects the increasing interest in the field.
Organizations worldwide have come to realize that in order to remain competitive or gain a competitive advantage over their counterparts, they need to be actively mining their datasets for newer and more powerful insights.
Big data can, thus, mean big money. But there seems to be at least one problem. A 2013 survey by the big data firm Infochimps, who looked at the responses from over 300 IT department staffers, indicated that 55% of big data projects do not get completed, with many others falling short of their objectives.
According to another study published by Capgemini & Informatica, who surveyed 210 executives from five developed countries (France, Germany, Italy, the Netherlands, and the UK) to the business value and benefits that enterprises are realizing from big data, only 27% of big data projects were reported as profitable, whereas 45% reached their equilibrium, and 12% actually lost money.
Further, in 2015, Gartner predicted that through 2017, 60% of big data projects will fail to go beyond piloting and experimentation. In 2016, Gartner actually conducted an online survey of 199 Gartner Research Circle members and the results indicated that only 15% of businesses deployed their big data projects from pilot to production.
These statistics show that although big investments are taking place in big data projects, the generation of value does not match the expectations. The obvious question is, of course, why? Why are investments in big data failing, or in other words, why having the data is not sufficient to yield the expected results?
In 2014, Watson advanced that “the keys to success with big data analytics include a clear business need, strongly committed sponsorship, alignment between the business and IT strategies, a fact-based decision-making culture, a strong data infrastructure, the right analytical tools, and people skilled in the use of analytics”.
Although informative and without doubt useful, today nonetheless, these tips seem to be insufficient; otherwise stated, if we know what we need in order to succeed with big data analytics, then why don’t we succeed in creating full value? The truth is that we are yet to profoundly understand how big data can be translated into economic and societal value and the sooner we recognize this shortcoming, the sooner we can find solutions to correct it.
In this blog, we advance that ethnography can support big data analytics in the generation of greater societal value. Although perceived to be in opposition, ethnography and big data analytics have much in common and in many ways, they have a shared purpose. In the following sections, we explore the intersection of the two approaches. Ultimately, we advance that researchers can blend big data analytics and ethnography within a research setting; hence, that big data analytics and ethnography together can inform the greater good to a larger extent than each approach on its own.
What Is Big Data?
‘Big data’: a concept, a trend, a mindset, an era. No unique definition, but a great potential to impact on essentially any area of our lives. The term big data is generally understood in terms of the four Vs advanced by Gartner: volume, velocity, variety, and veracity. In time, the number of Vs has increased, reaching up to ten Vs.
Other authors have further expanded the scope of the definition, broadening the original framework: Charles and Gherman, for example, advocated for the inclusion of three Cs: context, connectedness, and complexity. One of the most elegant and comprehensive definitions of big data can be found in the none other than the Oxford English Dictionary, which defines it as: “extremely large datasets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.”
Big data comes from various structured and unstructured sources, such as archives, media, business apps, public web, social media, machine log data, sensor data, and so on. Today, almost anything we can think of produces data and almost every data point can be captured and stored. Some would say: also, analyzed.
This may be true, but in view of the statistics presented in the introduction above, we are reticent to so state. Undoubtedly, data is being continuously analyzed for better and deeper insights, even as we speak. But current analyses are incomplete since if were to be able to fully extract the knowledge and insights that the datasets hold, we would most probably be able to fully capitalize on their potential, and not so many big data projects would fail in the first place.
The big data era has brought many challenges with it, which deemed the traditional data processing application software unfit to deal with them. These challenges include networking, capturing data, data storage and data analysis, search, sharing, transfer, visualization, querying, updating and, more recently, information privacy.
But the list is not exhaustive and challenges are not static; in fact, they are dynamic, constantly mutating and diversifying. One of the aspects that we seem to generally exclude from this list of challenges is human behavior. Maybe it is not too bold to say that one of the biggest challenges in the big data age is the extraction of insightful information not from the existing data, but from the data originating from emergent human dynamics that either hasn’t happened yet or that would hardly be traceable through big data.
One famous example in this regard is Nokia, a company that in the 1990s and part of 2000s was one of the largest mobile phone companies in the world, holding by 2007 a market share of 80% in the smartphone market. Nevertheless, Nokia’s over-dependence on quantitative data has led the company to fail in maintaining its dominance on the mobile handset market.
In a post published in 2016, technology ethnographer Tricia Wang describes how she conducted ethnographic research for Nokia in 2009 in China, which revealed that low-income consumers were willing to pay for more expensive smartphones; this was a great insight at the time that led her to conclude that Nokia should replace their then strategy from making smartphones for elite users to making smartphone for low-income users, as well.
But Nokia considered that Wang’s sample size of 100 was too small to be reliable and that moreover, her conclusion was not supported by the large datasets that Nokia pofefed; they, thus, did not implement the insight. Nokia was bought by Microsoft in 2013 and Wang concluded that:
There are many reasons for Nokia’s downfall, but one of the biggest reasons that I witnessed in person was that the company over-relied on numbers. They put a higher value on quantitative data, they didn’t know how to handle data that wasn’t easily measurable, and that didn’t show up in existing reports. What could’ve been their competitive intelligence ended up being their eventual downfall?
Netflix is at the other end of the game, illustrating how ethnographic insights can be used to strengthen a company’s position on the market. Without a doubt, Netflix is a data-driven company, just like Nokia. In fact, Netflix pays quite a lot of attention to analytics to gain insight into their customers. In 2006, Netflix launched the Netflix Prize competition, which would reward with $1 million the creation of an algorithm that would “substantially improve the accuracy of predictions about how much someone is going to enjoy a movie based on their movie preferences”.
But at the same time, Netflix was open to learning from more qualitative and contextual data about what users really wanted. In 2013, cultural anthropologist Grant McCracken conducted ethnographic research for Netflix and what he found was that users really enjoyed watching blog after blog of the same series, engaging in a new form of consumption, now famously called binge watching.
A survey conducted in the same year among 1500 TV streamers (online U.S. adults who stream TV shows at least once per week) confirmed that people did not feel guilty about binge-watching, with 73% of respondents actually feeling good about it. This new insight was used by Netflix to re-design its strategy and release whole seasons at once, instead of releasing one episode per week. This, in turn, changed the way users consume media and specifically Netflix’s products and how they perceived the Netflix brand while improving Netflix’s business.
What Is Ethnography?
let us now consider the concept of ethnography and discuss it further. Ethnography, from the Greek words ethnos, meaning ‘folk, people, the nation’, and graphs, meaning ‘I write’ or ‘writing’, is the systematic study of people and cultures, aimed at understanding and making sense of social meanings, customs, rituals, and everyday practices.
For example, ethnography defined as:
The study of people in naturally occurring settings or ‘fields’ by methods of data collection which capture their social meanings and ordinary activities, involving the researchers participating directly in the setting, if not also the activities, in order to collect data in a systematic manner but without meaning being imposed on them externally.
Participant observation, ethnography, and fieldwork are all used interchangeably… they can all mean spending long periods watching people, coupled with talking to them about what they are doing, thinking and saying, designed to see how they understand their world.
Ethnography is about becoming part of the settings under study. “Ethnographies are based on observational work in particular settings”, allowing researchers to “see things as those involved see things”; “to grasp the native’s point of view, his relation to life, to realize his vision of his world”. Consequently, the writing of ethnographies is viewed as an endeavor to describe ‘reality’, as this is being experienced by the people who live it.
Ethnography depends greatly on fieldwork. Generally, data is collected through participant or nonparticipant observation. The primary data collection technique used by ethnographers is, nonetheless, participant observation, wherein the researchers assume an insider role, living as much as possible with the people they investigate.
Participant observers interact with the people they study, they listen to what they say and watch what they do; otherwise stated, they focus on people’s doings in their natural setting, in a journey of discovery of everyday life. Nonparticipant observation, on the other hand, requires the researchers to adopt a more ‘detached’ position. The two techniques differ, thus, from one another based on the weight assigned to the activities of ‘participating’ and ‘observing’.
Finally, ethnography aims to be a holistic approach to the study of cultural systems, providing ‘the big picture’ and depicting the intertwining between relationships and processes; hence, it usually requires a long-term commitment and dedication. In today’s fast-paced environment, however, mini-ethnographies are also possible. A mini-ethnography focuses on a specific phenomenon of interest and as such, it occurs in a much shorter period of time than that required by a full-scale ethnography.
Big Data Analytics and Ethnography: Points of Intersection
Ford advanced that “data scientists and ethnographers have much in common, that their skills are complementary, and that discovering the data together rather than compartmentalizing research activities was key to their success”. In a more recent study postulated that “ethnographic observations can be used to contextualize the computational analysis of large datasets, while computational analysis can be applied to validate and generalize the findings made through ethnography”.
The latter further proposed a new approach to studying social interaction in an online setting, called big-data-augmented-ethnography, wherein they integrated ethnography with computational data collection.
To the best of our knowledge, the literature exploring the commonalities between big data analytics and ethnography is quite limited. In what follows, we attempt, thus, to contribute to the general discussion on the topic, aiming to highlight additional points of intersection.
Big data analytics comprise the skills and technologies for continuous iterative exploration and investigation of past events to gain insight into what has happened and what is likely to happen in the future. In this sense, data scientists develop and work with models. Models, nonetheless, are simplified versions of reality. Models built aim, thus, to represent the reality and in this sense, are continuously revised, checked, and improved upon and, furthermore, tested to account for the extent to which they actually do so.
On the other hand, ethnographies are conducted in a naturalistic setting in which real people live, with the writing of ethnographies being viewed as an endeavor to describe reality. Furthermore, just as big data analytics-informed models are continuously being revised, “ethnography entails continual observations, asking questions, making inferences, and continuing these processes until those questions have been answered with the greatest emic validity possible”. In other words, both big data and ethnography are more concerned with the processes through which ‘reality’ is depicted rather than with judging the ‘content’ of such reality.
Changing the Definition of Knowledge
Both big data analytics and ethnography change the definition of knowledge and this is because both look for a more accurate representation of reality. On the one hand, big data has created a fundamental shift in how we think about research and how we define knowledge, reframing questions about the nature and the categorization of reality and having a profound change at the levels of epistemology and ethics.
On the other hand, ethnographies aim to provide a detailed description of the phenomena under study, and as such, they may reveal that people’s reported behavior does not necessarily match their observed behavior. As a quote widely attributed to the famous anthropologist Margaret Mead states: “What people say, what people do, and what they say they do are entirely different things”. Ethnographies can and are generally performed exactly because they can provide insights that could lead to new hypotheses or revisions of existing theory or understanding of social life.
Searching for Patterns
Both data scientists and ethnographers collect and work with a great deal of data and their job is, fundamentally, to identify patterns in that data. On the one hand, some say that the actual value of big data rests in helping organizations find patterns in data, which can further be converted into smart business insights.
Big data analytics or machine learning techniques help find hidden patterns and trends in big datasets, with a concern more towards the revelation of solid statistical relationships. Generally, this means finding out whether two or more variables are related or associated.
Ethnography, on the other hand, literally means to ‘write about a culture’ and in the course of so doing, it provides think descriptions of the phenomena under study, trying to make sense of what is going on and reveal understandings and meanings. By carefully observing and/or participating in the lives of those under study, ethnographers thus look for shared and predictable patterns in the lived human experiences: patterns of behavior, beliefs, and customs, practices, and language.
Aiming to Predict
A common application of big data analytics includes the study of data with the aim to predict and improve. The purpose of predictive analytics is to measure precisely the impact that a specific phenomenon has on people and to predict the chances of being able to duplicate that impact on future activities. In other words, identifying patterns in the data is generally used to build predictive models that will aid in the optimization of a certain outcome.
On the other hand, it is generally thought that ethnography is, at its core, descriptive. But this is somehow misunderstood. Today, there is a shift in an ethnographer’s aims, whose ethnographic analyses can take the shape of predictions. Evidence of this is Wang’s ethnographic research for Nokia and McCracken’s ethnographic research for Netflix.
In the end, in practical terms, the reason why we study a phenomenon, irrespective of the method of data collection or data analysis used, is not just because we want to understand it better, but because we also want to predict it better. The identification of patterns enables predictions, and as we have already implied before, both big data analytics and ethnography can help in this regard.
Sensitive to Context
Big data analytics and ethnography are both context-sensitive; in other words, taken out of context, the insights obtained from both approaches will lose their meaning. On the one hand, big data analytics is not just about finding patterns in big data. It is not sufficient to discover that one phenomenon correlates with another; or otherwise stated, there is a big difference between identifying correlations and actually discovering that one causes the other (cause and effect relationship). Context, meaning, and interpretation become a necessity and not a luxury.
Analytics often happens in a black box, offering up a response without context or clear transparency around the algorithmic rules that computed a judgment, answer, or decision. Analytics software and hardware are being sold as a single- source, easy solution to make sense of today’s digital complexity.
The promise of these solutions is that seemingly anyone can be an analytics guru. There is a real danger in this do-it-yourself approach to analytics, however. As with all scientific instruments and approaches, whether it be statistics, a microscope, or even a thermometer, without proper knowledge of the tool, expertise in the approach, and knowledge of the rules that govern the process, the results will be questionable.
On the other hand, inferences made by ethnographers are tilted towards the explanation of phenomena and relationships observed within the study group. Ethnography supports the research endeavors of understanding the multiple realities of life in context, emphasis added); hence, by definition, ethnography provides detailed snapshots of contextualized social realities. Generalization outside the group is limited and taken out of context, meanings will also be lost.
‘Learning’ from Smaller Data
Bigger data are not always better data. Generally, big data is understood as originating from multiple sources (the ‘variety’ dimension) or having to be integrated from multiple sources to obtain better insights. Nevertheless, this creates additional challenges. “Every one of those sources is error-prone… […] we are just magnifying that problem [when we combine multiple datasets]” , cited by Boyd and Crawford. In this sense, smaller data may be more appropriate for intensive, in-depth examination to identify patterns and phenomena, an area in which ethnography holds the crown.
Furthermore, data scientists have always been searching for new or improved ways to analyze large datasets to identify patterns, but one of the challenges encountered has been that they need to know what they are looking for in order to find it, something that is particularly difficult when the purpose is to study emergent human dynamics that haven’t happened yet or that will not show up that easily in the datasets. Both big data analytics and ethnography can, thus, learn from smaller datasets (or even single case analyses).
On the other hand, we must acknowledge that there are also situations in which researchers generally rely on big data (such as clinical research), but sometimes they have to rely on surprisingly small datasets; this is the case of, for example, clinical drug research that analyses the data obtained after drugs are released on the market.
Presence of Behavioural Features
Le Compte once said that: “Those who study humans are them-selves humans and bring to their investigations all the complexity of meaning and symbolism that complicates too precise an application of natural science procedures to examining human life”. This has generally been presented as a difficulty that ethnographers, in particular, must fight to overcome. But it does not have to be so in the big data age. The truth is we need as many perspectives and as many insights as possible.
Fear that we might go wrong in our interpretations will only stop progression. A solution to this is posed by the collaboration between data scientists and ethnographers. In this sense, ethnographers should be allowed to investigate the complexity of the data and come out with propositions and hypotheses, even when they are conflicting; and then data scientists could use big data analytics to test those propositions and hypotheses in light of statistical analyses and see if they hold across the larger datasets.
Studying human behaviour is not easy, but the truth is that both big data analytics and ethnography have a behavioral feature attached to them, in the sense that they are both interested in analyzing the content and meaning of human behavior; the ‘proximity’ between the two approaches is even more evident if we consider the change that the big data age has brought with it.
While in the not so far past, analytics would generally be performed by means of relying upon written artifacts which recorded past human behavior, today, big data technology enables the recording of current human behavior, as this happens (consider live feed data, for example). And as technology will keep evolving, the necessity of ‘collaboration’ between big data analytics and ethnography will become more obvious.
Unpacking Unstructured Data
Big data includes both structured (e.g., databases, CRM systems, sales data, sensor data, and so on) and unstructured data (e.g., emails, videos, audio files, phone records, social media messages, weblogs, and so on). According to a report by Cisco, an estimated 90% of the existing data is either semi-structured or unstructured.
Furthermore, a growing proportion of unstructured data is video. And video constituted approx. 70% of all Internet traffic in 2013. One of the main challenges of big data analytics is just how to analyze all these unstructured data. Ethnography (and its newer addition, online ethnography) may have the answer.
Ethnography is a great tool to ‘unpack’ unstructured data. Ethnography involves an inductive and iterative research process, wherein data collection and analysis can happen simultaneously, without the need to have gathered all the data or even look at the entire data. Ethnography does not follow a linear trajectory, and this is actually an advantage that big data analytics can capitalize on. Ethnography is par excellence a very good approach to look into unstructured data and generate hypotheses that can be further tested against the entire datasets.
Today, most companies seem to be still collecting massive amounts of data with a ‘just in case we need it’ approach. But in the wise words of William Bruce Cameron, not everything that can be counted counts, and not everything that counts can be counted. What should probably happen is that: before going ahead and collecting huge amounts of data, companies could use initial ethnographic observations to identify emergent patterns and phenomena of interest, advancing various, even conflicting, hypotheses.
This could then inform and guide the overall strategy of massive data collection. Big data analytics could then analyze the data collected, testing the hypotheses proposed against these larger datasets. In this sense, ethnography can help shed light on the complexities of big data, with ethnographic insights serving as input for big data analytics and big data analytics can be used to generalize the findings.
Employing an ethnographic approach is generally understood in a traditional sense, which is that of having to undertake long observations from within the organization, with the researcher actually having to become an insider, a part of the organization or context that he decides to study.
The good news is that today, new methods of ethnography are emerging, such as virtual ethnography (also known as online ethnography, netnography, or webnography), which may turn out to be of great help in saving time and tackling the usual problem of having to gain access to the organization.
The virtual world is now in its exponential growth phase and doing virtual ethnography may just be one of the best, also convenient answers to be able to explore and benefit from understanding these new online contexts. The web-based ethnographic techniques imply conducting virtual participant observation via interactions in online platforms such as social networks (such as Facebook or Twitter), blogs, discussion forums, and chat rooms. Conducting ethnographies in today’s world may, thus, be easier than it seems.
In this blog, we have aimed to discuss the points of intersection between big data analytics and ethnography, highlighting both their similarities and complementary nature. Although the list is far from being exhaustive, we hope to have contributed to the discussions that focus on how the two approaches can work together to provide a more comprehensive picture of big data.
Ethnographers bring considerable skills to the table to contextualize and make greater meaning of analytics, while analytics and algorithms are presenting a new field site and complementary datasets for ethnographers”.
One of the most important advantages of combining big data analytics and ethnography is that this ‘intersection’ can provide a better sense of the realities of the contexts researched, instead of treating them as abstract, reified entities. And this better sense can translate into better understandings and better predictions, which can further assist in the creation of better practical solutions, with greater societal added value.
There are indeed many points of intersection between big data analytics and ethnography, having in many ways a shared purpose. They are also complementary, as data scientists working with quantitative methods could supplement their own ‘hard’ methodological techniques with findings and insights obtained from ethnographies. As Goodall stated:
Ethnography is not the result of a noetic experience in your backyard, nor is it a magic gift that some people have and others don’t. It is the result of a lot of reading, a disciplined imagination, hard work in the field and in front of a computer, and solid research skills…
Today, we continue to live in a world that is being influenced by a quantification bias, the unconscious belief of valuing the measurable over the immeasurable. We believe it is important we understood that big data analytics has never been one size fits all. We mentioned in this blog that many big data projects fail, despite the enormous investments that they absorb.
This is in part because many people still fail to comprehend that a deep understanding of the context in which a pattern emerges is not an option, but a must. Just because two variables are correlated does not necessarily mean that there are a cause and effect relationship taking place between them.
Too often, Big Data enables the practice of apophenia: seeing patterns where none actually exist, simply because enormous quantities of data can offer connections that radiate in all directions. In one notable example, Leinweber demonstrated that data mining techniques could show a strong but spurious correlation between the changes in the S&P 500 stock index and butter production in Bangladesh.
Ethnography can provide that so very necessary deep understanding. In this sense, big data analytics and ethnography can work together, complementing each other and helping in the successful handcrafting and implementation of bigger projects for a bigger, greater good.
Big Data: A Global Overview
More and more, society is learning how to live in a digital world that is becoming engulfed in data. Companies and organizations need to manage and deal with their data growth in a way that compliments the data getting bigger, faster and exponentially more voluminous. They must also learn to deal with data in new and different unstructured forms.
This phenomenon is called Big Data. This blog aims to present other definitions for Big Data, as well as technologies, analysis techniques, issues, challenges and trends related to Big Data. It also looks at the role and profile of the Data Scientist, in reference to functionality, academic background, and required skills. The result is a global overview of what Big Data is, and how this new form is leading the world towards a new way of social construction, consumption, and processes.
The Origins of Big Data and How It Is Defined
From an evolutionary perspective, Big Data is not new. The advance towards Big Data is a continuation of ancient humanity’s search for measuring, recording and analyzing the world. A number of companies have been using their data and analytics for decades.
The most common and widespread definition for Big Data refers to the 3 Vs: volume, velocity, and variety. Originally, the 3Vs were pointed out by Doug Laney in 2001, in a Meta Group report. In this report, Laney identifies the 3Vs as future challenges in data management and is nowadays widely used to define Big Data.
Although the 3Vs are the most solid definition for Big Data, they are definitely not the only one. Many authors have attempted to define and explain Big Data under a number of perspectives, going through more detailed definitions—including technologies and data analysis techniques, the use and goals of Big Data and also the transformations it is imposing within industries, services, and lives.
The expression Big Data and Analytics (BD&A) has become synonymous with Business Intelligence (BI) among some suppliers and for others, BD&A was an incorporation of the traditional BI but with the addition of new elements such as predictive analyses, data mining, operation tools/approaches and also research and science. Reyes defines Big Data as the process of assembling, analyzing and reporting data and information.
Big Data and BD&A are often described as data sets and analytical techniques in voluminous and complex applications that require storage, management, analysis, and unique and specific visualization technologies. They also include autonomous data sources with distributed and decentralized controls.
Big Data has also been used to describe a large availability of digital data and financial transactions, social networks, and data generated by smartphones. It includes non-structured data with the need for real-time analysis. Although one of Big Data’s main characteristics is the data volume, the size of data must be relative, depending on the available resources as well as the type of data that is being processed.
Mayer-Schonberger and Cukier believe that Big Data refers to the extraction of new ideas and new ways to generate value in order to change markets, organizations, the relationship between citizens and government, and so on. It also refers to the ability of an organization to obtain information in new ways, aiming to generate useful ideas and significant services.
Although the 3 Vs’ characteristics are intensely present in Big Data definitions throughout literature, its concept gained a wider meaning. A number of characteristics are related to Big Data, in terms of data source, technologies and analysis techniques, goals and generation of value.
In summary, Big Data is enormous datasets composed by both structured and non-structured data, often with the need for real-time analysis and use of complex technologies and applications to store, process, analyze and visualize information from multiple sources. It plays a paramount role in the decision-making process in the value chain within organizations.
Big Data promises to fulfill the research principles of information systems, which is to provide the right information for the right use, in the precise volume and quality at the right time. The goal of BI&A is to generate new knowledge (insights) that can be significant, often in real-time, complementing traditional statistics research and data source’s files that remain permanently static.
Big Data can make organizations more efficient through improvements in their operations, facilitating innovation and adaptability and optimizing resource allocation. The ability of crossing and relating private data about products and consumer preferences with information from tweets, blogs, product analysis, and social network data, open various possibilities for companies to analyze and understand the preferences and needs of the customers, predict demand and optimize resources.
The key to extracting value from Big Data is the use of Analytics since the collection and storage themselves add little value. Data needs to be analyzed and its results used by decision makers and organizational process.
The emergence of Big Data is creating a new generation of data for decision support and management and is launching a new area of practice and study called Data Science. It encompasses techniques, tools, technologies, and processes to extract reason out of Big Data. Data Science refers to qualitative and quantitative applications to solve relevant problems and predict outputs.
There are a number of areas that can be impacted by the use of Big Data. Some of them include business, sciences, engineering, education, health, and society. Within education, some examples of Big Data application are tertiary education management and institutional applications (including recruitment and admission processes), financial planning, donator tracking, and monitoring student performance.
What Is Big Data Transforming in the Data Analysis Sector?
Three big changes that are: The first of them is that the need for samples was due to a time where information was something limited. The second is that the obsession for correct data and the concern for the quality of the data were due to the short availability of data. The last is the abandonment of the search for causality and contentment and to shift focus to the discovery of the fact itself.
For the first big change, the argument is based on the Big Data definition itself, meaning in relative terms and not absolute. It was unviable and expensive to study a whole universe and is reinforced by the fact that nowadays some companies collect as much data as possible.
The second big change refers to the obsession for correct data, which adds to the first change: data availability. Before there was limited data, so it was very important to ensure the total quality of the data. The increase of data availability opened the doors to inaccuracy and Big Data transforms the numbers into something more probabilistic than precise. That is, the larger the scale, more accuracy is lost.
Finally, the third big change in the Big Data era is that the predictions based on correlations are in Big Data’s defense. That means that Big Data launches noncausal analyses in a way to transform the way the world is understood. The mentality has changed on how data could be used.
The three changes described above turn some traditional perspectives of data analysis upside down, concerning not only the need for sampling or data quality but also integrity. It goes further when a new way to look at data and what information to extract from it is brought to the table.
The role of IT (Information Technology) in the Big Data area is fundamental and advances that occurred in this context made the arrival of this new data-driven era possible. New Big Data technologies are enabling large-scale analysis of varied data, in unprecedented velocity and scale.
Typical sources of Big Data can be classified from the perspective of how they were generated, as follows:
• User-generated content (UGCs) e.g. blogs, tweets and forum content;
• Transactional data generated by large-scale systems e.g. weblogs, business transactions, and sensors;
• Scientific data from data-intensive experiments e.g. celestial data or genomes;
• Internet data that is collected and processed to support applications;
• Chart data composed by an enormous number of nodes of information and the relationship between them.
In the Information Technology industry as a whole, the speed that Big Data appeared generated new issues and challenges in reference to data and analytical management. Big Data technology aims to minimize the need for hardware and reduce processing costs. Conventional data technologies, such as databases and data warehouses, are becoming inadequate for the amount of data to analyze.
Big Data is creating a paradigm change in the data architecture, in a way that organizations are changing the way that data is brought from the traditional use of servers to pushing computing to distributed data. The necessity for Big Data analysis boosted the development of new technologies. In order to permit processing so much data, new technologies emerged, like MapReduce from Google and its open source equivalent Hadoop, launched by Yahoo.
The MapReduce technology allows the development of approaches that enable the handling of a large volume of data using a big number of processors, resulting in directing for some of the problems caused by volume and velocity.
Apache Hadoop is one of the software’s platforms that support data application in a distributed and intensive way and implement Map/Reduce. Hadoop is an open source project hosted by the Apache Software Foundation and consists of small subprojects and belongs to the infrastructure category of distributed computing.
The role of IT in the information flow’s availability to create competitive advantages was identified and pointed out as six components:
• Add volume and growth, through improvement or development of products and services, channels or clients;
• Distinguish or increase the will to pay;
• Reduce costs;
• Optimize risks and operations;
• Improve industry structure, innovate with products or services and generate and make knowledge and other resources and competencies available;
• Transform models and businesses processes to continuous relevance in the scenario changes.
Cloud computing is a key component for Big Data, not only because it provides infrastructure and tools, but also because it is a business model that BD&A can trace, as it is offered as a service (Big Data as a Service—BdaaS). However, it brings a lot of challenges. An intensive research project using academic papers about Big Data showed the following technologies as the most cited by the authors, by order of relevance: Hadoop/MapReduce, NoSQL, In-Memory, Stream Mining, and Complex Event Processing.
Normally, Big Data refers to large amounts of complex data and the data is often generated in a continuous way, implying that the data analysis occurs in real-time. Classical analysis techniques are not enough and end up being replaced by learning machine techniques.
Big Data’s analysis techniques encompass various disciplines, which include statistics, data mining, machine learning, neural networks, social network analysis, sign processing, pattern recognition, optimization methods, and visualization approaches. In addition to new processing and data storage technologies, programming languages like Python and R gained importance.
Modeling decision methods also include discrete simulation, finite elements analysis, stochastic techniques, and genetic algorithms among others. Real-time modeling is not only concerned about time and algorithm output, but it is also the type of work that requires additional research.
The opportunities of emerging analytical research can be classified into five critical technical areas: BD&A, text analytics, web analytics, social network analytics, and mobile analytics. Some sets of techniques receive special names, based on the way the data was obtained and the type of data to be analyzed, as follows:
• Text Mining: techniques to extract information from textual data, which involves statistical analysis, machine learning, and linguistics;
• Audio Analytics: non-structured audio data analyses, also known as speech analytics;
• Video Analytics: encompasses a variety of techniques to monitor, analyze and extract significant information out of video transmissions;
• Social Network Analytics: analysis of both structured and non-structured data from social networks;
• Predictive Analytics: embraces a number of techniques to predict future results based on historical and current data and can be applied to most disciplines.
Besides the data analysis techniques, visualization techniques are also fundamental in this discipline. Big Data is a study of transforming data, information, and knowledge in an interactive visual representation. Under the influence of Big Data’s technologies and techniques (large-scale data mining, time series analysis, and pattern mining), data like occurrences and logs can be captured in a low granularity with a long history and analyzed in multiple projections.
The analysis and exploration of datasets made analysis directed towards data (data-driven) possible and presents the potential to argue or even replace ad hoc analysis, for other types of analysis: consumer behavior tracking, simulations and scientific experiments and validation of hypothesis.
The Practical Use of the Big Data
Big Data has practical applications in multiple areas: commerce, education, business, financial institutions, and engineering, among many others. The Data Scientist, using programming skills, technology, and data analysis techniques presented in the last section, will support the decision-making process, provide insights and generate value for businesses and organizations.
The use of data supports the decision makers when responding to challenges. Moreover, understanding and using Big Data improves the traditional way of the decision-making process. Big data can assist not only the expansion of products and services but also enables the creation of new ones. The use of Big Data is not limited to the private sector; it shows great potential in public administration. In this section, some examples of use in both spheres for the greater good are presented.
Some businesses, due to the volume of generated data, might find Big Data to be of more use to improve processes, monitor tasks and gain competitive advantages. Call centers, for instance, have the opportunity of analyzing the audio of calls which will help to both control business processes—by monitoring the agent behavior and liability—and to improve the business, having the knowledge to make the customer experience better and identifying issues referring to products and services.
Although predictive analytics can be used in nearly all disciplines, retailers and online companies are big beneficiaries of this technique. Due to the large amount of transaction operations happening every day, a number of different opportunities for insights are possible, including: understanding the behavior of the customers and consumption patterns, knowing what their customers like to buy and when, predicting sales for better sales planning and replenishment and analyzing promotions are just a few examples of what Big Data can add.
Not only has the private sector experiences the benefits of Big Data. Opportunities in public administration are akin to private organizations. Governments use Big Data to stimulate the public good in the public sphere, by digitizing administrative data, collecting and storing more data from multiple devices.
Big Data can be used in the different functions of the public administration: to detect irregularities, for general observation of regulated areas, to understand the social impact through social feedback on actions taken and also to improve public services. Other examples of good use of Big Data in public administration are to identify and address basic needs in a faster way, to reduce the unemployment rate, to avoid delays in pension payments, to control traffic using live streaming data and also to monitor the potential need of emergency facilities.
Companies and public organizations have been using their data not only to understand the past but also to understand what is happening now and what will happen in the future. Some decisions are now automated by algorithms and findings that would have taken months or even years before are being discovered at a glance now. This great power enables a much faster reaction to situations and is leading us into a new evolution of data pacing, velocity, and understanding.
The Role and Profile of the Data Scientist
With the onset of the Big Data phenomenon, there emerges the need for skilled professionals to perform the various roles that the new approach requires: the Data Scientist. In order to understand the profile of this increasingly important professional, it is paramount to understand the role that they perform in Big Data.
Working with Big Data encompasses a set of different abilities from the ones organizations are used to. Because of that, it is necessary to pay attention to a key to success for this kind of project: people. Data Scientists are necessary for Big Data to make sense. The managers of organizations need to learn what to do with new data sources. Some of them are willing to hire Data Scientists with high income to work in a magical way. First, they need to understand the Data Scientist’s purpose and why is it necessary to have someone playing this role.
The role of the Data Scientist is to discover patterns and relationships that have never been thought of or seen before. These findings must be transformed into information that can be used to take actions and generate value to the organization. Data Scientists are people that understand how to fish answers for important business questions given exorbitant non-structured information.
Data Scientists are highly trained and curious professionals with a taste for solving hard problems and a high level of education (often Ph.D.) in analytical areas such as statistics, operational research, computer science, and mathematics. Statistics and computing are together the main technologies of Data Science.
Data Science encompasses much more than algorithms and data mining. Successful Data Scientists must be able to visualize business problems in the data perspective. There is a thinking structure of data analysis and basic principles that must be understood. The most basic universal ability of the Data Scientist is to write programming codes, although the most dominant characteristic of the Data Scientist is an intense curiosity. Data Scientists are a hybrid of a hacker, analyst, communicator, and trust counselor.
Overall, it is necessary to think of Big Data not only in analytical terms but also in terms of developing high-level skills that enable the use of the new generation of IT tools and data to collect architectures. Data must be collected from several sources, stored, organized, extracted, and analyzed in order to generate valuable findings. These discoveries must be shared with the main actors of the organization who are looking to generate competitive advantage.
Analytics is a complex process that demands people with a very specific educational specialization and this is why tools are fundamental to help people to execute tasks. Tools and computer programming skills, including Python and R, knowledge in MapReduce and Hadoop to process large datasets; machine learning and a number of other visualization tools: Google Fusion Tables, Infogram, Many Eyes, Statwing, Tableau Public and DataHero.
Big Data intensifies the need for sophisticated statistics and analytical skills. With all their technical and analytical skills, Data Scientists are also required to have solid domain knowledge. In both contexts, a consistent time investment is required. In summary, Data Scientists need to gather a rich set of abilities, as follows:
• Understand the different types of data and how they can be stored;
• Computer programming;
• Data access;
• Data analysis;
• Communicate the findings through business reports.
For Data Science to work a team of professionals with different abilities is necessary and Data Science’s projects shall not be restricted to data experiments. Besides that, it is necessary to connect the Data Scientist to the world of the business expert. It is very common for Data Scientists to work close to people from the organization that have domain knowledge of the business.
Because of this, it is useful to consider analytical users on one side and data scientists and analysts on the other side. Each group needs to have different capabilities, including a mixture of business, data and analytical expertise. Analytical talents can be divided into three different types:
• Specialists—that processes analytical models and algorithms, generate results and present the information in a way that organizational leaders can interpret and act;
• Experts—which are in charge of developing sophisticated models and apply them to solve business questions;
• Scientists—who lead the expert team/specialists and are in charge of constructing a story, creating innovative approaches to analyze data and producing solutions. Such solutions will be transformed into actions to support organizational strategies.
Having a very wide and yet specific profile, a mixture of technique and business knowledge, the Data Scientist is a rare professional. The difficulty in finding people with the technical abilities to use Big Data tools has not gone unnoticed by the media. Because of all those requirements, Data Scientists are not only limited but also expensive.
Issues and Challenges
The different forms of data, ubiquity and dynamic nature of resources are a big challenge. In addition, the long reach of data, findings, access, processing, integration and physical world interpretation through data are also challenging tasks. Big Data characteristics are intimately connected to privacy, security and consumer well-being and have attracted the attention of schools, businesses, and the political sphere.
Several challenges and issues involving Big Data have arisen, not only in the context of technology or management issues but also legal matters. The following issues will be discussed in this section: user privacy and security, risk of discrimination, data access and information sharing, data storage and processing capacity, analytical issues, skilled professionals, processes changing, marketing, the Internet of Things (IoT) and finally, technical challenges, which seems to be one of the issues with the most concern in the literature.
The first issue refers to privacy and security. Personal information combined with other data sources can infer other facts about one person that may be a secret or not wanted to be shared by the user. User’s information is collected and used to add more value to a business or organization, many times without being aware that their personal data is being analyzed.
The privacy issue is particularly relevant since there is data sharing between industries and for investigative purposes. That goes against the principle of privacy, which refers to avoiding data utilization. The advances in BD&A provided tools to extract and correlate data, enabling privacy violations easier. Preventing data access is also important for security matters against cybernetic attacks and enabling criminals to know more about their target.
Besides privacy matters, Big Data applications may generate concerning ethical preoccupations like social injustice or even discriminatory procedures, such as removing job possibilities to certain people, health access or even changing the social and economic level in a particular group.
On one hand, a person can obtain advantages from predictive analysis yet someone else may be disadvantaged against. Big Data used for law applications increase the chances that one person suffers consequences, without having the right to object or even further, without having the knowledge that they are being discriminated against.
Issues about data access and information sharing refer to the fact that data is used for precise decision making at the right time. For that, data needs to be available at the perfect time and in a complete manner. These demands make the process of management and governance very complex, with the additional need to make this data available for government agencies in a specific pattern.
Another issue about Big Data refers to storage and processing. The storage capacity is not enough for the amount of data being produced: social media websites are the major contributors, as well as sensors. Due to the big demand, outsourcing data to the cloud can be an option but loading all this data does not resolve the problem since Big Data needs to relate data and extract information. Besides the time of data uploading, data changes very rapidly, making it even harder to upload data in real-time.
Analytical challenges are also posted in the Big Data context. A few questions need an answer: What if the volume is so big that is not known how to deal with it? Does all data need to be stored? Does all data need to be analyzed? How to figure out what are the most relevant points? How can data bring more advantages?
As seen in the last session, the Data Scientist profile is not easy. Required skills are still at an early stage. With emerging technologies, Data Science will have to be appealing to organizations and youth with a number of abilities. These skills must not be limited to technical abilities but also must extend to research, analytics, data interpreting, and creativeness. These skills require training programs and the attention of universities to include Big Data in their courses.
The shortage of Data Scientists is becoming a serious limitation in some sectors. Universities and educational institutions must offer courses capable of providing all this knowledge for a new generation of Data Scientists. University students should have enough technical abilities to conduct predictive analysis, statistical techniques knowledge and handling tools available.
The Master of Science curriculum should have more emphasis on what concerns Business Intelligence and Business Analytics techniques and in application development, using high-level technology tools to solve important business problems. Another challenge here is how fast the universities can make their course updated with so many new technologies appearing every day.
Challenges referring to issues associated with change and the implementation of new processes and even business models, especially in reference to all the data made available through the internet and also in the marketing context. The digital revolution in society and marketing has created huge challenges for companies, which encompasses discussions on the effects of sales and business models, consequences of new digital channels and media with the prevailing data growth.
The four main challenges for marketing are:
1. The use of customer’s insights and data to compete in an efficient way
2. The Power of social media for brands and customer relationships
3. New digital metrics and effective evaluation of digital marketing activities
4. The growing gap of talents with analytical capabilities in the companies.
In the context of the IoT or Internet of Everything (IoE), the following challenges are posed:
• Learn the maturity capacity in terms of technologies and IT;
• Understand the different types of functionalities of IoT, that can be incorporated and how it will impact the value of the client;
• Comprehend the role of machine learning and predictive analytical models;
• Rethink business models and the value chain, based on the velocity of market change and relative responsiveness of the competition.
Finally, the technical challenges refer to error tolerance, scalability, data quality, and the need for new platforms and tools. With the arrival of the new technology, an error must be acceptable or the task must be restarted. Some of the methods of Big Data computing tend to increase the error tolerance and reduce the efforts to restart a certain task.
The scalability issue already took computing to the cloud, which aggregates loads of work with varied performance in large groups, requiring a high level of resource sharing. These factors combine to bring a new concern of how to program, even in complex tasks of machine learning.
Collecting and storing a massive amount of data come at a price. As more data drives the decision making or predictive analysis in the business, this will lead to better results. That generates some issues regarding relevant, quantity, data precision and obtained conclusions. The issue of data origin is another challenge, as Big Data allows data collection from different sources to make data validation hard.
New tools and analytical platforms are required to solve complex optimization problems, to support the visualization of large sets of data and how they relate to each other and to explore and automate multifaceted decisions in real-time.
Some new modern laws of data protection make it possible for a person to find out which information is being stored, but everyone should know when an organization is collecting data and with which purposes if it is going to be available to third parties and the consequences of not supplying the information.
Big Data has brought challenges in so many senses and also in so many unexpected ways. In this commotion, it is also an important issue for Big Data if the companies or organizations measuring and perceiving the return on investment (ROI) on its implementation.
BD&A applications for institutional purposes are still in their early stage and will take years to become mature, although its presence is already perceived and shall be considered. The future of Big Data might be the same quantitatively, but bigger and better or disruptive forces may occur which will change the whole computing outlook.
The term Big Data as we know today may be inappropriate in the future, as it changes with technologies available and computing capabilities, the trend is that the scale of Big Data we know today will be small in ten years.
In the era of data, these become the most valuable good and important to the organizations, and may become the biggest exchange commodity in the future Data is being called “the new oil”, which means that they are being refined and becoming high valued products, through analytical capabilities. Organizations need to invest in constructing this infrastructure now so they are prepared when supply and value chains are transformed.
An online paper was published in February 2015 on the Forbes website and signed by Douglas Lane, which presented the three big trends for business intelligence for the next year, boosted by the use of massive data volume. The first of them says that up to 2020, information will be used to reinvent, digitalize or eliminate 80% of business processes and products from the previous decade.
With the growth of the IoT, connected devices, sensors, and intelligent machines, the ability of things to generate new types of information in real-time also grows and will actively participate in the industry’s value chain. Laney states that things will become self-agents, for people and for business.
The second trend is that by 2017, more than 30% of Big Data companies access will occur through data services brokers as intermediates, offering a base for businesses to make better decisions. Laney projects the arrival of a new category of business centered in the cloud which will deliver data to be used in the business connect, with or without human intervention.
Finally, the last trend is that “…by 2017, more than 20% of customer-facing analytic deployments will provide product tracking information leveraging the IoT”. The rapid dissemination of the IoT will create a new style of customer analysis and product tracking, utilizing the ever-cheapening electronic sensors, which will be incorporated into all sorts of products.
A Novel Big Data-Enabled Approach,
Brain disorders occur when our brain is damaged or negatively influenced by injury, surgery, or health conditions. This blog shows how the combination of novel biofeedback-based treatments producing large data sets with Big Data and Cloud-Dew Computing paradigms can contribute to the greater good of patients in the context of rehabilitation of balance disorders, a significant category of brain damage impairments.
The underlying hypothesis of the presented original research approach is that detailed monitoring and continuous analysis of patient«s physiological data integrated with data captured from other sources helps to optimize the therapy w.r.t. the current needs of the patient improve the efficiency of the therapeutic process and prevent patient overstressing during the therapy.
In the proposed application model, training built upon two systems, Homebalance system enabling balance training and Scope system collecting physiological data, is provided both in collaborating rehabilitation centers and at patient homes. The preliminary results are documented using a case study confirming that the approach offers a viable way towards the greater good of a patient.
Brain disorders can negatively influence the function of most parts of the human body. Necessary medical care and rehabilitation is often impossible without considering the state-of-the-art neuroscience research achievements and close cooperation of several diverse medical, care to provide, IT and other specialists who must work jointly to choose methods that can improve and support healing processes as well as to suggest new treatment processes based on better understanding of underlying principles.
The key to their treatment decisions are data resulting from careful observation or examination of the patient during his/her exercise ensured as a part of the long-term therapy (which can last months or even years) and information/knowledge extracted from these data. Because of the impressive advances in scientific domains, namely, the neuroscience, as well as in technology that allows e.g., to ensure on-line monitoring of patient’s activity during exercise through streaming data from body area network of sensors, the volume, and complexity of the captured data have been continuously increasing.
The resulting data volumes are now reaching the challenges of the Big data technology with respect to the four Vs used to characterize this concept: volume (scale or quantity of data), velocity (speed and analysis of real-time data), variety (different forms of data, often from different data sources), and veracity (quality assurance of the data).
We believe that the achievements of Big Data in combination with the novel Cloud-Dew Computing paradigm could introduce the development of a new generation of non-invasive brain disorder rehabilitation processes and, in such a way, contribute to the greater good of patients thanks to the higher efficiency of the rehabilitation process.
These issues are addressed by our project, the results of which are presented here. In the rest of this section, we First provide the background-information as a motivation for our work, discuss the relevant state of the art, introduce our long-term research objectives and, Finally, outline the structure of the blog.
Background Information and the State of the Art
The use of modern technologies for the rehabilitation of patients with brain damage has made a remarkable progress in the past decade. The therapists are trying to incorporate into their therapy sessions various tools originally designed for entertainment, e.g., combination therapy with computer games or with virtual reality, to improve the attractiveness of the traditional beneficial exercise and adherence of the patients to the therapy plan.
Such an enrichment represents a significant addition to conventional procedures, whose experience and best practices have steered the development of new therapeutic technologies incorporating the latest achievements of computational intelligence, sensing technique, virtual reality, video games, and telemedicine.
While currently the needed therapy (some form of training) is primarily provided in rehabilitation centers; our vision is to make it available even in patients’ homes, because home therapy, without direct participation of a therapist, can save the patient discomfort of traveling to centers, makes the rehabilitation more attractive, provides motivation to exercise, and so, it can be more effective.
Further, home therapy combined with telehealth solutions could target much bigger groups of well-selected patients without compromising the quality of care. All this contributes to the greater good of the considered individual patients and of efficient utilization of resources in the healthcare system.
Resulting in big data, locally collected in patients’ home and transmitted to the rehabilitation center, where they are First integrated (and significantly enlarged) with the data collected in the center, processed and analyzed can further provide a precious source that can help us in understanding the function of human brain.
Balance disorder of the central and peripheral system is a significant symptom that causes functional deficit in common activities. Patients after brain damage often face situations (e.g. using escalators, orientation in city traffic) that can be possibly dangerous or even impossible to manage. Consequently, balance training is one of the main focuses of brain disorder rehabilitation.
Most modern balance training environments are interactive and built upon measuring instruments called force platforms (or force plates) that measure the ground reaction forces (their summary vector is often called a center of pressure) generated by a body standing on or moving across them, to quantify balance, gait and other parameters of biomechanics. Methods based on virtual reality technology and biological feedback are appropriately supplementing conventional rehabilitation procedures.
In this context, in our lab, we have developed a stabilometric system, based on the WiFi balance board, with a set of 2D or 3D training scenes so that patients can experience situations appearing in these scenes without any risk. Commonly used 2D gaming systems are not suitable for this purpose. The therapy includes active repetitive game-like training. The patient standing on the force platform is set the task to control his/her center of pressure (COP) movements to achieve the goals of the games.
The COP movement is visualized in the virtual environment to provide visual feedback. The difficulty of training and sensitivity of sensors can be adjusted with respect to the current patient’s state. Patient’s state is objectively evaluated in the beginning and during the entire rehabilitation process. The efficiency of patient’s COP movement is measured and presented as a score value. Data is analyzed in both the time and frequency domain. Graphs and raw values can be exported.
Dew computing is a choice for the implementation of the above application processes in a distributed way. It is a very new computing paradigm that sets up on top of Cloud computing and overcomes some of its restrictions like the dependency on the Internet. There are different definitions of Dew computing.
Dew computing is an on-premises computer software-hardware organization paradigm in the cloud computing environment where the on-premises computer provides functionality that is independent of cloud services and is also collaborative with cloud services. The goal of dew computing is to fully realize the potentials of on-premises computers and cloud services.
This guarantees that the offered services are independent of the availability of a functioning Internet connection. The underlying principle for this is a tight collaboration between on-premise and off-premise services based on an automatic data exchange among the involved compute resources.
Research Objectives and Organization
Investigations in the above-discussed research showed that regular training has a beneficial effect especially on the balance, motor skills, spatial orientation, reaction time, memory, attention, confidence and mental well-being of the user. However, the personalized application of the available systems for therapeutic processes requires careful choice of appropriate settings for a large number of system parameters, values of which cannot be estimated purely from few simple measurements (e.g. weight and height of the patient).
The system parameters must be modified w.r.t. the current patient«s state as well as to the set of multiple diseases the patient is suffering from. This is the FIrst reason why a large search space should be considered when planning the type of therapeutic processes to be recommended for system parameter settings.
The second reason is related to the fact that individual combination of therapeutic processes is recommended for each patient and this combination influences the parameter values for individual processes. All this leads ta combinatorial explosion that must be faced. Our objective is to contribute to the solution of these challenges
The system parameter setting currently relies mainly on subjective observation of the status and behavior of the patient by the persons steering the therapy this type of human expert supervision is time-consuming, sub-optimal, and unproductive. The challenge and an ambitious research goal is to provide technical support for the design of an optimal therapeutic process for each individual patient.
We suggest applying for that purpose method of machine learning or case-based reasoning. A prerequisite for this approach is to collect enough training data that fully characterize the detailed advance of the patient during the rehabilitation processes and his/her therapies.
An interesting issue is also the choice of the optimal set of sensors for a given patient. In theory, it seems that the more data we can collect from the considered patient, the better. On the other hand, one should not forget that placement of each BAN (body area network) sensor makes the preparation for the exercise more demanding and causes additional discomfort for the patient who tries to proceed with therapy at home.
That is why special attention should be given to the selection of those sensors that provide the necessary information and the patient has no problems to apply them on. Obviously, the set of sensors for home therapy cannot be the same for all patients under these conditions. The appropriate choice can be significantly supported by careful analysis of patient«s data collected during his/her therapy sessions in the hospital.
On the other hand, the situation is being gradually improving thank to the progress in the wireless sensor technology and non-contact and non-invasive body sensing methods, e.g. monitoring temperature by infrared thermography.
The second long-term aim of our research is to develop an adaptive optimization of involved therapeutic processes including precise monitoring, Assesment, learning, and response cycles and based on patient-specific modeling. The current patient status achieved by goal-oriented therapeutic actions is continuously monitored through a set of sensors.
The received synchronized signal series can be interpreted taking into account the current knowledge base status to help in selecting the next optimal actions from the rich set provided by the rehabilitation environment to be applied.
Treatment Workflow and Associated Data Space Development
Prior to entering the rehabilitation phase, we focus on, there is a certain patient’s development trajectory (pre-rehabilitation → rehabilitation → post-rehabilitation) associated with data collection incrementally creating a big Data Space. For example, in the Brain damage prehistory phase, an Electronic Health Record of the patient has been created and gradually developed.
The concept of Scientific Data Space has been originally introduced by Elsayed. This blog focuses on the Rehabilitation processes conducting the development of the Rehabilitation Data Sub-Space (RDF). For speciFIc data analysis purposes, RDF can be flexibly enlarged by integration with other data space elements. RDF includes three types of data:
• Primary data captured during therapies (e.g., data produced by sensors and balance force platform, game event data, videos, etc.)
• Derived data including statistical and data mining models, results of signal processing, images resulted from the visualization processes, etc.
• Background data providing a description of applied workflows, reports on scientific studies, relevant publications, etc.
The initial session and patient status estimation is the First step in the Rehabilitation phase of the treatment trajectory. Here, the Berg Balance Scale is typically applied.
In this section, the real data gathered during the sample experiments (Homebalance training sessions) will be explored in more detail. The relevance of individual data streams will be considered an additional data that may be useful during further analyses will be proposed.
The typical training (rehabilitation) session using the Homebalance system and the Scope device used for capturing biofeedback data as described in the previous section consists of three consecutive tasks that the probands are to perform: a simple chessboard scene and two scenes in 3D virtual environments.
Each of the considered training sessions has several levels of complexity given by the used parameters of the user task, e.g. maximal time allowed for accomplishing. In different medical FIelds, the term proband is often used to denote a particular subject, e.g., person or animal, being studied or reported on. the task expected the scope of movement, type of cognitive task to be ensured together with the movement.
The complexity of the task assigned to the patient is changed during the exercise in accordance to the biofeedback data informing how demanding is the actual setting for him/her. If the patient manages the task well, the current setting level can be slightly increased by a session supervisor (clinician, home therapist, an intelligent software agent, etc.). In the opposite case, it can be decreased. Scaling of the task complexity is described separately for each of the included training sessions below.
Data Processing and Visualization
Given the raw data that can be obtained from various sensors described previously, we will now focus on the description of data structures and data flows integral to our approach. This will be demonstrated on use-cases mentioned above the chessboard scene and virtual reality scenes.
One minor terminology remark: since many of the scenes considered here for training and rehabilitation have game-like character and using 2D/3D game engines for scene visualization is quite common, we will refer to the testing applications with the virtual environment as to games (as in serious games  and use game-related terminology further on.
Indeed, many of the scenarios used in training and rehabilitation applications resemble computer games: there are given goals that the proband should accomplish, the proband’s progress in the training session is measured by some kind of score (based on completion time, number of errors the proband made, etc.).
The scenes can have an adjustable difficulty level (by setting time limits, making scenes more complicated to navigate, etc.). Therefore, using serious game approach to the training and rehabilitation applications seems very natural. Conversely, many training and rehabilitation approaches that employ software applications can be gamiFIed, by turning them into repeatable scenarios all with difFIculty levels, awards, and scores.
Since the raw data acquired from sensors are quite low-level and use arbitrary device-related units, it is necessary to perform some preprocessing, followed by peak detection/feature extraction steps. In this section, the preprocessing of the sensory data originated from Homebalance platform and the structure of the game-related data for games used in the scenarios mentioned before as well as higher-level data that may be derived from the previous two sources will be discussed.
Electrocardiogram (ECG) represents the electrical activity of the human heart, typically measured by electrodes from the human body. ECG is a composite from 5 waves P, Q, R, S, and T. ECG signal, after necessary denoising, can be used to extract many useful health indicators, most notably the heart rate frequency. Many algorithms for heart rate detection are based on QRS complex detection and the heart rate is estimated based on the distance between QRS complexes.
Measurement of the electrical impedance (EIMP) is a method in which the electrical conductivity, permittivity, and impedance of a part of the body are inferred from surface electrode measurements. Lung resistivity increases and decreases several-fold between inspiration and expiration.
The investigated parameter is the breathing curve derived from changes in the bioimpedance of the chest. The benefit of this method is a non-invasive measurement of basic physiological function. The increase in breathing frequency is a sign of the high difficulty of the exercise.
The galvanic skin response (GSR), also known as an electrodermal activity or skin conductance response, is defined as a change in the electrical properties of the skin. The signal can be used for capturing the autonomic nerve responses as a parameter of the sweat gland function. GSR can be useful for estimating stress levels and cognitive load. GSR has been closely linked to autonomic emotional and cognitive processing and can be used as a sensitive index of emotional processing and sympathetic activity.
GSR signal processing is by no means a trivial task. External factors such as temperature and humidity affect GSR measurements, which can lead to inconsistent results. Internal factors such as medications can also change GSR measurements, demonstrating inconsistency with the same stimulus level.
There is a tremendous variability across individuals in terms of galvanic response; many individuals exhibit an extremely low galvanic response to cues, while others respond extremely strongly. This suggests that GSR would be useful mainly to compare stress levels in various stages of training/rehabilitation of a single individual.
Aside from the data from sensors, it is important to pay attention to what exactly is going on in the virtual scene, in the given moment. Therefore, it is useful to record:
• Game event data to be able later to reconstruct the game state and to identify the moments where the player has to solve various problems (Find a way over the street in virtual reality scene, stabilize a sphere over the target position for a specified duration in the chessboard scene) so they can be correlated with physiological measures (Is the moment particularly stressful? Did the player’s heart rate rise?).
• Low-level input data, which can be used for detecting anomalies from the normal device usage (e.g. by keeping track of mouse position when using it as an input device it is possible to detect abnormalities that may be evidence of tremors, etc.).
• The exact structure of game event data (or alternatively, the game model) depends heavily on the scenario used, however, the game event data together with associated data flows and evaluation logic share a common structure for all scenarios, with three main components: the task assignment, the gameplay events and the game logic itself.
The three components together comprise the game model and sufficiently describe the inner workings of the game, allowing to formalize the format of input (task assignment) data, output (game events, game outcomes) data, and the implicit or explicit relationships between the two (game logic).
This knowledge is sufficient for reconstructing and evaluating the exact progress of the gameplay later, perhaps for purposes of FInding the moments where the player experienced major struggles in the game and adjusting the game difficulty for later plays accordingly in order to optimize the training/learning process that the game is designed to facilitate.
From the low-level data mentioned above, various high-level indicators can be derived. GSR, for instance, is a good resource for estimating stress levels. It is also a prospective physiological indicator of cognitive load, defined as the amount of information that the short-term (or working) memory can hold at one time. Cognitive Load Theory explained by Sweller and his colleagues in suggests that learners can absorb and retain information effectively only if it is provided in such a way that it does not Òoverload their mental capacity.
Monitoring of physiological functions is a useful tool for early detection of increasing stress level of the patient during the therapeutic process. Monitoring of the patient leads to detection of a change of the patient’s state even if it is not visible for the therapist. The normal level of heart rate and other parameters is individual for each patient.
It is useful to collect data before the start of the therapeutic intervention, during the relaxation phase. The patient should be in optimal stress level during the whole therapeutic process. The current stress level should be compared to the normal level of the individual patient and to the optimal stress level of similar patients from the database.
Sensor data are managed by computers (stored, processed, analyzed, etc.) as time-series data. The PMML (Predictive Markup Language) format is a suitable form for representation of these signal data and other statistical and data mining models computed and stored in our data space.
Overall System Architecture
After presenting the functionality that has been already experimentally used and tested by a set of selected patients, we now show how the realized functional modules are included in the overall architecture, which in this section is presented at two levels that also represent two logical design phases:
• Application model architecture gives an overview of participating human actors, data sources and data flows;
• Distributed software system architecture, based on the Cloud-Dew Computing paradigm. The rationales for our design decisions are briefly explained below.
• Centralized approach. All treatment, data collection, and processing are physically placed at a clinic (rehabilitation center).
• Distributed approach. The treatment is provided both at a clinic and at home. There are several implementation patterns possible meeting appropriate requirements. In our case, we need support for effective and productive rehabilitation and a system that guarantees a high availability, reliability, safety, and security, besides its advanced functionality.
We have focused on Cloud-Dew computing to meet the above objectives. Our long-term goal is to have a network of cooperating nodes, where a node denotes a cloud server associated with a clinic (rehabilitation center) steering a set of home infrastructures equipped with balance force platforms and computing, storage, and sensor devices.
Each home system works autonomously, even if the Internet connection is not available now, and in specific intervals exchanges collaborative information (analysis results, data mining models, etc.) with the center involved in a cloud; here, appropriate security rules are preserved. The role of the processing on the premise is also annotation of the collected data with metadata before sending it to the cloud associated with the center.
The Cloud Server coordinates activities of several home infrastructures, simply Homes. Client Program participates in the traditional software pattern Client-Server. Moreover, it includes the software functionality associated with on-premise data processing. Sensor Network Management steers the data flow from sensors and other data stream sources (e.g. video cameras) to the Home Dataspace managed by the Data Space Management System (DSMS). The Dew Server acts as a specific proxy of the Cloud Server; among others, it is responsible for data synchronization on the Home side.
The Cloud Server may be connected to other cloud servers within an SKY infrastructure.
Rehabilitation of balance disorders, a significant category of brain damage impairments, is an issue of high social importance involving many research and development challenges. Patient rehabilitation and exercise supervised by an experienced therapist is an ultimate remedy to most of these balance disorders.
This process often requires a lot of time during which a patient’s improvement is very slow. Under such conditions, it is not easy to motivate the patient to adhere to the recommended exercise. We have developed for that purpose a game like an environment in which the patient uses the WiFi-based balance force platform to control computer games carefully designed to tempt the patient to repeat movements recommended by the therapist.
An appropriately selected computer game offers to the patient continuous visual feedback that indicates how far he/she is from the target location. There are mentioned briefly some case studies that prove advantages of this approach and indicate that success of this type of therapy relies on highly individualized approach to the patient that can currently be ensured by an experienced therapist, only.
This cannot be provided on a mass scale, of course. In this blog, we have introduced a promising Big-data centric therapy approach that can take over some duties of the experienced therapist without compromising the individual needs of the patient.
Our key objectives included: (a) Providing the patient with advanced training means-this was achieved by the WiFi-based balance force platform with wireless connection to a computing platform equipped by intelligent software; (b) Enabling unlimited access to this training system-a home version of the system is provided to patients in the hospital;
(c) Increasing the effect and productivity of the therapy-this is achieved by capturing a set of physiological parameters by a network of sensors, processing and analyzing this data and consequent optimizing therapy parameters setting; and (d) Final technical realization of the proposed functionality by means that guarantee a high reliability, safety and security this is provided by our focus on the Cloud-Dew technology. A small core of early adopters is currently successfully conducting balance disorder rehabilitation according to methodology relying on the proposed approach.
In future research, we plan to develop an automated adaptive optimization of involved therapeutic processes including extended monitoring, Assesment, learning, and response cycles based on patient-specific modeling. Further research plans include the extension of the described framework by the data provenance functionality. This is associated with the generation of additional big data resources increasing trust to collected therapy data and allowing reproduce successful treatments, improve the clinical pathways for the brain restoration domain, etc.
This is pathing a way to the future precision rehabilitation contributing to the greater good.
Big Data Improves Visitor Experience at Local, State, and National Parks—Natural Language Processing Applied to Customer Feedback
Local, State, and National parks are a major source of natural beauty, fresh air, and calming environs that are being used more and more by visitors to achieve mental and physical well being. Given the popularity of social networks and the availability of smartphones with user-friendly apps, these patrons are recording their visit experiences in the form of online reviews and blogs.
The availability of this voluminous data provides an excellent opportunity for facility management to improve their service operation by cherishing the positive compliments and identifying and addressing the inherent concerns. This data, however, lacks structure, is voluminous and is not easily amenable to manual analysis necessitating the use of Big Data approaches.
We designed, developed, and implemented software systems that can download, organize, and analyze the text from these online reviews, analyze them using Natural Language Processing algorithms to perform sentiment analysis and topic modeling and provide facility managers actionable insights to improve visitor experience.
There is a pleasure in the pathless woods
As the pace of day-to-day life moves faster, the living and work environment gets more mechanized and electronic, the food and the clothing become artificial, the desire to re-connect with nature becomes stronger. This has manifested into a multi-billion dollar global nature-based tourism that continues to grow.
While this industry (a key ingredient of many countries’ economic policy) is mostly centered around national, state and local parks and other protected areas which are abound with natural beauty, calming surroundings, and tranquil atmosphere; for this industry to continue to grow and thrive, there is a strong need for using scientific and data-driven service operations management techniques.
All around the world, there are more than thirty thousand parks and protected areas covering more than thirteen million square kilometers. This represents about nine percent of the total land area of the planet. These establishments can be found in over one hundred countries and their dependent territories. Of course, this growth did not occur overnight and has in fact been building (somewhat exponentially) over the past hundred or so years.
When managed well, ecotourism (another name for this industry) has the potential to generate jobs, revenues (some in the form of foreign exchange), philanthropic donations, and better health outcomes. While the designation, development, promotion, and maintenance of national parks and protected areas is a multi-faceted challenge that should involve experts from the travel industry, ecological sciences, environmental policy and safety and security, the quantification, tracking, and improvement of the customer visits is often a second thought.
However, this is changing and there is increasing inclination towards customer-centric strategies. Parks and Preserves that are known exclusively for their scenic beauty, biodiversity and recreational activities are now developing reputations as places that people want to visit because of the way they are received, the tremendous experiences awaiting them, the memories they expect to build, the facilities and processes that are hassle-free, and staff that welcomes them with thoughtful actions.
The on-going challenge that such parks face is to be a choice destination to a wide variety of people, while simultaneously upholding the highest standards of service and stewardship. In order to achieve this balance, parks have to think about the needs and interests of the people that frequent them and those they want to serve.
Most locations and their management often have very rudimentary (e.g., comment cards) approaches to collect customer feedback and even for the small number of responses they receive, do not have the adequate resources to analyze them. Given the recent technological developments in communication, data collection and storage, and text analytics, we propose, implement, and demonstrate the effectiveness of a novel customer feedback analysis system.
Park visitors are just like those to any service (hotel, restaurant, hospital, theater, etc.) in the sense that they come with certain initial expectation on facilities, duration, and quality. In order to ensure that these expectations can be met with a high level of guarantee, it is important for the service provider (in this case the park management) to not only know these expectations but actively manage them and set up an environment that consistently beats them.
The degree to which these expectations are met and hopefully exceeded determines whether the customers re-visit, tell their relatives, friends, and acquaintances to visit, and volunteer their time and resources.
Receiving customer feedback, measuring customer satisfaction, and identifying and fixing inherent issues has been a salient issue in a wide array of industries such as hotels, movies, and others, for the past few decades and it is time for the same management philosophy to be brought over and applied to park management.
There are many dimensions in which the customer experience can be measured. The most important being (i) visual attractiveness and appeal of the scenic areas;
(ii) condition of the physical facilities (such as bathrooms, fences, trails); (iii) congestion levels; (iv) staff appearance and demeanor; (v) promptness and accuracy of information; (vi) safety and security; (vii) availability of amenities (restaurant, café, drinks, etc.); (viii) pricing and value for money; (ix) entertainment and recreation; and (x) friendliness to families and children.
Effective management of parks and protected areas necessitates gathering the information on where visitors are going, what they are doing, who they are, where they are coming from, how satisfied they were with the visit, what they liked most, what they liked the least, and what suggestions they may have for improvement.
Comprehensive data collection efforts followed by rigorous analysis can ensure an accurate Assesment of the values of the park, its resources, and its activities. The techniques used for data collection must clearly support the underlying objectives and traditionally many organizations have resorted to physical questionnaires, telephone and online surveys, structured interviews, and paid and unpaid focus groups. Visitor surveys are often seen as a lower-cost, higher effective approach to collecting customer feedback.
They are so popular that, now park agencies worldwide are using them and this has resulted in the ability for the agencies to benchmark themselves against one another. In addition, this also enabled the park agencies to communicate their accomplishments to stakeholders (e.g. taxpayers, policymakers), accurately and comprehensively.
It is well known that many such surveys conducted by park management have had low responses (often in single digit percentages) and those that were received showed significant bias. Usually, only those visitors that have had a strongly negative experience tended to respond.
There is clearly a need to develop a systematic approach to collecting and analyzing large volumes of accurate customer feedback. One interesting and notable trend is that visitors are increasingly using social networking sites to register their experience in a variety of forms such as online reviews, posts, tweets, images, videos, hashtags. This digital content which is growing at a striking pace not only influences other potential visitors but also contains useful insights for park management to tap into.
Having evolved from the early systems such as newsgroups, listservs, chat rooms, messenger services, social networking has undoubtedly changed the way we communicate, and reduced the cost and inhibition associated with sharing feelings, experiences, and activities. The advent of the smartphone, the meteoric rise in its adoption, the availability of low/no cost apps that enable easy sharing of comments, photos, and hashtags have further encouraged and enabled information sharing.
It is rather astonishing to see the large number of people sharing vast amounts of data (which no longer necessarily implies numbers, but could be in various formats such as text, images, weblinks, hashtags) on these social networks. Park management could and should utilize this data for strengthening their programs, formulating marketing and outreach strategies and improving services.
Big Data presents serious challenges regarding data collection, storage, and integration. Most of the data is noisy and unstructured. In order to fully harness the power of Big Social Data, parks management have to build proper mechanisms and infrastructure to deal with data.
In this blog, we demonstrate how machine learning based natural language processing methods can be used to analyze social media content of park visitor comments. We do so by using the New York State Park System in the United States of America as an example. We describe the software methodologies and infrastructure used to extract useful and actionable insights for park management, economic development, and service operations.
New York State Park System
The New York State (NYS) park system consists of 214 parks and historic sites, over 2000 miles of trails, 67 beaches, and 8355 campsites. It attracts approximately 60 million visitors every year. The State Office of Parks, Recreation, and Historic Preservation are responsible for operating and maintaining the state park system, and one of its strategic priorities is to “Increase, Deepen, and Improve the Visitor Experience”.
Visitor feedback is integral to achieving this objective, but traditional feedback methods—public meetings, web-based surveys, and comment cards—are often tedious, expensive, and limited by low participation. Public online review platforms such as TripAdvisor offer a large volume of visitor feedback that could vastly improve how NYS park managers and other community leaders concerned with tourism or business development currently understand and improve visitor experiences.
The Challenges of Utilizing Public Online Reviews
The NYS park system could develop a deeper understanding of diverse public opinions about its parks by harnessing public online reviews. However, the data on these sites lack structure, is voluminous and is not easily amenable to manual analysis. In order to tap into this rich source of visitor information, facility managers need new ways to get feedback online and new software tools for analysis.
A research group from Cornell’s Samuel Curtis Johnson Graduate School of Management and the Water Resources Institute designed, developed, and implemented software systems to tap online review content for the benefit of state agencies and the general public. Among the many online social platforms that host visitor reviews, Yelp, TripAdvisor, and Google are the most popular and thus were used to develop a pilot decision support system for NYS park managers.
Basics of Natural Language Processing
We briefly describe some of the popular analytical techniques associated with natural language processing and list the various available software systems that can be utilized to perform the proposed analysis.
Preprocessing and Text Representation
Arguably the first critical step in analyzing unstructured and voluminous text is the transformation of the free form (qualitative) text into a structured (quantitative) form that is easier to analyze. The easiest and most common transformation is with the “bag of words” representation of text.
The moniker “bag of words” is used to signify that the distribution of words within each document is sufficient, i.e., linguistic features like word order, grammar, and other attributes within the written text can be ignored (without losing much information) for statistical analysis. This approach converts the corpus into a document-term matrix. This matrix contains a column for each word and a row for every document and a matrix entry is a count of how often the word appears in each document.
The resulting structured and quantitative document-term matrix can then, in principle, be analyzed using any of the available mathematical techniques. The size of the matrix, however, can create computational and algorithmic challenges. Natural language processing overcomes this hurdle by emphasizing meaningful words by removing uninformative ones and by keeping the number of unique terms that appear in the corpus from becoming extremely large.
There are preprocessing steps that are standard, including (i) transforming all text into lowercase, (ii) removing words composed of less than 3 characters and very common words called stop words (e.g., the, and, of), (iii) stemming words, which refers to the process of removing suffixes, so that words like values, valued and valuing are all replaced with value, and finally (iv) removing words that occur either too frequently or very rarely.
Word counts and sentiment represent the most basic statistics for summarizing a corpus, and research has shown that they are associated with customer decision making and product sales.
To accommodate potential non-linearities in the impact of sentiment on customer behavior, it is recommended to separately estimate measures of positive sentiment and negative sentiment of each item of customer feedback. The positive sentiment score is calculated by counting the number of unique words in the review that matched a list of “positive” words in validated databases (called dictionaries) in the existing literature. The negative sentiment score is calculated analogously.
The choice of the dictionary is an important methodological consideration when measuring sentiment since certain words can have the sentiment that changes with the underlying context. For instance, show that dictionaries created using financial 10-K disclosures are more appropriate for financial sentiment analysis rather than dictionaries created from other domains.
We chose the dictionaries in since they were created, respectively, to summarize the opinions within online customer reviews and to perform tone analysis of social media blogs. In total, the combined dictionaries consist of approximately 10,000 labeled words.
Review Analysis—Further Enhancements
These analytical tools are continuously improving, algorithmically getting faster, more sophisticated feature and thus are applicable not only to government-operated facilities but also to resources managed by non-profit groups such as land trusts. Depending on the needs of state or local agencies and other managers of public spaces, these tools could be enhanced in the following ways:
Automatic Downloading of Reviews: Online review platforms continue to increase in popularity and new reviews are submitted on a regular basis. Manual downloading is time-consuming, especially for managers who want “real-time” reports. It is necessary to develop and implement a process that automatically downloads and organizes online reviews into a database.
Topic Modeling on Negative/Positive Review Segments: The current approaches extracts themes from whole reviews that have been labeled as positive or negative. But reviews are rarely, if ever, completely positive or negative. Each review typically contains segments that are positive alongside segments that are negative. In order to get a more accurate collection of themes, the analysis should perform topic modeling on collections of review segments as opposed to whole reviews.
Topic Modeling Incorporating Expert Feedback: A topic is simply a collection of words. When the topics are chosen by computer software, some of the words in the topic may not fit according to the needs of park managers. In such cases, the managers can identify words that should be dropped from a topic and the model can be re-run. Such a recursive approach will lead to a more accurate extraction of themes and improved managerial insights.
Verify Reviews: Reviews from third-party online platforms, are unsolicited and often cannot be verified for veracity. With the advancement and proliferation of technologies like mobile phones and microchip wristbands, the use of devices that track key personal information is increasingly common. These devices carry important information which visitors could voluntarily share with the facility management to create verified or more detailed reviews.
Identifying and Accommodating Temporal Changes: Instances exist in which the underlying data of reviewer characteristics, the length, and content of reviews, the topics discussed and even the language used can undergo a seismic shift. When and if that happens, the existing analysis, applied without any changes, can lead to wrong insights and conclusions. It is necessary to have an approach for identifying such temporal shifts in data and determine ways in which the analysis should be appropriately adjusted.
The necessity and usefulness of natural scenic areas such as local, state, and national parks, has never been greater for our society. They provide much-needed serenity and tranquility and resulting mental health benefits and stress relief in our lives that are increasingly getting lonely, technology-centric, and hectic. Given that these benefits should be realized in short durations of available free time, it is important that appropriate information and feedback is provided to the visitors.
The traditional methods with which the park managers connect with their visitors have had many strengths, but also numerous weaknesses, and the availability of social networks and the information available within them can further strengthen the communication between the park visitors and park management.
Given the new developments in data storage, computational capacity, and efficiency of natural language processing algorithms, a new opportunity is presented to the park managers to better define visitor experience, measure their capacity to meet these expectations, and make the necessary changes to improve their service offering.
Using the New York State park system as the backdrop, we demonstrate the system we built to apply natural language processing algorithms to customer feedback collected from the various social networks. This is only a small first step and we describe ways in which such systems can be further improved and implemented by the teams managing these local, state, and national parks.