What are the main Big data Applications? (2019)
Big Data is the phenomenon that we are trying to record and the hidden patterns and complexities of the data that we attempt to unpack. It has an impact on the individual, organizational, and societal level, being perceived as a breakthrough technological development. In this blog, we explore 20+ Big data Applications.
Today, we are witnessing an exponential increase in ‘raw data’, both human and machine-generated; human, borne from the continuous social interactions and doings among individuals, which led McAfee to refer to people as “walking data generators”;
machine-generated, borne from the continuous interaction among objects (generally coined the ‘Internet-of-Things’), data which is generally collected via sensors and IP addresses. Big Data comes from five major sources:
Large-scale enterprise systems, such as enterprise resource planning, customer relationship management, supply chain management, and so on.
Online social graphs, resulting from the interactions on social networks, such as Facebook, Twitter, Instagram, WeChat, and so on.
Mobile devices, comprising handsets, mobile networks, and internet connection.
Internet-of-Things, involving the connection between physical objects via sensors.
Open data/ public data, such as weather data, traffic data, environment, and housing data, financial data, geodata, and so on.
Advanced analytics techniques such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural language processing, are just as important as the Big Data itself. This may sound somehow obvious and trivial, but it is important to be clear about the weight that each holds in the discussion.
Impact on the Farming and Agricultural Sector
It is to some extent gratuitous to say that, in the end, nothing is more important than our food supply. Considering that we still live in a world in which there are people dying of starvation, it comes as quite a surprise to note that about a third of the food produced for human consumption is lost or wasted every year.
The agriculture sector is thus, in desperate need of solutions to tackle problems such as inefficiencies in planting, harvesting, and water use and trucking, among others. The Big Data age promises to help.
For example, Big Data Analytics can help farmers simulate the impact of water, fertilizer, and pesticide, and engineer plants that will grow in harsh climatic conditions; it can help to reduce waste, increase and optimize production, speed up plant-growth, and minimize the use of scarce resources, such as water.
Generally speaking, Big Data Analytics has not yet been widely applied in agriculture. Nonetheless, there is increasing evidence of the use of digital technologies; and bio-technologies to support agricultural practices. This is termed as smart farming, a concept that is closely related to sustainable agriculture.
Farmers have now started using high-technology devices to generate, record, and analyze data about soil and water conditions and weather forecast in order to extract insights that would assist them in refining their decision-making process.
Some examples of tools being used in this regard include agricultural drones (for fertilizing crops), satellites (for detecting changes in the field); and sensors on the field (for collecting information about weather conditions, soil moisture and humidity, and so on).
As of now, Big Data Analytics in agriculture has resulted in a number of research studies in several areas—we herewith mention some of the most recent ones: crops, land, remote sensing, weather and climate change, animals’ research, and food availability and security.
The applicability of Big Data in agriculture faces a series of challenges, among which: data ownership and security and privacy issues, data quality, intelligent processing and analytics, sustainable integration of Big Data sources, and openness of platforms to speed up innovation and solution development.
These challenges would need, thus, to be addressed in order to expand the scope of Big Data applications in agriculture and smart farming.
Most encouraging and inspiring examples of mobile-based cloud computing deployments can be found in SSA, which have led to economic and social transformations.
For instance, the uses of mobile clouds in agricultural and farming activities have markedly enhanced productivity. For instance, the Apps4Africa Award-winning app iCow, already mentioned, helps small-scale dairy farmers track and manage their cows’ fertility cycles.
The app informs farmers about the important days of cow gestation period; collects and stores milk and breeding records, and sends farmers best practices. It also helps farmers find the nearest vets and other service providers. Its developer, Green Dreams, has also formed a system involving Google Docs.
Another innovative Big Data- and cloud-based mobile computing solution is Farmforce, a US$2 million platform developed by the Swiss-based Syngenta Foundation for Sustainable Agriculture, which was backed by the Swiss government.
Farmforce operates on a subscription-based SaaS model, which tracks pesticide residues in produce. Farmers can access the software freely online via a mobile phone. The farmers no longer need to use manual record keeping of farm activities and operations.
The cloud-based platform, AgriLife, which is accessible via mobile phone, is used for collecting data and analyzing farmers’ production capability and history.
In order to ensure fast, easy, and efficient availability of resources and services to distant, rural farmers, the platform also acts as an integration point for financial institutions, mobile network operators, produce buyers, and their agents.
The data analysis provides a better understanding of small farmers’ needs and production capability. Service providers can tailor their offerings such as crop insurance, input payments, and savings accounts based on the data.
Big Data- and Cloud-Based Mobile Computing Solutions in the Development of Index Insurance
Several researchers and practitioners have advocated the development and use of index insurance contracts to manage the risks faced by farmers and agricultural producers.
Note that whereas conventional insurance compensates an insurer based on verifiable losses, under an index insurance scheme, payment to an insured farmer depends on the observed value of a specified “index.”
Prior researchers have suggested that the benefits of index insurance contracts are likely to be greater to lending institutions such as agricultural and industrial development banks and MFIs than to an individual borrower.
The development of an index that would provide an accurate measurement of systemic agricultural production shocks in a lending institution’s geographic boundaries can help effectively track its cash flows.
This means that by diversifying a large proportion of the borrower-specific idiosyncratic risks, a lending institution is likely to face lower basis risk than faced by its borrowers individually.
The loan portfolios of most MFIs in developing countries are typically concentrated in urban areas. Systemic risks associated with droughts, floods, cyclones, and other extreme weather-related events tend to make agricultural loans less attractive and hinder the ability and enthusiasm of MFIs to expand their services to rural farmers.
Impact on Educational Outcomes
Big Data- and cloud-based mobile computing solutions can also have a positive impact on education and literacy-related activities. The Connect to Learn (CtL) program provides schools with laptops or netbooks and free wireless to access news, information, and educational content in the cloud.
Philanthropic and charitable causes have also been a factor in stimulating the deployment of Big Data- cloud-based mobile computing solutions in such activities.
For instance, Worldreader, which describes its mission as to “make digital blogs available to children and their families in the developing world, so millions of people can improve their lives,” uses [Amazon’s] AWS to download blogs.
Worldreader has made thousands of free blogs available on the cloud which can be accessed by low-end mobile devices and older 2G mobile networks. The blogs can be accessed through a free mobile software platform biNu.
Most of the processing is performed in the cloud’s servers instead of on the phone, which allows biNu to work ten times faster than regular mobile web browsers.
As customers, we enjoy the convenience of shopping at home and avoiding time-consuming queues. The disadvantages to the customer are few but, depending on the type of transaction, the lack of contact with a store employee may inhibit the use of online purchasing.
Increasingly, these problems are being overcome by online customer advice facilities such as ‘instant chat’, online reviews, and star rankings, a huge choice of goods and services together with generous return policies.
As well as buying and paying for goods, we can now pay our bills, do our banking, buy airline tickets, and access a host of other services all online.
eBay works rather differently and is worth mentioning because of the huge amounts of data it generates.
With transactions being made through sales and auction bids, eBay generates approximately 50 Tb of data a day, collected from every search, sale, and bid made on their website by a claimed 160 million active users in 190 countries.
Using this data and the appropriate analytics they have now implemented recommender systems similar to those of Netflix, discussed later in this blog.
Social networking sites provide businesses with instant feedback on everything from hotels and vacations to clothes, computers, and yogurt. By using this information, businesses can see what works, how well it works, and what gives rise to complaints, while fixing problems before they get out of control.
Even more valuable is the ability to predict what customers want to buy based on previous sales or website activity. Social networking sites such as Facebook and Twitter collect massive amounts of unstructured data that businesses can benefit from commercially given the appropriate analytics. Travel websites, such as TripAdvisor, also share information with third parties.
Professionals are now increasingly acknowledging that appropriate use of big data can provide useful information and generate new customers through improved merchandising and use of better-targeted advertising.
Whenever we use the Web we are almost inevitably aware of online advertising and we may even post free advertisements ourselves on various bidding sites such as eBay.
One of the most popular kinds of advertising follows the pay-per-click model, which is a system by which relevant advertisements pop up when you are doing an online search.
If a business wants their advertisement to be displayed in connection with a particular search term, they place a bid with the service provider on a keyword associated with that search term.
They also declare a daily maximum budget. The adverts are displayed in order according to a system based in part on which advertiser has bid the highest for that term.
If you click on their advertisement, the advertiser then must pay the service provider what they bid. Businesses only pay when an interested party clicks on their advertisement, so these adverts must be a good match for the search term to make it more likely that a Web surfer will click on them.
Sophisticated algorithms ensure that for the service provider, for example, Google or Yahoo, revenue is maximized. The best-known implementation of pay-per-click advertising is Google’s Adwords. When we search on Google the advertisements that automatically appear on the side of the screen are generated by Adwords.
The downside is that clicks can be expensive, and there is also a limit on the number of characters you are allowed to use so that your advertisement will not take up too much space.
Click fraud is also a problem. For example, a rival company may click on your advertisement repeatedly in order to use up your daily budget. Or a malicious computer program, called a clickbot, may be used to generate clicks.
The victim of this kind of fraud is the advertiser since the service provider gets paid and no customers are involved.
Probably the simplest method is to keep track of how many clicks are needed on average to generate a purchase. If this suddenly increases or if there are a large number of clicks and virtually no purchases then fraudulent clicking seems likely.
In contrast to pay-per-click arrangements, targeted advertising is based explicitly on each person’s online activity record.
Cookies come in several forms, all of which originate externally and are used to keep a record of some activity on a website and/or computer.
When you visit a website, a message consisting of a small file that is stored on your computer is sent by a Web server to your browser. This message is one example of a cookie, but there are many other kinds, such as those used for user-authentication purposes and those used for third-party tracking.
Every click you make on the Internet is being collected and used for targeted advertising.
This user data is sent to third-party advertising networks and stored on your computer as a cookie. When you click on other sites supported by this network, advertisements for products you looked at previously will be displayed on your screen.
Using Lightbeam, a free add-on to Mozilla Firefox, you can keep track of which companies are collecting your Internet activity data.
Recommender systems provide a filtering mechanism by which information is provided to users based on their interests. Other types of recommender systems, not based on the users’ interests, show what other customers are looking at in real-time and often these will appear as ‘trending’. Netflix, Amazon, and Facebook are examples of businesses that use these systems.
A popular method for deciding what products to recommend to a customer is collaborative filtering.
Generally speaking, the algorithm uses data collected on individual customers from their previous purchases and searches and compares this to a large database of what other customers liked and disliked in order to make suitable recommendations for further purchasing.
However, a simple comparison does not generally produce good results. Consider the following example.
Suppose an online book store sells a cookery blog to a customer. It would be easy to subsequently recommend all cookery blogs, but this is unlikely to be successful in securing further purchases.
It produces and sells a diverse range from electronic devices to books and even fresh food items such as yogurt, milk, and eggs through Amazon Fresh. It is also a leading big data company, with Amazon Web Services providing Cloud-based big data solutions for business, using developments based on Hadoop.
Amazon collected data on what books were bought, what books a customer looked at but did not buy, how long they spent searching, how long they spent looking at a particular book, and whether or not the books they saved were translated into purchases.
From this, they could determine how much a customer spent on books monthly or annually, and determine whether they were regular customers.
In the early days, the data Amazon collected was analyzed using standard statistical techniques. Samples were taken of a person and, based on the similarities found, Amazon would offer customers more of the same.
Taking this a step further, in 2001 researchers at Amazon applied for and were granted a patent on a technique called item-to-item collaborative filtering. This method finds similar items, not similar customers.
Amazon collects vast amounts of data including addresses, payment information, and details of everything an individual has ever looked at or bought from them. Amazon uses its data in order to encourage the customer to spend more money with them by trying to do as much of the customer’s market research as possible.
In the case of books, for example, Amazon needs to provide not only a huge selection but to focus recommendations on the individual customer. If you subscribe to Amazon Prime, they also track your movie watching and reading habits.
Many customers use smartphones with GPS capability, allowing Amazon to collect data showing time and location. This substantial amount of data is used to construct customer profiles allowing similar individuals and their recommendations to be matched.
Since 2013, Amazon has been selling customer metadata to advertisers in order to promote their Web services operation, resulting in huge growth.
For Amazon Web Services, their Cloud computing platform, security is paramount and multi-faceted. Passwords, key-pairs, and digital signatures are just a few of the security techniques in place to ensure that clients’ accounts are available only to those with the correct authorization.
Amazon’s own data is similarly multi-protected and encrypted, using the AES (Advanced Encryption Standard) algorithm, for storage in dedicated data centers around the world, and Secure Socket Layer (SSL), the industry standard, is used for establishing a secure connection between two machines, such as a link between your home computer and Amazon
Amazon is pioneering anticipatory shipping based on big data analytics. The idea is to use big data to anticipate what a customer will order. Initially, the idea is to ship a product to a delivery hub before an order actually materializes. As a simple extension, a product can be shipped with a delighted customer receiving a free surprise package.
Given Amazon’s returns policy, this is not a bad idea. It is anticipated that most customers will keep the items they do order since they are based on their personal preferences, found by using big data analytics.
Amazon’s 2014 patent on anticipatory shipping also states that goodwill can be bought by sending a promotional gift. Goodwill increased sales through targeted marketing, and reduced delivery times all make this what Amazon believes to be a worthwhile venture.
Amazon also filed for a patent on autonomous flying drone delivery, called Prime Air. In September 2016, the US Federal Aviation Administration relaxed the rules for flying drones by commercial organizations, allowing them, in certain highly controlled situations, to fly beyond the line of sight of the operator.
This could be the first stepping stone in Amazon’s quest to deliver packages within thirty minutes of an order being placed, perhaps leading to a drone delivery of milk after your smart refrigerator sensor has indicated that you are running out.
Amazon Go, located in Seattle, is a convenience food store and is the first of its kind with no checkout required. As of December 2016, it was only open to Amazon employees and plans for it to be available to the general public in January 2017 have been postponed.
At present, the only technical details available are from the patent submitted two years ago, which describes a system eliminating the need to go through an item-by-item checkout.
Instead, the details of a customer’s actual cart are automatically added to their virtual cart as they shop. Payment is made electronically as they leave the store through a transition area, as long as they have an Amazon account and a smartphone with the Amazon Go app.
The Go system is based on a series of sensors, a great many of them used to identify when an item is taken from or returned to a shelf.
This will generate a huge amount of commercially useful data for Amazon. Clearly, since every shopping action made between entering and leaving the store is logged, Amazon will be able to use this data to make recommendations for your next visit in a way similar to their online recommendation system.
However, there may well be issues about how much we value our privacy, especially given aspects such as the possibility mentioned in the patent application of using facial recognition systems to identify customers.
Another Silicon Valley company, Netflix, started in 1997 as a postal DVD rental company. You took out a DVD and added another to your queue, and they would then be sent out in turn.
Rather usefully, you had the ability to prioritize your queue. This service is still available and still lucrative, though it appears to be gradually winding down.
Now an international, Internet streaming, media provider with approximately seventy-five million subscribers across 190 different countries, in 2015 Netflix successfully expanded into providing its own original programmes.
Netflix collects and uses huge amounts of data to improve customer service, such as offering recommendations to individual customers while endeavoring to provide reliable streaming of its movies. The recommendation is at the heart of the Netflix business model and most of its business is driven by the data-based recommendations it is able to offer customers.
Netflix now tracks what you watch, what you browse, what you search for, and the day and time you do all these things. It also records whether you are using an iPad, TV, or something else.
In 2006, Netflix announced a crowdsourcing competition aimed at improving their recommender systems. They were offering a $1 million prize for a collaborative filtering algorithm that would improve by 10 percent the prediction accuracy of user movie ratings.
Netflix provided the training data, over 100 million items, for this machine learning and data mining competition —and no other sources could be used.
Netflix offered an interim prize (the Progress Prize) worth $50,000, which was won by the Korbell team in 2007 for solving a related but somewhat easier problem. Easier is a relative term here since their solution combined 107 different algorithms to come up with two final algorithms, which, with ongoing development, are still being used by Netflix.
These algorithms were gauged to cope with 100 million ratings as opposed to the five billion that the full prize algorithm would have had to be able to manage. The full prize was eventually awarded in 2009 to the BellKor’s Pragmatic Chaos team whose algorithm represented a 10.06 percent improvement over the existing one.
Netflix never fully implemented the winning algorithm, primarily because by this time their business model had changed to the now-familiar one of media streaming.
Once Netflix expanded their business model from postal service to providing movies by streaming, they were able to gather a lot more information on their customers’ preferences and viewing habits, which in turn enabled them to provide improved recommendations.
However, in a departure from the digital modality, Netflix employs part-time taggers, a total of about forty people worldwide who watch movies and tag the content, labeling them as, for example, ‘science fiction’ or ‘comedy’. This is how films get categorized—using human judgment and not a computer algorithm initially; that comes later.
Netflix uses a wide range of recommender algorithms that together make up a recommender system. All these algorithms act on the aggregated big data collected by the company.
Content-based filtering, for example, analyses the data reported by the ‘taggers’ and finds similar movies and TV programmes according to criteria such as genre and actor. Collaborative filtering monitors such things as your viewing and search habits.
Recommendations are based on what viewers with similar profiles watched. This was less successful when a user account has more than one user, typically several members of a family, with inevitably different tastes and viewing habits. In order to overcome this problem, Netflix created the option of multiple profiles within each account.
On-demand Internet TV is another area of growth for Netflix and the use of big data analytics will become increasingly important as they continue to develop their activities.
As well as collecting search data and star ratings, Netflix can now keep records on how often users pause or fast forward, and whether or not they finish watching each programme they start.
They also monitor how, when, and where they watched the programme and a host of other variables too numerous to mention. Using big data analytics we are told that they are now even able to predict quite accurately whether a customer will cancel their subscription.
A particularly useful method for mining big data is the Bloom filter, a technique based on probability theory which was developed in the 1970s. As we will see, Bloom filters are particularly suited to applications where storage is an issue and where the data can be thought of as a list.
The basic idea behind Bloom filters is that we want to build a system, based on a list of data elements, to answer the question ‘Is X in the list?’
With big datasets, searching through the entire set may be too slow to be useful, so we use a Bloom filter which, being a probabilistic method, is not 100 percent accurate—the algorithm may decide that an element belongs to the list when actually it does not; but it is a fast, reliable, and storage efficient method of extracting useful knowledge from data.
Bloom filters have many applications. For example, they can be used to check whether a particular Web address leads to a malicious website.
In this case, the Bloom filter would act as a blacklist of known malicious URLs against which it is possible to check, quickly and accurately, whether it is likely that the one you have just clicked on is safe or not.
Web addresses newly found to be malicious can be added to the blacklist. Since there are now over a billion websites, and more being added daily, keeping track of malicious sites is a big data problem.
A related example is that of malicious email messages, which may be spam or may contain phishing attempts. A Bloom filter provides us with a quick way of checking each email address and hence we would be able to issue a timely warning if appropriate.
Each address occupies approximately 20 bytes, so storing and checking each of them becomes prohibitively time-consuming since we need to do this very quickly—by using a Bloom filter we are able to reduce the amount of stored data dramatically. We can see how this works by following the process of building a small Bloom filter and showing how it would function.
Bloom filter for malicious email addresses.
So, how do we use this array as a Bloom filter? Suppose, now, that we receive an email and we wish to check whether the address appears on the malicious email address list. Suppose it maps to positions 2 and 7, both of which have value 1. Because all values returned are equal to 1 it probably belongs to the list and so is probably malicious.
We cannot say for certain that it belongs to the list because positions 2 and 7 have been the result of mapping other addresses and indexes may be used more than once. So the result of testing an element for list membership also includes the probability of returning a false positive.
However, if an array index with value 0 is returned by any hash function (and, remember, there would generally be seventeen or eighteen functions) we would then definitely know that the address was not on the list.
The mathematics involved is complex but we can see that the bigger the array the more unoccupied spaces there will be and the less chance of a false positive result or incorrect matching.
Obviously, the size of the array will be determined by the number of keys and hash functions used, but it must be big enough to allow a sufficient number of unoccupied spaces for the filter to function effectively and minimize the number of false positives.
Bloom filters are fast and they can provide a very useful way of detecting fraudulent credit card transactions. The filter checks to see whether or not a particular item belongs to a given list or set, so an unusual transaction would be flagged as not belonging to the list of your usual transactions.
For example, if you have never purchased mountaineering equipment on your credit card, a Bloom filter will flag the purchase of a climbing rope as suspicious.
On the other hand, if you do buy mountaineering equipment, the Bloom filter will identify this purchase as probably acceptable but there will be a probability that the result is actually false.
Bloom filters can also be used for filtering email for spam. Spam filters provide a good example since we do not know exactly what we are looking for—often we are looking for patterns, so if we want email messages containing the word ‘mouse’ to be treated as spam we also want variations like ‘m0use’ and ‘mou$e’ to be treated as spam.
In fact, we want all possible, identifiable variations of the word to be identified as spam. It is much easier to filter everything that does not match with a given word, so we would only allow ‘mouse’ to pass through the filter.
Bloom filters are also used to speed up the algorithms used for Web query rankings, a topic of considerable interest to those who have websites to promote.
Lossless data compression
In 2017, the widely respected International Data Corporation (IDC) estimates that the digital universe totals a massive 16 zettabytes (Zb) which amounts to an unfathomable 16 x 1021 bytes.
Ultimately, as the digital universe continues to grow, questions concerning what data we should actually save, how many copies should be kept, and for how long will have to be addressed.
However, with the huge amounts of data being stored, data compression has become necessary in order to maximize storage space.
There is considerable variability in the quality of the data collected electronically and so before it can be usefully analyzed it must be pre-processed to check for and remedy problems with consistency, repetition, and reliability. Consistency is clearly important if we are to rely on the information extracted from the data.
Removing unwanted repetitions is good housekeeping for any dataset, but with big datasets, there is the additional concern that there may not be sufficient storage space available to keep all the data.
Data is compressed to reduce redundancy in videos and images and so reduce storage requirements and, in the case of videos, to improve streaming rates.
There are two main types of compression—lossless and lossy. In lossless compression, all the data is preserved and so this is particularly useful for text.
Lossy data compression
In comparison, sound and image files are usually much larger than text files and so another technique called lossy compression is used. This is because, when we are dealing with sound and images, lossless compression methods may simply not result in a sufficiently high compression ratio for data storage to be viable.
Equally, some data loss is tolerable for sound and images. Lossy compression exploits this latter feature by permanently removing some data in the original file so reducing the amount of storage space needed. The basic idea is to remove some of the detail without overly affecting our perception of the image or sound.
For example, consider a black and white photograph, more correctly described as a greyscale image, of a child eating an ice-cream at the seaside. Lossy compression removes an equal amount of data from the image of the child and that of the sea.
The percentage of data removed is calculated such that it will not have a significant impact on the viewer’s perception of the resulting (compressed) image—too much compression will lead to a fuzzy photo. There’s a trade-off between the level of compression and quality of the picture.
If we want to compress a greyscale image, we first divide it into blocks of 8 pixels by 8 pixels. Since this is a very small area, all the pixels are generally similar in tone.
This observation, together with knowledge about how we perceive images, is fundamental to lossy compression.
Each pixel has a corresponding numeric value between 0 for pure black and 255 for pure white, with the numbers between representing shades of grey.
After some further processing using a method called the Discrete Cosine Algorithm, an average intensity value for each block is found and the results compared with each of the actual values in a given block.
Since we are comparing these actual values to the average most of them will be 0, or 0 when rounded. Our lossy algorithm collects all these 0s together, which represent the information from the pixels that are less important to the image.
These values, corresponding to high frequencies in our image, are all grouped together and the redundant information is removed, using a technique called quantization, resulting in compression.
For example, if out of sixty-four values each requiring 1 byte of storage, we have twenty 0s, then after compression we need only 45 bytes of storage. This process is repeated for all the blocks that make up the image and so redundant information is removed throughout.
For color images, the JPEG (Joint Photographic Experts Group) algorithm, for example, recognizes red, blue, and green, and assigns each a different weight based on the known properties of human visual perception. Green is weighted greatest since the human eye is more perceptive to green than to red or blue.
Each pixel in a color image is assigned a red, blue, and green weighting, represented as a triple <R, G, B>. For technical reasons, <R, G, B> triples are usually converted into another triple, <YCbCr> where Y represents the intensity of the color and both Cb and Cr are chrominance values, which describe the actual color.
Using a complex mathematical algorithm it is possible to reduce the values of each pixel and ultimately achieve lossy compression by reducing the number of pixels saved.
Multimedia files in general, because of their size, are compressed using lossy methods. The more compressed the file, the poorer the reproduction quality, but because some of the data is sacrificed, greater compression ratios are achievable, making the file smaller.
Following an international standard for image compression first produced in 1992 by the JPEG, the JPEG file format provides the most popular method for compressing both color and grayscale photographs. This group is still very active and meets several times a year.
Consider again the example of a black and white photograph of a child eating an ice-cream at the seaside. Ideally, when we compress this image we want the part featuring the child to remain sharp, so in order to achieve this, we would be willing to sacrifice some clarity in the background details.
A new method, called data warping compression, developed by researchers at Henry Samueli School of Engineering and Applied Science, UCLA, now makes this possible.
Big data analytics
Having discussed how big data is collected and stored, we can now look at some of the techniques used to discover useful information from that data such as customer preferences or how fast an epidemic is spreading.
Big data analytics, the catch-all term for these techniques, is changing rapidly as the size of the datasets increases and classical statistics makes room for this new paradigm.
Hadoop provides a means for storing big data through its distributed file system. As an example of big data analytics, we’ll look at MapReduce, which is a distributed data processing system and forms part of the core functionality of the Hadoop Ecosystem. Amazon, Google, Facebook, and many other organizations use Hadoop to store and process their data.
A popular way of dealing with big data is to divide it up into small chunks and then process each of these individually, which is basically what MapReduce does by spreading the required calculations or queries over many, many computers.
It’s well worth working through a much simplified and reduced example of how MapReduce works—and as we are doing this by hand it really will need to be a considerably reduced example, but it will still demonstrate the process that would be used for big data.
There would be typically many thousands of processors used to process a huge amount of data in parallel, but the process is scalable and it’s actually a very ingenious idea and simple to follow.
There are several parts to this analytics model: the map component; the shuffle step; and the reducing component. The map component is written by the user and sorts the data we are interested in.
The shuffle step, which is part of the main Hadoop MapReduce code, then groups the data by key, and finally, we have the reduce component, which again is provided by the user, which aggregates these groups and produces the result. The result is then sent to HDFS for storage.
The role of IT (Information Technology) in the Big Data area is fundamental and advances that occurred in this context made the arrival of this new data-driven era possible. New Big Data technologies are enabling large-scale analysis of varied data, in unprecedented velocity and scale.
Typical sources of Big Data can be classified from the perspective of how they were generated, as follows:
User-generated content (UGC) e.g. blogs, tweets and forum content;
Transactional data generated by large-scale systems e.g. weblogs, business transactions, and sensors;
Scientific data from data-intensive experiments e.g. celestial data or genomes;
Internet data that is collected and processed to support applications;
Chart data composed by an enormous number of nodes of information and the relationship between them.
In the Information Technology industry as a whole, the speed that Big Data appeared generated new issues and challenges in reference to data and analytical management.
Big Data technology aims to minimize the need for hardware and reduce processing costs. Conventional data technologies, such as databases and data warehouses, are becoming inadequate for the amount of data to analyze.
Big Data is creating a paradigm change in the data architecture, in a way that organizations are changing the way that data is brought from the traditional use of servers to pushing computing to distributed data.
The necessity for Big Data analysis boosted the development of new technologies. In order to permit processing so much data, new technologies emerged, like MapReduce from Google and its open source equivalent Hadoop, launched by Yahoo.
The MapReduce technology allows the development of approaches that enable the handling of a large volume of data using a big number of processors, resulting in directing for some of the problems caused by volume and velocity.
Apache Hadoop is one of the software’s platforms that support data application in a distributed and intensive way and implement Map/Reduce. Hadoop is an open source project hosted by the Apache Software Foundation and consists of small subprojects and belongs to the infrastructure category of distributed computing.
The role of IT in the information flow’s availability to create competitive advantages was identified and pointed out as six components:
Add volume and growth, through improvement or development of products and services, channels or clients;
Distinguish or increase the will to pay;
Optimize risks and operations;
Improve industry structure, innovate with products or services and generate and make knowledge and other resources and competencies available;
Transform models and businesses processes to continuous relevance in the scenario changes.
Cloud computing is a key component for Big Data, not only because it provides infrastructure and tools, but also because it is a business model that BD&A can trace, as it is offered as a service (Big Data as a Service—BdaaS).
However, it brings a lot of challenges. An intensive research project using academic papers about Big Data showed the following technologies as the most cited by the authors, by order of relevance: Hadoop/MapReduce, NoSQL, In-Memory, Stream Mining, and Complex Event Processing.
Is it possible to predict whether a person will get some disease 24 hours before any symptoms appear? It is generally considered that the healthcare system is one of the sectors that will benefit the most from the existence of Big Data Analytics. Let us explore this in the following lines.
There is a certain consensus that some of the challenges faced by the healthcare sector include the inadequate integration of the health care systems and poor health-care information management.
The healthcare sector, in general, amasses a large amount of information, which nonetheless, results in today in unnecessary increases in medical costs and time for both healthcare service providers and patients.
Researchers and hospital managers alike are thus interested in how this information could be used instead to deliver a high-quality patient experience, while also improving organizational and financial performance and meeting future market needs.
Big Data Analytics can support evidence-based decision-making and action taking in healthcare.
In this sense, a study and found that only 42% of the healthcare organizations surveyed supported their decision-making process with Big Data Analytics and only 16% actually had the necessary skills and experience to use Big Data Analytics. The value that can be generated in general goes, thus, far beyond the one that is created today.
But beyond improving profits and cutting down on operating costs, Big Data can help in other ways, such as curing disease or detecting and predicting epidemics. Big Data Analytics can help to collect and analyze the health data that is constantly being generated for faster responses to individual health problems; ultimately, for the betterment of the patient.
It is now a well-known fact that with the help of Big Data Analytics, real-time datasets have been collected, modeled, and analyzed and this has helped speed up the development of new flu vaccines, identifying and containing the spread of viral outbreaks such as the Dengue fever or even Ebola.
Furthermore, we can only imagine for now what we would be able to do if, for example, we would collect all the data that is being created during every single doctor appointment.
There are a variety of activities that happen during the routine medical examinations, which are not necessarily recorded, especially if the results turn out to be within some set parameters (in other words, if the patient turns out to be healthy):
The doctor will take the body temperature and blood pressure, look into the eyes to see the retina lining, use an otoscope to look into the ears, listen to heartbeat for regularity, listen to breathing in the lungs, and so on—all these data could help understand as much about a patient as possible, and as early in his or her life as possible.
This, in turn, could help identify warning signs of illness with time in advance, preventing further advancement of the disease, increasing the odds of success of the treatment, and ultimately, reducing the associated expenses.
Now, to some extent, collecting this kind of granular data about an individual is possible due to smartphones, dedicated wearable devices, and specialized apps, which can collect data, for example, on how many steps a day a person walks and on the number of daily calories consumed, among others. But the higher value that these datasets hold is yet to be explored.
Psychiatry is a particular branch of medicine that could further benefit from the use of Big Data Analytics and research studies to address such matter have just recently started to emerge.
It is well known that in psychiatric treatments, there are treatments that are proven to be successful, but what cannot be predicted generally is who they are going to work for; we cannot yet predict a patient’s response to a specific treatment.
What this means, in practical terms, is that most of the times a patient would have to go through various trials with various medicines, before identifying that works the best for the patient in question.
Importance of Big Data and robust statistical methodologies in treatment prediction research, and in so doing, they advocated for the use of machine-learning approaches beyond exploratory studies and toward model validation. The practical implications of such endeavors are rather obvious.
The healthcare industry, in general, has not yet fully understood the potential benefits that could be gained from Big Data Analytics. The authors further stated that most of the potential value creation is still in its infancy, as predictive modeling and simulation techniques for analyzing healthcare data as a whole have not been yet developed.
Today, one of the biggest challenges for healthcare organizations is represented by the missing support infrastructure needed for translating analytics-derived knowledge into action plans, a fact that is particularly true in the case of developing countries.
Traffic congestion and parking unavailability are a few examples of major sources of traffic inefficiency. Worldwide. But how about if we could change all that? How about if we could predict traffic jams hours before actually taking place and use such information to reach our destinations within lesser time?
How about if we could be able to immediately find an available parking space and avoid frustration considerably? Transportation is another sector that can greatly benefit from Big Data. There is a huge amount of data that is being created, for example, from the sat nav installed in vehicles, as well as the embedded sensors in infrastructure.
But what has been achieved so far with Big Data Analytics in transportation? One example is the development of the ParkNet system (“ParkNet at Rutgers”, n/a), a wireless sensing network developed in 2010 which detects and provides information regarding open parking spaces.
The way it works is that a small sensor is attached to the car and an onboard computer collects the data which is uploaded to a central server and then processed to obtain the parking availability.
Another example is VTrack, a system for travel time estimation using sensor data collected by mobile phones that address two key challenges: reducing energy consumption and obtaining accurate travel time estimates. In the words of the authors themselves:
Real-time traffic information, either in the form of travel times or vehicle flow densities, can be used to alleviate congestion in a variety of ways: for example, by informing drivers of roads or intersections with large travel times (“hotspots”);
by using travel time estimates in traffic-aware routing algorithms to find better paths with smaller expected time or smaller variance; by combining historical and real-time information to predict travel times in specific areas at particular times of day;
by observing times on segments to improve operations (e.g., traffic light cycle control), plan infrastructure improvements, less congestion pricing and tolling schemes, and so on.
A third example is VibN a mobile sensing application capable of exploiting multiple sensors feeds to explore live points of interest of the drivers. Not only that, but it can also automatically determine a driver’s personal points of interest.
Lastly, another example is the use of sensors embedded in the car that could be able to predict when the car would break down.
A change in the sound being emitted by the engine or a change in the heat generated by certain parts of the car—all these data and much more could be used to predict the increased possibility of a car to break down and allow the driver to take the car to a mechanic prior to the car actually breaking down. And this is something that is possible with Big Data and Big Data Analytics and associated technologies.
To sum up, the Big Data age presents opportunities to use traffic data to not only solve a variety of existent problems, such as traffic congestion and equipment fault but also predict traffic congestion and equipment fault before it actually happens. Big Data Analytics can, thus, be used for better route planning, traffic monitoring and management, and logistics, among others.