Define Data and What is Data
Everything, every process, every sensor, will soon be driven by data. This will dramatically change the way in which business is carried out.
In 10 years from now, I predict that every employee of every organization in the world will be expected to have a level of data literacy and be able to work with data and derive some insights to add value to the business.
Think about the last film you saw at the cinema. How did you first hear about it? You might have clicked on the trailer when YouTube recommended it to you, or it may have appeared as an advertisement before YouTube showed you the video you actually wanted to see.
You may have seen a friend sing its praises on your social network, or had an engaging clip from the film interrupt your newsfeed.
If you’re a keen moviegoer, it could have been picked out for you on an aggregate movie website as a film you might enjoy.
Even outside the comfort of the internet, you may have found an advertisement for the film in your favorite magazine, or you could have taken an idle interest in the poster on your way to that coffeehouse with the best Wi-Fi.
None of these touchpoints was coincidental. The stars didn’t just happen to align for you and the film at the right moment. Let’s leave the idealistic serendipity to the onscreen encounters.
What got you into the cinema was less a desire to see the film and more of a potent concoction of data-driven evidence that had marked you out as a likely audience member before you even realized you wanted to see the film.
When you interacted with each of these touch points, you left a little bit of data about yourself behind. We call this ‘data exhaust’. It isn’t confined to your online presence, nor is it only for the social media generation. Whether or not you use social media platforms, whether you like it or not, you’re contributing data.
It has always been this way; we’ve just become better at recording and collecting it. Any number of your day-to-day interactions stands to contribute to this exhaust.
On your way to the London Underground, CCTV cameras are recording you. Hop onto the Tube, and you’re adding to Transport for London’s statistical data about peak times and usage.
When you bookmark or highlight the pages of a novel on your Kindle, you are helping distributors to understand what readers particularly enjoyed about it, what they could put in future marketing material and how far their readers tend to get into the novel before they stop.
When you finally decide to forgo the trials and punishments of public transport and instead drive your car to the supermarket, the speed you’re going is helping GPS services to show their users in real time how much traffic there is in an area.
And it also helps your car gauge how much more time you have left before you’ll need to find a petrol station.
And today, when you emerge from these touchpoints, the data you leave behind is swept up and added to a blueprint about you that details your interests, actions, and desires.
But this is only the beginning of the data store. This blog will teach you about how to define data and what data is. You will learn the essential concepts you need to be on your way to mastering data science, as well as the key definitions, tools, and techniques that will enable you to apply data skills to your own work.
This blog will broaden your horizons by showing you how data science can be applied to areas in ways that you may previously have never thought possible.
I’ll describe how data skills can give a boost to your career and transform the way you do business – whether that’s through impressing top executives with your ideas or even starting up on your own.
Data is everywhere
Before we move any further, we should clarify what we mean by data. When people think of data, they think of it being actively collected, stashed away in databases on inscrutable corporate servers and funneled into research. But this is an outdated view. Today, data is much more ubiquitous.
Quite simply, data is any unit of information. It is the by-product of any and every action, pervading every part of our lives, not just within the sphere of the internet, but also in history, place, and culture. A cave painting is a data. A chord of music is data. The speed of a car is data. A ticket to a football match is data.
A response to a survey question is data. A blog is a data, as is a blog within that blog, as is a word in that blog, as is a letter within that word. It doesn’t have to be collected for it to be considered data.
It doesn’t have to be stored in a vault of an organization for it to be considered data. Much of the world’s data probably doesn’t (yet) belong to any database at all.
Let’s say that in this definition of data being a unit of information, data is the tangible past. This is quite profound when you think about it. Data is the past, and the past is data.
The record of things to which data contribute is called a database. And data scientists can use it to better understand our present and future operations.
They’re applying the very same principle that historians have been telling us about for ages: we can learn from history. We can learn from our successes – and our mistakes – in order to improve the present and future.
The only aspect of data that has dramatically changed in recent years is our ability to collect, organize, analyze and visualize it in contexts that are only limited by our imagination.
Wherever we go, whatever we buy, whatever interests we have, this data is all being collected and remodelled into trends that help advertisers and marketers push their products to the right people, that show the government people’s political leanings according to their borough or age.
And that help scientists create AI technologies that respond to complex emotions, ethics, and ideologies, rather than simple queries.
All things considered, you might start to ask what the limits to the definition of data are. Does factual evidence about a plant’s flowering cycle (quantitative data) count as data as much as the scientist’s recording of the cultural stigma associated with giving a bunch to a dying relative in the native country (qualitative data)?
The answer is yes. Data doesn’t discriminate. It doesn’t matter whether the unit of information collected is quantitative or qualitative.
Qualitative data may have been less usable in the past when the technology wasn’t sophisticated enough to process it, but thanks to advancements in the algorithms capable of dealing with such data, this is quickly becoming a thing of the past.
To define data by its limits, consider again that data in the past. You cannot get data from the future unless you have managed to build a time machine. But while data can never be the future, it can make insights and predictions about it. And it is precisely data’s ability to fill in the gaps in our knowledge that makes it so fascinating.
Big (data) is beautiful
Now that we have a handle on what data is, we need to shake up our understanding of where and how it actually gets stored.
We have already shown our wide-reaching potential for emitting data (that’s our data exhaust) and have explained that, in its being a unit of information, there is a very broad concept of what we understand as being data. So once it is out there, where does it all go?
By now, you’re likely to have heard the term ‘big data’. Put very simply, big data is the name given to datasets with columns and rows so considerable in number that they cannot be captured and processed by conventional hardware and software within a reasonable length of time.
For that reason, the term is dynamic – what might once have been considered big data back in 2015 will no longer be thought of as such in 2020, because by that time technology will have been developed to tackle its magnitude with ease.
The 3 Vs
To give a dataset the label of big data, at least one of three requirements must be fulfilled:
Its volume – which refers to the size of the dataset (eg the number of its rows) – must be in the billions;
Its velocity – which considers how quickly the data is being gathered (such as online video streaming) – assumes that the speed of data being generated is too rapid for adequate processing with conventional methods; and
Its variety – this refers to either the diversity of the type of information contained in a dataset such as text, video, audio or image files (known as unstructured data) or a table that contains a significant number of columns which represent different data attributes.
Big data has been around for much longer than we’d care to imagine – it’s just that the term for it didn’t appear until the 1990s. We’ve been sitting on big data for years, for all manners of disciplines, and for far longer than you might expect. So I have to break it to you: big data is not big news. It’s certainly not a new concept.
Many if not all of the world’s largest corporations have mammoth stores of data that have been collected over a long period of time on their customers, products, and services.
Governments store human data, from censuses to surveillance and habitation. Museums store cultural data, from artifacts and collector profiles to exhibition archives.
In short, if you just can’t work with it, you can call it big data. When data scientists use the term, they don’t use it loosely. It is used to draw attention to the fact that standard methods of analyzing the dataset in question are not sufficient.
Why all the fuss about big data?
You might think it strange that we have only just started to realize how powerful data can be. But while we have been collecting data for centuries, what stopped us in the past from lassoing it all into something beneficial to us was the lack of technology. After all, it’s not how big the data is that matters; it’s what you do with it.
Any data, ‘big’ or otherwise, is only useful to us if it can be mined for information, and before the technology was developed to help us analyze and scale this data, its usefulness could only be measured by the intellectual capability of the person wrangling the data.
But big data requires a faster and more powerful processor than the human brain to sort it. Before the technological developments of the 20th century, data was stored on paper, in archives, in libraries, and in vaults.
Now, almost all new data we capture is stored in a digital format (and even old data is being actively converted to digital, as evidenced by the sheer amount of resources being funneled into such digital collection projects as the Europeana Collections and the Google Books project).
Storing and processing data
With the advent of the computer came the possibility of automating the process of data storage and processing. But large datasets bogged down the early machines; scientists working with electronic datasets in the 1950s would have had to wait for hours before a simple task was finally churned out.
These scientists soon came to the conclusion that, in order to process large sets of data properly – to make connections between elements and to use those connections to make accurate and meaningful predictions – they would need to build information carriers that could both manage the data and handle its storage.
Sure enough, as the technology behind computing improved, so did computers’ storage and processing capacities. And in the last 70 years, not only have we been able to store information in significantly more efficient ways, we have also been able to make that information portable.
The same information that would in the 1970s have only fitted on 177,778 floppy disks could all fit on a single flash drive by the 2000s. Today, you can store all that and more in the cloud (a storage facility with a virtualized infrastructure that enables you to view your personal files from anywhere in the world).
Just bear in mind that the next time you access personal documents from your local library or place of work – or simply on your mobile device – you’re effectively doing what would have required carrying over 100,000 floppy disks in the 1970s.
When these new technologies made the storing of data easier, researchers started to turn their attention to how this stored data could actually be used.
How did we start to create order from chaos? Let’s return to our earlier example of the last film you watched at the cinema.
You were probably cherry-picked to see that film not by an insightful marketer poring over the evidence but instead by a clever machine that looked at your data exhaust and matched it against the demographics that it found were the most likely to watch – and enjoy – that film.
That might sound novel, but as we have already established, data and its (manual) processing have been around for a long time.
Some of Hollywood’s production houses were gathering data as early as the 1950s about what their audience wanted to see, from actor to director to genre, before slicing and dicing that information into respondent demographics that corresponded to age, location, and gender. Even at that time, people were making potentially game-changing decisions through what the data told them.
Data can generate content
So, what if, after all the clever data-driven evidence, you ended up hating the film you last saw at the cinema? Well, data might not be able to predict everything, but it certainly got you in the seat. Data might sometimes get a C for achievement, but it always gets an A for effort.
And the former is being worked on. Rather than attaching the right audience demographic to a new film or television series, production companies are now finding ways to respond by using audience data to make informed decisions about their entertainment output.
Affecting this movement requires more data. For that reason, data collection does not stop once you have watched the film that was picked for you to see;
Any follow-up responses that you make on social media, or through e-mail or through changing your viewing habits online, will generate a fresh set of data about you, ‘the moviegoer’, that will sharpen and tailor any future recommendations before finally subdividing the demographics of which you are apart.
So, as you transition from that emo teen only interested in dystopian zombie flicks into the sophisticated surrealism buff who everyone avoids at cocktail parties, your data will move along with you and adapt to those fluctuating preferences.
As a nota bene: the even better news is that data will not deny you your interests. If you’re only playing at being the connoisseur but still enjoy a trashy zombie movie once the curtains are drawn, your data will continue to keep that secret enthusiasm of yours fed.
Of course, the flip side of the coin here is that your data can spill the beans on those preferences. Be aware that data is a record of your actions – it will not lie on your behalf.
Some people will even go to great lengths to hide their ‘actual’ data footprint on digital music service sites by vanity playing, that is, starting up an album of music that they consider to have social cachet and then leaving their desk while the album plays through, so that their historical data will show other users a skewed version of what they enjoy.
In my view, these people have far too much time on their hands, but manipulating data is nevertheless an important topic, and one that we shall return to in due course.
Entertainment company Netflix’s House of Cards first proved to the industry just how powerful data can be not only in reaching out to the right audience for specific types of content but also in driving the actual production of content.
The 2013 political drama series was an early experiment in how data can be applied to produce hit shows.
In the lead-up to House of Cards’ production, Netflix had been gathering data from its users. This data included users’ viewing habits, and those insights allowed Netflix to group its video content into diverse and surprising categories in accordance with the data.
These categories were hidden from public view within its interface but were nevertheless exploited by the company to direct the right kind of film to the right kind of audience.
When the details of their subcategories were revealed online some years ago, the internet was abuzz.
To give you a feel for just how specific Netflix became, among the subcategories were ‘Exciting Horror Movies from the 1980s’, ‘Feel-good Education & Guidance starring Muppets’, ‘Showbiz Dramas’, ‘Goofy Independent Satires’, ‘Irreverent Movies Based on Real Life’, ‘Cerebral Foreign War Movies’, ‘Steamy Thrillers’ and ‘Critically Acclaimed Dark Movies Based on Books’. That’s some specialist viewing habits.
But Netflix had found a significant audience for each one of these categories, and for many more besides.
Eventually, Netflix’s data scientists started to see overlap in their audience’s viewing patterns. It appeared that there was a significant number of Netflix subscribers who enjoyed both Kevin Spacey’s body of work and gritty political dramas.
The rest – updating the original 1990s House of Cards and putting Kevin Spacey in the lead role – is history (or is it data?).
Riding the wave of success
Netflix was absolutely right to value their data: House of Cards became an award-winning and critically acclaimed series. So it came as no surprise when many of Netflix’s competitors later sought to copy this winning model.
Hadelin de Ponteves, a data science entrepreneur and my business partner, worked for a competitor of Netflix to create a similar system that could work for them:
We knew that Netflix already had a powerful recommendation system in place, and so the pressure on us as data scientists and systems developers was not to emulate the same system for our company but rather to find where we could bring about a difference.
We realized that to develop a truly interesting system, we would need to do more than develop a tool to recommend movies that fit a known demographic.
We also wanted to find an algorithm that could suggest movies that might initially take users out of their comfort zones but that they would nevertheless find enjoyable. We really wanted to get that element of surprise in there.
Some might complain that this approach to using data to drive creative content is actually killing off creativity. To that, I would answer that the data only follows what people want.
This is desirable for any industry: to show the right audience at the right time and in the right place the relevant content to entice them to buy into their service.
Data has in this way made industries more democratic. Because while the machines might start to drive our purchases, we still hold the most valuable information: human desire. The machines are not telling us what we want; they are making connections for us that we could not possibly have known.
Data is not telling people to go out and watch superhero movies and not to watch French surrealist films; it is listening to what people want and enjoy. If you believe that there is a problem with creativity being stifled, it is not the fault of data – it is a fault in our society. I cannot emphasize enough that data is the past. It is merely a record of information.
If you do want more French surrealist films, then make sure you go and see them – and make sure you are vocal about them afterward. It might seem as though you’re just adding to the noise of the internet.
But this noise is swiftly being rounded up and mined for use elsewhere. Thanks to data, this is an age where our voices can actually get heard and have real power – so why not make good use of it?
Besides which, the models for using data have not yet been perfected. In the case of the media industry, other corporations have since taken on the Netflix concept – and some may point out that they have had varying levels of success. But again, it’s not the data that is at fault, it is the human, creative input.
After all, that is where the current limit of our ability to use data in order to produce content lies. We might be able to assess the likelihood of the number of people interested in a concept, but there is also a great deal more at stake, as the ultimate success of any form of entertainment will rest on the talent involved in its creation.
Let that be a warning to writers and directors hoping to get an easy ride by relying solely on the data: databases that show the varying success of film genres might be a useful guide to follow, but they can only remain a guide for as long as the work rests on human talent.
Why data matters now
Many are already aware of how technology is set to shake up jobs in the future. If you are feeling particularly brave, a quick Google search for ‘technological impact on jobs’ will show you there are myriad articles that speak to the likelihood of your job becoming automated.
While this information has been backed up by data, I would argue that there may have been a degree of subjectivity from the researchers when taking into account the tasks required of certain jobs.
Nevertheless, I would certainly not recommend people train to be sports umpires for the very reason that their job rests on the data of a game – and machines will inevitably supply more accurate data to corroborate or flout any challenges made by competitors.
The umpire might be a tradition that makes the experience more personable or entertaining right now, but in my opinion, the nostalgia associated with the position doesn’t mean that it will last forever.
Even after clarifying how all-consuming data is, some may still think that data science will not affect their business for some time to come. Things take time, after all, to develop. But thinking this way would be making a big mistake because that would be denying the principle of Moore’s law.
Moore’s law is a law of predictions. Initially conceived by Intel co-founder Gordon Moore in 1965, Moore’s law first referred to the expected increase in the number of transistors (devices used to control electrical power) per square inch on integrated circuits (eg computer chips, microprocessors, logic boards) over time.
It was observed that the number of these transistors roughly doubled every two years, and the law stated that this phenomenon would only continue. To date, it has held true.
In layman’s terms, this means that if you go to your local computer store today and buy a computer for £1,000, and after two years you purchase another for £1,000 at the same store, the second machine will be twice as powerful, even though it cost the same amount.
Many have applied this law to the mushrooming advancements made in the field of data science.
Data science is one of the fastest-growing academic disciplines, and its practitioners are working on increasingly sophisticated ways to find novel means to capture the data, to construct cost-effective systems to store the data and to develop algorithms that turn all those chunks of big data into valuable insights.
Ever feel as though technology is moving at so fast a pace you can’t keep up? Then spare a thought for data scientists. They are playing a game of catch-up with technology that hasn’t even been invented.
Take the developments made in voice recognition as a case example. The co-founders of Siri, Dag Kittlaus, Adam Cheyer and Tom Gruber, created the intelligent personal assistant well before the technology had advanced to the extent that it could actually produce their concepts and put them on the market.
Siri’s creators built algorithms and frameworks for the data they had available in order to support voice recognition technology that had not yet been invented.
What they did know, however, was that while it was not possible to operate the software with the technology available at the time, it would ultimately be possible to run Siri once the technology had been given enough time to catch up.
They were, in short, intercepting technological trends. The concept that Siri’s makers were applying to their prediction was Moore’s law.
And it’s incredibly important for data science. The law has been applied to many technological processes and is a necessary rule to consider when making business ventures and decisions.
Worrying achieves nothing
Hollywood and the entertainment industry, in general, have long held a dystopian idea of data and the dangers that future data abuse and exploitation can pose to humans. We only need to think of that ominous line from 2001:
A Space Odyssey, ‘Open the pod bay doors, Hal’, where Hal – the spaceship’s AI technology – has become so sophisticated that it decides for itself to disobey human command in favor of its own (superior) reasoning.
Ex Machina, Her, Blade Runner, Ghost in the Shell – all of these films imagine the problems that humans may face when that technology starts to develop consciousness and second-guess us.
But there is another area that is, to me, a far more likely, far more pressing and far more insidious way in which data could be exploited, and it has much more to do with humans abusing it than the robots themselves: privacy.
Issues of privacy pervade many of our interactions online. People can choose to remain anonymous, but their information will always be collected – and used – somewhere.
Even if that data is stripped of its characteristic indicators that can be traced back to an individual, some may ask: is it right that this data is being collected at all?
Your online footprint
Readers who were users of the internet in the 1990s will be familiar with the word ‘avatar’ as referring to the rather innocent images we chose to represent ourselves on online forums. The word avatar is today used to describe something much broader.
It now means our intangible doppel-gänger in the online world; a collection of data on us from the searches, choices and purchases we make online, and everything we post online, from text to images.
This kind of data is a potential goldmine of information for credit agencies and companies that aggregate it. Companies can then use these insights to sell to others.
As data science has become a more prolific discipline, questions of ethics and security surrounding data’s permeability, distortion and capture are being asked. We have very valid reasons to be concerned about the pathways that data science is opening up, and the fact that it does not discriminate in terms of who – or what – accesses this information.
While moving from paper to digital has improved many practical methods in companies, data can still go missing or deteriorate, and there is a significant human influence on data (misplacing information, losing databases, and espionage) that can have devastating consequences.
CASE STUDY The Heartbleed Bug
To my mind, the Heartbleed Bug represents the most radical violation of privacy in the world to date. This bug enabled hackers to exploit a vulnerability in a source code used on the internet, which allowed otherwise protected information sent through Secure Sockets Layer (SSL) connections to be stolen.
This loophole exposed sensitive information on purchasing sites for years before we were made fully aware of its magnitude.
In 2014, Google’s security team found this problem in the SSL source code during a regular review of their services. It was discovered that some 800,000 websites globally had this error in their source code, enabling access for information thieves and hackers who knew of the vulnerability.
But during the two years leading up to its discovery, the bug went unnoticed, allowing potentially countless amounts of data to be stolen. Ironically as SSL-enabled websites (those starting with ‘https’), they were supposed to be more secure than those with normal ‘http’ URLs.
Ignoring widespread speculation at the time about whether the bug had been allowed to continue by governmental or spurious organizations, the fact remains that the Heartbleed Bug represented a monumental violation of privacy.
Don’t censor – educate
The inconvenient truth of data science, and indeed of any discipline where money is directly involved, is that as interest in the discipline grows, so will interest in the more nefarious means to tamper with its processes. Some might consider that to be enough of a reason to put a halt on data gathering and use.
But I see it differently, and I would wager many other data scientists feel the same: rather than censoring and restricting, we need to educate people. We must take care to tell our children that their activities carried out online will form an avatar that may be used in their favor – or against them.
We must ensure that people are generally better versed in how their data is being used, and why.
This is the world in which we live now. It will be much easier to remove yourself from this emotional attachment than to resist.
After all, the youngest generation today has already let go, and these are the new consumer's companies will approach through advertising. This is evidenced in the way that many businesses operate online, from Amazon to Outfittery.
Nowadays, consumers are willing to give their personal information in return for products and services that are better tailored to them. A quick glance at Instagram or Twitter will show you that relinquishing personal information online – to a variety of domains – may even feel like second nature to Millennials.
Unless you are planning on living off-grid in the wilderness and speaking only to the birds, cybersecurity is simply another risk of living today. Fighting it will be futile; the Luddites may have violently protested against the use of machinery in the 19th century, but that changed little in the long run.
It is far less likely that we will shut down the services we all take for granted and have already integrated into our own lives, primarily because we now need these services.
Where it was once a luxury, technology has swiftly become a basic need in the way we live and work. And in order for us to continue developing this technology, we need to exploit the data.
From social media’s insistent barrage of information at all hours of the day to news sites constantly updating their pages as new information comes to them, the pace at which the world is moving, and the option for us to now watch it all happening in real time online, can feel overwhelming.
This overload of data is coming at us from all sides, and there is no way of stopping it. You cannot put a cork in a volcano and expect it not to blow.
What we can do, however, is manage and analyze it. You may have heard of ‘content curators’ and ‘aggregate websites’ such as Feedly, through which you can collect and categorize news stories from blogs and websites that are of interest to you.
These individuals and companies are working to organize the data relevant to them and to their followers or subscribers.
These attempts to manage information should give us comfort, and they are among the many options that we have to process data.
As the technology improves to help us manage and analyze our data, so will our acceptance of it as an integral part of our existence in the Computer Age. So set aside your doubts and let’s focus instead on the possibilities of data, and how it can serve to improve your life.
How data fulfills our needs Maslow’s hierarchy of needs
CASE STUDY Environmental data and Green Horizon
Green Horizon, launched by IBM in 2014, is responding to the severe state of China’s air quality by ‘transforming its national energy systems and support[ing] its needs for sustainable urbanization’ (IBM, 2017a).
Green Horizon assimilates data from 12 global research labs and applies cognitive models to the gathered data in order to inform the project’s central initiative to reduce pollution.
Here, data is essential for monitoring the fluctuations in air pollution in selected areas, and for scientists to analyze the various factors that directly and indirectly affect the air’s quality, temperature and state in order to begin improving China’s physical environment.
The great benefit of these projects is that environmental data is, more often than not, publicly available and on a global scale, meaning the technological developments to help combat the issue of air pollution can move swiftly.
Having access to important datasets that improve our most basic need is essential for understanding how our technologies can perform better.
That is why we now have special glass panes that can be installed in buildings to allow windows to ‘breathe’, cleaning the air inside the premises and thus protecting its inhabitants. That is also why we have filters that can be put into factories to reduce their emissions and protect local residents from poisoning.
Food is another case example of how data can respond to the most basic of human needs (physiological factors on Maslow’s hierarchy).
It might be the stuff of science fiction for some, but food has been grown in laboratories for many years, and the phenomenon of ‘cultured meat’ is becoming increasingly imaginative.
Silicon Valley start-up Memphis Meats is just one institution that has, since its establishment, developed a variety of cultured meat, from beef to poultry.
As it is still such a grey area for regulatory authorities, religions, and science, cultured meat has drawn praise and ire from the world community to various degrees. But whether we like it or not, cultured meat could soon be the future for what we eat.
It will become the environmentally friendly solution to the severe strains that agriculture puts on the natural world, dramatically reducing water usage and carbon emissions.
And the data we collect to produce such meat will eventually go beyond DNA capture. As food technology becomes more commonplace, additional consumer data will be exploited to address other factors such as how cultured meat might be best presented and flavored for it to be the most palatable and – crucially for the companies producing it – saleable.
Data science and safety
Once physiological needs have been met, Maslow’s hierarchy states that the need for safety (physical, financial, personal) will take precedence. Safety is the level, then, that largely includes personal health and wellbeing, and medicine is one of the most prominent disciplines on which data science is making its mark.
In the medical industry, data science is used to revolutionize the medicine we need to diagnose and cure illnesses. All medical trials are analyzed through participant data, and this gathered data can be used to inform diagnosis, recommend different practical approaches, and build new products.
The diagnosis of complex and rare illnesses puts pressure on medical practitioners to stay informed about the many different manifestations and symptoms of the disease, leaving a good deal of human error against finding the root of a problem and dealing with it efficiently.
And as more complex problems come to require specialist doctors, illnesses could go unchecked for the weeks and months that it takes patients to get an appointment with the relevant specialist.
For data scientists, the pressure is on to develop advanced machine learning algorithms to get the most accurate data. This data can be built upon, enhanced and used to predict unusual situations. What’s more, the data that is gathered is not reliant on the welfare of the data scientist (sorry).
Once medical specialists retire, their specialist knowledge goes with them. Once data scientists do the same, the algorithms they have left behind or the data they have gathered can be used to build upon existing knowledge. Data science always builds upon what has been left behind, the information of our past.
It is this ability to crowdsource data that makes the application of data science in the discipline of medicine so powerful – for as long as the data remains, the gathered knowledge will not be dependent on individuals.
CASE STUDY Diagnosing with SkinVision
There are a number of digital applications on the market that crowdsource data on a variety of things, from stars in the night sky to sunspots on your skin.
SkinVision is an application for mobile devices that helps check users’ moles for skin cancer. Using aggregated user data, SkinVision’s algorithm can determine the likelihood of a user’s mole showing malignant symptoms.
It really is as simple as taking a photo of your skin with the app, which SkinVision will then record and analyze, before recommending the next steps for you to take with your doctor.
You might think that using technology like this for diagnosis on a mobile device is flippant, but that is entirely the wrong way to think about it. As more and more data is gathered on an illness, the databases on its causes and effects will grow, making the algorithm’s ability to diagnose patients much more effective than even an experienced surgeon’s.
The more people who use a recognized digital application such as SkinVision to diagnose their condition, the more the technology will be able to distinguish the benign from the malignant – because it will have a large pool of data with which to cross-examine the user-submitted image.
Think about it: would you rather be diagnosed by a human who might have looked at 1,000 individual cases, or by a machine that has an accumulated knowledge of 1,000,000 individual cases and counting?
It is not only digital applications that are paving the way to data-driven medicine. IBM’s Watson is, in their words, ‘a cognitive technology that can think like a human’ (IBM, 2017b). Watson entered the news when it became the first artificial intelligence to beat humans in jeopardy.
But really that just shows for the papers. What makes Watson so fascinating for us is how its technology can apply data to healthcare. Because its most significant asset is that it can be used as a support for doctors to diagnose patients.
Watson uses much the same principle as the SkinVision app – applying gathered data to inform practice – only it naturally requires more sophisticated algorithms for it to function. In one fascinating case, Watson was able to diagnose a rare type of leukemia in just 10 minutes, in a woman whose condition had stumped human doctors for weeks.
Still feeling hesitant about the prospect of using AI in medicine? Watson isn’t the answer to all our problems, though. The machine’s AI
can still make mistakes. But the difference between machine doctors and human doctors is data, and as the technology to process growing quantities increases, so does the difference in ability between human and machine.
After all, humans can absorb information from conferences, medical journals, and articles, but we all have a finite capacity for storing knowledge. What’s more, the knowledge that human doctors possess will largely be limited to their life experience. A machine doctor, on the other hand, can only get better the more data it is given.
With instant access to data from other machines via the cloud, shared data can inform more accurate diagnoses and surgeries across the world. Thanks to exponential growth, these machines will have access to all manner of variations in the human body, leaving human knowledge flagging far behind.
Data science and belonging
After fulfillment of the second stage of Maslow’s hierarchy (safety), the need for belonging to a social environment (family, friends, relationships) will follow. It states that humans need to be part of a community of people who share their interests and outlook on life.
The perceived disconnect between technology and society has been a topic of much discussion in recent years. The internet is often criticized as contributing to an increasingly isolated existence where our every whim and need is catered for.
As an outdoorsy person, I won’t make any case in support of socializing in the digital over the physical world. However, I do believe that the relatively democratic accessibility that the internet affords to people all over the world at all hours of the day is to my mind a great asset to human existence and experience.
What’s more, what makes social networks such as Facebook, Instagram and LinkedIn successful is not the usability of the platform – it’s their data. A badly subscribed social network is unlikely to offer the same breadth of services as a well-subscribed network because social communication ultimately relies on relationships.
If the data isn’t there to connect us to the right information, whether that means human connections, images that appeal to us, or news stories on subjects in which we are interested, the social network will not be useful to us.
Data is helping to make our world much more interconnected, and it is not only aiding us in personal ventures like finding old school friends; it is also helping scholars and practitioners who are carrying out similar projects to find each other and partner up.
CASE STUDY Forging connections through LinkedIn
I love using LinkedIn – and I think that they have really applied their data to benefit both themselves and their users. A quick visit to the business network’s ‘People You May Know’ tab will show you an inexhaustible list of recommendations for connections with LinkedIn’s other users.
Some of these might be people at your current workplace, but you may also notice people from your university, and even school friends, cropping up on the system as recommended connections.
To do this, LinkedIn uses the data you post to your profile – background, experience, education, existing colleagues – and matches it with the profiles of others.
LinkedIn’s technology has enabled thousands of people to rebuild connections with their past. And as these connections grow, so does the network’s data, thereby generating yet more connections.
Whenever you connect with another user, not only do you gain what they call a ‘first-degree connection’ but their linked colleagues become ‘second-degree connections’, thereby expanding your circle much further than may be apparent.
For LinkedIn, as with any other social media channels, all that is essential is input from its users. I have found numerous friends and ex-classmates on the site, many of whom have since gone into the same field as me and, thanks to data’s ability to match us, this has opened up a new dialogue between old acquaintances.
Knowing that I have access to friends and colleagues online build a sense of community and maintain it long after we have moved on, whether that be from a city or a place of work, and I find this interconnectedness comforting.
By connecting with others who share our interests, courses of study and location, LinkedIn can also give us a good insight into jobs that are relevant to us.
When I was in the market for a new job, I started posting status updates to LinkedIn – the platform’s data algorithms identified my needs according to keywords that I had used, and this is how recruiters started to find me.
What was even better was that since I was writing about subjects that interested me, LinkedIn’s algorithms matched me to jobs that specifically required those branches of knowledge. It was even how this book’s commissioning editor found me. How’s that for social media channels’ abilities to improve happiness?
As I have mentioned, while the benefits of having an online presence can be profound and can significantly contribute to our happiness and need for belonging in both the personal and professional spheres, we must also be aware of the problems it can cause us. One of the biggest concerns is how we can protect our data from being stolen.
Cybersecurity has been a hot topic ever since the growth of online banking and since e-commerce became the modus operandi for the retail industry to reach new customers.
In the past, we were told to frequently update our passwords, to only make purchases from reputable websites and, if our bank details were compromised, to contact our bank’s fraud department as soon as possible.
Considering that we are increasingly carrying out transactions online, it is reasonable for us to be concerned about how companies protect our information.
CASE STUDY Data breaches and ransomware
Your data exhaust will inevitably increase the more you use the internet, and the more connected you are to other users. The more data you produce, the more valuable you become to companies that sell user data. Data has superseded oil as the world’s most valuable resource (The Economist, 2017).
But when things become valuable, they also become threatened by theft and abuse. And considering how well connected we are, concerns for our personal information today go far beyond our credit card numbers.
A wealth of personal information is being put online, and whenever our personal computer is connected to the internet or an external server, we are at risk of having that information stolen.
We only need to look back to the global WannaCry cyber attack in May 2017, a computer worm which infected Microsoft computers with ransomware in 150 countries, to see the potential magnitude of this risk.
From FedEx in the United States to the Ministry of Foreign Affairs in Romania, the WannaCry worm locked user data from individuals and organizations on a global scale, with the worm’s developers demanding payment in exchange for its restoration.
Ultimately, those affected had no choice but to pay the team holding their data ransom in order to prevent it from being destroyed.
This is the power of data – its theft can bring an organization to its knees in seconds.
Another recent example of a major cybersecurity breach is the Equifax data breach.
Aggregating data on over 800 million consumers and over 88 million businesses worldwide, Equifax is considered to be one of the big three credit businesses. On 7 September 2017, Equifax announced that it had fallen victim to a cybercrime identity theft that potentially affected 143 million consumers in the US.
The stolen information included first and last names, birth dates, social security numbers, addresses and more (Haselton, 2017). Considering that the population of the US at the time was 324 million, almost every second person in the country was affected.
The rise of cybersecurity
Cyber attacks on consumers and institutions alike are growing in both number and scale. At the same time, cybercriminals are also getting better at covering their footsteps, which makes even finding their location difficult.
The rise of Bitcoin, a digital payment system that enables anonymized transfers, adds a further layer to the already complicated issue of finding and bringing these hackers to justice. As information can be breached from anywhere in the world, this makes it difficult for law enforcement to deter criminals.
Today, it is no surprise that cybersecurity specialists are in real demand. What cybersecurity does is prevent fraudsters and hackers in real time, and it also carries out forensic analyses once attacks have occurred.
Our interactions online have changed, and as the digital systems develop and change, so does the way people commit fraud online and, by proxy, the means at our disposal to combat it. Cybersecurity experts must constantly play a game of cat and mouse if they want to remain ahead of the threats.
My tip if you want to get involved with cybersecurity? Get to know how to work with unstructured data, that is, non-numerical information. Typically, 80 percent of a company’s data is unstructured. We will look in further detail at the developments in working with unstructured data in the following blog.
How can we protect ourselves from cyber attacks?
If we use computers that are connected to the internet or to external servers, and especially if we use social channels for sharing information, we cannot completely protect ourselves from data theft.
However, we can become more careful about storing and managing our data, to ensure that any issues can be efficiently dealt with. Here are a few guidelines that I use for protecting my data:
1 Keep copies of all the files that you cannot afford to lose on an external hard drive or data stick.
2 Clone your hard drive to a reliable external hard drive on a regular basis.
3 Keep tabs on your online accounts, and close any accounts that you no longer use.
4 Archive data that you no longer use, and disconnect it from the internet. Make sure that these files are only stored locally, and keep your archives in a cool, secure environment.
5 Keep all sensitive information away from sharing servers like the cloud.
6 Run regular checks on your hardware to detect data breaches before they occur. Ransomware bugs and worms can spend months within a user’s system before being activated, probing and infecting every last corner of the database, even corrupting backup files, before they finally encrypt the data.
Data science and esteem
Acquiring esteem is the fourth most important need, as Maslow suggests in his hierarchy. There is a very clear link to creating esteem through data.
Due to the rise in people working online, many digital work platforms are helping clients, agencies and freelancers to find the best person for the task by using recommendation and starring systems.
Once a project is complete, online freelancing platforms give those involved the opportunity to publicly rate each other, based on factors that range from availability to quality of work.
Each platform’s rating system is slightly different, but what this rating data ultimately does is both help match clients with the best freelancer for them and contribute to the freelancer’s overall work score, intended to encourage those who receive a good score to continue working in that way, and to motivate those who receive a negative review to improve their performance.
Some may be resistant to the idea of being subject to such scrutiny, but consistent performance data allows people to identify where they excel, and where they may need further training.
Keeping data in high esteem
What is to be done about managing data relative to esteem? The next step for companies will be to encourage users to include relevant demographic data about themselves (such as age and location), and to develop a more comprehensive system beyond the simple starring method and to carry out unstructured analytics on these reviews, which should give a more valuable and accurate example of how a user feels.
This data might then be visualized in word clouds (popular visual representations of text, which we will discover more about in the following chapter) or accessed through filters applicable to user demographics.
Data science and self-actualization
Here is where the fun (literally) starts. With ‘self-actualization’, Maslow refers to a person’s need to fulfill his or her potential in life.
Unlike the earlier levels in the hierarchy that largely reflected needs innate to all people, need here can manifest itself in very different ways – tangibly or intangibly – depending on a person’s interests.
One person’s need for self-actualization may be alleviated through mastering watercolor painting, while another’s might be to become a compelling public speaker, and so on.
CASE STUDY Experience gaming
Ultimately, self-actualization represents the human need for joy. And we have already seen something of the importance this has in/for the entertainment industries.
The billion-dollar video game industry has obvious connections with data science in their dependency on technology. Virtual reality (VR) is one of the most exciting areas in which data has specifically been used to further develop and improve the gaming experience.
Where VR had once been considered a fad, it is now a major strand in the industry – and that is largely thanks to technology’s improved capabilities in processing data, for instance in the frame rates and detail necessary for creating a realistic VR world.
Before the developments made in the 1990s, computer-aided design (CAD) data had been limited by the lack of technology capable of processing it.
Now, data can be used to build a life-size, 3D virtual environment, and algorithms are used to dynamically track where you ‘are’ in this environment, enabling the players’ screens to match their gaze through active shutter glasses and multi-surface projection units.
That is how data improves the engineering behind the video game. But data can also be used to enhance a gamer’s experience through its capture of how a game is being played. And data can be gathered from users in many more ways than is possible for other entertainment industries such as film.
The data exhaust that users leave spans across their interactions, playing time, expenditure on additional game components and their gaming chats, among other things, thereby not only improving recommendation systems and advertising.
But also the game’s mechanics to make it more enjoyable, and even utilizing the big data produced by software distribution platforms to predict their peak times and so attend to their servers accordingly.
Some final thoughts
It is clear that the developments in data science have directly benefited a vast number of areas in our lives. And data is continuing to drive a permeable layer between the physical and digital landscapes, redefining the way we engage with both environments.
This might bring with it some conflicting thoughts, but as we can see from how responsively data can be attributed to Maslow’s hierarchy of needs, data-driven developments will fundamentally facilitate human existence.
Naturally, a great deal of these developments and how we adapt to them depend on the data scientist, which is why I will use the next blog to describe how we can think like one and ensure that our first foray into the discipline is well guided and utilizes the experiences we already have at hand.
Pick the right place to start
Data scientists do not need to know the ins and outs of every piece of software and every algorithm to make a difference in the field. The available programs are innumerable, and the algorithms range from the most basic options that classify our data to the most complex that help drive artificial intelligence.
When you are first starting out, the trick is to take the time to acknowledge where your interests lie, whether that is visualization or machine learning, before pursuing a specific area.
Hold off from taking a knee-jerk response to this – that will not only limit you in your early exploration of data science but may also leave you uninspired if you choose to invest in the wrong area.
To many, visualization may sound more interesting than analysis at a superficial level, but you should take the time to understand what is required of each before making a rash judgment. The good news is that by the time you have finished reading this blog, you will be much clearer about which area interests you the most.
Let’s also be explicit here about what we mean by targeting a specific area; there is a big difference between choosing a niche from which you can spring-board your career and specializing in it.
The latter is a dangerous move and is one that I would never advise. After all, data science is a dynamic subject that requires its practitioners to be equally dynamic in their exploration of how to tackle new problems in the field.
Algorithms change, software alters and specializing in something that will only become defunct in the future is not a constructive way to practice the discipline. This is especially true considering how the rate of technological development directly affects their work, as identified by our old friend Moore’s law.
Moore’s law 2.0
Moore’s law is a projection of exponential growth and is based on the initial observation that the number of transistors in an integrated circuit will double every two years.
It has since been used to account for the rate of development (and inversely proportionate costs) in technology, and to forecast how soon future advancements might be completed.
The fact that every year we have a new iPhone with a processor approximately 50 percent faster than the previous model’s is one such example of Moore’s law in action.
In contrast to 30 years ago – when the only people with access to data-processing facilities were from the intelligence and security branches of government – even pre-school children can now access a wide variety of data from handheld devices that fit in their back pocket.
Moore’s law enables us to access, explore and exploit the potential of data through this explosion of technological advancement.
One of my favorite examples of Moore’s law in practice is the Human Genome Project, which was launched in 1990.
The project’s researchers set out to determine the sequence of the nucleotide base pairs that comprise human DNA. The slow pace at which the project moved in its initial years was a cause for concern for those watching its development from the outside.
Once the first seven years had passed, forecasters took stock of how much of the genome had been sequenced so far, and they predicted that the rest of it would take another 300 years to complete.
In these predictions, however, they failed to account for Moore’s law. Sure enough, the next seven years of the project saw the full and successful sequencing of the genome – some 294 years ahead of schedule if we were to take linear progression into account.
Flex your creative muscles
As we have learned, the dataset is only going to be as useful as the data scientist. Thus, for any project, a good degree of creativity is required to get the most out of the data.
Data scientists must get into the mindset of asking the right questions of their data, and I want to emphasize here that you should absolutely embrace blue-sky thinking: considering the wider implications of a project through its sets of data.
After all, data can give us surprising results to the questions asked of it, and it can highlight problems, issues and gaps that we may not have known about had we not explored the data thoroughly.
That can be said for all disciplines and industries that use data to drive practice: the creativity that its data scientists bring to the table on how to best solve a given problem will significantly affect the task.
There is, of course, a spectrum in the required level of creativity: some challenges may only need a rudimentary approach, while others might warrant some extreme out-of-the-box thinking.
And if you ask me what lies to the far end of this spectrum, what is at the cutting edge of data science and technology, without a shadow of the doubt I will answer artificial intelligence.
Whoever it is I talk to, mention of artificial intelligence (AI) will always get them sitting forward in their seats. It is a fascinating area of development and one that is guaranteed to make headlines. AI is, however, entirely dependent on the availability of data, and the computer’s ability to process it.
The first thing that many will think of when discussing AI is Hollywood’s treatment of it in movies forewarning that improvements in this area will eventually lead to our undoing. In Blade Runner, adapted from Philip
K Dick’s science fiction novel Do Androids Dream of Electric Sheep, robots (‘replicants’) are so lifelike in their design and responses that they eventually become a threat to human existence. For that reason, they are banished to off-world colonies, where they can be kept separate from Earth’s citizens.
Some of these replicants, however, find their way back to Earth and become hostile towards our species. As it is impossible to distinguish these robots from humans by simply looking at them, the Voigt-Kampff machine is developed.
This polygraph-like machine poses a series of questions to its test subjects, specifically designed to scrutinize emotional response. It is reasoned that these questions will perplex the robot subjects – where emotion is deemed to be absent – and therefore reveal their true ‘identity’.
This test has its roots in reality, as the Turing test. Proposed by codebreaker Alan Turing in the 1950s to assess the ability of humans to distinguish a machine from another human being, the Turing test would evaluate responses given during an interrogation.
Unlike the Voigt-Kampff, there would be two subjects in the Turing test – one a robot, the other a human being – and both are hidden from the examiner’s view.
The examiner, whose job is to discern which of the subjects is a robot, will ask both subjects a series of text-only questions and will evaluate how closely their answers resemble the responses a human being might give.
To date, no robot has passed the Turing test.
We may still be a little while away from the highly sentient robots of Blade Runner, but there have been many examples of situations where robots have quite literally beaten humans at their own game.
CASE STUDY Deep Blue and AlphaGo
In a 2016 competition of Go – an abstract, two-player, strategy board game that is hugely popular in East Asia – the machine is known as AlphaGo which was created by Google’s subsidiary DeepMind managed to beat 18-time world champion Lee Sedol in four out of five games.
You may not consider this any great feat, remembering the famous game of chess played between Russian grandmaster Garry Kasparov and Deep Blue, a computer developed for the purpose by IBM.
Deep Blue won, and that was back in 1997. But even though the robot’s success came almost 20 years before that of AlphaGo, the accomplishment of the latter machine should be of special interest to us.
The game of chess is entirely based on logic. The goal for Deep Blue, then, was to flawlessly observe this logic and wait until its opponent made an error. Humans make errors, machines do not.
Unlike chess, the game of Go is based on intuition. Intuition is a much more complicated concept than logic for a computer to handle, as it requires the machine to develop intrinsic knowledge about the game that cannot be simply pre-programmed into it.2
In Go, players use black and white stones on a 19×19 grid board. The object of Go is to cordon off more areas of the board than your opponent. AlphaGo was initially given a large database of around 30 million (human player) moves, which were analyzed through a combination of machine algorithms and tree search techniques before it was used as a Go player.
Once a significant number of games had been played against human contestants and enough knowledge of its opponents’ behaviors had been gathered, AlphaGo was made to play against itself millions of times to further improve its performance.
Only after this training period had passed did the machine’s creators pit it against the world’s top players. From chess to Go, artificial intelligence has come a considerable way, learning through doing and observing, rather than applying mathematical logic.
At this point, you might be thinking: ‘AI winning in chess and Go is exciting, but how is this all relevant to adding value to businesses?’
The application of artificial intelligence isn’t limited to beating humans at Go. The same company, DeepMind, developed an AI to help Google better manage to cool in its vast data centers. The system was able to consistently achieve an astonishing 40% reduction in the amount of energy used for cooling.
Not only does this create a huge potential for savings for the company, but it also means improved energy efficiency, reduced emissions and, ultimately, a contribution towards addressing climate change (DeepMind, 2016). Now if that is not a creative approach to a business problem, then I don’t know what is.
Make use of your background
As I have said under Point 1, the real beauty of data science is that unlike many other disciplines it will not take years of practice to master. Readers who may be just getting started in data science, then, should not feel at a disadvantage when comparing themselves to their peers who may have worked with and studied data all their lives.
Again, all you need is a slight shift in mindset – to focus on what you do know, rather than what you don’t.
Leverage your background, both in terms of your in-depth knowledge of another subject, and any of the soft skills that you will probably have picked up from your professional and/or educational experience.
Not only is data science simple to pick up, it is also beneficial to have come to work in the discipline after having a grounding in another. Here’s where the creative hinges of data science can be highlighted once again. Let us take writing professionals as an example.
If a writer has spent all of his or her education and professional development in only studying and producing writing, and no time at all learning about other disciplines.
or understanding the way people work, or reading extensively, or experiencing the world at large, then that writer is going to have a limited spectrum of knowledge and experiences about which they can write.
It’s a similar concept with data science: those who have studied data science all their lives and have limited professional or personal experience elsewhere will approach a project somewhat underdeveloped.
So, let’s say that somebody with a background in linguistics has decided to transition to data science. They will have a significant advantage over other data scientists for projects within that subject area.
The example isn’t facetious; name a profession and I’ll tell you how data science can be applied to it. Someone with such a background, for example, might be better able to access material from the International Dialects of English Archive, which records the voices of thousands of participants from across the globe and uses those sound files to populate a map of the world.
A ‘raw’ data scientist may be able to play with the material, but a data scientist with the right background will be able to ask the right questions of the project in order to produce truly interesting results.
A geographical pocket of the West Indies, for example, known by the linguist for its unusual use of slang, could be taken as an initial case study from which we might glean further insights, such as the development of expressions between generations, between ethnic backgrounds and between genders.
Undertaking a career in data science, then, does not mean performing a U-turn on everything you have learned from previous disciplines. Quite the opposite is true.
Sometimes, the most interesting projects for you will naturally be close to home. Consider the problems that you face in your workplace: could there be a way to resolve them through data?
While undoubtedly helpful, you do not necessarily have to be an expert in a field to get a head start in data science. Even soft, transferable skills such as teamwork and public speaking can afford you significant leverage in the discipline.
This may be more useful than in-depth knowledge for those who have just left school or university, for example, or those who may not have the same level of life experience or education as others.
Consider your skills: are you a good communicator? Can you adapt established solutions? Do you have an eye for aesthetic appeal? Are you an out-of-the-box thinker?
I entered the discipline of data science with considerable knowledge in finance, but while that was undeniably helpful in my time at multinational consulting firm Deloitte, I think what ultimately helped me were the soft skills that I had picked up much earlier on, even during my formative school years.
I also came to data science with a good understanding of creating graphics to visualize the results of a project in an aesthetically appealing way. During my childhood, I lived in Zimbabwe where I studied art twice a week.
I came away from this school with only a basic ability to paint and mold funny-looking pottery, but while the course may not have set me up for success as the next Joan Miró, it did train me to think more constructively about colors and aesthetics and the positive psychological effects that they can have on my final report.
Once I returned to Russia some years later, my schooling took a very different turn, largely comprising the hard sciences at three different senior schools simultaneously.
That type of schooling drilled into me the academic rigor that I needed for my coming years at university, but it also left me somewhat lacking in social skills.
Confident Data Skills
As an almost incurable introvert, I took to teaching myself some of the confidence and communication skills that I knew I would need. I found a self-help blog that told me all I needed to know about how to get out of my shell.
Its exercises were a little unorthodox (lying down in the middle of a busy coffee shop, drumming up a conversation with people on public transport) but it worked for me.
This effort may have been initially motivated by juvenile thoughts of engaging in university socials and sports teams, but it later helped me in establishing myself as an approachable and communicative figure who was attractive to my place of work, which needed their data scientists to deliver their reports to a wide range of stakeholders across the company.
This is another crucial factor for data scientists: if you want to be able to run a data science project, you will need to be able to speak to the right people. That will often mean asking around, outside your team and potential comfort zone.
The data won’t tell you anything unless you ask the right questions, so it is your job to get out there and find answers from the people who have contributed towards your data.
In both points we discussed here – whether you are leveraging your in-depth knowledge to locate information or using your soft skills to gain answers from people – you will probably come across data that is non-numerical and that therefore depends on a level of context and subjectivity of analysis to get right.
This kind of information – which we call unstructured data – can be a written response, a recorded (video/audio) interview or an image. Since it cannot be quantified, it is the reason why companies often favor subject matter specialists to analyze it.
Unstructured analytics work with – you guessed it – unstructured data, which comprises the majority of information in the world. In defining unstructured data, it may be easier to say it is everything that structured data (numerical information) is not. It can be text, audio, video or an image.
The reason for the name is that this kind of data cannot be forced into a dataset – it must first be prepared, and as unstructured data is oftentimes not automatically quantifiable, a certain degree of subjectivity or bias becomes unavoidable to analyze it. This makes unstructured analytics an essential area for any data scientist.
A classic example of unstructured analytics is working with qualitative surveys, which give data in a textual or another non-numerical format. In the past, this data had to be converted into numerical form before it could be understood by an analytics tool.
This meant any survey questions that were not multiple choice or single answer – and thus could not easily be transposed to a numerical format – required a further manual effort from the data scientist to numerically categorize each answer.
For example, a question about what a visitor to Yellowstone National Park enjoyed about their stay could result in a range of responses including ‘the wildflowers’, ‘picnicking’, ‘painting’, ‘bird watching’, ‘kayaking’, ‘great bed and breakfast’ and so on.
The data scientist would have to read all these results and then manually group them into categories that they felt were significant, such as ‘nature’, ‘activities’, ‘sightseeing’ and ‘relaxation’. It is not always so easy to group a response into a category, leaving them open to human subjectivity.
You can imagine that transposing those responses into numbers left the resulting dataset a little skewed, at best.
Today, methods of ordering results by context have dramatically changed the way we carry out research, and new algorithms in this area are helping us to accurately work with images as well.
Data scientists saw the problematic methodology behind organizing qualitative data and made a concerted effort to deal with values that could not easily be converted into numbers.
The resulting algorithms tackle media to make far more accurate predictions than had previously been possible. Now, we can treat words in a similar way to numerical data, for example, by teaching analytics tools to identify the support verbs as well as idiomatic phrases that are of peripheral interest to the actual keyword.
This enables a machine to explore textual data in a far more qualitative way. The trend in the digital humanities to analyze literary works may spring to mind here, but that’s only scratching the surface of what the machine algorithms in this area can do.
Unstructured analytics has applications that extend well beyond the academic realm and into the commercial world. Even in forensics, machines can now trawl through suspects’ written communication to make behavioral connections that a detective might not have been able to see.
You may think that humans will always be better than machines at trawling through media because most of us still feel that we will always better understand this wider contextual environment.
How can a computer recognize a period of art, or a flock of gulls, or emotion, better than a human? In reality, machines have long been able to make stunningly accurate predictions about non-numerical data.
As early as 2011, a study carried out between the Institute for Neuroinformatics at the Ruhr-Universität, Bochum and the Department of Computer Science at the University of Copenhagen found that machines could outperform humans, even in complex tasks like identifying traffic signs.
For this study, the team presented their machine and human test subjects with a photograph that had been divided into squares. The task was to assert which (if any) of the squares contained all or part of a traffic sign.
You may have seen these tests online – they are currently used as an additional security check before a user logs on to a site, and they are specifically designed to confound robots from accessing secure data.
The results of this study suggest that we are already failing to create sufficient safeguards if there were to be an AI takeover.
I see word clouds used a lot in public presentations, and I suspect the reason for that is because they artfully and meaningfully combine the image with text.
Word clouds (or tag clouds) are popular ways of visualizing textual information, and if you aren’t yet using them in your presentations, you may want to once you’ve learned how they work.
A word cloud creator will take a set of the most commonly used words from a targeted piece of text and group them in a single image, identifying their order of importance by font size and sometimes also by color.
Word clouds can naturally be used to highlight those terms that appear most frequently in a text, whether this text is a press release or a work of literature. They can also be run on survey data, which makes them a very simple but effective way of showing users the key concepts or feelings associated with a given question.
Their effectiveness is therefore demonstrated in their versatility and in identifying key or significant words from anything that contains the text: metadata, novels, reports, questionnaires, essays or historical accounts.
There are plenty of simple word cloud generators that are available to use online, where you can play around with fonts, layouts and color schemes. They are much more appealing to the eye than the ordered lists. Try it out for your next presentation; you might be surprised how you can generate a discussion around them.
Data science has significantly improved the techniques for companies accessing and analyzing media. Most business owners and marketers will be familiar with SurveyMonkey, an online, free-to-use questionnaire provider that runs your surveys through its data analytics tools.
Users acquire access to their consumer data in real time, and responses from their questionnaire’s participants are visualized in simple graphics and a front-end dashboard.
At the time of writing, the provider’s data analytics offer includes real-time results, custom reporting through charts and graphs, filtering your data to uncover trends by demographic – and text analysis, where users will receive the most relevant text data from their survey in a word cloud.
Practice makes perfect
One great aspect of data science is that there are so many free and open-source materials that make it easy to keep practicing.
When people are new to a discipline, there is a tendency to spend month after month learning its theories – instead of doing this, get into the mindset of always applying what you learn in a practical environment.
As an exercise, just type ‘free datasets’ into a search engine and you will find a huge number of sites that allow you to download their .csv files (files for storing tabular data) directly to your computer, ready for analysis.
Considering the sheer amount and range of data, from NASA space exploration to Reddit comments or even sports data (basketball, football, baseball – take your pick), I am positive that you will find something of value and interest.
While the best analysis tools out there are not currently free to use, an increasing amount of software is either open source or freely available online. If you were a painter, this would be like having an endless supply of easels, paints, and canvases to work with, with no worry about using your materials up.
I urge you to make use of these publicly available datasets to test your skills and run your own analyses. There is no shortcut to the practice. Much of what you do, especially in the initial stages, will be trial and error.
The best way to train yourself to think laterally about solving problems through data is to increase your exposure to different scenarios – in other words, different datasets.
Where to begin? The best place may be right under your nose. I expect that many readers will be business owners or individuals working in a company that expects to use data in the near future.
Those of you who have been working with a company in any shape or form will have at one point come across business intelligence.
Business intelligence vs data science
If you have already used business intelligence (BI) in your place of work, you will have a degree of practice under your belt. With business intelligence, you are required to identify the business question, find the relevant data and both visualize and present it in a compelling way to investors and stakeholders.
The main exception is that BI does not carry out detailed, investigative analyses on the data. It simply describes what has happened, in a process that we call ‘descriptive analytics’.
Data science, then, gives us the edge needed to answer further questions that may arise from a company’s datasets, as well as make predictions and reliable suggestions for improvement.
Technology research firm Gartner has a model to divide business analytics into four types, and while business intelligence responds to the first type of analysis, data science can help us tick the boxes for the final three.
Analytic value escalator
That’s the good news, but practicing business intelligence outside the principles of data science can end up as harmful to your progress. It is true that you will be using data the first time that you generate a business report.
But business owners will often need reports on a regular basis. When this happens it is common for attention to the data to fall to the wayside in favor of the end results.
This is one of the problems with BI, that data often comes second to an updated report. But data needs to be at the heart of any results and insights we make in a business – for every report we make, we must carry out analyses of our data beforehand because otherwise, we will only be scrutinizing our data according to the limitations of our previous investigation.
It may be tempting to cling to BI because you or your company has operated in this way for years, but data science simply offers a far more impressive toolset – figuratively and literally – for analytics.
Data science opens the door to a variety of analytical software and a thriving online community of data scientists working on open-source codes to improve and share their processes.
An ability to use these tools eliminates the human burden of looking for insights manually, enabling you to focus on isolating bottlenecks, uncovering sales opportunities and evaluating the health of a business division. Unfortunately, BI’s traditional dependence on Excel can teach you bad habits.
Everything you think you know is wrong
All readers will have worked with Excel. It has become one of the most important programs for corporations, and most spreadsheets will be shared in the .xlsx format.
However, Excel can have the effect of oversimplifying things and so people may have a skewed perception of data. If the only data you know is Excel, you have to be open to changing your perception of analytics.
Be prepared to use a program that is not Excel. In my view, some of the best programs for analyzing datasets are R and Python.
Keep ethics in mind
On a crisp February morning, and long before any reasonable person would be awake, I got a phone call from the Queensland police. Still bleary of eye and furry of the tongue, I could barely get my words out: yes, I was Kirill
Eremenko; yes, I was at my home in Brisbane; yes, I owned the number plate that they read out to me. So what was the problem? They asked me whether anyone apart from me used the motorbike I owned and whether I knew where the vehicle was. The question launched me into consciousness and had me leaping down the stairs to my garage.
With relief, I saw that my pride and joy was still there. But the question remained: if everything they were asking me about (including me) was safely locked away, what were the police doing with all of my details?
They told me that they had spotted a motorcycle with my number plate evading the police in the Gold Coast, a beach city not far from Brisbane. They said that considering my motorbike was at my home, my number plate must have been forged – and they later found it was.
Imagine for a moment that my bike had indeed been stolen. How could I have proved that it wasn’t me who had been evading the law enforcement officers? That night I had been alone, and I had no alibi to speak of.
As far as the police were concerned, it could certainly have been me, particularly considering how difficult it is to forge a number plate in so heavily regulated a country as Australia.
Even though at the start of the conversation I didn’t know if my motor-bike had been stolen, I realized that I hadn’t been at all worried about an alibi during this phone interrogation, not even for a second, because I knew that I had done nothing wrong.
I knew that technology would act as my witness. I carry my phone with me much of the time, I charge it near my bed and any actions I perform with it are registered. This brought to mind my time at Deloitte when I worked with the forensics division.
We worked on countless situations where people professed that they were doing something or that they were in one place, but the tracking recorded on their phones told quite a different story. These records were used as evidence because from mobile devices to CCTV cameras, recorded data doesn’t lie.
The point here is that data can heal. It can act as your alibi. It can act as proof in criminal cases. Many people have a mindset that data can only harm – you won’t get very far in the discipline if you only think of yourself as the villain.
A little change in how you consider data science and its functions will encourage you to look for new ways that practices can be improved and enhanced through data, rather than feeling that you need to justify your work to colleagues.
The ethical cost of data
We know that data can do harm, as evidenced by the boom in conferences and institutions that serve to investigate the implications of technological development on human ethics and codes of practice.
Who has access to our data? Should they have access to it at all?5 As we have seen, data opens up new ways of working, of living, of investigating, of waging war – and it is doing so at an incredible rate.
Take 3D printing as one such example. As the cost to develop these printers decreases, the number of people with access to this technology will increase. While commercial 3D printers might at the moment be producing toys and games, they also have the potential to print any number of items, items that could prove dangerous – all they need is a data blueprint.
This alone is surely enough to provoke concern, considering the disproportionate rate of technological development against our ability to legislate and protect. Can we ever hope to keep up with this rapid rate of change?
Data abuse and misuse
As data scientists, one of the most pressing matters in the debate surrounding technology and ethics is the amount of access that machines have to information. As the processing capabilities of robots increase, machines will soon be capable of handling information in a way that wildly surpasses human limits.
Information of all kinds is becoming digitized. Keeping information in a digital rather than physical format is becoming the norm. Historical artifacts are digitally curated, books and journals are made available online and personal photographs are uploaded to social clouds. After all, information is much safer when it’s kept electronically:
it is less subject to wear and tear, multiple copies can be made, content can be shared and connections between related items can be established. Of course, digital data is not completely protected from damage. It can deteriorate or get lost – but it is ultimately less susceptible to spoiling than data that is only kept in physical form.
The fact that there is so much information on the web – both in breadth and depth – increases the possibilities for machines that have access to this data and widens the divide between human and computer capabilities.
Computers have not reached their processing limits – but we have. The machines are only waiting for three things: access to data, access to faster hardware and access to more advanced algorithms.
When these three needs are met, the use and misuse of machines that can handle the amount of available data will only be a matter of time. And that is already making for a powerful weapon, whether through analyzing online behavior or masquerading as a human on social media sites for the purposes of propaganda.
If we are to believe futurist Raymond Kurzweil’s prediction that, by 2029, a computer will pass the Turing test, then giving machines unfettered access to the internet could make data access a most powerful tool for manipulation.
Why don’t we just stop time?
Returning home after a night out in Brisbane’s city center, I found myself unwittingly pulled into a heated conversation with my taxi driver. Learning that I was a data scientist seemed to have an adverse effect on him.
He started to accuse me of the apparently negative consequences that my work would have on the future. Fearing the worst, the taxi driver gestured to the night sky and asked either me or the heavens above, ‘Why not just – stop where we are, right now?’
That simply isn’t possible. It is in our nature to explore the world, and to continue expanding our horizons. It was natural for my agitated taxi driver to have hesitations about how data – and the algorithms to process them – will be used in the future.
But anxiety about what may or may not happen will only hold us back; a disastrous move considering that while we’re busy panicking, technology will continue to develop.
We must also understand that the concerns of one generation will not necessarily be the concerns of another. What bothers us today about how information about us is gathered, stored and used will probably not be the case for younger generations who have grown up with technology.
This evolution of what we consider ‘the norm’ is reflected in the way we approach the gathering and processing of data. Consider the case of internet cookie storage. Many websites choose to capture data from their users. This data is called a cookie.
It is recorded in a file that is stored on the user’s computer to be accessed the next time they log on to the site, and it can range from usernames to web pages accessed, even advertising for third-party sites, and helps a website to tailor its services to its visitors.
CASE STUDY Internet cookies
You may find the following statement familiar: ‘To make this site work properly, we sometimes place small data files called cookies on your device. Most big websites do this too.’
This is a notice from the European Commission (EC), which has passed legislation demanding that every EUROPA website using cookies must state, in the form of a popup or otherwise, that it records viewer data.
Those who wish to continue to use the website can either agree to the conditions or find out more before agreeing to comply. This legislation was passed at a time when people were concerned that their privacy was being invaded by companies who were using cookies to track pages viewed, interactions made and more.
Since then, those ethical concerns about cookies have slowly but surely been laid to rest. No one really cares about cookies anymore, certainly not the Millennials, and that’s largely because we have become accustomed to them as a part of our online lives.
In other words, the concern surrounding internet cookies has eased off, and so the requirement for a company’s website to clearly state that it is collecting data on its users will be phased out at the beginning of 2018.
Cookies are one example of how data collection is becoming an accepted part of our society. The way that most Millennials use social media – for example, freely expressing their opinion, chatting publicly, uploading their personal photos, tagging friends – must feel worlds apart for the Baby Boomers and how they (generally) behave online.
I do not consider the ethical implications that factor into the discussion to be merely awkward obstructions that a data scientist might prefer to ignore. But I put the question to the reader: should we really suppress the development of technology based on our present concerns?
Or should we rather seek to strike a balance between the rate of technological growth and the rate that we can develop suitable ethical guidelines for them?