Big data security
It is becoming increasingly clear that Big Data is creating the potential for significant innovation in many sectors of the economy, such as science, education, healthcare, public safety and security, retailing and manufacturing, e-commerce, and government services.
Big Data characteristics are intimately connected to privacy, security and consumer well-being. In this blog, we explain several big data security cases and solutions.
Amazon Kindle case
In July 2009, Amazon Kindle readers found life imitating art when their copy of Orwell’s novel 1984 completely disappeared from their devices. In 1984, the ‘memory hole’ is used to incinerate documents that are considered subversive or no longer wanted. Documents permanently disappear and history is rewritten.
Customers were angry, having paid for the e-book and assumed that it was, therefore, their property. A lawsuit filed by a high school student and one other person was settled out of court.
In the settlement, Amazon stated that they would no longer erase books from people’s Kindles, except in certain circumstances, including that ‘a judicial or regulatory order requires such deletion or modification’.
Amazon offered customers a refund, gift certificate, or to restore the deleted books. In addition to being unable to sell or to lend our Kindle books, it seems we do not actually own them at all.
Although the Kindle incident was in response to a legal problem and was not intended maliciously, it serves to illustrate how straightforward it is to delete e-documents, and without hard copies, how simple it would be to completely eradicate any text viewed as undesirable or subversive.
If you pick up the physical version of this book tomorrow and read it you know with absolute certainty it will be the same as it was today but if you read anything on the Web today, you cannot be certain that it will be the same when you read it tomorrow. There is no absolute certainty on the Web.
Since e-documents can be modified and updated without the author’s knowledge, they can easily be manipulated. This situation could be extremely damaging in many different situations, such as the possibility of someone tampering with electronic medical records. Even digital signatures, designed to authenticate electronic documents, can be hacked.
This highlights some of the problems facing big data systems, such as ensuring they actually work as intended, can be fixed when they break down and are tamper-proof and accessible only to those with the correct authorization.
Securing a network and the data it holds are the key issues here. A basic measure taken to safeguard networks against unauthorized access is to install a firewall, which isolates a network from unauthorized outside access through the Internet.
Even if a network is secure from direct attack, for example from viruses and trojans, the data stored on it, particularly if it is unencrypted, can still be compromised.
For instance, one such technique, that of phishing, attempts to introduce malicious code, usually by sending an email with an executable file or requesting personal or security data such as passwords. But the main problem facing big data is that of hacking.
The retail store Target was hacked in 2013 leading to the theft of the details of an estimated 110 million customer records, including credit card details of forty million people.
It is reported that by the end of November the intruders had successfully pushed their malware to most of Target’s point-of-sale machines and were able to collect customer card records from real-time transactions.
At that time, Target’s security system was being monitored twenty-four hours a day by a team of specialists working in Bangalore.
Suspicious activity was flagged and the team contacted the primary security team located in Minneapolis, who unfortunately failed to act on the information. The Home Depot hack, which we will look at next, was even bigger but used similar techniques, leading to massive data theft.
Home Depot Hack
On 8 September 2014, Home Depot, which describes itself as the largest home improvement retailer in the world, announced in a press release that its payment data systems had been hacked. In an update on 18 September 2014, Home Depot reported that the attack had affected approximately fifty-six million debit/credit cards.
In other words, fifty-six million debit/credit cards details were stolen. In addition, fifty-three million email addresses were also stolen. In this case, the hackers were able to first steal a vendor’s log, giving them easy access to the system—but only to the individual vendor’s part of the system. This was accomplished by a successful phishing attempt.
The next step required the hackers to access the extended system. At that time, Home Depot was using Microsoft XP operating system, which contained an inherent flaw that the hackers exploited. The self-checkout system was then targeted since this sub-system was itself clearly identifiable within the entire system.
Finally, the hackers infected the 7,500 self-checkout terminals with malware to gain customer information. They used BlackPOS, also known as Kaptoxa, a specific malware for scraping credit/debit card information from infected terminals.
For security, payment card information should be encrypted when the card is swiped at a point-of-sales terminal but apparently, this feature, point-to-point encryption, had not been implemented and so the details were left open for the hackers to take.
The theft was uncovered when banks started to detect fraudulent activity on accounts that had made other recent purchases at Home Depot—the card details had been sold through Rescator, a cybercrime outlet found on the dark Web. It is interesting that people using cash registers, which also take cards, were not affected by this attack.
The reason for this appears to be that in the mainframe computer, cash registers were identified only by numbering and so were not readily identifiable as checkout points by the criminals. If Home Depot had also used simple numbering for its self-checkout terminals, this hacking attempt might have been foiled.
Having said that, at the time Kaptoxa was deemed state-of-the-art malware and was virtually undetectable, so given the open access to the system the hackers had obtained, it almost certainly would eventually have been introduced successfully.
The biggest data hack yet
In December 2016, Yahoo! announced that a data breach involving over one billion user accounts had occurred in August 2013. Dubbed the biggest ever cyber theft of personal data, or at least the biggest ever divulged by any company, thieves apparently used forged cookies, which allowed them access to accounts without the need for passwords.
This followed the disclosure of an attack on Yahoo! in 2014 when 500 million accounts were compromised. Chillingly, Yahoo! alleged the 2014 hack was perpetrated by an unnamed ‘state-sponsored actor’.
The list of big data security breaches increases almost daily. Data theft, data ransom, and data sabotage are major concerns in a data-centric world. There have been many scares regarding the security and ownership of personal digital data. Before the digital age, we used to keep photos in albums and negatives were our backup.
After that, we stored our photos electronically on a hard-drive on our computer. This could possibly fail and we were wise to have back-ups but at least the files were not publicly accessible.
Many of us now store data in the Cloud. Photos, videos, home movies all require a lot of storage space and so the Cloud makes sense from that perspective.
When you store your files in the Cloud, you are uploading them to a data centre—more likely, they will be distributed across several centres—and more than one copy will be kept.
If you store all your photos in the Cloud, it’s highly unlikely with today’s sophisticated systems that you would lose them. On the other hand, if you want to delete something, maybe a photo or video, it becomes difficult to ensure all copies have been deleted.
Essentially you have to rely on your provider to do this. Another important issue is controlling who has access to the photos and other data you have uploaded to the Cloud. If we want to make big data secure, encryption is vital.
It has been estimated that in 2015 over 200 billion emails were sent every day, with less than 10 percent of these being authentic and not spam or with malicious intent. Most emails are not encrypted, making their contents vulnerable to interception by hackers.
When I send an unencrypted email, let’s say from California to the UK for example, it is divided into data ‘packets’ and transmitted through a mail server, which is connected to the Internet.
The Internet is essentially made up of a big worldwide network of wires, above ground, below ground, and below oceans, plus cell phone towers and satellites. The only continent unconnected by transoceanic cables is Antarctica.
So although the Internet and Cloud-based computing are generally thought of as wireless, they are anything but; data is transmitted through fiber-optic cables laid under the oceans.
Nearly all digital communication between continents is transmitted in this way. My email will be sent via transatlantic fiber-optic cables, even if I am using a Cloud computing service.
The Cloud, an attractive buzzword, conjures up images of satellites sending data across the world, but in reality, Cloud services are firmly rooted in a distributed network of data centers providing Internet access, largely through cables.
Fiber-optic cables provide the fastest means of data transmission and so are generally preferable to satellites. The current extensive research into fiber-optic technology is resulting in ever faster data transmission rates.
Transatlantic cables have been the target of some curious and unexpected attacks, including those from sharks intent on biting through the cables.
While, according to the International Cable Protection Committee, shark attacks account for fewer than 1 percent of the faults logged, even so, cables in vulnerable areas are now often protected using Kevlar.
Assuming there are no problems with transatlantic cables due to inquisitive sharks, hostile governments, or careless fishermen, and my email makes landfall in the UK and continues on its way, it may be at this point that, as with other Internet data, it is intercepted.
In June 2013, Edward Snowden leaked documents revealing that the Government Communications Headquarters (GCHQ) in the UK were tapping into a vast amount of data, received through approximately 200 transatlantic cables, using a system called Tempora.
The Snowden case
Edward Snowden is an American computer professional who was charged with espionage in 2013 after leaking classified information from the US National Security Agency (NSA).
This high-profile case brought government mass surveillance capabilities to the attention of the general public, and widespread concerns were expressed regarding individual privacy.
Awards made to Snowden since taking this action have been many and include election as rector of the University of Glasgow, the Guardian’s Person of the Year 2013, and Nobel Peace Prize nominations in 2014, 2015, and 2016.
In June 2013, the Guardian newspaper in the UK reported that the NSA was collecting metadata from some of the major US phone networks. This report was swiftly followed by the revelation that a program called PRISM was being used to collect and store Internet data on foreign nationals communicating with the US.
A whole slew of electronic leaks followed, incriminating both the US and the UK. A Booz Allen Hamilton employee and NSA contractor working at the Hawaii Cryptologic Center, Edward Snowden, was the source of these leaks, which he sent to members of the media he felt could be trusted not to publish without careful consideration.
Snowden’s motivations and the legal issues involved are beyond the scope of this blog but it is apparent that he believed that what had started out as legitimate spying on other countries had now turned in on itself and the NSA was now spying, illegally, on all US citizens.
The free Web scraping tools, DownThemAll, an available extension of Mozilla Firefox, and the program wget, give the means to quickly download the entire contents of a website or other Web-related data. These applications, available to authorized users on NSA classified networks, were used by Snowden to download and copy massive amounts of information.
He also transferred large amounts of highly sensitive data from one computer system to another. In order to do this, he needed usernames and passwords, which a systems administrator would routinely have. He thus had easy access to many of the classified documents he stole, but not all.
To get access to higher than top-secret documents, he had to use the authentication details of higher level user accounts, which security protocols should have prevented.
However, since he had created these accounts and had system administrator privileges, he knew the account details. Snowden also managed to persuade at least one NSA employee with security clearance higher than his to tell him their password.
Ultimately, Snowden copied an estimated 1.5 million highly classified documents, of which about 200,000 (Snowden understood that not all of his stolen documents should be made public and was cautious about which should be published) were handed over to trusted reporters, although relatively few of even these were eventually published.
While the details have never been fully revealed by Snowden, it seems he was able to copy the data onto flash drives, which he apparently had no difficulty in taking with him when he left work for the day. Security measures to prevent Snowden from being able to remove these documents were clearly inadequate.
Even a simple body scan on exiting the facility would have detected any portable devices, and video surveillance in the offices could also have flagged suspicious activity.
In December 2016, the US House of Representatives declassified a document dated September 2016, which remains heavily redacted, reviewing Snowden the man as well as the nature and impact of the leaked documents.
From this document, it is clear that the NSA had not applied sufficient security measures and as a result of the Secure, the Net initiative has since been put into operation, although it is yet to be fully implemented.
Snowden had extensive system administrator privileges, but given the extremely sensitive nature of the data, allowing one person to have full access with no safeguards was not acceptable.
For example, requiring validation credentials of two people when data was accessed or transferred might have been sufficient to prevent Snowden from illicitly copying files.
It is also curious that Snowden could apparently plug in a USB drive and copy anything he wanted. A very simple security measure is to disable DVD and USB ports or not have them installed in the first place.
Add further authentication using retina scan to the requirement for a password and it would have been very difficult for Snowden even to access those higher level documents. Modern security techniques are sophisticated and difficult to penetrate if used correctly.
In late 2016, entering ‘Edward Snowden’ in Google search gave over twenty-seven million results in just over one second; and the search term ‘Snowden’ gave forty-five million results.
Since many of these sites give access to or even display the leaked documents labeled ‘Top Secret’, they are now firmly in the global public domain and will no doubt remain so. Edward Snowden is currently living in Russia.
In contrast with Edward Snowden’s case, WikiLeaks presents a very different story.
WikiLeaks is a huge online whistleblowing organization whose aim is to disseminate secret documents. It is funded by donations and staffed largely by volunteers, though it does appear to employ a few people too.
As of December 2015, WikiLeaks claims to have published (or leaked) more than ten million documents. WikiLeaks maintains its high public profile through its own website, Twitter, and Facebook.
Highly controversial, WikiLeaks and its leader Julian Assange hit the headlines on 22 October 2010 when a vast amount of classified data— 391,832 documents—dubbed ‘Iraq War Logs’ was made public. This followed the approximately 75,000 documents constituting ‘The Afghan War Diary’ that had already been leaked on 25 July 2010.
An American army soldier, Bradley Manning, was responsible for both leaks. Working as an intelligence analyst in Iraq, he took a compact disc to work with him and copied secret documents from a supposedly secure personal computer.
For this, Bradley Manning, now known as Chelsea Manning, was sentenced in 2013 to thirty-five years in prison following conviction, by court-martial, for violations of the Espionage Act and other related offenses.
Former US president Barack Obama commuted Chelsea Manning’s sentence in January 2017, prior to his leaving office. Ms. Manning, who received treatment for gender dysphoria while in prison, was released on 17 May 2017.
Heavily criticized by politicians and governments, WikiLeaks has nonetheless been applauded by and received awards from the likes of Amnesty International (2009) and the UK’s The Economist (2008), among a long list of others. According to their website, Julian Assange has been nominated for the Nobel Peace Prize in six consecutive years, 2010–15.
The Nobel Committee does not release the names of nominees until fifty years have passed but nominators, who have to meet the strict criteria of the Peace Prize committee, often do publicly announce the names of their nominees.
For example, in 2011, Julian Assange was nominated by Snorre Valen, a Norwegian parliamentarian, in support of WikiLeaks exposing alleged human rights violations.
In 2015, Assange had the support of former UK member of parliament George Galloway, and in early 2016 a supportive group of academics also called for Assange to be awarded the prize.
Yet by the end of 2016, the tide was turning against Assange and WikiLeaks, at least in part because of alleged bias in their reporting. Against WikiLeaks are ethical concerns regarding the safety and privacy of individuals; corporate privacy; government secrecy; the protection of local sources in areas of conflict; and the public interest in general.
The waters are becoming increasingly muddied for Julian Assange and WikiLeaks. For example, in 2016, emails were leaked at a time best suited to damage Hillary Clinton’s presidential candidacy, raising questions about WikiLeaks’ objectivity, and prompting considerable criticism from a number of well-respected sources.
Regardless of whether you support or condemn the activities of Julian Assange and WikiLeaks, and almost inevitably people will do both, varying with the issue at stake, one of the big technical questions is whether it is possible to shut down WikiLeaks.
Since it maintains its data on many servers across the world, some of it in sympathetic countries, it is unlikely that it could be completely shut down, even assuming that this was desirable.
However, for increased protection against retaliation following each disclosure, WikiLeaks has issued an insurance file. The unspoken suggestion is that if anything happens to Assange or if WikiLeaks is shut down, the insurance file key will be publicly broadcast. The most recent WikiLeaks insurance file uses AES with a 256-bit key and so it is highly unlikely to be broken.
As of 2016, Edward Snowden is at odds with WikiLeaks. The disagreement comes down to how each of them managed their data leaks. Snowden handed his files over to trusted journalists, who carefully chose which documents to leak. US government officials were informed in advance, and, following their advice, further documents were withheld because of national security concerns.
To this day, many have never been disclosed. WikiLeaks appears simply to publish its data with little effort to protect personal information. It still seeks to gather information from whistleblowers, but it is not clear how reliable recent data leaks have been, or indeed whether its selection of the information it presents allows it to be completely disinterested.
On its website, WikiLeaks gives instruction for how to use a facility called TOR (The Onion Router) to send data anonymously and ensure privacy, but you do not have to be a whistleblower to use TOR.
TOR and the dark Web
Janet Vertesi, an assistant professor in the Sociology Department at Princeton University, decided to conduct a personal experiment to see if she could keep her pregnancy a secret from online marketers and so prevent her personal information becoming part of the big data pool. In an article published in TIME magazine in May 2014, Dr. Vertesi gives an account of her experience.
She took exceptional privacy measures, including avoiding social media; she downloaded TOR and used it to order many baby-related items, and in-store purchases were paid for in cash.
Everything she did was perfectly legal but ultimately she concluded that opting out was costly and time-consuming and made her look, in her own words, like a ‘bad citizen’. However, TOR is worth looking at, not least because it made Dr. Vertesi feel safe and maintained her privacy from trackers.
TOR is an encrypted network of servers that was originally developed by the US Navy to provide a way of using the Internet anonymously, and so prevent tracking and the collection of personal data. TOR is an ongoing project, aimed at developing and improving open-source online anonymity environments that anyone concerned about privacy can use.
TOR works by encrypting your data, including the sending address, and then anonymizes it by removing part of the header, crucially including the IP address, since an individual can easily be found by back-tracking given that information. The resulting data package is routed through a system of servers or relays, hosted by volunteers, before arriving at its final destination.
On the positive side, users include the military who originally designed it; investigative journalists wishing to protect their sources and information; and everyday citizens wishing to protect their privacy.
Businesses use TOR to keep secrets from other businesses, and governments use it to protect sources of sensitive information as well as the information itself. A TOR Project press release gives a list of some of the news items involving TOR between 1999 and 2016.
On the negative side, the TOR anonymity network has been widely used by cybercriminals. Websites are accessible through TOR-hidden services and have the suffix ‘.onion’. Many of these are extremely unpleasant, including illegal dark websites used for drug dealing, pornography, and money laundering.
For example, the highly publicized website Silk Road, part of the dark Web and a supplier of illegal drugs, was accessed through TOR, making it difficult for law enforcement to track it.
A major court case followed the arrest of Ross William Ulbricht, who was subsequently convicted of creating and running Silk Road, using the pseudonym Dread Pirate Roberts. The website was closed down but later sprang back up again, and in 2016 was in its third reincarnation under the name Silk Road 3.0.
The deep Web refers to all those websites that cannot be indexed by the usual search engines, such as Google, Bing, and Yahoo! It comprises legitimate sites as well as those that make up the dark Web.
It is popularly estimated to be vastly bigger than the familiar surface Web, though even with special deep Web search engines it is difficult to estimate the size of this hidden world of big data.
Big data and society
The eminent economist, John Maynard Keynes, writing during the British economic depression in 1930, speculated on what working life would be like a century later. The industrial revolution had created new city-based jobs in factories and transformed what had been a largely agrarian society.
It was thought that labor-intensive work would eventually be performed by machines, leading to unemployment for some and a much-reduced working week for others.
Keynes was particularly concerned with how people would use their increased leisure time, freed from the exigencies of gainful employment by technological advances.
Perhaps more pressing was the question of financial support leading to the suggestion that a universal basic income would provide a way of coping with the decline in available jobs.
Gradually over the 20th century, we have seen jobs in industry eroded by ever-more sophisticated machines, and although, for example, many production lines were automated decades ago, the Keynesian fifteen-hour working week has yet to materialize and seems unlikely to do so in the near future.
The digital revolution will inevitably change employment, just as the industrial revolution did, but in ways, we are unlikely to be able to predict accurately.
As the technology of the ‘Internet of Things’ advances, our world continues to become more data-driven. Using the results of real-time big data analysis to inform decisions and actions will play an increasingly important role in our society.
There are suggestions that people will be needed to build and code machines, but this is speculative and, in any case, is just one area of specialized work where we can realistically expect to see robots increasingly taking the place of people.
For example, the sophisticated robotic medical diagnosis would reduce the medical workforce. Robotic surgeons, with extended Watson-like capabilities, are likely. Natural language processing, another big data area, will develop to the point where we cannot tell whether we are talking to a robotic device or a doctor—at least when we are not face-to-face.
However, predicting what jobs humans will be doing once robots have taken over many of the existing roles is difficult. Creativity is supposedly the realm of humans, but computer scientists, working in collaboration at the Universities of Cambridge and Aberystwyth, have developed Adam, a robot scientist.
Adam has successfully formulated and tested new hypotheses in the field of genomics, leading to new scientific discoveries. The research has progressed with a team at the University of Manchester successfully developing Eve, a robot that works on drug design for tropical diseases. Both of these projects implemented artificial intelligence techniques.
The craft of the novelist appears to be uniquely human, relying on experience, emotion, and imagination, but even this area of creativity is being challenged by robots.
The Nikkei Hoshi Shinichi Literary Award accepts novels written or co-written by non-human authors. In 2016, four novels written jointly by people and computers passed the first stage of the competition, without the judges knowing the details regarding authorship.
Although scientists and novelists may eventually work collaboratively with robots, for most of us the impact of our big data-driven environment will be more apparent in our daily activities, through smart devices.