Critical Infrastructure Protection
Protection fields have some common themes that form the basis that underpins all protection efforts. This tutorial explore the details of each aspect of critical infrastructure protection and the common themes that address the cohesion of the process and the design and utility of infrastructures.
Design and Utility of Infrastructures
Protection is something that is done to components and composites of components, which we will more often call systems. Infrastructures are almost always systems of systems, with the subsystems controlled by different individuals and groups and with predefined interfaces.
For example, the US highway system is composed of state highway systems and the interstate highway system. These highways are connected to the local road and street systems.
Each locality controls the local streets, states control state highways, and the country is in charge of the interstate system as a whole. The interfaces are the points where these streets and highways contact each other and where other supporting components of the infrastructure contact each other.
For example, most highways have electric lighting at night, and these contact the power infrastructures; most have emergency call booths that contact some communications system; many have rest stops with fresh and wastewater facilities; and so forth.
Each component has a physical makeup based on the physics of devices, and engineering is done to create components with properties, combine them to composites with properties, and combine those into larger and larger systems, each with its own properties.
The infrastructure as a whole has some basic properties as well, and the engineering designs of the components and the way they fit together to create those properties.
For example, water systems have incoming water supplies, purification systems, piping of various sorts, pumps and holding stations, pressure controllers, and so forth.
Each of these has properties, such as the strength of the pipe and the resulting water pressure it can hold, the maximum flow rate of a pump, the maximum slew rate of a valve, and so forth.
The overall water system has properties that emerge from these components, such as the water pressure under normal loads, the total amount of water that it can purify per unit time, the maximum holding tank capacities, and so forth.
Engineering takes the properties of the materials and the construction capabilities of the society along with cost and time and other constraints and produces and ultimately builds the overall system.
Infrastructures are operated by operators of different sorts. For example, in California, the Independent System Operator (ISO) operates the power grid as a whole, while each of the power providers and consumers operates their facilities.
The price for power is controlled by the local power companies, who are, in turn, controlled by the public utility commission, and they have to buy from the ISO based on the California energy market, which is an exchange sort of like the New York Stock Exchange, only with very different rules on bidding, buying, and selling.
The different parties have various obligations for their operations; however, each makes its own trade-offs associated with costs and quality of service subject to the regulatory and competitive environments they operate within.
Operators literally turn things on and off, repair things that break, charge customers for services, and do the day-to-day operations of components and overall infrastructures.
Many aspects of operations today in advanced infrastructure systems are controlled by automated systems. These automated control systems are called Supervisory Control and Data Acquisition (SCADA) systems.
They do things like detecting changes in measurable phenomena and altering actuators to adjust the systems to produce proper measured results. Oil pipelines, as an example, run under pressure so that the oil, which is rather thick compared with water, flows at an adequate rate to meet the need.
Too much pressure and the pipes break; too little pressure and the oil stops flowing. As demand changes, the amount of oil flowing out the end changes, so the pumping and valve stations along the way need to adapt to keep the pressure within range.
While a person sitting at a control valve 24 hours a day can do some of this sort of work, automated control valves are far less expensive and more reliable at making small adjustments in a timely fashion than people are.
The SCADA systems communicate information about pressures and flow so that valves can be systematically controlled to keep the overall system properly balanced and so that it can adapt to changing conditions, like a breakdown or a pressure surge.
Management of the operations is done by operators using their management structure and people, while management of operators takes place through a combination of governmental and privately generated external requirements, including those of shareholders, boards of directors, owners, and a wide range of legal and governmental frameworks.
When infrastructures interface at, or cross borders, they are referred to as being international. These exist in the social framework of the societies and the world as a whole.
For example, the Internet is a rapidly expanding global infrastructure that is composed of a wide range of highly compatible technology at the level of network packets.
Impact of Infrastructures on Society
Finally, infrastructures change the worlds they operate within and do so at every level. At the level of the individual who uses specific content, infrastructures like the Internet both provide content and communication and change the way people do what they do as well as the things that they do. Infrastructures become ends in and of themselves, driving whole industries and individual innovation.
Hundreds of millions of people communicate daily using electronic mail, something few of them ever did before the Internet. The time frames of these communications change many things about how they work, what they say, and the language and expressions they use every day.
But this is only the beginning. In the latter part of the 20th century, automated teller machines revolutionized the way people dealt with cash needs. Whereas people previously had to deal with getting cash only on weekdays between 9 a.m. and 5 p.m. at a bank, today, people can get cash almost anywhere almost any time in many cities and towns in much of the world.
This revolutionized the carrying of cash, eliminated many robberies and thefts, and created tracking capabilities for governments over individuals. It meant that instead of being tied to the local bank, people could get the amount of money they needed wherever they were, whenever they needed it. It changed the way people thought about money and the way they spent it.
The highway system changed the nature of travel and work in that people no longer had to live right next to where they worked and goods could be transported point to point rather than running through the rail system, which itself revolutionized transportation before the emergence of trucks and cars.
This enabled different models of commerce, people who lived their lives moving from place to place became far more common, and communities changed forever.
All of these things also changed the consumption patterns of whole societies and altered the environments in which they lived. Moving from place to place also changed the nature of food and how it was delivered.
With the advent of refrigeration and the electrical power grid, food could be preserved over time, allowing far wider distribution of the food and its packaging.
Smaller groups eating more quickly led to fast-food and snack food and altered eating habits while producing far more waste from food and its packaging, consuming more power and more resources, and changing family farming while creating the huge corporate farms that currently dominate.
Water systems changed the face of irrigation but also decimated much of the wildlife and habitat in regions that used to have a lot of available water.
Waste management did wonders for the people living near the oceans, but for quite a long time, much of the waste was dumped into the oceans, causing major changes in the oceanic environment.
Mining produced the materials needed for energy and manufacturing, but strip mining destroyed large areas of land and destroyed much of the capacity of that land to be used for other purposes. Oil production resulted in oil spills that killed off wildlife and poisoned portions of the oceans.
The list goes on and on. These so-called unanticipated consequences of modern society are intimately tied to the infrastructures created by people to support their lifestyles. The complexity of the overall feedback system is beyond the human capacity to model today, but not beyond the capacity of humanity if we decide to model it.
These complex feedback systems that drive extinctions and destruction must be managed if human infrastructures are to thrive while humans survive.
For most of the people living in advanced societies, there is no choice but to find ways to understand and engineer critical infrastructures so that they provide sustainable continuity in the face of these realities.
From the way the power grids get their power to the way societies treat their resources, these critical infrastructures will largely determine the future of those societies and humanity.
Random Nature of Faults, Failures, and Engineering
Engineering would be simple in the ideal world, and mathematics associated with much of engineering is based on idealizations because of the need to simplify calculations.
Rules of thumb are often used to shortcut complex analysis, engineered systems once analyzed are reproduced in large numbers to avoid re-engineering, and many assumptions are made in the use of components when forming composites from them.
History and extensive analysis create these rules of thumb, and where the assumptions are violated, recalculation is commonly undertaken.
A good example is in digital circuit design, where fan-in and fan-out simplify the analysis of how many outputs can be connected to how many inputs within a given technology.
If the same technology is used between inputs, and outputs and other factors such as temperature, humidity, and the electromagnetic environment remain within specified ranges, no additional calculation is needed. One output can connect to a certain number of inputs and everything will continue to work properly.
However, if these assumptions are no longer true, either as a result of natural changes in the operating environment or of malicious attacks by outside actors, then the assumptions are no longer true.
While most engineered solutions are designed for specific environments, design changes in the field can be very expensive, and if the environment changes and these assumptions do not hold, then infrastructures that depend on these assumptions fail.
Similar examples happen in all areas of infrastructure. They fail here and there as components or composites fail unless adequate redundancy is in place to ensure continuity in the presence of faults in components.
The glory of infrastructures that are properly designed and operated is that when one component or composite fails, the infrastructure as a whole continues to operate, making it resilient to failures in components and composites. Or at least that is true if they are properly designed and operated.
When they are not designed and operated with adequate redundancy and designed to be resilient to failures, we see cascade failures such as those that have brought down major portions of the U.S. and European Union power grids over the past ten years. A typical failure of the infrastructure may occur as follows:
1. The power grid is operating at or near maximum load during hot days in the summer because of the heavy use of air conditioning.
2. The heat produced by the high power usage added to the high outside temperature causes wires in the power grid to expand, lowering them until they come near to trees or other natural or artificial phenomena.
3. As one power line shorts out from the contact, it has to go offline, and the power it was supplying is replaced by power from other sources.
4. The increased loads on those other sources cause them to heat up and some of them hit trees, causing them to shut down.
5. Continue item 4 until there is not enough power supply to meet demand or until all of the redundant power lines into areas fail and you have major outages.
6. Pretty soon, all of the changing loads create power fluctuations that start to damage equipment and vast parts of the power grid collapse.
This is not just a fantasy scenario. It has happened several times, and this resulted in the collapse of power in the Western states of the United States in one instance.
There are many other similar scenarios that are related to running the power grid at too close to its maximum capacity and suffering from a failure somewhere that cascades throughout the rest of the system, and every few years, we see a major outage that spreads over a wide area.
Recovery times may last from a few hours to a few days, and there are often broken components that take days or weeks to be repaired.
It has to be noted that the reason for these large-scale outages is that power is shared across vast areas to increase efficiency. Energy is sent from Canada to the United States in the summer and from the United States to Canada in winter.
This saves building more power plants in both countries, each of which would run more or less a portion of its capacity at different parts of the year.
Sharing means that more resources can be brought to bear to meet demands at heavy usage times or during emergency periods, but it also means that interconnections have to be managed and that local effects can spread to far wider areas.
Similar effects in all infrastructures exist, and each is more or less resilient to faults and interdependencies depending on how they are designed, implemented, and operated. By the nature of an infrastructure, it will eventually have faults in components, have components replaced, and be modified for one reason or another.
Whether the city grows and needs more water or there is massive inflation and we need to handle more digits in our financial computers, or a new technology comes along and we need to add electric trains to the existing tracks, changes and faults will produce failures within small portions of infrastructures.
The challenge of critical infrastructure design is to ensure that these happen rarely, for short times, and that their effects are reasonably limited. The way we do this is by making them fail less often, fail less severely or in safer ways, recover more quickly, and tolerate many faults that don’t need to cause failures.
Fault Intolerance and Fault Tolerance
Failures are caused by faults that are exercised and not covered by redundancy. For faults in components that are used all of the time and not covered by any redundancy, failures occur as soon as the faults appear. For example, computers typically have clocks that cause the components to operate in synchronization.
If there is a single clock in the computer and it stops working, the computer will stop working. For faults that are not exercised all of the time but do not have redundancy, the fault may occur long before a failure results and the failure may never occur if the fault is never exercised.
A good example of this is a bad emergency brake cable in a manual transmission car that is never used in hilly areas. Even though the cable would not work, the car may never roll down a slope because the emergency brake is never exercised.
The other example of a fault without a failure is the case where there are redundant components covering the situations so that even though faults are exercised, the failures that they could produce are never seen because of redundancy in the system. A good example is a baseball bat with a minor crack in it.
There is natural redundancy in the structure of the wood so that a crack that goes only part way into the bat will not cause the bat to split.
Even though every hit exercises the fault, the bat does not fail, but like a bat with a partial crack in it, if there is a fault that is exercised and the redundancy fails, a failure will occur, just as a solid hit in the wrong way will split the bat.
A different notion underlying the design of composites that fail less spectacularly is the notion of fail-safe.
The idea of fail-safe is to design composites so that they tend to fail in a safe mode when enough components fail and cause the composite to fail. Fail-safe modes apply to almost any sort of system, but they are far more important in cases where the consequences of failure are higher.
For example, in nuclear power plants, safe failure modes are a key driver, while in most water systems, fail-safes are only relatively limited parts of the design.
In the Presence of Attackers
The discussion up to here has been about the design principles for composites made up of components under “natural” failure modes, that is, under modes where failures come at random because of the nature of the world we live in, the manufacturing processes we use, and so forth.
There is an implicit assumption that all of these failures are unintended, and that is the assumption we are now going to abandon.
Intentional, Intelligent, and Malicious Attackers
In the presence of intentional, intelligent, malicious attackers, some, but not all, of the assumptions underlying these design principles fall apart.
For example, even the most malicious intentional and intelligent attacks cannot realistically change the laws of physics to the point where water ceases to flow downhill for a substantial period of time.
On the other hand, because designers build circuits, girders, pipes, and other components to operate over particular operating ranges, an intentional, intelligent, malicious attacker could realistically alter some of these operating conditions at some places at some times so as to cause changes in the failure rates or modes of components, thus causing failures in the composites that they form. As a simple example, to cause a valve to fail, one might pour glue into it.
This will most certainly change the conditions for most valves whose designers did not intend them to operate in an environment in which a high-viscosity fluid is bound to the internal parts of the mechanism.
While one could order all of the valves to be sealed to prevent this sort of thing, it would increase the price substantially and prevent only a limited subset of the things that an attacker might do to cause a failure.
Further, other approaches to designing defenses might be less expensive and more effective for a wider range of attacks. For example, putting all valves in physically secured areas might accomplish this, as well as preventing a wide range of other attacks, but that may not be practical everywhere either.
Capabilities and Intents
Having started down this road, it would be a disservice if we failed to mention that real attackers are not infinite in their capacity to attack.
They have real limitations associated with their capabilities, and for the most part, they are motivated in some way toward specific intents. This combination of capabilities and intents can be used to characterize attackers so as to understand what they can realistically do.
Without this sort of threat basis, defenses would have to be perfect and work against unlimited numbers of colluding attackers to be designed to be effective. However, with a threat basis for designing defenses, the limitations of attackers can be taken into account in the preparation of protection.
Threat capabilities are often considered in terms of things like finances, weaponry, skill level, number of people, knowledge levels, initial access, and things like that. Intents are often characterized in terms of motivating factors, group rewards, and punishments, strategies, and tactics.
For example, a high-quality confidence artist typically has little money, weaponry, or initial access but has a lot of skill, a small group of people, and perhaps substantial knowledge and is motivated to get money, stay covert, and rarely use violence.
However, typical terrorist groups have substantial finances, weaponry similar to a paramilitary group, training, and skills in many areas, multiple small teams of people with some specialized knowledge, and no initial access.
They are usually motivated by an unshakable belief system, often with a charismatic leader, use military tactics, and are willing to commit suicide in the process of disrupting a target, killing a lot of people, and making a big splash in the media.
The protection against one is pretty clearly very different from the protection against the other, even though there may be some common approaches. Without doing the research to determine who the threats are and their capabilities and intents, it is infeasible to design a sensible protection system against them.
Different methods are available to assess attacker capabilities and intents. The simplest method is to simply guess based on experience. The problem is that few of the people running or working in infrastructures have much experience in this arena and the experience they have tends to be highly localized to the specific jobs they have had.
More sophisticated defenders may undertake searches of the Internet and publications to develop a library of incidents, characterize them, and understand historical threats across their industry.
Some companies get a vendor who has experience in this area to do a study of similar companies and industries and to develop a report or provide a copy of a report they have previously developed in this area.
In the presence of specific threats, a high-quality, highly directed threat assessment done by an investigative professional may be called for, but that is rarely done in design because the design has to address a spectrum of threats that apply over time.
The most reasonable approach used by most infrastructure providers who want good results is a high-quality general threat assessment done by threat assessment professionals and looking at categories of threats studied over time.
Finally, intelligence agencies do threat assessments for countries, and portions of these assessments may be made available to select infrastructure providers.
Redundancy Design for System Tolerance
Given that a set of threats exist with reasonably well-understood capabilities and intents, a likely set of faults and failure modes for the infrastructure can be described.
For example, if a group that seeks to poison populations is a threat of import and you run a food distribution system, faults might be in the form of putting poison within food stuff and failures might be the delivery of substantial quantities of poisoned food into a population, resulting in some deaths and a general disruption of some part of the food chain for a period of time.
To achieve protection, in the language of fault-tolerant computing, the goal would be to reduce the number of faults and put redundancy in place to tolerate more faults than you would if there was no threat to the food supply.
To do this, a variety of approaches might be undertaken, ranging from sterilization of food in the supply chain process to elimination of sequences in which biological contaminants are introduced before the sterilization point, to multiple layers of sealed packaging so that creating a fake repackaged version requires more and more sophisticated capabilities that are available to the threat.
The general notion that is that, just as there are assumptions about failure modes used to design systems to tolerate naturally occurring faults in the absence of intentional malicious, intelligence threats, different fault models are used to design systems to tolerate the faults in the presence of those threats.
It turns out that the fault models for higher-grade threats are more complex and the protective measures are more varied than they are for naturally occurring phenomena, but the basic approach is similar.
Some set of potentially redundant protective measures are combined with designs that are less susceptible to faults to design composites that are relatively less susceptible to failures out of components that are individually more susceptible to faults. Of course, perfection is unattainable, but that is not the goal. The goal is, ultimately, to reduce cost plus a loss to a minimum.
This notion of reducing cost plus loss is the goal of risk management. In essence, the risks are formed from the combination of threats, vulnerabilities to the capabilities and intents of those threats inducing failures, and consequences of those failures.
Risk management is a process by which those risks are managed by combining risk avoidance, transfer, reduction, and acceptance with the goal of minimizing cost plus loss.
For example, the risk of a nuclear nation launching an intercontinental ballistic missile at your city water plant, thus causing massive faults that are not defended by the fences and guards at the gate to the reservoir and total loss of use of the water system for quite a long time, is a risk that is typically transferred to the national government in its role of providing for the common defense.
The risk of the attack described earlier where someone walks into the back door and plants a wireless access device is likely one that should be reduced (a.k.a. mitigated) until it is hard to accomplish and unlikely to succeed, at which point the residual risk should be accepted.
The risk of having someone walk up and pour poisonous gas into the air intake of your air conditioning system at street level should probably be avoided by not placing air intakes at street level.
Of course, this is only the beginning of a very long list with a lot of alternatives for the different circumstances, and in reality, things do not fit quite so neatly into an optimization formula. Decisions have to be made with imperfect knowledge.
The complexity of risk management gets more extreme when interdependencies are considered. For example, suppose you implemented a defense based on detecting intruders and alarming a guard force to respond to detected intrusions.
While this seems like a reasonable approach at first, the analysis becomes complex when the target is highly valued and the threats are high quality. What if the attacker decides to cut electrical power to the entire location as a prelude to their attack?
Then the sensor system may not function properly and your response force may not know where to respond to. So to deal with this, the sensor system and the guard force will have to have the ability to respond in the presence of an outage of external electrical power.
Suppose you do that by putting an uninterruptible power supply (UPS) in place for operation over a 30-minute period and include a motor generator for supplementary power after the initial few minutes of outage against the event of a long-term external outage.
This sort of analysis is necessary for everything you do to defend your capabilities, and the dependency chain may not be that simple. For example, suppose that the mechanism that turns on the UPS is controlled by a computer.
High-quality attackers may figure this out through their intelligence process and seek to defeat that computer system as a prelude to the power outage part of their attack.
Suppose that the alarm system depends on a computer to prioritize alarms and facilitate initial assessments before devoting a response force and the attackers can gain access to that computer system.
Then in the alarm assessment phase, the actual attack might be seen only as a false alarm, thus suppressing the response for long enough to do the damage. This means that physical security depends on computer security, which depends on the power system, which depends on another computer system.
The chain goes on and on, but not without end, if the designers understand these issues and design to reduce or eliminate inter-dependencies at the cost of slightly different designs than designers without this understanding tend to produce.
This is why the security design has to be done along with risk management starting early in the process rather than after the rest of the system is in place.
Imagine all the interdependencies that might be present if no attempt was made to reduce them and you will start to see the difference between a well-designed secure operations environment and an ad hoc response to a changing need for and appreciation of security.
Random Stochastic Models
Relating this back to the notions of faults and failures, the presence of threats creates a situation in which there are a lot more faults than nature would normally create, and those faults are of different sorts than the random stochastic models of the bathtub curve produces.
They are otherwise highly improbable combinations of faults that occur in specific sequences.
Randomness and nature could never produce most of the sequences seen in attacks, except through the indirect results of nature that produces animals that think, learn, and direct themselves toward goals.
At the same time, every naturally occurring event is observed by the attackers just as it is observed by those protecting infrastructures. When a bridge fails, attackers notice how it happened and may decide to target bridges that have similar conditions to reduce the effort in attack.
Imagine an attacker that decided to attack all of the bridges known to be in poor condition. There are steam, water, and sewage pipes under almost all major cities, and many of them are old and poorly maintained, inadequately alarmed, and unlikely to be well protected. Attackers know this and, if they have a mind to, may target many of them rather than targeting only a few more well-guarded targets.
To provide protection against intentional, intelligent, malicious threats, systems need to tolerate far more and more complex sorts of faults and be hardened against far more vicious, localized, and directed events that nature could throw at them, and defenders must also understand that the death of 1000 pinpricks may be the mode of attack chosen by some threats.
That is not to say that nature is not a force to be reckoned with. It remains a threat to critical infrastructures as it always has been, but simply dealing with nature is not enough to mitigate the threats of human nature.
To succeed against realistic threats, more and more correlated faults must be considered. Common mode failures must be largely eliminated to be effective against human attackers, and faults are certain to be exercised in spurts instead of in random distributions.
Step functions in the exposures of faults will occur as attacks expose systems to harsh environments, and anyone system will most surely be defeated or destroyed quickly and without notice, unless it is covered by another.
In the presence of attackers, engineering takes on whole new dimensions, and assumptions are the things that are exploited rather than the things we can depend upon.
At the infrastructure level, it may be necessary to allow some targets to suffer harm to protect the infrastructure as a whole against greater harm, particularly when the defenders are resource constrained.
There are many approaches, of course. Alarm systems are often characterized in terms of nuisance alarm rates and the likelihood of detection, while medical measurements talk about false-positives and false-negatives, as do many computer security calculation approaches.
These metrics are used to try to balance alarms against response capabilities, which have very direct costs.
But approaches to risk management that go beyond the simplistic always end up dealing with two critical things. One of them is the nature of the conflict between attackers and defenders in terms of their skill levels and resources. The other is the notion of time and its effects.
Issues of Time and Sequence
In the power grid, time problems are particularly extreme because response times are particularly short. Many people have suggested that we use computers and the Internet to detect outages at one place in the power grid so that we can then notify other parts of the grid before the resulting power surges hit them.
It sounds like a great idea, but it cannot work because the energy disruptions in power grids travel down the power infrastructure at the speed of light in the wires carrying them. While the wires have dips in them as they go from pole to pole, this increases the total distance by only a small percentage.
Power tends to run long distances over fairly straight paths. So if the speed of light in the wire is 6 108 meters per second and the distance from California to Washington State is 954 miles, that converts to about 1,535,314 meters, That’s 1/400th of a second or 2.5 milliseconds.
Getting an Internet packet from outside of San Francisco, California (about half of the way from Los Angeles to Seattle), to Seattle, Washington, takes something like 35 milliseconds on an Internet connection.
That means that if a computer in San Francisco instantly sent notice to a computer in Seattle the moment there was a failure, it would get to the computer in Seattle 32.5 milli seconds too late to do anything about it.
Even if the power grid wires went twice as far out of the way as they would in the best of cases, we would still be 30 milliseconds too late, and that assumes that we do no processing whatsoever on either side of the computer connection.
Now some may argue that the Internet connection is slow or that our numbers are off by a bit, and they are probably right on both accounts, but that does not change the nature of the speed of light.
While it may be possible to get a signal to Seattle via radio or a laser before the power fluctuation in San Francisco makes its way through the power grid, there will not be enough time to do much, and certainly not enough time to alter the large physical machines that generate the power in a significant way.
The only thing you could hope to do would be to disconnect a portion of the power grid from the rest of the grid, but then you would lose the power supplied by each part to the other and would ensure an outage. Thus, safety cutoffs are used for power generation systems and the slow reconstitution of power systems over periods of days or sometimes weeks and months after large-scale cascade failures.
However, not the entire infrastructure works like the power grid. Water flows far more slowly than communications signals do, oil in pipelines flows even more slowly, and government decision making often acts at speed involving multiple legal processes, which are commonly timed in months to years.
The issue of time is as fundamental to protection as the notion of threats. It is embedded in every aspect of protection design, as it is in everyday life. Everything takes time, and with the right timing, very small amounts of force and action can defeat any system or any attack.
The descriptions of sequences of attacks undertaken by malicious actors can be more generally codified in terms of graphs, which are sets of “nodes” connected together by weighted “links.”
These graphs typically describe the states of an attack or the paths from place to place. For example, a graph could be made to describe the vandalism threat against the utility shed. That graph might start with a vandal deciding to do some damage.
The node links to each of the typical intelligence processes used by vandals, which in turn link to the shed as a target. The time taken in these activities is really unimpor want to the defenders in terms of the start of the attack that they can detect; however, in other cases where more intelligence efforts are undertaken, this phase can have important timing and defense issues.
Once the shed has been identified as a target, the vandal might show up with spray paint, or pick up a rock from the ground near the shed, or bring a crowbar. Again, each of these may take time in advance of the attack, but unless the vandal visits the site first and gets detected, it does not matter to the defender.
The spray paint might be applied to the outside of the shed and the vandalism then ends—a success for the attacker, with identifiable consequence to the defender.
Unless the defender can detect the attempted spray painting in time to get response forces to the shed before the consequences are all realized, the defender has failed to mitigate those consequences.
Of course, the consequences may be accrued over long time frames, perhaps months, if the defender does not notice the paint and others view it. The damage accrues over time as people see the vandalism, and that costs a small amount of the reputation of the defender.
Game Theory Modeling
If this is starting to seem like a game in which there are multiple actors making moves for individual advantage and defenders working in concert to protect themselves, you have understood the issues very well.
In fact, the area of game theory is designed to deal with just such strategic situations in which conflicts between actors with different objectives are interacting.
Consider, for example, that the purpose of the shed is to protect the valve from being turned, and as such, its use by a vandal is not particularly harmful.
In some sense, the vandal wins and so does the infrastructure because they are not in strict competition with each other. An even better example would be a vagrant who decided to take up residence in the shed and acted as an unofficial guard.
While this is not particularly desirable for the utility because of liability issues and the inability to detect a real threat, the sides in this conflict, in fact, have different but not conflicting, or perhaps a more descriptive word would be noncommon, objectives.
Game theory is generally used to model complex situations in which “players” make “moves” and to evaluate “strategies” for how to make those moves.
A game like chess is a two-player, zero-sum game. It is zero-sum because a win for one side is a loss for the other. It uses alternating moves in which each player takes a turn and then awaits the other player; therefore, it is also synchronous.
However, attack and defense games such as those played out in the competition between infrastructure attackers and defenders are not this way.
They are multiplayer, non-zero-sum, and asynchronous, with noncommon objectives in most cases. The defenders have some set of assets they are trying to protect to retain their business utility, while attackers vary from people trying to find a place to sleep in nation-states trying to engage in military actions.
Model-Based Constraint and Simulations
Simulation is the only technology currently available to generate the sorts of design metrics necessary to understand the operation of a protection system as a whole and with reasonable clarity.
While design principles and analysis provide a lot of useful information, taking the results of that effort and putting it into an event-driven simulation system provides the opportunity to examine hundreds of thousands of different scenarios in a relatively short time frame, generating a wide range of results, and yielding a sense of how the overall system will perform under threat.
While simulation cannot replace real-world experience, generate creative approaches, or tell you how well your people and mechanisms will perform in the face of the enemy, it can allow you to test out different assumptions about their performance and see how deviations in performance produce different outcomes.
By examining statistical results, a sense of how much training and what response times are required can be generated.
Many fault scenarios can be played out to see how the system deviates with time. Depending on the simulation environment, workers can be trained and tested on different situations at higher rates than they would normally encounter to help improve their performance by providing far more experience than they could gain from actual incidents that occur on a day-to-day basis.
Even long-time experts can learn from simulations, but there are limitations to how well simulations can perform, and simulations can be expensive to build, operate, and use, depending on how accurate and fine-grained you want them to be.
Simulations are also somewhat limited in their ability to deal with the complexity of total situations. One good example of this is intelligence processes, in which elicitation might be used to get an insider to reveal information about the security system and processes.
This might be combined with externally available data, like advertising from providers claiming that they provide some components, and perhaps with testing of the system.
For example, sending someone who appears to be a vagrant to wander into the area of a shed with a valve to see what detection, assessment, and response capabilities are in place and to plant capabilities for future use.
Over a period of time, such a complex attack might involve many seemingly unrelated activities that get fused together in the end to produce a highly effective distributed coordinated attack against the infrastructure element.
If this seems too far out to consider, it might be worthwhile examining what the United States and its coalition did in the first Gulf War to defeat Iraqi infrastructure.
They gathered intelligence on Iraqi infrastructure ranging from getting building plans from those who built portions of the facilities to using satellite and unmanned aerial vehicles to get general and detailed imagery.
They modeled the entire set of infrastructures that were critical to the Iraqi war capability, did analysis, and determined what to hit, where, and in what order to defeat what was a highly resilient million soldier army within a whole country designed for resilience in war.
Optimization and Risk Management Methods and Standards
From the standpoint of the critical infrastructure designer and operator, the protection-related design goal is to provide appropriate protection for the elements of the infrastructure they control to optimize the cost plus loss of their part of the infrastructure.
This is, of course, at odds with the overall goal of the infrastructure of optimizing its overall cost plus loss and at odds with the national or regional goal of optimizing cost plus loss across all infrastructures.
For example, a local utility would be better off from an individual standpoint by doing little to protect itself if it could depend on the national government to protect it, especially since most utilities are in monopoly positions.
However, large central governments tend to be more brittle and to create systems with common mode failures out of a desire for global efficiency and reduced effort, while local decision-makers tend to come to different decisions about similar questions because of local optimizations.
Infrastructures are, of course, different from other systems in that they may cross most if not all geographical boundaries, they must adapt over time to changes in technology and application, and they tend to evolve over time rather than go through step changes.
The notion of rebuilding the Internet from scratch to be more secure is an example of something that is no more likely to happen than rebuilding the entire road system of the world to meet a new standard.
So by their nature, infrastructures have and should have a wide range of different technologies and designs and, with those technologies and designs, different fault models and failure modes.
This has the pleasant side effect of reducing common mode failures and, as such, is a benefit of infrastructures over fully designed systems with highly structured and unified controls. It also makes management of infrastructures as whole entities rather complex and limited.
To mitigate these issues, the normal operating mode of most infrastructures is defined by interfaces with other infrastructure elements and ignores the internals of how those elements operate.
Economic Impact on Regulation and Duties to Protect
As discussed earlier, it is generally impossible to design and operate an infrastructure to handle all threats that can ever exist for all time without interruption of normal services and it is very expensive to try to do so.
In addition, if design and operations were done in this manner, costs would skyrocket and operators who did a poorer job of protection would be more successful, make more money, be able to charge lower prices, and put competitors out of business.
For these reasons, the invisible hand of the market, if left unchecked, will produce weak infrastructures that can handle everyday events but that will collapse in more hostile circumstances.
The question arises of how much the invisible hand of the market should be forced through regulation to meet national and global goals and the needs of the citizenry. There is no predefined answer, but it must be said that the decision is one of public policy.
The challenge for those who protect these infrastructures is how to best meet the needs of the market in the presence of regulations. These requirements form the duty to protect that must be met by the designers and operators.
Duties to protect generally come from the laws and regulations, the owners of the infrastructure and their representatives, outside audit and review mandates, and top management.
Some of these duties are mandatory because they are externally forced, while others are internally generated based on operating philosophy or community standards.
Each individual infrastructure type has different legal and regulatory constraints in each jurisdiction, and as such, each infrastructure provider must peruse its own course of analysis to determine what is and is not mandated and permitted. Nevertheless, we will help to get things rolling by covering the basics.
The Market and the Magnitude of Consequences
The market essentially never favors the presence of security controls over their absence unless the rate of incidents and magnitude of consequences are so high that it becomes hard to survive without strong protective measures in place.
The reason that the invisible hand of the market does not directly address such things is that luck can lead to success.
For example, suppose that there is a 50% chance of a catastrophic attack on some infrastructure element once a year, but that there are 32 companies in direct competition for that market, that security costs increase operating costs by 5%, that margins are 10%, and that four companies pay the price for security.
In this simplistic analysis, it ignores items like the time value of money; after the first year, 14 companies fail, two companies that would have failed to continue because they had adequate security, and those not attacked continue to operate. Now we have 18 companies in the market, 4 of them with half the profit of the other 14.
In the second year, seven more fails, with the two who happened to have security surviving from the nine attacked.
Now we have 11 companies left, 7 of which have no security and that are more profitable than the 4 that have security by a factor of 20% to 10%, or two to one. In the next year, three more fails, leaving four without security and four with security.
The four without security have now made enough money to buy the four with security, and they abandon the security controls, having demonstrated that they are more efficient and generate higher profits.
The uncontrolled market will do this again and again for situations in which the markets are moving rapidly, there is a lot of competition and little regulation, and serious incidents are not so high that it is possible to last a few years without being taken out of business. For those who doubt this, look at the software business and the Internet service provider business.
Most physical infrastructures are not this way because they are so critical and because there is rarely a lot of competition. Most cities have few options for getting natural gas, local telephone, electrical, garbage, or sewage services.
However, in the banking arena, Internet services, automobile gas, long-distance services, and other areas, there is substantial competition and therefore substantial market pressure in any number of areas.
An example where protection becomes a market issue is in the release of credit card data on individuals. Laws, which we will discuss in more detail, have forced disclosures of many such releases, which have started to have real impacts on companies.
The replacement of credit cards, for example, costs something on the order of tens of dollars per individual, including all of the time and effort associated with disabling previous versions, sending agreements as to potential frauds, checking credit reports, and so forth.
The losses resulting from these thefts of content are also substantial as the information is exploited on a global basis. The effect on the companies from a standpoint of market presence and reputation can be substantial, and attention of regulators and those who determine rights of usage to public facilities is sometimes affected as well.
Legal Requirements and Regulations
Legal and regulatory requirements for public companies and companies that deal with the public are substantially different from those for private companies that deal only with other companies—so-called business to business businesses.
Critical infrastructure providers come in both varieties. While the customer-facing components of critical infrastructures are the apparent parts, much of the back-end of some infrastructures deals only with other organizations and not with the public at large.
Examples of back-end organizations include nuclear power producers who produce power and sell to distribution companies, network service providers that provide high- bandwidth backbone services for telecommunications industry and financial services, companies that extract natural resources like gas and oil for sale to refineries, and companies that provide large water pipes to send large volumes of water between water districts.
Most of the commercial ventures performing these services are themselves public companies and are therefore subject to regulations associated with public stocks, and of course, many critical infrastructures are government-owned and/or operated or government-sanctioned monopolies.
The regulatory environment is extremely complex. It includes, but is not limited to, regulations on people, things, ownership, reporting, decision making, profit margins, sales and marketing practices, employment, civil arrangements, pricing, and just about anything else you can think of, but most of these are the same as they are for other companies that are not in the critical infrastructure business.
As a result, the number of special laws tends to be limited to issues like eminent domain; competitive practices; standards for safety, reliability, and security; information exchanges with governments and other businesses; and pricing and competition regulations.
Each industry has its own set of regulations within each jurisdiction, and with more than 200 countries in the world and many smaller jurisdictions contained therein, there is an enormous number of these legal mandates that may apply to any given situation.
For example, in California, a law titled SB1386 requires that an unauthorized release of personally identified information about a California citizen must produce a notice to that individual of the release or notice to the press of the overall release.
If you are a water company in California and allow credit cards for payment of water bills, you have to be prepared to deal with this. Similar laws exist in many other states within the United States.
If you are a telecommunications provider, you have similar requirements for multiple states.
If you are a global telecommunications provider, you have additional legal requirements about the personal information that may bar you from retaining or transmitting it across national boundaries in the European Union while being required to retain it in other countries, and this is just one very narrow branch of one legal requirement.
Emerging Internet sites typically provided by universities or industry organizations provide lists of laws related to businesses and they commonly have hundreds of different legal mandates surrounding security issues, but these are only a start.
Building codes, which may be localized to the level of the neighborhood, to limits on levels of toxic substances, to fertilizer composition, to temperature controls on storage, are all subject to regulation.
Regardless of the business reasons for making protection decisions, these regulatory mandates represent a major portion of the overall protection workload and include many duties that must be identified and resources that must be allocated to carry out these duties.
Contractual obligations are also legal mandates; however, the requirements they produce have different duties and different rewards and punishments for failures to carry them out. Contracts can be, more or less, arbitrary in what they require regarding rewards and punishments. As such, contracts have the potential to vary enormously.
However, in practice, they do not stray very far from a few basics. For critical infrastructures, they typically involve the delivery of a service and/or product meeting time and rate schedules, quality levels, and locations, within cost constraints and with payment terms and conditions.
For example, food is purchased in bulk from growers with government-inspected grades of quality, at costs set by the market, within expiration dates associated with freshness mandates, at quantities and prices set by contracts.
Wholesalers purchase most of the food, which is then either processed into finished goods or sold directly to retailers, with more or less the same sets of constraints.
Retailers sell to the general public and are subject to inspections. While the details may vary somewhat, critical infrastructures most commonly have fairly limited ranges of rates for what they provide, and most rates are published and controlled in some way or another by governments.
Payment processes use payment systems compatible with the financial infrastructure, and information requirements involve limited confidentiality.
All of these legal constraints are subject to force majeure, in which war, insurrection, nationalization, military or government takeover, or other changes out of the control of the provider or their customers change the rules without much in the way of recourse.
Other Duties to Protect
Other duties to protect exist because of management and ownership decisions and the oft-missed obligation to the public. Management and ownership decisions are directly tied to decision making at the highest levels of the enterprise, and the obligation to the public is a far more complex issue.
Ownership and management decisions create what are essentially contractual obligations to employees, customers, and suppliers.
For example, there are legal definitions of the term “organic” in many jurisdictions, and owners who decide to sell organic food create obligations to the buying public to meet those local requirements.
The farmers who sell an organic product must follow rules that are specific to the organic label or be subject to legal recourse. Internet providers who assert that they maintain the privacy of customer information must do so or induce civil liability.
Duty to the public stems from the obligation implied by infrastructure providers to the people they serve. In many cases, critical infrastructure providers have exclusive control over markets in which they operate as monopolies with government sanction.
In exchange for exclusivity, they have to meet added government regulations. They could give up the exclusivity in exchange for reductions in regulations, but they choose not to.
Many companies in the telecommunications field choose to act as “common carriers,” which means that they will carry any communications that customers want to exchange and pay no attention to the content exchanged.
In exchange for not limiting or controlling content, they gain the advantage of not being responsible for it or having legal liability for it.
Common carrier laws have not yet been applied to the Internet in most places, creating enormous lists of unnecessary outages and other problems that disrupt its operation, while telephone lines continue to operate without these problems, largely because of common carrier laws and fee structures.
Employee and public safety and health are another areas of duty that is implied and often mandated after providers fail to meet their obligations on a large scale. For emerging infrastructures, this takes some time to evolve, but for all established infrastructures, these duties are defined by laws and regulations.
Warnings of catastrophic incidents, evacuations, and similar events typically call for interactions between critical infrastructure providers and local or federal governments.
In most cases, reporting chains are defined by regulation or other means, but not in all cases. For example, if a nuclear power plant has a failure that has potential public health and safety issues, it always has a national-level contact it makes within a predefined time frame.
If a fire causes an outage in a power substation, regulatory notifications may be required, but affected parties find out well before it makes the media because their lights go out.
If a gas pipeline is going to be repaired during a scheduled maintenance process, previous notice must be given to affected customers in most cases, and typically the maintenance is scheduled during minimal-usage periods to minimize effects.
Special needs, like power for patients on life support systems, or manufacturing facilities with very high costs associated with certain sorts of outages, or “red tag” lines in some telecommunications systems that have changes locked out for one reason or another, are also obligations created that require special attention and induce special duties to protect.
For providers serving special government needs, such as secure communications associated with the U.S. “Emergency Broadcast System” or the “Amber Alert” system, or public safety closed circuit television systems, additional duties to protect are present. The list goes on and on.
Finally, natural resources and their uses include duties to protect in many jurisdictions and, in the case of global treaties, throughout the world.
For example, many critical infrastructure providers produce significant waste byproducts that have to be safely disposed of or reprocessed for a return to nature or other uses.
In these cases, duties may range from simple separation into types for differentiated recycling or disposal to the requirements associated with nuclear waste for processing and storage over hundreds or even thousands of years.
Life cycle issues often involve things like dealing with what happens to the contamination caused by chemicals put into the ground near power lines to prevent plant growth because as rain falls, and Earth movement causes contaminants to spread through groundwater into nearby areas.
While today, only a few critical infrastructure providers have to deal with these protection issues, over time, these life cycle issues will be recognized and become a core part of the critical infrastructure protection programs that all providers must deal with.
Strategies and Operations
The protection space, as you may have guessed by now, is potentially very complex. It involves a lot of different subspecialties, and each is a complex field, most with thousands of years of history behind them.
Rather than summarize the last 10,000 years of history in each of the subspecialties here, an introduction to each will be provided to give a sense of what they are and how they are used in critical infrastructure protection.
Needless to say, there is a great deal more to know that will be presented here, and the reader is referred to the many other fine books on these subjects for additional details.
Protect is defined herein as “keep from harm.” Others identify specific types of harm, such as “damage” or “attack” or “theft” or “injury.” There are all sorts of harm. Keeping critical infrastructures from being harmed has an underlying motivation in keeping people from being harmed, both over the short run and over the long run.
At the end of the day, if we have to disable a power or water system to save peoples’ lives, we should do so. As a result, somewhere along the line, the focus has to point to the people served by the critical infrastructure.
Now, this is a very people-focused view of the world, and as many will likely note, protection of the environment is very important. The focus on people is not one of selfishness; rather, it is one of expediency.
Since harm to the environment will ultimately harm people in the long run, environmental protection is linked to people protection. While it would be a fine idea to focus on the greater good of the world or, perhaps, by implication, the universe, protecting the world might be best served by eliminating all humans from it.
This will not likely get past the reviewers, so we will focus on the assumption that keeping the critical infrastructures from being harmed serves the goal of keeping people from being harmed, even though we know it is not always so.
At the same time, we will keep in mind that, at the strategic level, things are heavily intertwined, and the interdependencies drive an overall need to protect people (because they are the ones the infrastructures were built to serve) and the implied strategic need to protect the world that those people depend upon.
Without physical security, no assurances can be provided that anything will be as it is desired. All critical infrastructures have physicality. Protecting that physicality is a necessary but not sufficient condition of providing services and goods and to the protection of all sorts.
At the same time, perfect physical security is impossible to attain because there is always something with more force than can be defended against in the physical space.
Nuclear weapons and forces of nature such as earthquakes and volcanoes cannot be stopped by current defenses in any useful way, but each is limited in physical scope.
Asteroids of large enough size and massive nuclear strikes will likely put an end to human life if they come to pass in the foreseeable future, and protection against them is certainly beyond the scope of the critical infrastructure provider.
Similarly, the fall of governments, insurrections, and the chaos that inevitably follows make continuity of critical infrastructures very difficult, and delivery of products and services will be disrupted to one extent or another.
However, many other forces of nature and malicious human acts may be successfully protected against and must be protected to a reasonable extent to afford stability to critical infrastructures.
Physical security for critical infrastructures generally involves facility security for central and distributed offices and other structures containing personnel and equipment, distribution system security for the means of attaining natural resources and delivering finished goods or services, and a range of other physical security measures associated with the other business operations necessary to sustain the infrastructure and its workers.
As an example, an oil pipeline typically involves a supply that comes from an oil pumping station of some sort that connects to underground storage and supply, a long pipe that may go under and over ground, a set of pressure control valves along the way, and a set of delivery locations where the oil is delivered to the demand.
The valves and pumps have to be controlled and are typically controlled remotely through SCADA systems with local overrides. Supply has to be purchased and demand paid for, resulting in a financial system interface.
The pipeline has to be maintained so there are people who access it and machines that might work it from the inside during maintenance periods.
For anything that does not move, physical security involves understanding and analyzing the physicality of the location it resides in. The analysis typically starts from a “safe” distance and moves in toward the protected items, covering everything from the center of the Earth up to outer space.
In the case of sabotage, for example, the analysis has to start only at the attacker starting location and reach a distance from which the damage can be done with the attacker’s capabilities and intents.
This series of envelopes around the protected items also have to be analyzed in reverse in the case of a desire to prevent physical acts once the protected items are reached, such as theft, which involves getting the protected items out of the location to somewhere else.
Each enveloped area may contain natural and/or artificial mechanisms to deter, prevent, or detect and react to attacks, and the overall system is adapted over time.
The risks increase because of the higher consequences both to end demand and of damage to the infrastructure. A major gas pipe explosion can kill a lot of people, start a lot of fires, and take a lot of time to fix. At the head end, it can cripple portions of a city, and during winter, it can cost many lives.
The problem with protecting long- distance fixed infrastructure trans-portation components is that electrical power infrastructure, other energy pipelines, and ground-based telecommunications transport media must traverse large distances.
The cost of effective preventative protection all along the way is too high to bear for the consequences experienced in most socie ties today.
Even in war zones, where sabotage is common, every inch of these distribution systems cannot be protected directly.
Rather, there are typically zones of protection into which limited human access is permitted, sensors and distances are put in place to delay the attack, and detection of attack along with rapid response makes the price of successful attack high in human terms.
There are only so many suicide bombers out there at any given time, and the total number is not so large that it is worth putting perimeter fencing with sensors and rapid response guards next to every inch of infrastructure.
When we talk about personal security, we are generally talking about things we do to protect the critical infrastructures from malicious human actors who are authorized to act, rather than protecting those humans from being harmed or protection from personnel not authorized to act.
This includes methods intended to gain and retain people who are reliable, trustworthy, and honest, making provisions to limit the potential negative effects of individuals and groups of individuals on the infrastructures, deterrence, surveillance, and combinations of rewards and punishments associated with proper and improper behaviors.
The personnel life cycle typically starts for the critical infrastructure provider with a job application.
At that point, applicants undergo a background investigation to check on what they said about themselves and its veracity, to verify their identity, and to gain an understanding of their history as an indicator of future performance and behavior.
Not all providers do these activities, and those that do not are far more likely to have problem employees and problem infrastructures.
Depending on the type of check undertaken, the background can reveal previous criminal acts, lies on the application (which are very common today), foreign intelligence ties, a false identity, high debt levels or another financial difficulty, or any number of other things that might affect employment. For many positions of high trust, clearances by the government may be required.
Obviously, police would be hesitant to hire people with criminal records, firefighters would hesitate to hire arsonists, and financial industries would hesitate to hire financial criminals or those with high debt, but any of these are potentially problematic for all of these positions and many more.
After a background check and clearance process is undertaken, protection limits the assignment of personnel to tasks.
For example, foreign nationals might be barred from certain sensitive jobs involving infrastructures that provide services to government agencies and military installations, people with inadequate experience or expertise might not be assigned to jobs requiring high skill levels in specialty areas, and people without government clearances would be barred from work at certain facilities.
In the work environment, all workers must be authenticated to a level appropriate to the clearance they have, and the authentication and authorization levels for certain facilities may be higher than others.
Over the period of work, some workers may not be granted access to some systems or capabilities unless and until they have worked at the provider for a certain length of time, under the notion that trust grows with time.
Periodic reinvestigation of workers might be undertaken to see if they are somehow getting rich when they are not highly paid or have large debts that are growing, making them susceptible to blackmail.
For highly sensitive positions, workers may have to notify their employer of arrests, travel to certain locations, marriages and divorces, and even relationships outside of marriage.
Obviously, this information can be quite sensitive and should be carefully protected as well, but that is covered under information protection below.
Human reliability studies have been performed on a wide array of people and for many factors, and in particularly sensitive jobs, these sorts of efforts can be used as a differentiator.
Behaviors are often identified after the fact for workers who have violated trust, but on a predictive basis, such indicators are poor at identifying people who will betray trust. People who appear very loyal may in fact just be good at deception or trained to gain insider access.
Insiders who might normally be worthy of trust may be put under duress and, at the risk of having family members killed or infidelity exposed, may violate trusts. Again, the personnel security issues are very complex.
Operations security has to do with specific processes (operations) undertaken. It tends to be in effect for a finite period of time and be defined in terms of specific objectives.
Threats are identified relative to the operation, vulnerabilities are associated with the capabilities and intents of the specific threats to the operation, and defensive measures are undertaken to defeat those threats for the duration of the operation. These defenses tend to be temporary, one-time, unstructured, and individualized.
Operations consist of special purpose efforts, typically to meet a crisis or unusual one-off situation. The trans-Alaska pipeline creation was an operation requiring operations security, but its normal use requires operational security.
The bridge that collapsed in Oakland, California, due to a fuel truck fire was repaired in a matter of a few weeks, and this is clearly an exceptional case, an operation requiring operations security. However, the normal process of building and repairing roads is an operational security issue.
For operations, security is a one-off affair, thus, it is typically less systematic and thoughtful in its design, and it tends not to seek optimization as much as a workable one-off solution and costs are not controlled in the same way because there are no long-term life-cycle costs typically considered.
Decisions to accept risks are far more common, largely because they are being taken once instead of many times, so people can be far more attuned and diligent in their efforts than will happen day after day when the same things are repeated.
In some cases, operations security is more intensive than operational security because it is a one-off affair, so more expensive and specialized people and things can be applied.
Also, there is little, if any, history to base decisions on because each instance is unique, even if a broader historical perspective may be present for experienced operations workers.
Operational security is a term we use for the security we need around normal and exceptional business processes. This type of operational security tends to continue indefinitely and be repeated, readily and not focused on a specific time frame or target.
In other words, these business processes are the day-to-day things done to make critical infrastructures work. Protection of normal operations tends to be highly structured and routine, revisited periodically, externally reviewed, and evolutionary.
Information protection addresses ensuring the utility of content. Content can be in many forms, as can its utility.
For example, names and addresses of customers and their current amounts due are useful for billing and service provisioning, but if that is the sole purpose of their presence, they lose utility when applied to other uses, such as being stolen for use in frauds or sold for advertising purposes.
Since utility is in the context of the infrastructure, there is no predefined utility, so information systems must be designed to maximize utility specific to each infrastructure provider or they will not optimize the utility of content.
The cost of custom systems is high, so most information systems in most critical infrastructures are general purpose and thus leave a high potential for abuse.
In addition to the common uses of content such as billing, advertising, and so forth, critical infrastructures and their protective mechanisms depend on information for controlling their operational behaviors.
For example, SCADA systems are used to control the purification of water, the voltage and frequency of power distribution, the flow rates of pipelines, the amount of storage in use in storage facilities, the alarm and response systems of facilities, and many similar mechanisms, without the proper operation of which, these infrastructures will not continue to operate.
These controls are critical to the operation and if not properly operating can result in loss of service; temporary or long-term loss of utility for the infrastructure; the inability to properly secure the infrastructure;
damage to other devices, systems, and capabilities attached to the infrastructure; or, in some cases, inter-infrastructure collapse through the interdependency of one infrastructure on another.
For example, an improperly working SCADA system controlling stored water levels in a water tower could empty all of the storage tanks, thus leaving inadequate supply for a period of time. As the potential for negative consequences of lost information utility increases, so should the certainty with which that utility is ensured.
Information is subject to regulatory requirements, contractual obligations, owner- and management-defined controls, and decisions made by executives. Many aspects of information and its protection are subject to audit and other sorts of reviews.
As such, a set of duties to protect are defined and there is typically a governance structure in place to ensure that controls are properly defined, documented, implemented, and verified to fulfill those duties.
Duties are codified in the documentation that is subject to audit, review, and approval and that defines a legal contract for carrying out protective measures and meeting operational needs.
Typically, we see policy, control standards, and procedures as the documentation elements defining what is to be done, by whom, how, when, and where. As tasks are performed, these tasks are documented and their performance reviewed with sign-offs in logbooks or other similar mechanisms.
These operational logs are then used to verify from a management perspective that the processes as defined were performed and to detect and correct deviations from policy.
The definition of controls is typically required to be done through an approved risk management process intended to match surety to risk to keep costs controlled while providing adequate protection to ensure the utility of the content in the context of its uses.
This typically involves identifying consequences based on a business model defining the context of its use within the architecture of the infrastructure, the threats and their capabilities and intents for harming the infrastructure, and the architecture and its protective features and lack thereof.
Threats, vulnerabilities, and consequences must be analyzed in light of the set of potentially complex interdependencies associated with both direct and indirect linkages. Risks can then be accepted, transferred, avoided, or mitigated to levels appropriate to the situation.
In large organizations, information protection is controlled by a chief information security officer or some similarly titled position.
However, most critical infrastructure providers are small local utilities that have only a few tens of workers in total and almost certainly do not have a full-time information technology (IT) staff. If information protection is controlled at all, it is controlled by the local IT worker.
As in physical protection, deterrence, prevention, detection and response, and adaptation are used for protection. However, in smaller infrastructure providers, design for prevention is predominantly used as the means of control because detection and response are too complex and expensive for small organizations to handle and adaptation is too expensive in its redesigns.
While small organizations try to deter attacks, they are typically less of a target because of the more limited effects attainable by attacking them.
As is the case for physical security, information protection tends to be thought of in terms of layers of protection encircling the content and its utility; however, most information in use today gains much of its utility through its mobility. Just as in transportation, this limits uses of protective measures based on situational specifics.
To be of use, information must be processed in some manner, taking information as input and producing finished goods in the form of information useful for other purposes at the other end of each step of its production. Information must be protected at rest, in motion, and in use to ensure its utility.
Control of the information protection system is typically more complex than that of other systems because information systems tend to be interconnected and remotely addressable to a greater degree than other systems.
While a pipeline has to be physically reached to do harm, a SCADA system controlling that pipeline can potentially be reached from around the world by using the interconnectedness of systems whose transitive closure reaches the SCADA system.
While physically partitioning SCADA and related control systems from the rest of the world are highly desirable, it is not the trend today. Indeed, regulatory bodies have forced the interconnection of SCADA systems to the Internet in the attempt to make more information more available in real-time.
Further, for larger and interconnected infrastructures such as power and communications systems, there is little choice but to have long-distance connectivity to allow shared sourcing and distribution over long distances.
Increasingly complex and hard-to-understand and manage security barriers are being put in place to allow the mandated communication while limiting the potential for exploitation.
In addition, some efficiency can be gained through collaboration between SCADA systems, and this efficiency translates into a lot of money, exchanged for an unquantified amount of reduction in security.
SCADA systems are only part of the overall control system that functions within an infrastructure for protection. Less time-critical control systems exist at every level, from the financial system within a nonfinancial enterprise to the governance system in which people are controlled by other people. All of these, including the paper system, are information systems.
All control systems are based on a set of sensors, a control function, and a set of actuators. These must operate as a system within limits or the system will fail. Limits are highly dependent on the specifics of the situation, and as a result, engineering design and analysis are typically required to define the limits of control systems.
These limits are then coded into systems with surety levels and mechanisms appropriate to the risks. Most severe failures come about when limits are improperly set or the surety of the settings or controls limiting those settings being applied is inadequate to the situation at hand. For example, the slew rate of a water valve might have to be controlled to prevent pipe damage.
In other infrastructures, such as financial systems, control systems may be far more complex, and it may not be possible to completely separate them from the Internet and the rest of the world.
For example, electronic payment systems today operate largely over the Internet, and individuals, as well as infrastructure providers, can directly access banking and other financial information and make transfers or payments from anywhere.
In such an infrastructure, a far more complex control system with many more actuators and sensors is required, and a far greater management structure is going to be needed.
In voting systems, to do a good job of ensuring that all legitimate votes are properly cast and counted, a paper trail or similar unforgeable, obvious, and hard-to-dispute record has to be apparent to the voted and the count ers.
The recent debacles in voting associated with electronic voting have clearly demonstrated the folly of such trust in information systems when the risks are so high, the systems so disbursed, and the operators so untrained, untrusted, and inexperienced.
These systems have largely been out of control and therefore untrustworthy for the use.
Intelligence and Counterintelligence Exploitation
Understanding threats and the current situation involves an effort to gain intelligence while defeating attempts to characterize the infrastructure for exploitation is called counterintelligence because it is intended to counter the adversary intelligence process.
In the simplest case, a threat against an infrastructure might have a declared intent to cause failures for whatever reason.
This threat is characterized regarding capabilities and intents to identify if there are any weaknesses in the infrastructure that have not properly addressed the threat.
If there are, then temporary and/or permanent changes may be made to the infrastructure or its protective systems to address the new threat. Depending on the urgency and severity, immediate action may be required, and of course, threats may be sought out and arrested by interactions by law enforcement, destroyed or disabled by military action through government, and so forth.
Based on the set of identified and anticipated threats and threat types, those threats are likely to undertake efforts to gain information about the infrastructure to attack it.
The counterintelligence effort focuses on denying these threats the information they need for a successful attack and exploit-ing their attempts to gain intelligence to defeat their attempts to attack.
It is essentially impossible to discuss intelligence and counterintelligence separately because they are two sides of the same coin. To do either well, you need to understand the other and understand that they are directly competitive.
A simple defeat approach might be something like refusing them the information they need by identifying it as confidential, but this will not likely stop any serious threat from trying other means to gain that information.
For example, if one wants to attack a power infrastructure, an attacker can simply start anywhere that has power and trace back the physical power lines to get to bigger and bigger power transmission facilities, control and switching centers, and eventually power sources.
Trying to stop an attacker by not publishing the power line maps will not be very effective, and in the Internet and satellite imagery era, attackers can literally follow the wires using over-head imagery to create their own map.
Clearly, little can be done about this particular intelligence effort, but perhaps a defender can conceal the details of how these infrastructures are controlled or other such things to make some of the attacker’s jobs harder. To get a sense of the nature of this game, a table sometimes works for looking at intelligence and counterintelligence measures.
Life Cycle Protection Issues
As the previous discussion shows, protection issues in critical infrastructures apply across life cycles. Life cycle issues are commonly missed, and yet they are obvious once identified.
From the previously cited CISO Tool Kit—Governance Guidebook, life cycles for people, systems, data, and businesses have to be considered, and for the more general case, life cycles for all resources consumed and outputs and waste generated have to be considered. That means modeling the entire process of all infrastructure elements “from the womb to the tomb”—and beyond.
Consider the ecological infrastructure of a locality, even ignoring regional and global issues. As natural resources are pulled from the Earth to supply the infrastructure, they are no longer available for future use, the location they were taken from may be altered and permanently scarred;
other life forms living there may be unable to survive, the relationship of those resources to their surroundings may alter the larger scale ecology, and over longer time frames, these effects may outweigh the benefits of the resources themselves.
Suppose the resource is coal burned to supply power. The extraction may produce sinkholes that disrupt other infrastructures like gas lines or the water table, or it may create problems for roads or future uses.
The coal, once removed, typically has to be transported, using a different infrastructure, to the power plant. If this is at a distance, more energy is used in transportation, there is interdependency on that infrastructure for the power infrastructure, and the transportation is part of the life cycle that can be attacked and may have to be defended.
Since the coal depends on the transportation infrastructure, the security of that infrastructure is necessary for the coal to go where it is going and the security systems may have to interact, requiring coordination.
For example, if the fuel was nuclear rather than coal, different transportation security needs would be present, and if the power plant is running low and previous attacks have caused transportation to be more expensive, these attacks may have to be protected against as well.
The steps go on and on throughout the interacting life cycles of different things, people, systems, and businesses, and all of these lifecycle steps and interactions have to be accounted for to understand the protective needs for the individual and overall infrastructures.
Regardless of the details involved in each infrastructure element, the nature of life cycles is that there are many complex interacting elements involved in them and they are best managed by the creation of models that allow them to be dealt with systematically and analyzed in conjunction with other models of other life cycles.
From a protection standpoint, these models allow the analyst to cover things more thoroughly and with more certainty than they otherwise would likely be able to do, and as events in the world show weaknesses in the models, the models can be updated as part of their life cycles to improve with age and experience.
This not only allows the protection analyst to improve with time but also provides the basis for the creation of policies, control standards, and procedures designed to meet all of the modeled elements of the life cycles of all infrastructure components.
Thus, the models form the basis for understanding the protective needs and the life cycles help form the basis for the models.
These lifecycle models can also be thought of as process models; however, they are described and discussed this way to ensure that the processes cover all aspects of all interacting components from before they are created until after they are consumed. The model of life cycles is one that itself helps ensure that protection coverage is complete.
Finally, it is important to note that while all infrastructure components have finite life cycles, the totality of infrastructures is intended to have an infinite life cycle. Life cycles that are usually noticed are at the inception of the idea of having the infrastructure, the creation of the components and composites, and their ultimate destruction.
However, the widely ignored and absolutely critical elements of maintenance and operation, upgrades, and post-destruction clean-up and restoration of the surrounding environment are often ignored by the public at large, even though they are the hard part that has to be done day to day.
Change happens whether we like it and plan for it or not. If we fail to manage it, it will cause critical infrastructures to fail, while if we manage it reasonably well, the infrastructures will change with the rest of the world and continue to operate and facilitate our lifestyles and the advancement of humanity and society.
While those who wish to tear down societies may wish to induce changes that are disruptive, change management is part of the protective process that is intended to help ensure that this does not happen.
Changes can be malicious, accidental, or intended to be beneficial, but when changes occur and protection is not considered, they will almost certainly produce weaknesses that are exploitable or result in accidental failures.
In a sense, change management belongs under the heading of life cycles and yet they are typically handled separately because they are typically considered within each part of life cycles and most commonly within the normal operating portion of the overall life cycle of any component of interest.
Strategic Critical Infrastructure Protection
Strategic critical infrastructure protection is about the overall long-term protection of the totality of critical infrastructure and humanity as a whole.
As such, it is less about the protection of any given infrastructure element and more about the protection of the evolving overall community of support systems that support human life and society on Earth and, eventually, as humanity expands through space, to other places.
As a starter, it must be recognized that all resources are finite and that the notion that “the solution to pollution is dilution” cannot stand. The notion of sustainability, however, must be balanced for some time with the notion of progress, because if humanity is to survive the ultimate end of the sun or even the next major asteroid hit, we will need to improve our technology.
This then means that we need to expend our limited nonrenewable (at least for a long time to come they are not renewable) resources wisely to advance ourselves to the level where we can exist on renewable resources alone.
Coal will run out soon, but oil will run out sooner, at least here on Earth, with current consumption patterns. In the time frame of infrastructures, coal is not yet a serious problem, but oil is because it is now at or about at its peak production for all time and production will start to decline and never again return to its previous levels.
Going to coal means more pollution and has many other implications, and that means that protection of the power and energy infrastructure implies research and development in that arena with plans for transition and change management starting now rather than at the last minute.
Technology and Process Options
There are a lot of different technologies and processes that are used to implement protection. A comprehensive list would be infeasible to present without an encyclopedic volume, and the list changes all the time, but we would be remiss if all of the details were left out.
The lists of such things, being so extensive, are far more amenable to computerization than printing in books.
Rather than add a few hundred pages of lists of different things at different places, we have chosen to provide the information within a software package that provides, what amounts to, checklists of the different sorts of technologies that go in different places. To give a sense of the sorts of things typically included in such lists, here are some extracts.
In the general physical arena, we include perimeters, access controls, concealments, response forces, property location and geology, property topology and natural barriers, property perimeter artificial barriers, signs, alarms, and responses, facility features and paths, facility detection, response, and supply, facility time and distance issues, facility location and attack graph issues, and observe, orient, decide, and act (OODA) loops, perception controls, and locking mechanisms.
Within locking mechanisms, for example, we include a selection of lock types, electrical lock-out controls, mechanical lock-out controls, fluid lock-out controls, and gas lock-out controls, time-based access controls, location-based access controls, event sequence-based access controls, situation-based access controls, lock fail-safe features, lock default settings, and lock tamper-evidence.
Similar sorts of lists exist in other areas. For example, in technical information security, under network firewalls, we list outer router, routing controls, and limitations on ports, gateway machines, demilitarized zones (DMZs), proxies, virtual private networks (VPNs), identity-based access controls, hardware acceleration, appliance or hardware devices, inbound filtering, and outbound filtering.
Each of these has variations as well. Under Operations Security, which is essentially a process methodology with some technologies in all areas of security that support it;
we list time frame of operation, the scope of operation, threats to the operation, secrets that must be protected, indicators of those secrets, capabilities of the threats, intents of the threats, observable indicators present, vulnerabilities, the seriousness of the risk, and countermeasures identified and applied.
In the analysis of intelligence indicators, we typically carry out or estimate the effects of these activities that are common to many threats:
Review widely available literature;
Send intelligence operatives into adversary countries, businesses, or facilities;
Plant surveillance devices (bugs) in computers, buildings, cars, offices, and elsewhere;
Take inside and outside pictures on building tours;
Send e-mails in to ask questions;
Call telephone numbers to determine who works where, and to get other related information;
Look for or build up a telephone directory;
Build an organizational chart;
Cull through thousands of Internet postings;
Do Google and other similar searches;
Target individuals for elicitation;
Track the movement of people and things;
Track customers, suppliers, consultants, vendors, service contracts, and other business relationships;
Do credit checks on individual targets of interest;
Use commercial databases to get background information;
Access history of individuals including airline reservations and when they go where;
Research businesses people have worked for and people they know;
Find out where they went to school and chat with friends they knew from way back;
Talk to neighbors, former employers, and bartenders;
Read the annual report; and
Send people in for job interviews, some of whom get jobs.
It rapidly becomes apparent that (1) the number of alternatives is enormous for both malicious attacker and accidental events, (2) the number of options for protection is enormous and many options often have to be applied;
and (3) no individual can attain all of the skills and knowledge required to perform all of the tasks in all of the necessary areas to define and design the protective system of an infrastructure.
Even if an individual had all of the requisite knowledge, they could not possibly have the time to carry out the necessary activities for a critical infrastructure of substantial size. Critical infrastructure protection is a team effort requiring a team of experts.
Protection Design Goals and Duties to Protect
In a sense, the goal of protection may be stated as a reduction in negative consequences, but in real systems, more specific goals have to be clarified. There is a need to define the duties to protect if those duties are going to be fulfilled by an organization.
The obvious duty that should be identified by people working on critical infrastructure protection is the duty to prevent serious negative consequences from occurring;
but as obvious as this is, it is often forgotten in favor of some other sort of duty, like making money for the shareholders regardless of the implications to society as a whole.
A structured approach to defining duties to protect uses a hierarchical process starting with the top-level definition of duties associated with laws, owners, directors, auditors, and top management.
Laws and regulations are typically researched by a legal team and defined for internal use. Owners and directors define their requirements through the set of policies and explicit directives.
Auditors are responsible for identifying applicable standards against which verification will be performed and the enterprise measured. Top executives identify day-to-day duties and manage the process.
Duties should be identified through processes put in place by those responsible; however, if this is not done, the protection program should seek out this guidance as one of its duties to be diligent in its efforts.
Identified duties should be codified in writing and be made explicit, but if this is not done by those responsible, it is again incumbent on the protection program to codify them in documentation and properly manage that documentation.
There is often resistance to any process in which those who operate the protection program seek to clarify or formalize enterprise-level decisions.
As an alternative to creating formal documents or forcing the issue unduly, the protection an executive might take the tactic of identifying the duties that are clarified in writing and identifying that no other duties have been stipulated as part of the documentation provided for the design of their protection program.
The operating environment has to be characterized to gain clarity in the context of protection. Just as a bridge designer has to know the loads that are expected for the bridge, the length of the span, the likely range of weather conditions;
and other similar factors to design the bridge properly, the protection designer has to know enough about the operating environment to design the protection system to operate in the anticipated operating conditions.
The specific parameters are highly specific to the infrastructure type and protection area. For example, physical security of long-distance telecommunications lines has different operating environment parameters than do personal security in a mining facility.
Security-related operating environment issues tend to augment normal engineering issues because they include the potential actions of malicious actors in the context of the engineering environment.
While engineers design bridges to handle natural hazards, the protection specialist must find ways to protect those same bridges when they are attacked in an attempt to intentionally push them beyond design specifications.
The protection designer has to understand what the assumptions are and how these can be violated by intentional attackers, and this forms the operating environment of the projection designer.
A systematic approach to design is vital to success in devising protection approaches. Without some sort of method to the madness, the complexity of all of the possible protection designs is instantly overwhelming. There are a variety of design methodologies.
There are many complaints in the literature about the waterfall process in which specifications are developed, designs were undertaken, evaluations of alternatives completed, and selections made, with a loop for feedback into the previous elements of the process.
However, despite the complaints, this process is still commonly embraced by those who are serious about arriving at viable solutions to security design challenges. In fact, this process has been well studied and leads to many positive results, but there are many alternative approaches to protection design.
As an overall approach, one of the more meaningful alternative approaches is to identify the surety level of the desired outcome for the overall system and its component parts. Surety levels can be thought of in fairly simple terms, low, medium, and high, for example.
For low surety, a different process is undertaken because the consequences are too low to justify serious design effort. For medium consequences, a systematic approach is taken, but not pushed to the limit of human capability for design and analysis.
For high consequences, the most certain techniques available are used and the price is paid regardless of the costs.
Of course, realistic designers know that there is no unlimited cost project, that there are tradeoffs at all levels, and that such selection is only preliminary, and this sort of iterative approach to reducing the space of possibilities helps to focus the design process.
Process, Policy, Management, and Organizational Approaches
This is very similar to other engineering disciplines, and rightly so. Protection system design is an engineering exercise, but it is also a process definition exercise in that along with all of the things that are created.
There are operational procedures and process requirements that allow the components to operate properly together to form the composite. Protection is a process, not a product.
The protection system and the infrastructure as a whole have to function and evolve over time frames, and in the case of the protection system, it has to be able to react in very short time frames as well as adapt in far longer time frames.
As a result, the process definitions and the roles and actions of the parties have to be defined as part of the design process, in much the same way as the control processes of a power station or water system requires that people and process be defined while the plant is designed.
Except that in the case of infrastructures like power plants and water systems, the people in these specialty fields and their management typically already know what to expect. In protection, they do not.
The problem of inadequate knowledge at the management and operational level relating to protection will solve itself over time, but today, it is rather serious.
The technology has changed in recent years, and the changes in the threat environment have produced serious management challenges to Western societies, but in places like the former Soviet Union and in oppressive societies with internal distrust, these systems are well understood and have been in place for a long time.
The challenge is getting a proper mix of serious attention to protection and reasonable levels of trust based on reasonable assumptions.
A management process must be put in place in order to ensure that whatever duties are identified and policies mandated, they are managed so that they get executed, the execution is measured and verified, and failures in execution are mitigated in a timely fashion.
The protection designer must be able to integrate the technical aspects of the protection system into the management aspects of the infrastructure provider to create a viable system that allows the active components of the protection system to operate within specifications or the overall protective system will fail.
This has to take into account the failures in the components of the active system, which include not only technology but also people, business process, management failures, and active attempts to induce failures.
For example, an inadequate training program for incident evaluation will yield responses that cause inadequate resources to be available where and when needed, leading to reflexive control attack weaknesses in the protection system.
These sorts of processes have to be deeply embedded in the management structure of the enterprise to be effective. Otherwise, management decisions about seemingly irrelevant matters will result in successful attacks.
A typical example is a common decision to put content about the infrastructure on the Internet for external use with business partners.
Once the information is on the Internet, it is available on a more or less permanent basis to attackers, many of whom constantly seek out and collect permanent records of all information on potential future targets.
It is common for job descriptions to include details of operating environments in place, which leads attackers to the in-depth internal knowledge of the systems in use.
Because there are a limited number of systems used within many infrastructure industries, a few hints rapidly yield a great deal of knowledge that is exploitable in attacks.
In one case, a listing of vendors was used to identify lock types, and a vulnerability testing group was then able to get copies of the specific lock types in use, practice picking those locks, and bring special pick equipment to the site for attacks.
This reduced the time to penetrate barriers significantly. When combined with a floor plan that was gleaned from public records associated with a recent renovation, the entry and exit plan for covert access to control systems were devised, practiced, and executed.
If management at all levels does not understand these issues and make day-to-day operational decisions with this in mind, the result will be the defeat of protective systems.
The recognition that mistakes will be made is also fundamental to the development of processes. It is not only necessary to devise processes associated with the proper operation of the protective system and all of the related information and systems.
In addition, the processes in place have to deal with compensation for failures in the normal operating modes of these systems so that small failures do not become large failures.
In a mature infrastructure process, there will not be heroic individual efforts necessary for the protective system to work under stress. It will degrade gracefully to the extent feasible given the circumstance, according to the plan in place.
The policy is typically missing or wrong when infrastructure protection work is started, and it is not always fixed when the work is done. It is hard to get top management to make policy changes and all the harder in larger providers. Policies have to be followed and have legal standing within companies, while other sorts of internal decisions do not have the same standing.
As a result, management is often hesitant to create policy. In addition, the policy gives leverage to the protection function, which is another reason that the management in place may not want to make such changes.
Since security is usually not treated as a function that operates at top management levels, there is typically nobody at that level to champion the cause of security, and it gets short shrift.
Nevertheless, it is incumbent on the protection architects and designers to find ways to get policies in place that allow leverage to be used to gain and retain an appropriate level of assurance associated with their function.
Given that there are specified business and operational needs, specified duties to protect, and a reasonably well-defined operating environment, proposed architectures and designs, along with all of the processes, management, and other things that form the protection program.
And plan, need to be evaluated to determine whether protection is inadequate, adequate or excessive, reasonably priced, and performing for what is being gained and to allow alternatives to be compared.
Unlike engineering, finance, and many other fields of expertise that exist in the world, the protection arena does not have well-defined and universally applied analysis frameworks.
Any electrical engineer should be able to compute the necessary voltages, currents, component values, and other things required to design and implement a circuit to perform a function in a defined environment.
Any accountant can determine a reasonable placement of entries within the double entry bookkeeping system. However, if the same security engineering problem is given to a range of protection specialists, there are likely to be highly divergent answers.
One of the many reasons for the lack of general agreement in the security space is that there is a vast array of knowledge necessary to understand the entire space and those who work in the space range over a vast range of expertise.
Another challenge is that many government studies on the details of things like fence height, distances between things, and so forth, are sensitive because if the details are known, they may be more systematically defeated, but on the whole, the deeper problem seems to stem from a lack of a coherent profession.
There are many protection-related standards, and to the extent that these standards are embraced and followed, they lead to more uniform solutions with a baseline of protection.
For example, health and safety standards mandate a wide range of controls over materials, building codes ensure that certain protective fences do not fall over in the wind or accidentally electrocute passersby;
standards for fire safety ensure that specific temperatures are not reached within the protected area for a period of time is defined external conditions, standards for electromagnetic emanations limit the readability of signals at a distance, and shredding standards make it very hard to reassemble most shredded documents when the standards are met.
While there are a small number of specialized experts who know how to analyze these specific items in detail, protection designers normally just follow the standards to stay out of trouble—or at least they are supposed to.
Unfortunately, most of the people who work designing and implementing protective systems are unaware of most of these standards, and if they are unaware, they most certainly do not know whether they are following these standards and cannot specify them as requirements or meet them in implementation.
From a pure analysis standpoint, there is a wide range of scientific and engineering elements involved in protection, and all of them come to bear in the overall design of protective systems for infrastructures.
However, the holy grail of protection comes in the form of risk management: the systematic approach to measuring risk and making sound decisions about risk based on those measurements.
The problem with this starts with the inability to define risk in a really meaningful way, followed by the inability to measure the components in most definitions, the high cost of accurate measurements, the difficulty in analyzing the effect of protective measures on risk reduction, and the step functions in results associated with minor changes in parameters.
Standard Design Approaches
The standard design approaches are based on the notion that in-depth protection science and/or engineering can be applied to define a design that meets the essential criteria that work for a wide range of situations.
By defining the situations for which each design applies, an organization can reduce or eliminate design and analysis time by simply replicating a known design where the situation meets the design specification criteria.
Thus, a standard fence for protecting highways from people throwing objects off of overpasses can be applied to every overpass that meets the standard design criteria, and “the paralysis of analysis” can be avoided.
The fiats that have to be watched carefully in these situations are that
(1) the implementations do indeed meet the design criteria, (2) the design actually does what it was intended to do, and (3) the criteria are static enough to allow for a common design to be reproduced in place after place.
It turns out that, to a close approximation, this works well at several levels. It works for individual design components, for certain types of composites, and for architectural level approaches.
By using such approaches, analysis, approval processes, and many other aspects of protection design and implementation are reduced in complexity and cost, and if done on a large scale, the cost of components can go down because of mass production and competition. However, mass production has its drawbacks.
For example, the commonly used mass production lock and key systems used on most doors are almost uniformly susceptible to the bump-key attack.
As the sunk cost of a defense technology increases and it becomes so standard that it is almost universal, attackers will start to define and create attack methods that are also readily reproducible and lower the cost and time of the attack. Standardization leads to common mode failures.
The cure to this comes in the combinations of protective measures put in place. The so-called defense-in-depth is intended to mitigate individual failures, and if applied systematically with variations of combinations forming the overall defense, then each facility will have a different sequence of skill requirements for attack and the cost to the attackers will increase while their uncertainty increases as well.
They have to bring more and more expensive things to increase their chances of success unless they can gather intelligence adequate to give away the specific sequences required, and they have to have more skills, train longer, and learn more to be effective against a larger set of targets.
This reduces the threats that are effective to those with more capabilities and largely eliminates most of the low-level attackers (the so-called ankle biters) that consume much of the resources in less well-designed approaches.
As it turns out, there is also a negative side effect to effective protection against low-level attacks. As fewer and fewer attackers show up, management will find less and less justification for defenses. As a result, budgets will be cut and defenses will start to decay until they fail altogether in a rather spectacular way.
This is why bridges fall down and power systems collapse and water pipes burst in most cases. They become so inexpensive to operate and work so well that maintenance is reduced to the point where it is inadequate. It works for a while and then fails spectacularly.
Subsequently, in a case where businesses run infrastructures and short-term profits are rewarded over long-term surety, management is highly motivated and rewarded by shirking maintenance and protection and leaving success to luck in these areas.
So we seem to have come full circle. Standard designs are good for being more effective with less money, but as you squeeze out the redundancy and the costs, you soon get to common mode failures and brittleness that cause collapses at some future point in time.
So along with standard designs, you need standard maintenance and operational processes that have most of the same problems, unless rewards are aligned with reliability and long-term effectiveness. Proper feedback, then, has to become part of the metrics program for the protection program.
Design Automation and Optimization
For protection fields, there is only sporadic design automation and optimization, and the tools that exist are largely proprietary and not sold widely on the open market.
Unlike circuit design, building design, and other similar fields, there has not been a long-term academic investigation of most areas of protection involving intentional threats that have moved to mature the field.
While there are many engineering tools for the disciplines involved in protection, most of these tools do not address malicious actions. The user can attempt to use these to model such acts, but these tools are not designed to do so and there are no widely available common libraries to support the process.
In the risk management area, as a general field, there are tools for evaluating certain classes of risks and producing aggregated risk figures, but these are rudimentary in nature, require a great deal of input that is hard to quantify properly, and produce relatively little output that has a material effect on design or implementation.
There are reliability-related tools associated with carrying out the formulas involved in fault tolerant computing and redundancy, and these can be quite helpful in determining maintenance periods and other similar things, but again, they tend to ignore malicious threats and their capacity to intentionally induce faults.
For each of the engineering fields associated with critical infrastructures, there are also design automation tools, and these are widely used, but again, these tools typically deal with the design issue, ignoring the protective issues associated with anything other than nature.
Control systems represent a different sort of IT than most designers and auditors are used to. Unlike the more common general-purpose computer systems in widespread use, these control systems are critical for the moment-to-moment functioning of mechanisms that, in many cases, can cause serious negative physical consequences.
Generally, these systems can be broken down into sensors, actuators, and PLCs themselves controlled by SCADA systems.
They control the moment-to-moment operations of motors, valves, generators, flow limiters, transformers, chemical and power plants, switching systems, floor systems at manufacturing facilities, and any number of other real-time mechanisms that are part of the interface between information technologies and the physical world.
When they fail or fail to operate properly, regardless of the cause, the consequences can range from a reduction in product quality to the deaths of tens of thousands of people, and beyond, and this is not just theory.
It is the reality of incidents like the chemical plant release that killed about 40,000 people in a matter of an hour or so in Bhopal India and the Bellingham Washington SCADA failure of the Olympic Pipeline Company that, combined with other problems in the pipeline infrastructure at the time, resulted in the deaths of about 15 people and put the pipeline company out of business.
Control Systems Variations and Differences
Control systems are quite a bit different from general-purpose computer systems in several ways. These systems differences, in turn, make a big difference in how they must be properly controlled and audited and, in many cases, make it impossible to do a proper audit on the live system. Some of the key differences to consider include, without limit, the following:
They are usually real-time systems. Denial of services or communications for periods of thousandths of a second or less can sometimes cause catastrophic failure of physical systems, which in turn can sometimes cause other systems to fail in a cascading manner.
This means that real-time performance of all necessary functions within the operating environment must be designed and verified to ensure that such failures will not happen.
It also means that they must not be disrupted or interfered with except in well-controlled ways during testing or audits. It also means that they should be as independent as possible of external systems and influences.
They tend to operate at a very low level of interaction, exchanging data like register settings and histories of data values that reflect the state or rate of change of physical devices such as actuators or sensors.
That means that any of the valid values for settings might be reasonable depending on the overall situation of the plant they operate within and that it is hard to tell whether a data value is valid without a model of the plant in operation to compare the value to.
They tend to operate in place for tens of years before being replaced and they tend to exist as they were originally implemented. They do not get updated very often, do not run antivirus scanners, and, in many cases, do not even have general-purpose operating systems.
This means that the technology of 30 years ago has to be integrated into new technologies and that designers have to consider the implications over that time frame to be prudent. Initial cost is far less important than life-cycle costs and consequences of failure tend to far outweigh any of the system costs.
Most of these systems are designed to operate in a closed environment with no connection outside of the control environment. However, they are increasingly being connected to the Internet, wireless access mechanisms, and other remote and distant mechanisms running over intervening infrastructure.
Such connections are extremely dangerous, and commonly used protective mechanisms like firewalls and proxy servers are rarely effective in protecting control systems to the level of surety appropriate to the consequences of failure.
Current intrusion and anomaly detection systems largely fail to understand the protocols that control systems use and, even if they did, do not have plant models that allow them to differentiate between legitimate and illegitimate commands in context.
Even if they could do this, the response times for control systems is often too short to allow any such intervention, and stopping the flow of control signals is sometimes more dangerous than allowing potentially wrong signals to flow.
Control systems typically have no audit trails of commands executed or sent to them; have no identification, authentication, or authorization mechanisms; and execute whatever command is sent to them immediately unless it has a bad format.
They have only limited error detection capabilities, and in most cases, erroneous values are reflected in physical events in the mechanisms under control rather than error returns.
When penetration testing is undertaken, it very often demonstrates that these systems are highly susceptible to attack.
However, this is quite dangerous because as soon as a wrong command is sent to such a system or the system slows down during such a test, the risk is run of doing catastrophic damage to the plant. For that reason, actual systems in operation are virtually never tested and should not be tested in this manner.
In control systems, integrity, availability, and use control are the most important objectives for operational needs, while accountability is vital to forensic analysis, but confidentiality is rarely of import from an operational standpoint at the level of individual control mechanisms. The design and review process should be clear in its prioritization.
This is not to say that confidentiality is not important. In fact, there are examples such as reflexive control attacks and gaming attacks against the financial system in which control system data have been exploited;
but given the option of having the system operate safely or leaking information about its state, the safe operation should be given precedence.
Questions to Probe
Finally, while each specific control system has to be individually considered in context, there are some basic questions that should be asked with regard to any control system and a set of issues to be considered relative to those questions.
Question 1: What Is the Consequence of Failure and Who Accepts the Risk?
The first question that should always be asked with regard to control systems is the consequences associated with control system failures, followed by the surety level applied to implement and protect those control systems. If the consequences are higher, then the surety of the implementation should be higher.
The consequence levels associated with the worst-case failure, ignoring protective measures in place, indicate the level at which risks have to be reviewed and accepted.
If lives are at stake, likely the chief executive officer (CEO) has to accept residual risks. If significant impacts on the valuation of the enterprise are possible, the CEO and chief finance officer (CFO) have to sign off.
In most manufacturing, chemical processing, energy, environment, and other similar operations, the consequences of a control system failure are high enough to require top management involvement and sign-off.
Executives must read the audit summaries and the chief scientist of the enterprise should understand the risks and describe these to the CEO and CFO before sign-off.
If this is not done, who is making these decisions should be determined and an audit team should report this result to the board as a high priority item to be mitigated?
Question 2: What Are the Duties to Protect?
Along with the responsibility for control systems comes civil and possibly criminal liability for failure to do the job well enough and for the decision to accept a risk rather than mitigate it.
In most cases, such systems end up being safety systems, having potential environmental impacts, and possibly endangering surrounding populations.
Duties to protect include, without limit, legal and regulatory mandates, industry-specific standards, contractual obligations, company policies, and possibly other duties.
All of these duties must be identified and met for control systems, and for most high-valued control systems, there are additional mandates and special requirements.
For example, in the automotive industry, safety mechanisms in cars that are not properly operating because of a control system failure in the manufacturing process might produce massive recalls, and there may be a duty to have records of inspections associated with the requirements for recalls that are unmet within some control systems.
Of course, designers should know the industry they operate in, as auditors, and without such knowledge, items such as these may be missed.
Question 3: What Controls Are Needed, and Are They in Place?
Control systems in use today were largely created at a time when the Internet was not widely connected. As a result, they were designed to operate in an environment where connectivity was very limited.
To the extent that they have remote control mechanisms, those mechanisms are usually direct command interfaces to control settings.
At the time they were designed, the systems were protected by limiting physical access to equipment and limiting remote access to dedicated telephone lines or wires that run with the infra-structure elements under control.
When this is changed to a nondedicated circuit, when the telephone switching system no longer uses physical controls over dedicated lines, when the telephone link is connected via a modem to a computer network connected to the Internet;
or when a direct IP connection to the device is added, the design assumptions of isolation that made the system relatively safe are no longer valid.
When connecting these systems to the Internet, such connections are typically made without the necessary knowledge to do them safely. Given the lack of clarity in this area, it is probably important to not make such connections without having the best experts to consider the safety of those changes.
This sort of technology change is one of the key things that make control systems susceptible to attack, and most of the technology fixes put in place with the idea of compensating for those changes do not make those systems safe. Here are some examples of things we have consistently seen in reviews of such systems:
The claim of an “air gap” or “direct line” or “dedicated line” between a communications network used to control distant systems and the rest of the telephone network is almost never true, no matter how many people may claim it.
The only way to verify this is to walk from place to place and follow the actual wires, and every time we have done it, we have found these claims to be untrue.
The claim that “nobody could ever figure that out” seems to be a universal form of denial. Unfortunately, people do figure these things out and exploit them all the time, and of course, our teams have figured them out to present them to the people who operate the control systems, demonstrating that they can be figured out.
Remote control mechanisms are almost always vulnerable, less so between the SCADA and the things it controls when the connections are fairly direct.
But almost always for mobile control devices, any mechanisms using wireless, any system with unprotected wiring, any system with a way to check on or manage from afar, and anything connected either directly or indirectly to the Internet.
Encryption, VPN mechanisms, firewalls, intrusion detection sensors, and other similar security mechanisms designed to protect normal networks from standard attacks are rarely effective in protecting control systems connected to or through these devices from attacks that they face.
And many of these techniques are too slow, cause delays, or are otherwise problematic for control systems. Failures may not appear during testing or for years, but when they do appear, they can be catastrophic.
Insider threats are almost always ignored, and typical control systems are powerless against them. However, many of the attack mechanisms depend on a multistep process that starts with changing a limiter setting and is followed by exceeding the normal limits of operation.
If detection of these limit-setting changes were done in a timely fashion, many of the resulting failures could be avoided.
Change management in control systems is often not able to differentiate between safety interlocks and operational control settings. Higher standards of care should be applied to changes of interlocks than changes in data values because the interlocks are the things that force the data values to within reasonable ranges.
As an example, interlocks are often bypassed by maintenance processes and sometimes not verified after the maintenance is completed.
The standard operating procedure should mandate safety checks including verification of all interlocks and limiters against known good values and external review should keep old copies and verify changes against them.
If accountability is to be attained, it must be done by an additional audit device that receives signals through a diode or similar mechanism that prevents the audit mechanism from affecting the system.
This device must itself be well protected to keep forensically sound information required for investigation. However, since there is usually poor or no identification, authentication, or authorization mechanism within the control system itself, attribution is problematic unless explicitly designed into the overall control system.
Alarms should be in place to detect loss of accountability information, and such loss should be immediately investigated. A proper audit system should be able to collect all of the control signals in a complex control environment for periods of many years without running out of space or becoming overwhelmed.
If information from the control system is needed for some other purpose, it should run through a digital diode for use.
If the remote control is really needed, that control should be severely limited and implemented only through a custom interface using a finite state machine mechanism with syntax checks in context, strict accountability, strong auditing, and specially designed controls for the specific controls on the specific systems.
It should fail into a safe mode and be carefully reviewed and should not allow any safety interlocks or other similar changes to be made from afar.
To the extent that distant communication is used, it should be encrypted at the line level where feasible; however, because of timing constraints, this may be of only limited value.
To the extent that remote control is used at the level of human controls, all traffic should be encrypted and the remote control devices should be protected to the same level of surety as local control devices.
That means, for example, that if you are using a laptop to remotely control such a mechanism, it should not be used for other purposes, such as e-mail, Web browsing, or any other nonessential function of the control system.
Nothing should ever be run on a control system other than the control system itself. It needs to have dedicated hardware, infrastructure, connectivity, bandwidth, controls, and so forth. The corporate LAN should not be shared with the control system, no matter how much there are supposed to be guarantees of quality of service.
If voice over IP replaces plain old telephone service (POTS) throughout the enterprise, make sure it is not replaced in the control systems. Fight the temptation to share an Ethernet between more than two devices, to go through a switch or other similar device, or to use wireless, unless there is no other way.
Just remember that the entire chain of control for all of these infrastructure elements may cause the control system to fail and induce the worst case consequences.
Finally, experience shows that people believe a lot of things that are not true. This is more so in the security arena than in most other fields and more critical in control systems than in most other enterprise systems. When in doubt, do not believe them. Trust, but verify.
Perhaps more dangerous than older systems that we know have no built-in controls are modern systems that run complex operating systems and are regularly updated. Modern operating platforms that run control systems often slow down when updates are underway or at different times of the day or during different processes.
These slowdowns sometimes cause control systems to slow unnecessarily. If an antivirus update causes a critical piece of software to be detected in a false-positive, the control system could crash, and if a virus can enter the control system, the control system is not secure enough to handle medium- or high-consequence control functions.
Many modern systems have built-in security mechanisms that are supposed to protect them, but the protection is usually not designed to ensure availability, integrity, and use control, but rather to protect confidentiality. As such, they aim at the wrong target, and even if they should hit what they aim at, it will not meet the need.