What is Agile and DevOps Methodology?
There are large organizations in the industry using leading-edge techniques such as Agile and DevOps to develop software faster and more efficiently than anyone ever thought possible. This tutorial explains how Agile DevOps Methodology used with the best examples.
The DevOps approach
The DevOps approach of integrating working code across the organization in an operation-like environment is one of the biggest challenges for large, traditional organizations, but it provides the most significant improvements in aligning the work across teams.
It also provides the real-time feedback engineers need to become better developers. For this to work well, the continuous deployment pipeline needs to be designed to quickly and efficiently localize issues in large, complex systems and organizations.
It requires a large amount of test automation, so it is important that the test automation framework is designed to quickly localize issues and can easily evolve with the application as it changes over time.
This is a big challenge for most organizations, so the executives need to make sure to start with achievable goals and then improve stability over time as the organization’s capabilities improve.
Teams can’t and won’t drive this level of change, so the executives need to understand these concepts in enough detail to lead the transformation and ensure their teams are on the right track.
Transforming development and delivery processes in a large, traditional organization requires a lot of technical changes that will require some work, but by far the biggest challenges are with changing the culture and how people work on a day-to-day basis.
What do these cultural shifts look like? Developers create a stable trunk in a production-like environment as job #1. Development and Operation teams use common tools and environments to align them on a common objective.
The entire organization agrees that the definition of done at the release branch means that the feature is signed off, defect-free, and the test automation is ready in terms of test coverage and passing rates.
The organization embraces the unique characteristics of software and designs a planning process that takes advantage of the software’s flexibility. These are big changes that will take time, but without the executives driving these cultural shifts, the technology investments will be of limited value.
From business objectives and continuous improvement to planning and DevOps, Leading the Transformation takes you through the step-by-step process of how to apply Agile and DevOps principles at scale.
It is an innovative approach that promises markedly better business results with less up-front investment and organizational turmoil.
Change Management Capacity
Transitioning to Agile is a very big effort for a large organization. There are technical and process changes required. Frequently, organizations focus on technical solutions. While required, they represent a smaller portion of the effort. This means that organizational change management capacity is the most precious resource, and it should be actively managed and used sparingly.
The approach to rolling out an enterprise-level Agile transition should focus on the business breakthroughs the Agile principles are intended to achieve while taking into consideration the capacity of an organization to change. This is where executives can add knowledge and expertise.
The Limitations of Traditional Agile Implementation: An Executive Perspective
What follows is an example of a large (1,000-developer) organization that tries to enable small Agile teams in the organization. The example is not real, but it is a snapshot of what is being done in the industry today.
The first step is to select a few pilot teams with eight to ten members who will start being “Agile.” These teams will gain valuable experience and create some best practices. Once these teams have demonstrated the advantages of Agile, they create a plan for how it should be done for this organization.
The plan will need to scale across the organization, so in the end, there are going to be ~100 Agile teams. Agile coaches will be hired to get a few teams going, and then within a year, all the teams should be up and running.
Throughout the year, each coach can probably ramp up about five teams; thus, this rollout could require in the range of 20 coaches. With a cost of $150/hour, this adds up to over $2M/year.
When forming Agile teams, it is important to understand that you want teams that can own the whole feature end to end, so you need to pick members from the different component teams to form the prototype teams.
This works fine with just a few teams, but when we start rolling this out organization-wide, we are going to need a complete reorganization. Once everyone is in teams creating effective workspaces for collaboration, moving everyone around will probably take another $1–2M.
The next step is making sure the teams have the right tool for documenting and tracking their Agile stories, which will probably run $1M or more. All of a sudden, we are up to over $5M for the transition.
We better make sure we talk to the executives who can commit that level of resources. This will require an ROI justification for the CFO.
So now we are committed to big investment and a big return to the top-level executives. At this point, we have an implementation that engages and includes the engineers and top-level executives. The management team in the middle, however, does not have a clear role except to provide their support and stay out of the way.
There are a number of problems with this whole approach. You are now $4–5M into a transition, and you still don’t have a plan for having always-releasable code for the enterprise or an enterprise backlog.
Teams may have a clear local definition of “done” in their dedicated environments and team backlogs, but at the enterprise level, you have not changed the process for releasing code.
Also, this approach has driven lots of organizational change that may be met with some resistance. We started by taking managers out of the process because they are a big part of the problem and don’t understand how to coach Agile teams. Can you see why and how they might undermine this transformation?
Next, we have a complete reorganization, which always tends to be a cause for concern and creates resistance to change. Add to that moves designed to get teams together in collaborative workspaces.
You can see how this approach is going to create a lot of turmoil and change while not fundamentally changing the frequency of providing the code to the customers for feedback.
The other big challenge is getting the teams to buy in and support how they are going to approach the details of their day-to-day work. The first prototype teams are going to be successful because they had a lot of input in defining the team-level processes and have ownership of its success.
The problem most organizations have is that once they have determined the “right” way for their teams to do Agile, they feel the next step is teaching everyone one else to do it that way.
This approach of telling the teams how to do their day-to-day work feels like it is contrary to the good management practices that we learned early in our careers.
In most cases, for any problem, there are at least three different approaches that will all achieve the same solution. On the one hand, if we were really smart managers, we could pick the best approach and tell the team how to do it.
If, on the other hand, we let the team pick how to meet the objective, they are more likely to make their idea more successful than they would have made our idea.
If we told them how to do it and it failed, it was clear that we didn’t have a very good idea. If they got to pick how then they were much more likely to do whatever it took to make their idea successful. Just being managers or being part of the prototype team did not mean we were any more likely to pick the best idea.
Therefore, as leaders we feel it is important, wherever possible, to provide the framework with the objectives and let the team have as much design flexibility in defining how the work will get done. It provides them with more interesting work, and they take more ownership of the results.
In addition, when the situation changes, those doing the work are likely to sense it and adapt more quickly than an executive would.
BUSINESS OBJECTIVES AND CRUCIAL FIRST STEPS
The reason these two Agile authors say “don’t do Agile” is that we don’t think you can ever be successful or get all the possible business improvements if your objective is simply to do Agile and be done. Agile is such a broad and evolving methodology that it can’t ever be implemented completely.
Someone in your organization can at any time Google “What is Agile development” and then argue for pair programming or Extreme Program or less planning, and you begin a never-ending journey to try all the latest ideas without any clear reason why.
Additionally, Agile is about continuous improvement, so by definition, you will never be done.
At HP we never set out to do Agile. Our focus was simply on improving productivity. The firmware organization had been the bottleneck for the LaserJet business for a couple of decades.
In the few years before this transformation started, HP tried to spend its way out of the problem by hiring developers around the world, to no avail. Since throwing money at the problem didn’t work, we needed to engineer a solution.
We set off on a multiyear journey to transform the way we did development with the business objective of freeing up capacity for innovation and ensuring that, after the transformation, the firmware would not be the bottleneck for shipping new products. This clear objective really helped guide our journey and prioritize the work along the way.
Based on this experience and others like it, we think the most important first step in any transformation is to develop a clear set of business objectives tuned to your specific organization to ensure you are well positioned to maximize the impact of the transformation on business results.
When we started, the firmware had been the bottleneck in the business for a couple of decades and we had no capacity for innovation.
At the end of a three-plus-year journey, adding a new product to our plans was not a major cost driver. We had dramatically reduced costs from $100M to $55M per year and increased our capacity for innovation by eight times.
To be clear, achieving these results was a huge team effort. For example, it required us to move to a common hardware platform so that a single trunk of code could be applied to the entire product line up. Without collaboration with our partners throughout the business, we could not have achieved these results.
Having a set of high-level business objectives that the entire organization is focused on is the only way to get this type of cross-organizational cooperation and partnership.
These types of results will not happen when you “do Agile.” It takes a laser-like focus on business objectives, a process for identifying inefficiencies in the current process and implementing an ongoing, continuous improvement process.
Activity-Based Accounting and Cycle-Time Approach
At HP our clear business objectives of freeing up capacity for innovation and no longer being the bottleneck for the business meant we needed to dramatically improve productivity.
Therefore, we started by understanding our cost and cycle-time drivers to identify waste in our development processes. We determined these by mapping our processes, thinking about our staffing, digging through our finances, and looking back at our projects under development.
This is an important first step. Most people understand how they are spending money from a cost-accounting perspective. They have detailed processes for allocating costs out to products or projects. They don’t have an activity-based accounting view of what is driving the costs.
This step requires either a deep understanding of how people spend their time or a survey of the organization. Also keep in mind that while it does not have to be accurate to three significant digits, you do need to have a reasonably good feel for the cost and cycle-time drivers to prioritize your improvement ideas.
Once we were clear about our business objectives, cycle-times, and cost drivers, we were ready to start our improvement process. We focused on waste in the system, which we defined as anything driving time or cost that was not key to our business objectives.
It was only at this point that we considered changing our development approach to align with some of the DevOps and Agile methods.
We also wanted to make sure we were starting where we could show quick progress. There are lots of different starting points and an organization can only change so many things at once, so we wanted to make sure we began where we would get the biggest return on our investment.
Applying DevOps and Agile Principles at Scale
While starting with activity-based accounting and cycle-time drivers are probably the most accurate approach, it will quickly point you to the processes for maintaining always-releasable code and your planning processes, just as it did for us at HP.
Writing code is similar in large and small organizations, but the processes for planning it, integrating it with everyone else, qualifying it, and getting it to the customer are dramatically different.
In almost every large, traditional organization, this is by far the biggest opportunity for improvement. Far too often these parts of the process take a lot of resources and time because traditional organizations tend
It is difficult to address the unique and business-specific results that come out of activity-based accounting and cycle-time driver approach. Therefore, the rest of this blog will focus on applying Agile and DevOps principles at scale to provide ideas and recommendations to address the traditional challenges of transforming these parts of your development process.
Whether you do the detailed activity-based accounting and cycle-time view of your processor start with applying DevOps and Agile principles at scale, it is important that you begin with clear business objectives.
This transformation is going to take a lot of effort, and if you don’t have clear business objectives driving the journey, you can’t expect the transformation to provide the expected business results.
At HP we were able to get two to three times the improvement in business results because that was the focus of all our changes.
If we couldn’t see clearly how a change would help meet those goals, we did not waste any time with it, even though it might have been popular in the Agile community. It is this type of focus on the business objectives that enable the expected business results.
These objectives also help in the organizational change management process, where you can constantly remind the team of where you are, where you are going, and the expected benefits.
ENTERPRISE-LEVEL CONTINUOUS IMPROVEMENT
At HP we made sure we had a set of objectives that we used to drive the organization during each iteration. There were typically four to seven high-level objectives with measurable sub-bullets we felt were most important to achieve, things like shipping the next set of printers or taking the next step forward in our test automation framework.
It also included things like improving our test passing rates, since stability had dropped, or a focus on feature throughput, since development was getting behind.
This required our second highest priority to be improving the stability of the codebase on CE and getting the simulator automated testing in place.
The third priority was addressing all the product-specific requirements of the new products in this release window. We also had the first prototypes of the next generation products showing up that were based on Windows CE and an ARM processor.
This led to our fourth priority of ensuring we could port the common codebase to the ARM processor. Our fifth priority was getting the first XPe on MIPS processor products ready for system qualification on the final products.
While it includes some team-level stories, it, more importantly, focuses on the enterprise-level deliverables. Executives work with the organization to set these kinds of objectives so that everyone feels they are important and achievable.
They also make sure the objectives are based on what the teams are actually doing and achieving. This kind of collaboration helps build a culture of trust.
These are very high-level strategic objectives that include business objectives and process improvements. Can you see how it is much more than just an aggregate of the team-level stories?
Can you also see why everyone across the organization would have to make these their top priorities instead of team-level stories if we were really going to make progress?
These objectives provided an organization-wide set of priorities that drove work throughout the organization. If you were working on one of these top priorities, the organization understood people needed to help if you asked. Each team was expected to prioritize the team’s stories that were required to meet these objectives.
Once the team had met the strategic objectives, they were empowered to use the rest of their capacity to make the best use of their time and prioritize the team-level stories.
This approach allowed for defining aligned objectives for the enterprise while also empowering the teams. It was a nice balance of top-down and bottom-up implementation.
Tracking Progress and Understanding Challenges
These objectives guided all of our work. There was a website that aggregated all of our metrics and enabled tracking every objective down through the organization. The metrics would start at the overall organization, then cascade down to each section manager, and then down to each team.
As executives, we would each start the morning at our desks spending 30–45 minutes reviewing these metrics, so we were able to know exactly how we were doing meeting these objectives and could identify where we were struggling.
We did not have many status meetings or Scrum of Scrum meetings because the data was always at everyone’s fingertips.
The leadership team would then spend most of their days walking the floor trying to understand where we were struggling and why. This is a new role for most executives and one we encourage executives to embrace if this process is going to be successful. We became investigative reporters trying to understand what was working and what needed to be improved.
People wanted to meet the objectives we set and felt were reasonable at the beginning of the iteration, so something must have happened if they were struggling.
These discussions around what got in the way were one of the most valuable parts of the process both in terms of learning what needed improving and in terms of changing the culture.
When high-level executives first started showing up at the desks of engineers that were struggling, it was intimidating for the engineers. After a while, though, the engineers realized we were just there to help, and the culture started to change.
Adjusting Based on Feedback and Aligning for the Next Iteration
During the last week of our monthly iteration, we would start evaluating what we could and could not get done in that iteration. We would also integrate what we learned about where the organization was struggling in setting the objectives for the next iteration.
Objectives usually started out as a discussion between the program manager and the director with a notepad standing in front of a poster of the previous iteration objectives. In about 20 minutes, we would create a rough draft of objectives for the next iteration.
We say “rough draft” because we had not yet gotten the perspectives and support of the broader organization. Over the next few days, there were reviews in lab staff meetings, the section managers’ meetings, and a project managers’ meeting where we scrubbed and finalized the objectives into a list everyone felt was achievable and critical to our success.
We know that aligning the organization on enterprise iteration goals takes longer than most small Agile teams take to set objectives for an iteration, but for a large organization, it felt like a good balance of driving strategy at the top and getting the buy-in and support from the troops.
AGILE ENTERPRISE PLANNING
Convincing large, traditional organizations to embrace Agile principles for planning is difficult because most executives are unfamiliar with the unique characteristics of software development or the advantages of Agile.
As we have discussed, they expect to successfully manage software development using a Waterfall planning and delivery process, just like they successfully manage other parts of their business.
This approach, though, does not utilize the unique advantages of software or address its inherent difficulties. The software is infinitely flexible.
It can be changed right up to the time the product is introduced. Sometimes it can be changed even later than that with things like software or firmware upgrades, websites, and software as a service (SaaS).
Software does have its disadvantages, too. Accurately scheduling long-term deliveries is difficult, and more than 50% of all software developed is either not used or does not meet its business intent.
If executives managing software does not take these differences into account in their planning processes, they are likely to make the classic mistake of creating detailed, inaccurate plans for developing unused features.
At the same time, they are eliminating flexibility, which is the biggest advantage of software, by locking in commitments to these long-range plans.
Unique Characteristics of Software
Agile software development was created because of traditional planning and project management techniques were not working. Unlike traditional projects, the software is almost infinitely flexible.
If you’ve ever lived through a full metal mask roll for a custom ASIC on a laser printer to resolve a design problem, you are fully aware of just how long, painful, and expensive this can be. A simple human error made during the ASIC design can result in a schedule slip of several months and cost overruns of $1M.
This same principle can be applied in most non-software projects. Once the two-by-fours are cut, they’re not going to get any longer. Once the concrete is poured and dried, it’s not going to change shape.
The point is that if you’re managing a construction project or a hardware design effort, you must do extensive requirements gathering and design work up front;
otherwise, mistakes are very time-consuming and costly to fix. The project management discipline has evolved to match this lack of flexibility and minimize the likelihood of making these kinds of mistakes.
Software on the other hand, if managed and developed correctly, is less expensive and easier to change. The problem is that most software is managed using the same techniques that are applied to these other inflexible disciplines.
In the software world, we call it the Waterfall development life cycle. By applying these front-end heavy management techniques to software development we limit ourselves in two ways.
First, we rob ourselves of software’s greatest benefit: flexibility. Instead of doing extensive planning and design work that locks us into a rigidly defined feature set, we should take advantage of the fact that software can be quickly and easily modified to incorporate new functionality that we didn’t even know about when we did the planning and scheduling several months prior.
The software process, if done correctly, cannot only accept these late design changes but can actually embrace and encourage them to ensure that the customers’ most urgent needs are met, regardless of when they are discovered.
Second, we drive software costs up far higher than they should be. When software is developed using the Waterfall model, many organizations will spend 20–30% or more of their entire capacity to do the requirement gathering and scheduling steps.
During these steps, the software teams make many assumptions and estimates because every new software project you do is unique.
It is unlike any other piece of software you already have. Even if you are rearchitecting or refactoring existing functionality, the implementation of the code is unique and unlike what you had before.
Compare this to remodelling your kitchen, where Waterfall planning works well. Your contractor has installed a thousand sinks, dishwashers, and cabinets. He will use the same nails, screws, and fittings that he has used before.
His estimates of cost and schedule are based on years of experience doing the same activities over and over, finding similar issues.
There is some discovery during the actual implementation, but it is much more limited. With software, your teams have limited experience doing exactly what you’re asking them to do.
As a result, the actual implementation is filled with many discoveries and assumptions made up front that are later found to be incorrect when actual development begins.
Additionally, integration and qualification tend to uncover major issues late in the process.
If the customer sees the first implementation of a feature months after development began and you then discover that you misunderstood what was wanted, or if the market has shifted and other features are now more important, you have just wasted a significant amount of time and money.
It is far better to show the customer new functionality right as it becomes available, which is simply not possible using the Waterfall methodology.
The challenge is to determine what requires a long-term commitment and ensure these commitments don’t take up the majority of your capacity.
Ideally, if these long-term commitments require less than 50% of your development capacity, a small planning investment can provide the information required to make a commitment.
On the other hand, if these firm long-range commitments require 90–110% of the capacity, this creates two significant problems. First, when the inevitable discovery does occur during development, you don’t have any built-in capabilities to handle the adjustments.
Second, you don’t have any capacity left to respond to new discoveries or shifts in the market. In either case, if you do need to make a change, it is going to take an expensive, detailed replanning effort to determine what won’t get done so you can respond to the changes.
The LaserJet Planning Example
An example from the HP LaserJet business can help clarify how this planning process can work. HP had to make significant investments in manufacturing capacity 12-plus months before the software or firmware development was complete, so being able to commit to shipping the product on time was critical.
Before we completed the rearchitecting and re-engineering of the development processes, porting the code to a new product and the testing required to release it was taking ~95% of the resources.
This resulted in large investments in planning to ensure enough accuracy for committing to a product launch. To make matters worse, the organization wanted firm long-term commitments to every feature that Marketing considered a “MUST.”
Additionally, Marketing had learned that if a feature was not considered a “MUST” then it was never going to happen, so almost everything became a “MUST” feature 12-plus months before the product was introduced, adding on another 50–55% of demand on capacity.
This led to an organization that was trying to make long-term commitments with 150% of its capacity, requiring a large investment in planning to clarify what could and couldn’t be done.
When the planning was done and it was clear that not all the “MUST” features could be completed, we needed another planning cycle to prove the “MUST” features really couldn’t be delivered—because after all, they were “MUST” features.
This vicious cycle was taking precious capacity away from a team that could have been delivering new capabilities and products.
Breaking this logjam required a significant investment in the code architecture and development processes. When just porting the code to a new product and the release process was taking ~95% of the capacity, it was not realistic to create an effective long-range planning process.
Therefore, we rearchitected the code so that porting to a new product did not require a major investment and we automated our testing so it was easier to release the code on new products.
These changes meant that the long-range commitments required from the business were taking less than 50% of our capacity.
Additionally, we separated out the long-term manufacturing commitment requirements from the feature commitment decisions that could wait until later. Only then was it possible to develop a nice, lightweight, long-range planning process?
This new planning process consisted of three approaches for different time horizons to support different business objectives and decisions.
The goal was to avoid locking in the capacity as long as possible but still support the decisions required to run the business. We worked to keep the long-range commitments to less than 50% of the capacity.
For shorter time frames when there was less uncertainty, we would commit another 30%. Then for the last part of the capacity, we didn’t really plan but instead focused on delivering to a prioritized backlog. This approach built-in flexibility with the reserved capacity to respond to the inevitable changes in the market and discoveries during development.
The first and longest range phase focused on the commitment to ship the product on a specific date. In this phase, the unique capabilities of the new printer were characterized and high-level estimations were put in place to ensure there wasn’t anything that would preclude committing to an introduction date.
Because this was taking less than 50% of the capacity, it did not require a lot of detailed rigour in the planning for introductions 12-plus months in the future.
The next phase focused on major marketing initiatives that the organization leaders wanted to target over the next six months.
This involved the system engineers working with the Marketing leadership team to clarify at a high-level what capabilities they would like to message across the product line in the next marketing window.
The system engineers would then roughly estimate the capacity left after commitments to shipping new products and the initiative’s demands on the different teams in the organization.
The other change in the planning processes at HP was to minimize the requirements inventory. When we were not making long-term commitments to all the features, we did not have to invest in breaking down the requirements into details until we were much closer to starting development and knew the requirements were going to be used and much less likely to change.
For the long-range commitments, the details of the only requirements created were the unique characteristics of the new printer. Then, in the initial phase, the new features were only defined in enough detail to support the high-level estimates.
Once these initiatives were getting closer to development, the system engineers would break the initiatives into more detailed user stories so that everyone better understood what was expected.
Then right before development started, these user stories were reviewed in feature kickoff meetings with all the engineers involved in the development along with the Marketing person and system engineers.
At this time the engineers had a chance to ask any clarifying questions or recommend potentially better approaches to the design. Then after everyone was aligned, the Development engineers would break down the high-level user stories into more detailed developer requirements, including their schedule estimates.
This just-in-time approach to managing our requirements provided a couple of key advantages. First, the investment in breaking the requirements down into more detail was delayed until we knew they would get prioritized for development.
Second, since there were not large piles of requirements inventory in the system when the understanding of the market changed, there were not a lot of requirements that needed to be reworked.
The net result of all these changes is that planning went from taking ~20% of the organization’s capacity down to less than 5%, freeing up an extra 15% of the capacity to focus on delivering business value. At the same time, the management team had the information required to make business decisions for the different planning horizons.
By delaying locking in all the capacity as long as possible, the organization was able to use the inherent flexibility of software to increase the likelihood that the new features would be used and would deliver the intended business results.
Creating an enterprise planning process that leverages the principles of Agile starts with embracing the characteristics that are unique to software. It also requires planning for different time horizons and designing a planning process that takes advantage of the flexibility software provides.
These changes in the planning processes can also help to eliminate waste in the requirements process by reducing the amount of inventory that isn’t ever prioritized or requires rework as the understanding of the market changes.
The HP LaserJet example shows how embracing the principles of Agile can provide significant business advantages. This example is not a prescription for how to do it, but it does highlight some important concepts every organization should consider.
First, the planning process should be broken down into different planning horizons to support the business decisions required for different time frames.
Second, if the items that must have long-term commitments require more than 50% of your capacity, you should look for architectural and process improvements so that those commitments are not a major capacity driver.
Third, since the biggest inherent advantage of the software is its flexibility, you should not eliminate that benefit by allowing your planning process to lock in all of your capacity to long-range commitments—especially since these long-range features are the ones most likely to end up not meeting the business objectives.
Lastly, you should consider moving to the just-in-time creation of requirements detail to minimize the risk of rework and investment in requirements that will never get prioritized for development.
The specifics of how the planning process is designed needs to be tuned to meet the needs of the business.
It is the executives’ role in the organization to appreciate how software development is different from the rest of their business and be willing to drive these new approaches across the organization.
If they use the traditional approach of locking all their capacity to long-range commitments, then when there is discovered during development or in the market it will require expensive replanning cycles.
Or even worse, the organization will resist changing their plans and deliver features that don’t provide any value.
Instead, executives need to get the organization to appreciate that this discovery will occur and delay locking in the capacity as long as possible so they can avoid these expensive replanning cycles and take advantage of the flexibility software provides.
Executives need to understand that the Agile enterprise planning process can provide significant competitive advantages for those companies willing to make the change and learn how to effectively manage software.
ENSURING A SOLID FOUNDATION
Executives need to understand the character of their current architecture before starting to apply DevOps principles at scale. Having software based off of a clean, well-defined architecture provides a lot of advantages.
Almost all of the organizations presenting leading-edge delivery capabilities at conferences have architectures that enable them to quickly develop, test, and deploy components of large systems independently.
These smaller components with clean interfaces enable them to run automated unit or subsystem tests against any changes and to independently deploy changes for different components. In situations like this, applying DevOps principles simply involves enabling better collaboration at the team level.
On the other hand, large, traditional organizations frequently have tightly coupled legacy applications that can’t be developed and deployed independently.
Ideally, traditional organizations would clean up the architecture first so that they could have the same benefits of working with smaller, faster-moving independent teams. The reality is that most organizations can’t hold off process improvements waiting for these architectural changes.
Therefore, executives are going to have to find a pragmatic balance between improving the development processes in a large, complex system and fixing the architecture so the systems are less complex over time.
We encourage you to clean up the architecture when and where you can, and we also appreciate that this is not very realistic in the short term for most traditional organizations. As a result, we will focus on how to apply DevOps principles at scale assuming you still have a tightly coupled legacy architectures.
In these situations where you are coordinating the work across hundreds to thousands of people, the collaboration across Development and Operations requires much more structured approaches like Continuous Delivery.
Embedded software and firmware has the unique architectural challenge of leveraging common stable code across the range of products it needs to support.
If the product differences are allowed to propagate throughout the code base, the Development team will be overwhelmed porting the code from one product to another.
In these cases, it is going to be important to either minimize the product-to-product hardware differences and/or isolate the code differences to smaller components that support the product variation. The architectural challenge is to isolate the product variation so as much of the code as possible can be leveraged unchanged across the product line.
The Build Process
The next step in creating a solid foundation is to validate that the build process will enable you to manage different parts of your architecture independently. Some organizations do not have this fundamental in place. There is a simple test that can evaluate how ready you are with your build process.
This idea may seem simple, but it is a very basic building block because keeping the system stable requires building up and testing large software systems in a structured manner with different stages of testing and artefact promotion.
If we are thinking of these artefacts as independent entities but in reality can’t manage them that way, then we are not set up for success.
If your software does not pass this simple test you need to start with fixing your build process and modifying your architecture to ensure that all components can be built and deployed into testing environments independent of the others.
A large amount of test automation is necessary when changing the development processes for large, traditional organizations. Without a solid foundation here, your feedback loops are going to be broken and there won’t be an effective method for determining when to promote code forward in your pipeline.
Writing good test automation is even more difficult than writing good code because it requires strong coding skills plus a devious mind to think about how to break the code. It is frequently done poorly because organizations don’t give it the time and attention that it requires.
Because we know it is important we always try to focus a lot of attention on test automation. Still, in almost every instance we look back on, we wish we had invested more because it is so critical.
You just can’t keep large software systems stable without a very large number of automated tests running on a daily basis. This testing should start with the unit- and services-level testing which is fairly straightforward. Ideally, you would be able to use these tests to find all of your defects.
This works well for software with clean architectures but tends to not work as well in more tightly coupled systems with business logic in the user interface (UI) where you need to depend more on system level-based UI testing.
In this case, dealing with thousands of automated tests can turn into a maintenance nightmare if they are not architected correctly. Additionally, if the tests are not designed correctly, then you can end up spending lots of time triaging failures to localize the offending code.
Therefore, it is very important that you start with a test automation approach that will make it efficient to deal with thousands of tests on a daily basis. A good plan includes the right people, the right architecture, and executives who can maintain the right focus.
Running a large number of automated tests on an ongoing basis is going to require creating environments where it is economically feasible to run all these tests. These test environments also need to be as much like production as possible so you are quickly finding any issues that would impact delivery to the customer.
For websites, software as a service, or packaged software this is fairly straightforward with racks of servers. For embedded software or firmware this is a different story.
There the biggest challenge is running a large number of automated tests cost-effectively in an environment that is as close as possible to the real operational environment.
Since the code is being developed in unison with the product it is typically cost prohibitive, if not impossible, to create a large production-like test farm.
Therefore, the challenge is to create simulators and emulators that can be used for real-time feedback and a deployment pipeline that builds up to testing on the product.
A simulator is a code that can be run on a blade server or virtual machine that can mimic how the product interacts with the code being developed.
The advantage here is that you can set up a server farm that can quickly run thousands of hours of testing a day in a cost-effective way. The disadvantage is that it is not fully representative of your product, so you are likely to continue finding different defects as you move to the next stages of your testing.
The objective here is to speed up and increase the test coverage and provide feedback to enable developers to find and fix as many defects as possible as early and thus as cheaply as possible. The simulator testing is fairly effective for finding defects in the more software-like parts of your embedded solution.
The challenge is that lots of products include embedded firmware running on custom ASICs.
In this case, it is almost impossible to find defects in the interactions between the hardware and firmware unless you are running the code on the custom ASICs. This is the role of emulators.
It is like the simulator in that it is code that mimics the product, but in this case, it includes the electronics boards from the product with the custom ASICs but does not include the entire product.
This is incrementally better than simulator testing because it is more like production. The challenge here is that these are more expensive to build and maintain than the simulator-based generic blade servers.
Additionally, early in the development cycle the new custom ASICs are very expensive and in limited supply. Therefore, the test environments for the deployment pipeline are going to require a balance of simulators and emulators.
Finally, since these emulator environments are still not fully production-like, there will still need to be testing on the product. Creating processes for enabling small-batch releases and quick feedback for developers is going to require large amounts of automated testing running on the code every day.
The challenge for embedded software and firmware is that this is not really practical on the product.
Therefore, robust simulator and emulators that can be trusted to quickly find the majority of the defects are a must.
Executives will need to prioritize and drive this investment in emulators and simulators because too often embedded organizations try to transform development processes with unreliable testing environments, which will not work.
Designing Automated Tests for Maintainability
A big problem with most organizations is that they delegate test automation task to the quality assurance (QA) organization and ask their manual testers to learn to automate what they have been doing manually for years.
Some organizations buy tools to automatically record what the manual testers are doing and just play it back as the automated testing, which is even worse.
The problem with record and playback is that as soon as something on the UI or display changes, tests begin to fail and you have to determine if it is a code defect or a test defect. All you really know is that something changed.
Since new behaviour frequently causes the change, organizations get in the habit of looking to update the test instead of assuming it found a defect.
This is the worst-case scenario for automated testing: where developers start ignoring the results of the tests because they assume it is a test issue instead of a code issue.
The other approach of having manual testers writing automated tests is a little better, but it has a tendency to result in brittle tests that deliver very long scripts that just replicate the manual testing process.
This works fine for a reasonable number of tests when the software is not changing much. The problem, as we will demonstrate in the example below, comes when the software starts to change and you have thousands of tests. Then the upkeep of these tests turns into a maintenance nightmare.
Whenever the code changes, it requires going through all the thousands of tests that reference that part of the software to make sure they get the right updates.
In this case, organizations start to ignore the test results because lots of tests are already failing due to the QA team not being able to keep all their tests running and up to date.
The best approach for automated testing is to pair a really good development architect that knows the fundamentals of object-oriented programing with a QA engineer that knows how to code is manually tested in the current environment. They can then work together to write a framework for automated testing.
A good example can be found in the blog Cucumber & Cheese: A Tester’s Workshop by Jeff Morgan. He shows how to create an object-oriented approach to a test framework using a puppy adoption website as an example.
Instead of writing long monolithic tests for different test cases that navigate the website in similar ways, he takes an object-oriented approach.
Each page on the website is represented by a page object model. Each time the test lands on that page there is a data magic gem that automatically randomly fills in any data required for that page. The test then simply defines how to navigate through the website and what to validate.
With this approach, when a page on the website changes, that one-page object model is the only thing that needs to change to update all the tests that reference that page. This results in tests that are much less brittle and easier to maintain.
Using Page Object Models and other similar techniques will go a long way towards reducing your test development and maintenance costs.
However, if your approach is to move fully to automated testing, the use of Page Object Models will be insufficient to drive your maintenance costs to an acceptable level. There are some additional techniques and processes that you will need to put in place.
Designing Automated Tests to Quickly Localize Defects
We mentioned that one of the most common mistakes that organizations make when moving from manual to automated testing is to take their existing QA teams and simply have them start automating their existing manual scripts.
This approach will cause some serious issues and will prevent you from having a cost-effective testing process. Since we’ve fallen into this pit a few times ourselves, we want to make sure that you are aware of the issues with this approach so that you can avoid them.
An example that will help illustrate the issues with automating manual test scripts is a large e-commerce website with new, added functionality that allows them to accept a new brand of a credit card for payment.
We want some tests that demonstrate that you can actually check out and purchase a product using this new type of card.
To test this, our manual tester will go to the homepage of our website. From there they might search for a product using the keyword search, and after finding and selecting a product, they will add it to the cart. From there they might go to the cart and see if the product they selected is the one that actually ended up in the cart.
Next, they might sign in as a registered user so that they can determine if the loyalty points functionality will work using this new brand of a credit card. To sign in as a registered user, they will have to enter their username and password.
Now they are signed in and can add this new brand of a credit card to their account and see if the website now accepts it as a valid form of payment. Assuming so, they’re now ready to check out.
They click on the checkout button and land on the checkout page. Since they’re logged in as a registered user, their name, address, and phone number should be prepopulated in the correct fields on the page.
After validating this information the manual tester is now ready to do the part of the test that we have been waiting for: Can I select this new brand of a credit card as my form of payment and actually purchase the product I selected?
The problem with automating this script is that the new automated test can fail for any number of reasons that have nothing to do with the new credit card or the checkout functionality.
For example, if keyword search is not working today, this checkout test will fail because it can’t find a product to put in the cart.
If the sign in functionality is broken, the test will fail because it can’t get past this stage of the test.
Because this new test was written to validate our new functionality and it failed, the responsibility to triage this test failure lands upon the team developing the new credit card capability.
In reality, there will be many of these tests, some with valid and invalid credit card numbers, some that attempt to purchase a gift card, and some that attempt to purchase a product, but ship it to an alternate address.
The list goes on. The result will be that on any given day our team could walk in to discover that most of their test suite that was working yesterday is now failing.
As they begin to triage each failing test, they quickly learn that most are not failing because of the new functionality that was supposed to be tested, but because something else in the system is broken. Therefore, they can’t trust the test failure to quickly localize the cause of the failure.
It was designed and labelled as a test to check out with a new credit card, and it is not until someone manually debugs the failure that it can be determined it is due to a code failure somewhere else in the system.
Tests that are designed this way require a lot of manual intervention to localize the offending code. Once you start having thousands of tests you will realize this is just not feasible.
The tests need to be written so that they are easy to update when the code changes, but they also need to be written so that they quickly localize the offending code to the right team.
One of the best and most important tools at your disposal to enable efficient triage is component-based testing. Think back to the architectural drawing.
What you want is a set of automated tests that fully exercise each component without relying on the functionality of the other components.
In the traditional Waterfall method, we called this subsystem testing, and we expected each component or subsystem team to mock out the system around them and then test their component.
Once all components were tested in this way, we moved on to the integration phase and put them all together and began system testing.
However, it is now possible to develop an automated testing framework that allows you to isolate the various components of the system, even when the system is fully integrated and deployed.
Using our new credit card for a checkout example again, what we really want to do is start the test on the URL associated with the checkout page and to already have the user signed in and products loaded into the cart.
The team will now quickly learn to pay attention to the test results because a pass or fail is much more likely to be an accurate representation of the state of their functionality.
The second big change is in your ability to statistically analyze your test metrics. When your automated test suite consists of a large number of tests that are just an automated version of your manual tests, it is very difficult to look at a large number of test results each day and determine which functionality is working and which functionality is broken.
Even if you organize these tests into components, you will not be able to determine the status of each component. Remember from our example: when your checkout tests are all failing, you don’t know if the problem is in the checkout functionality or not. You have no idea what is broken.
If instead, you take a component-based approach of designing and grouping tests, then you can quickly localize a drop in pass rates to the offending code without a lot of manual triage intervention. This makes it feasible to efficiently respond to the thousands of automated tests that will be running every day.
Transforming the software development practices of large, traditional organizations is a very big task, which has the potential for dramatic improvements in productivity and efficiency.
This transformation requires a lot of changes to how the software is developed and deployed. Before going too far down this path, it is important to make sure there is a solid foundation in place to support the changes.
Executives need to understand the basic challenges of their current architecture and work to improve it over time. The build process needs to support managing different artefacts in the system as independent entities.
Additionally, a solid, maintainable test automation framework needs to be in place so developers can trust the ability to quickly localize defects in their code when it fails. Until these fundamentals are in place, you will have limited success effectively transforming your processes.