Applying DevOps principles at scale for enterprise solutions delivered across servers is going to require implementing Continuous Delivery. This is a big effort for most large organizations.
Continuous Delivery is an approach to development and operations that automate configuration, deployment, and testing processes. In this blog, we explain how we utilize Continuous Delivery for better results.
Because it requires cultural changes, you want to make sure this transition is as efficient as possible to avoid resistance to the changes that happen when it takes too long. Architecting the approach to efficiently leverage common code and knowing where to start is essential.
The pipeline should be designed with an orchestrator that will enable the leverage of common components and avoid the creation of monolithic scripts that are hard to maintain.
Designing a scripted environment and scripted deployment with common scripts that have an approach for separately describing the differences between environments will help get Development and Operations working toward common goals with common tools.
Once there is good architecture design, the next most important step is determining which changes to take on in which order. These priorities should be based on the business objectives and where you will get the biggest return on your investment.
Executives should also prioritize changes that start the cultural shifts that are required early before you invest too much time and energy in technical solutions.
It should consider which components are changing most frequently and what is causing the most pain in the deployment process.
Implementing Continuous Delivery in a large, traditional organization is going to take some time and effort, but it is worth the effort because it will require fixing issues that have been undermining the effectiveness of your organization for years.
Continuous Delivery Definitions
Continuous Delivery is a fundamentally different approach to development and operations that automate as much of the configuration, deployment, and testing processes as possible and puts it all under revision control.
This is done to ensure the repeatability of the process across multiple servers and deployment environments.
The automation helps to eliminate human errors, improves consistency, and supports increased frequency of deployments. As pointed out in Continuous Delivery, initially these processes are very painful. Nonetheless, you should increase their frequency to force you to fix all the issues.
When the frequency of the build and deployment process is fairly low, your organization is able to use brute force to work through these issues. When you increase the frequency, this is no longer possible.
These issues that have been plaguing your organization for years will become much more visible and painful, requiring permanent automated solutions.
Putting this automation under revision control enables tracking changes and ensuring consistency. The core pieces of Continuous Delivery you need to know are continuous integration, scripted environments, scripted deployments, evolutionary database design, test automa-tion, deployment pipeline, and orchestrator.
Continuous integration is a process that monitors the source code management tool for any changes and then automatically triggers a build and the associated automated build acceptance testing. This is required to ensure the code maintains a certain base level of stability and all changes are being integrated often to find conflicts early.
A scripted environment is an approach that creates a script that can be run against a server or virtual machine to configure all the settings from the operating system to the container.
This is very different from what happens in most traditional organizations, where people are able to log onto servers and make changes required to get the application working.
When people manually make changes, it is easy to get different servers across production or the deployment pipeline in different states of configuration.
This results in seeing different issues as the code is moved through the deployment pipeline and having flaky issues based on different servers with the same application having different configurations.
In this situation, a system test can become intermittent and begin to flicker. Even with exactly the same application code, a test will pass in one environment but fail in another because it is misconfigured in some subtle way.
This tends to make the triage process very difficult because it is hard to reproduce the problem when it only occurs on one of several servers hosting a common component.
Scripted environments fix this by having every server configured off of a common script that is under revision control. This ensures every server across the system has the exact same configuration.
Scripted deployment is a process for automating the deployment across all of the environments.
This process deploys and configures the applications. Ideally, it also tests the deployments to ensure they were successful. This automation is also put under revision control so that it is repeatable day-to-day and server-to-server.
Evolutionary database design (EDD) is an approach for managing database changes to ensure that schema changes won’t ever break your application.
It also includes tools that put the database changes under revision control and can automatically update the database in any environment to the desired version of the configuration.
The application then ensures it is calling the version of the schema it was expecting when the code was written. Once the code in the application is updated to call the new version of the database schema, the old rows or columns are deleted.
This approach allows you to keep your application in an always-releasable state and enables you to decouple application code changes from your database schema changes.
Deployment pipeline defines how new code is integrated into the system, deployed into different environments, and promoted through various stages of testing.
This can start with static code analysis and unit testing at the component level. If these tests pass, the code can progress to a more integrated application-level testing with a basic build acceptance-level testing.
Once these build acceptance tests are passing, the code can progress to the next stage, typically full regression and various forms of non-functional testing like performance and security testing.
If full regression and non-functional testing results reach an acceptable pass rate, the code is ready to deploy into production. The deployment pipeline in Continuous Delivery defines the stages and progression model for your software changes.
An orchestrator is a tool that coordinates all of this automation. It enables you to call scripts to create new environments on a set of virtual machines with scripted environments, deploy code to that environment using scripted deployment, update the database schema with EDD, and then kick off automated testing.
At the end of the automated testing, if everything passes, it can move this version of the applications and scripts forward to the next stage of the deployment pipeline.
Continuous Delivery Architecture
These definitions show the different pieces that need to come together to create a Continuous Delivery pipeline. These different pieces can be in different tools or, at times, in the same tool.
That said, it makes sense to think of all the core pieces of Continuous Delivery as different and distinct activities so you can ensure that you are architecting Continuous Delivery correctly.
Additionally, you want to make sure you are using an object-oriented approach to creating these scripts so that you don’t end up with big, long, monolithic scripts that are brittle and hard to maintain.
A Continuous Delivery pipeline can be created to work with monolithic scripts, and this might even be the best approach for small initiatives. That said, as you work to scale these processes across large organizations, it is important to take a more object-oriented approach.
Think of the orchestrator as a different component that abstracts away the business logic of the deployment pipeline.
It defines what scripts to call and when to call them to avoid creating too much complexity in the basic scripts.
It also helps to think of grouping the scripts into common objects that can be leveraged if called by the orchestrator in the right order to support several different stages and roles in the deployment pipeline. Proper use of the orchestrator should enable the other components to have common scripts that can be leveraged for a variety of things.
Once you have thought through what tools to use for each process, it is important to think through how to architect scripted environments and scripted deployments.
The objective is to leverage, as much as is possible, a well-architected set of common scripts across different stages in the deployment pipeline from Development to Production.
These scripts need to be treated just like application code that is being developed and promoted through the deployment pipeline. Ideally, you would want the exact same script used in each stage of the deployment pipeline.
While ideal, using the same script is frequently not possible because the environments tend to change as you progress up the deployment pipeline on the way to the full production environment.
Therefore, your architectural approach needs to address how to handle the differences across environments while creating one common script that can be qualified and promoted through the deployment pipeline.
This typically is accomplished by having one common script where variables are passed in to define the differences between the environments.
If this approach is not used, you are at risk of having large monolithic scripts that are different for each environment. When this occurs, even if these are being managed within an SCM, they tend to have very similar code being held in different branches that are difficult to keep consistent and tend to break.
This ignores the basic principle of having a common script that is defined and qualified in the early stages of the deployment pipeline, then progressed forward to production and lacks the consistency that scripted environments is intended to address.
Therefore, it is important to make sure your scripts for environments are architected correctly. Once they are, you have a process for defining an environment and having it replicated consistently across the deployment pipeline.
Executives need to ensure these technical changes happen because they start the cultural change of getting Development and Operations working with the same tools on common objectives.
If the technical approaches are allowed to deviate, then you won’t get the cultural alignment that is required for a successful transformation.
These same architectural concepts should apply for scripted deployments. Depending on the tool you chose, the process for defining environment differences for a common deployment script may be already built-in.
If not, it is important to think through how to apply this architectural principle so you don’t end up with long, monolithic scripts that are hard to maintain and not consistent across the deployment pipeline.
The other key principle to include as part of the scripted deployment process is creating post-deployment validation tests to ensure the deployment was successful on a server-by-server basis. The traditional approach is to create environments, deploy the code, and then run system tests to ensure it is working correctly.
The problem with this approach is that once a system test fails, it leads to a long and painful triage process. The triage process needs to determine if the test is failing due to bad code, a failed deployment, or other environmental differences.
The scripted environment process ensures environmental consistency. This still leaves differentiating between code and deployment issues, which can be challenging.
The scripted deployment process ensures the same automated approach is used for every server; however, it does not naturally ensure the deployments were successful for every server or that the various servers can actually interface with one another as expected.
This becomes especially problematic when the deployment only fails for one of several common servers, because this leads to the system tests failing intermittently, depending on if that run of the tests happens to hit the server where the deployment failed.
When this happens, it results in a very long and frustrating triage process to find the bad server and/or routing configurations amongst tens to hundreds of servers.
The good news is that there is a better approach. There are techniques to isolate code issues from deployment issues during the deployment process.
You can create post-deployment validations to ensure the deployment was successful and localize the offending deployment issues down to the fewest number of potentially offending servers or routing devices as quickly as possible.
Each step in the process is validated before moving on to the next step. Once the servers and routers are configured, there should be tests to ensure that the configuration has occurred correctly. If not, fix the issues before deploying the code.
The next step is to validate that the deployment was successful for every server individually. This can be done by automatically checking log files for warnings or errors and writing specific tests to ensure the deployment was successful.
These checks should be done for every server and routing configuration before starting any system testing.
Once each of these steps is complete, the system tests can start. The idea here is to define and automate the deployment pipeline such that it is isolating issues so that you can save time and energy in the triage process.
If this is done correctly, the system tests will be much more likely to find code issues and not fail because of issues within the environment.
If you are still seeing system tests fail because of non-code issues, you probably have holes in your post-deployment checks that should be plugged to help isolate environment/deployment issues and improve the triage process.
Continuous Delivery Implementation
The principles of Continuous Delivery are well designed to address the challenges faced by most traditional enterprise organizations deploying code across multiple servers.
Determining where to start, though, can be a bit overwhelming for most large organizations, because this magnitude of change can take several months to years. Therefore, you need to start where the changes will have the most benefit.
It may be tempting to start “doing Continuous Delivery” by just taking one application and implementing all the Continuous Delivery approaches to learning everything about Continuous Delivery before rolling it out broadly.
This will work well if that component can be developed, qualified, and deployed independently. If instead, you are working with a tightly coupled architecture, this approach is not going to work.
It will take a while to implement all those steps for one application, and since it still can’t be deployed independently, it won’t provide any benefits to the business.
When this happens, your transformation is at risk of losing momentum because everyone knows you are investing in Continuous Delivery but they are not seeing or feeling any improvements in their day-to-day work.
Providing feedback to developers in an operation-like environment is a key first step that starts helping the organization right away. It forces the different teams to resolve the integration and production issues on a daily basis when they are cheaper and easier to fix.
It also encourages the cultural change of having the organization prioritize coordinating the work across teams to deliver value to the customer instead of local optimizations in dedicated environments.
It gets Development and Operation teams focused on the common objective of proving the code will work in a production-like environment, which starts to lower the cultural barrier between the two groups.
You start ensuring the developers prioritize making sure their changes will work in production with all the other code instead of just writing new features.
Determining if the organization will embrace these cultural changes up front is important because if they won’t, there is no sense in making big investments in technical solutions that on their own won’t help.
Once it is clear the organization will commit to the cultural changes, deploying and testing all the code in this production-like environment as many times a day as possible provides the additional advantage of helping to prioritize the rest of the technical improvements.
By increasing the frequency, you are making visible the issues that have been plaguing your organization for years.
This process helps clarify what is most important to standardize and speed up with automation. For example, to speed up the frequency it is important to start automating the things that change most, like code, so it usually makes sense to start with automating the continuous integration and deployment process.
The next steps depend on where you are seeing the most change or feeling the most pain. If your environments are changing lots and causing problems with consistency, then you know you need to implement scripted environments.
If database changes are giving you problems, then you can prioritize the evolutionary database work.
If keeping the test automation passing every day requires constantly updating or changing tests, then your framework needs to be rearchitected for maintainability.
If your team is spending lots of time triaging between deployment and code issues, work on creating post-deployment tests.
The important point is to let the pain of increasing the frequency on this production-like environment drive the priority of your technical changes. This will force you to fix the issues in priority order and provide the fastest time to value for the transformation.
DESIGNING THE DEPLOYMENT PIPELINE
Providing quick feedback to developers and maintaining a more stable code base are essential to improving the productivity of software development in large, traditional organizations.
Developers want to do a good job, and they assume they have until they get feedback to the contrary. If this feedback is delayed by weeks or months, then it can be seen as beating up developers for defects they don’t even remember creating.
If feedback comes within a few hours of the developer commit and the tools and tests can accurately identify which commits introduced the problem, the feedback gets to engineers while they are still thinking about and working on that part of the code.
This type of feedback actually helps the developers become better coders instead of just beating them up for creating defects.
Even better, if it is very good feedback with an automated test that they can replicate on their desktop, then they can quickly identify the problem and verify the fix before re-committing the code.
For large, traditional organizations with tightly coupled architectures, it is going to take time and effort to design the test stages of the deployment pipeline to appropriately build up a stable system.
Principles Driving the Design
Working to quickly localize the offending code down to the fewest number of developers possible is the basic principle that should drive design. This can be accomplished with two complementary approaches.
First, more frequent build and test cycles mean that fewer developers have committed code since the last build.
Second, build up stable components or applications that get qualified and progressed forward to larger integrations of the enterprise system. Ideally, the entire system would be fully deployed and tested with every new commit to quickly localize any system issues immediately down to an individual.
This, though, is just not practical for most large, traditional organizations. Therefore, you need to design a deployment pipeline that moves as close to that ideal state as possible but accommodates the current realities of your traditional business.
The first step in any testing process is unit testing and static code analysis to catch as many defects as quickly as possible.
In fact, we can’t think of any good reason why a developer would ever be allowed to check in code with broken unit tests. The advantage of unit tests is that if properly designed, they run very fast and can quickly localize any problems.
The challenges with unit tests for lots of large, traditional organizations are twofold: First, due to their tightly coupled architectures, traditional organizations are not able to effectively find most of the defects with unit tests.
Second, going back through lots of legacy code to add unit tests is seldom realistic until that part of the code is being updated.
The second step in the testing process to qualify code before final integration is component- or service layer-testing at clean interfaces. This should be used where possible because these tests run faster, and the interfaces should be more stable than the true user interface-based system testing.
The final step that can be very effective for finding the important integration issues in large, traditional organizations is user interface-based system testing.
The challenge with these tests is that they typically run slowly and can require running the entire enterprise software system. Therefore you have to think carefully about when and where they are used.
Designing Test Stages
Ideally, you would run every test on every check-in with every component that needs to work together.
This is not practical for most large, traditional organizations that have tightly coupled architectures requiring long-running system tests and hundreds or thousands of developers working on software that must work together in production.
Therefore, you need to design a test stage strategy that localizes the feedback to the smallest number of developers possible while finding as many of the integration and operational issues as possible.
This strategy is accomplished in three ways. First, build and test as much of the system as possible, as often as possible as one unit.
Second, break down and simplify the problem by using service virtualization to isolate different parts of the system. Third, pick a subset of the testing for fast feedback before promoting the code forward to more stages for more extensive testing.
This subset of tests defines the minimal level of stability that you will allow in the system at each stage of code progression. They go by different names in different organizations, but for the purpose of this example, we will refer to them as build acceptance tests.
Where possible, the best solution is to build up and test the system as often as possible in an environment that is as close to production as it can be. While not always practical, it should be the first choice where possible.
Localizing the issues to the smallest number of commits and developers by reducing the time window between test executions provides the advantage of finding as many of the system integration and operational issues as possible as early and as cheaply as possible.
It can be challenging, though, to get everyone working together to keep the build green if it becomes too large or complex with different organizations providing code at different levels of stability. Additionally, if there are frequent build failures, it might be difficult to get someone to own the resolution of the issue.
Therefore, in large organizations, it frequently makes sense to break the problem down into smaller, manageable pieces that can be built up into a larger, stable system.
Service virtualization is a very effective tool for helping to break up a very large enterprise system into manageable pieces.
Isolating components from other parts of the system has the advantage of allowing you to localize the feedback to developers working on just one part of the enterprise system while simultaneously running system tests.
Different teams can make progress independently creating new capabilities against a known interface with the server virtualization, then start end-to-end testing once both sides are ready.
The disadvantages are creating virtualization that will have to be maintained and the potential for the virtualization to be different from the actual interface with the other component.
Therefore, virtualization should be used as little as possible, but it is an important tool for building up testing stages for enterprise software systems.
For very large organizations, virtualization should be considered for interfaces across significant organizational boundaries to help localize the ownership of issues.
It can also be a very helpful approach for working across teams using different development methodologies like Agile and Waterfall because it allows independence of development models while forcing cross-organizational alignment with working code.
The different groups can work independently and make progress with the virtual service.
This works well as long as teams are removing the virtual server when the code is ready for true end-to-end testing, ideally daily, to ensure the actual enterprise system stays stable even though the teams are using different management approaches.
Instead of forcing the convergence of all the development processes, start with integrating stable code on a more frequent basis.
This will help align the Waterfall organization with the business value of having a more stable code base without forcing them to embrace Agile as a completely new development methodology.
The beauty of doing it this way is that the Waterfall organizations will start evolving toward always-releasable code, which is one of the biggest advantages of Agile.
The next big step for them in applying Agile and DevOps principles at scale is just creating a prioritized backlog. Then, over time, if they choose to they can start implementing Agile process changes at the team level.
This approach gets the entire organization moving toward the business benefits of Agile without requiring a “big bang” Agile transformation across the enterprise.
The next step is to build and system test the Agile components on the left of the graphic with the rest of the Waterfall components running against the virtual service. You want to do this as many times a day as possible using your build acceptance tests.
This enables finding the integration issues between the Agile components and most of the integration issues between the Agile and Waterfall components. We say most of the issues because the virtual server will not be a perfect substitution for the Waterfall components.
An additional benefit is being able to isolate the Agile code progression from the periods of instability that originally exist in most Waterfall development processes. The Agile components are not held up for weeks or months while the Waterfall components are undergoing these periods of instability.
It also optimizes the speed of feedback and localizes the feedback to a smaller number of developers by increasing the frequency of the build for the Agile components.
The most extensive set of automated testing should be run at this stage. The objective here is to be able to fully test the Agile components as a system so the issues can be isolated and fixed as a smaller component of the larger system.
Then take a subset of the automated tests that will fully test the interface and have them run against the full enterprise system with the virtual service removed.
Running a subset of tests that fully exercise this interface in the entire system, ideally on a daily basis, has the advantage of ensuring there are no disconnects between the virtual service and the actual code.
The other advantage of this approach is that the majority of the full regression testing can be run on smaller, less expensive environments that only represent the Agile components.
The more expensive and complex enterprise environment is only used for the subset of tests that are required to validate the interface.
For example, there may be hundreds of automated tests required to validate the Agile components that have the same use case against the interface.
In this case, run the hundreds against the virtual service and then pick one of them to be part of the enterprise system test. This saves money and reduces the complexity of managing end-to-end testing.
This is a simple example, but it shows how you can use build frequency and service virtualization in combination to help break down and build up complex software systems to localize feedback while moving the organization to the more stable system.
Defining the Build Acceptance Test
The third approach to designing the test stages is to break the long-running system tests into groups with different objectives.
System testing has huge advantages for traditional organizations in the quality of feedback to developers. The tradeoff is the timeliness of this feedback and how it is incorporated into code progression.
Therefore, these system tests need to be broken into stages to optimize coverage while at the same time minimizing the delay in the feedback. While unit and component testing are the first priority, there also needs to be an effective strategy for system tests.
There should be a clear subset of build acceptance system tests that are designed to get the best coverage as quickly as possible.
These build acceptance tests are probably the most valuable tool available because they are used to drive up the stability and quality of the code base.
In traditional continuous integration systems, this is where you would turn the build red. The build acceptance tests define the absolute minimum level of stability that will ever be allowed in the code base.
If these tests fail, then getting them fixed is the top priority for the entire Development organization and nobody is allowed to check in code until they are fixed.
This is a big cultural change for most organizations. It requires management reinforcing expected behaviors through extensive communication. It also requires the rest of the organization to hold off any check-ins until things get fixed, which slows down the entire organization.
A more effective approach, as discussed before, is to use your systems to enforce the behaviors by automatically blocking code that does not pass the build acceptance tests from ever being committed to trunk in the source code management (SCM) tool.
Building Up the Enterprise System
The next step is to combine gated commits and build acceptance tests with the process of building up and integrating a large enterprise software system in order to balance progressing code through the system quickly to find integration issues with keeping the code base stable.
As discussed earlier, from a change management perspective it is best if you can have your system gate commitments at the SCM level in order to keep the code base stable and to provide feedback to a small group of developers.
You don’t want the offending code to ever make it into the repository. Ideally, you would want all your tests run against the entire enterprise system before letting code into the SCM to optimize stability.
The problem with this approach is the length of time required for feedback to the developer and the time it takes to get code onto trunk and progress through the system.
The most pragmatic approach is to gate commitments into the SCM at the point in the deployment pipeline where you optimized for the frequency of builds on subcomponents of the enterprise system. This automates keeping the trunk stable for components that have fast build and test cycles.
IMPROVING STABILITY OVERTIME
Once the fundamentals are in place and you have started applying DevOps principles at scale, you are ready to start improving stability.
This section covers how to increase the stability of the enterprise system using build acceptance tests and the deployment pipeline. This next step is important because the closer the main code base gets to production-level quality, the more economical it is to do more frequent, smaller releases and get feedback from customers.
Additionally, having a development process that integrates stable code across the enterprise is one of the most effective ways of aligning the work across the teams, which, as we have discussed before, is the first-order effect for improving productivity.
Depending on the business, your customers may not require or even allow overly frequent releases, but keeping the code base more stable will help find integration issues early before there are large investments in code that won’t work well together.
It will also make your Development teams much more productive because they will be making changes to a stable and functioning code base.
Even if you are not going to release more frequently than you currently do, you will find keeping the code base more stable will help improve productivity.
If your organization does want to move to continuous deployment, trunk (not branches) should be driven to production levels of stability, such that any new development can be easily released into production.
Regardless of your need for Continuous Delivery, the process for improving stability on the trunk is going to be a big transformation that requires managing the automated tests and the deployment pipeline.
Understanding the Work between Release Branch and Production
The first step in improving stability and enabling smaller, more frequent releases requires understanding the work that occurs between the release branch and production.
At this stage, development is ideally complete and you are going through the final stages of making sure the quality is ready for customers.
For leading-edge companies doing Continuous Delivery, the trunk is always at production levels of stability and this only takes minutes. For most large, traditional organizations, though, this is a long and painful process, and getting better is going to take some time.
To get the code ready for release, all the stories need sign-off, the defects need to be fixed or deferred, and the test passing rates need to meet release criteria. Additionally, to accomplish all these things in a timely manner, the deployment pipeline needs to be delivering green builds with the latest code on a regular basis.
You can see how when there is a red build on days 2, 4, and 7 the metrics go flat because when you can’t process code through the deployment pipeline, you can’t validate any of the code changes.
You can also see that the time delay between release branch and production is driven by the time it takes to complete all the stories, fix all the defects, and get the tests passing. This one graphic allows you to quickly visualize all the work left in the system.
The closer trunk is to release quality, the less work required for getting the release into production. Therefore, the goal is to improve the stability of the trunk so there is less work overall.
Farming Your Build Acceptance Tests to Increase Trunk Stability
This example illustrates another important capability. When you see a component test suite take a dramatic pass-rate drop like this, you need to ask yourself, “How did code that broke that many tests make its way into my repository?”
A developer who has rebased their environment since this code came in is now working on a system that is fundamentally broken, and this will impact their productivity until the issue is resolved. This is especially true of the team that owns this functionality.
Other developers on this team will continue to work on other tasks and their work will be tested by these same tests. Because these tests are already failing at a high rate, the developers have no way to know if their code has introduced a new set of issues.
Essentially, you’ve allowed the train wreck to happen directly on your development trunk because your gating did not protect you.
The good news is that this problem is easy to solve. You have some set of build acceptance tests that are gating the code check-ins, but they allowed this problem to make its way into the code base. You’ve got a large number of tests in your component test suite that readily detected the problem.
All you need to do is identify the best candidate tests and promote them into the build acceptance test suite. The next time a similar problem occurs, the build acceptance tests will likely detect it and cause the offending code to be rejected and sent back to the developer.
While finding the issue within 24 hours of when it was introduced and getting that information to a small group of developers who committed code is good, getting the same information to an even smaller group within a few hours of the commit is even better.
This is why we strongly suggest that you actively manage which tests are in the build acceptance test suite to ensure you have the most effective tests in place to gate the check-in. Harvesting tests from your regression suite is a great way to help keep the build acceptance tests fresh and relevant.
This same approach can also be used to drive passing rates up over time. If your overall component tests are passing at 85%, can you find a defect fix that will drive that up to 90%?
If so, get that defect fixed and then add a test to the build acceptance test suite to ensure those tests keep passing. You can continue this process of farming your build acceptance tests over time to drive up the stability of trunk.
If you are a website or SaaS that is driving for Continuous Delivery, this requires driving trunk to production levels of stability. If your business model does not support that frequency of release, there are still real benefits to driving up the stability of trunk, but you might start getting diminishing returns for a very large number of tests needing to pass at over 90% every day.
At HP we had ~15,000 computer hours of tests running every day on the trunk. We found that driving the passing rates to around 90% was very valuable, even though we were just doing quarterly releases.
On the other hand, we found that driving the passing rates up to 98% required a lot of extra effort that slowed down our throughput and did not provide much additional benefit.
Therefore, if your business requires less frequent releases it is going to be important to find the right balance between stability and throughput.
Embedded Firmware Example
The concepts for designing the test stages and building up the enterprise system work equally well for embedded, packaged, and web/cloud solutions. There are, however, some unique considerations for embedded software and firmware.
This is because while in addition to building up the components and progressing through the stages of testing, you also have to consider moving through different test environments to build up to more operational-like product environments.
Depending on the availability of the final product and emulators, you might have to rely extensively on simulators for gating all the code into the SCM.
If your embedded solution is broken into components that are more software-like versus embedded firmware interacting with custom ASICs, then you should work to gate the software components with simulators and save the emulator gates for the lower-level firmware.
Additionally, give consideration to the build acceptance tests at the simulator, emulator, and product level, which may be different given the objectives of each stage.
Either way, you should not wait too long before ensuring the build stays at an acceptable level of stability for every test environment leading up to and including the final product.
Improving the processes
Executives need to make sure they don’t let the magnitude of the overall change keep them from starting on improving the processes. We understand that taking on this big of a change in a large organization can be a bit overwhelming, which might cause some executives to question leading the effort.
They need to understand that while leading the transformation is a big challenge, trying to compete in the market with traditional software development processes is going to be even more challenging as companies around them improve.
Instead of worrying about the size of the challenge or how long it is going to take, executives just need to make sure they start the enterprise-level continuous improvement process. There is really no bad software development process. There is only how you are doing it today and better.
The key is starting the process of continually improving. The first step is making sure you have a clear set of business objectives that you can use to prioritize your improvements and show progress along the journey. Next is forming the team that will lead to the continuous improvement process.
The team should include the right leaders across Development, QA, Operations, and the business that will need to help support the priorities and lead the transformation. You need to help them understand the importance of this transformation and get them to engage in the improvement process.
As much as possible, if you can get these key leaders to help define the plan so they take ownership of its success, it is going to help avoid resistance from these groups over time. Nobody is going to know all the right answers up front, so the key is going to be moving through the learning and adjusting processes as a group.
The First Iteration
It is going to take a few months to get good at setting achievable monthly objectives and to ensure you are working on the right stuff. Don’t let that slow you down. It is more important to get the process of continuous improvement and learning started than it is to get it perfectly right.
After all, the worst that could happen is that you are wrong for a month before you have a chance to learn and adjust.
These objectives should include all the major objectives the organization has for the month because if you have a separate set of priorities for the business deliverables and the improvement process, people are going to be confused about overall priorities.
Therefore, make sure you have a one-page set of priorities that everyone agrees are critical to sustaining the business and improving the development processes.
The key deliverables for the business are going to be easy to define. Where to start the process changes may be a bit more unfamiliar for the team.
We suggest you consider starting the process of improving the day-to-day stability of the trunk. Set up a continuous integration process with some automated tests and start the process of learning to keep the builds green.
If you are already doing this at the component level with unit tests, consider moving this to larger application builds in an operational-like environment with automated system tests.
This starts the cultural change of getting all the different development groups getting all the code working together and out to the customer instead of working on some intermediate step.
It also is a good opportunity to stress your test automation to ensure it is stable and maintainable. If it isn’t, your developers will let you know right away. It is important to figure this out before you end up with a lot of automation you may need to throw away.
Make sure when you set the goals for the first month that they are quantifiable. Can you measure clearly whether the objective was met?
Also, make sure the organization feels the goals are important and achievable. This may take a little time to get the broader support, but it is important that people felt like the goal was doable up front.
That way, when you later review why it wasn’t accomplished, you can get people to clarify what is getting in the way of their ability to get things done. If it was a goal they never thought was possible when it was not done they will say just that and there is nothing to learn.
If instead they really felt it was possible and important, you are more likely to get an insightful response to what is impeding progress in your organization.
This is where the key learning and improvements really start to occur. The leadership team needs to take on the challenge of becoming investigative reporters spending time out in the organization learning what is working and what needs to be improved.
They then need to take this new understanding to prioritize removing impediments for the organization. It is this process of engaging with the organization to continually improve that will become your long-term competitive advantage.
Leading the Key Cultural Changes
Executives need to understand that other than engaging with the organization in the continuous improvement process, their most important role is leading the cultural shifts for the organization.
They will need to be involved and helping to prioritize the technical improvements but they need to understand that these investments will be a waste of time if they can’t get the organization to embrace the following cultural shifts:
Getting the entire organization to agree that the definition of done at the release branch is that the feature is signed off, defect-free, and the test automation is ready in term of test coverage and passing rates
Getting the organization to embrace the unique characteristics of software and design a planning process that takes advantage of its flexibility
These are big changes that will take time, but without the executives driving these cultural shifts the technology investments will be of limited value.