The DevOps Handbook, by Gene Kim, Jez Humble, Patrick DeBois, and John Willis
Why did I read this book, and do I think that it met my expectations? I was motivated to read this book, because I have been recognizing that our NPSS development team is using some outdated practices, and in order to stay competitive in the technical work environment, we need to pursue more modern technologies and management methods. For example, our team currently uses Bugzilla for tracking our customer issues and requests, and we used Microsoft Word for tracking and reporting our progress. While these tools work, they are not automated, they are clunky, and they are prone to errors. On the positive side of things, all of our source code is saved to a Git repository, and we have a suit of automatic tests to verify our builds and bug fixes. None of these tools are integrated with one another, and I think there is an opportunity for us to improve our development practices, reduce the amount of time we spend on manual builds, manual tracking, manual testing, and manual reporting. By using better tools, and using more strategic practices, I think that we can provide more value to our customers. By reducing the time spent on all of these manual tasks, we enable our developers to spend more time actually completing fixes. What can be automated, should be automated. In general, the aerospace industry is slow to adapt new technologies, and rightfully so, since these are critical systems, catastrophic failures are unacceptable, and there are strict requirements to adapt new technologies. However, after technologies are well-understood and they have demonstrated their usefulness and reliability, there comes a time when certain technologies should be adapted. My goal with this book was to learn the what and how of improving of our development team so that we can ultimately provide more value to the customer. The customer is the most important person, and whatever we do needs to give them value. Overall, I think this was great book, and I look forward to exploring more some of these methods and tools in more depth.
This book was primarily dedicated to the integration of the development and operations teams. In my situation, I lead a development team; however, we rely heavily on our operations team for support, and many of the practices recommended in this book are applicable to any technology team. For starters, how do you implement change is an organization that has massive momentum and is going to rebel against change? Especially, how do you implement a massive change in an anti-change industry? The answer is to start small and to start behind-the-scenes. Break it into small pieces, and you will succeed. Start small and then scale up. Our software project is divided into three distinct projects, each of which is independent of one another. Therefore, our strategy will be to start with one project, implement a few changes internally and work through the bugs, then roll out the changes for that single project.
The authors talk about “The Three Ways” and how these three ways are applied in IT operations. “The Three Ways” embrace ideas from the lean movement, the Agile Manifesto, and the Toyota Kata, and these three ways are given as:
Continual Learning and Experimentation
A great recommendation is for teams to spend 20% of their time dedicated to non-customer-facing projects. For software projects, this enables the team to reduce the technical debt and fix whatever items they find personally valuable and whatever items improve their personal workflow. This is a great way to reduce burnout, reduce technical debt, and improve employee satisfaction. I think that our company, Southwest Research Institute (SwRI), does a good job of this in general. We allow our employees time to spend time on “promotional” work, attending conferences, and also working on internally funded research and development projects. I think this is a big benefit, because it allows employees to spend time learning about topics that they enjoy and working on projects that interest them. It is a large part of what keeps people at our organization. Continual learning and continual experimentation were a big theme in “The DevOps Handbook” and I think that SwRI does a good job encouraging these.
Something that I really liked about this book was the number of examples and case studies that they discussed. For example, they used LinkedIn to talk about the topic of reducing technical debt. LinkedIn was growing exponentially, and they needed to spend 2 months of time working solely on internal architecture. During that 2-month time period, they spent zero time working on customer facing solutions, and they focused 100% of their time on fixing their internal architecture to support the increased demand for their service. This book frequently used examples from LinkedIn, Facebook, Etsy, Target, Google, HP LaserJet, and Bazaarvoice. These companies made huge improvements in their DevOps operations, and as a result became leaders in their respective marketplaces. They implemented key ideas such as smaller teams, limiting work-in-progress (WIP), version control, trunk-based deployment, working in small batch sizes, continuous integration, telemetry, and continuous feedback. I thought the case studies were excellent!
I learned about the difference between Greenfield and Brownfield services. A greenfield project or service refers to something that is started from scratch, with no existing architecture or infrastructure. On the other hand, a Brownfield project or service is related to work done on something is already existing, such as construction on an existing building or development work on an existing software project.
I also learned about the Andon Cord, which is a method that Toyota implemented in their facility to immediately stop production whenever something goes wrong. Anybody can pull the Andon Cord, and when the Andon Cord is pulled, everybody works together to solve the problem. The idea is that the entire team is committed to quality and every person has the responsibility to stop the service and point out potential mistakes.
Some potential solutions to enhance the communication between the Dev team and Ops team is to invite Ops team members to the regular Dev meetings. Typically, there is a 10:1 Dev members to Ops members ratio, and it is beneficial for the Dev team to include the Ops team in their planning. Another recommendation that I thought was good for improved efficiency is to hire generalists, where generalists are people who have both Dev skills and Ops skills. When a team is full of generalists, this limits the number of bottle necks. When everything runs through a single person, this creates a bottleneck and slows down production time. By having a team that is skilled in all areas, we can remove bottlenecks and increase speed.
One of the big themes was telemetry. Telemetry enables us to track metrics and identify problems before they happen. Telemetry should involve lots of graphs and visuals to show us how our code is performing, where items are failing, what services are being used, and where failures might occur. Telemetry also helps fix problems that have already broken. We can also use filters, such a Kolmogorov-Smirnov filter to sift out anomalies, send alerts, and recognize anomalies. Telemetry is important! How can I implement telemetry into our product? I’m sure that we can do better here.
To endow a sense of ownership in the developers, nothing works better than to have them follow their work downstream. Have developers watch the customer use the product. This is a really good idea and also very important!
I really enjoyed the chapter about A/B testing. A/B testing can be used to determine if the features you are implementing have the desired benefit. It is expensive to implement a feature that will not be used or even makes the end-user’s life more difficult. It is worth the additional money to conduct a survey, or use other means, to understand what feature or fix is going to best help the customer. For example, moving the button in an application, adding a new button to automate a set of clicks, changing the color of a sign, or changing the wording in a question – all of these things may or may not be beneficial, and one way to determine if a change will have the desired outcome is to conduct A/B testing.
In conclusion, I learned that change is hard, and it takes guts to implement large changes. But that’s what project leaders are responsible for. Project leaders are responsible for recognizing areas that need improvement, crafting a plan to get the desired improvement, selling that plan, and implementing that plan in a way that yields the desired outcome. It is a lot of responsibility to have this role, but without somebody willing to accept the challenge and face the potential failure, companies will never improve, and if companies never improve, then they will inevitably fail. Innovation is impossible without risk-taking. If you are not making at least a few people angry, then you are not trying hard enough.