You are currently browsing the category archive for the ‘Articles’ category.

My engineering teams recently switched over to Kanban, and we’ve been having great success ever since.  One of the things I never fully understood until today, though, was how you tune your WIP limit.  My team had been using a WIP limit factor of 1.5, meaning that for every developer there were 1.5 stories in progress.

We felt this was a reasonable approach; that keeping a high WIP limit would ensure that there was always something for everyone to do and people wouldn’t end up idle.  We were using three service classes to organize the work – production support, product development, and maintenance class.  Things seemed to be going pretty well, and they largely were, but this approach carried significant risk that we ended up paying for today.

I track cycle time as well as a total completed story points for every release, and I noticed that with today’s release we had a significantly higher cycle time and finished about half as many points as we usually do.  Shocked, I took a look at our work in progress on the board and noted that we were at our standard WIP limit of eight.  What was different this time, though, was that almost every story on the board was large: 13 points or more.

For six developers, we had 66 points in-flight.  Digging deeper, our cycle time on a per-story basis had gone through the roof.  Because the stories were so large, and there were so many of them, we ended up leaving a lot of work on the floor when the release, which occurs on a regular two week cycle, arrived to pick up whatever we had finished by then.

As students of Lean, I’m sure many of you have seen this type of diagram before.  It illustrates that by completing work serially rather than in parallel, you can deliver value to your customers faster.  Well, today really brought this point home and in doing so I learned a lot about WIP limits in Kanban.

Part of the benefit of keeping as low as possible WIP limits is that you allow your developers to swarm around as few large in-flight items as possible, ensuring you:

  • Get more eyes on a single feature to ensure quality and shared understanding.
  • Reduce cycle time on individual stories (maximize developer participation), both reducing the average amount of time spent on stories and ensuring that as many items are deliverable to customers on a given day as possible.

Because we transitioned from Scrum, our product team was used to seeing a large body of work in progress that had been committed to at an earlier date that would end up getting delivered as a release; to them that’s what a team being successful at delivery looks like.  In an effort to ease the transition into Kanban, and also because I didn’t see the risk, I went along with higher work in progress limits so that things generally looked similarly to what they saw as efficiency.  Now I realize this is a dangerous approach that should be avoided.

Instead of using the classes I mentioned above (production support, product development, and maintenance) we are now organizing the WIP rules around the size of the stories (we’re using fibonacci numbers, and this is roughly what breaks down to large and small for us).

  • Large Story – Any story that is 8 points or larger can be worked on with a WIP limit factor of 1/3, rounded down.
  • Small Story – Any story that is 5 points or smaller can be worked on with a WIP limit factor of 1/3, rounded up.

Our WIP limit factor is now significantly smaller – instead of 1.5 developers per story, we now have .66.  Any excess capacity on a day-to-day basis is soaked up on testing automation, but under this new approach I expect to see cycle time return to normal levels and a much reduced risk of incomplete work at code freeze.  Additionally, we’re balancing our work around the way we actually do work – we want more smaller stories in progress since typically a single developer can work on them, but at the same time we want to see developers swarming around the larger, more involved stories.

I learned two things here.  First, that Kanban really does either require stories to be a consistent size or, if you do want to use size variants on a single board, that you account for this in your rules.   The second thing I learned is that when you keep artificially high WIP limits, even with good intention, you are not likely doing anyone any favors.

First of all, I wanted to announce that the scope of this blog might be growing a bit; I was recently promoted from the Release Manager at Citysearch to Engineering Manager for several teams working on our advertiser platform. A lot of the reason I was promoted was because of my experience in process development, so there will still be a lot of correlation between the original content and where we go in the future, but I’ll definitely be growing and talking about new subjects with relationship to the broader organization.

In that light, I wanted to talk about my experience moving my software teams from Scrum to an iteration-less form of Kanban. My CTO, Christophe Louvion, provided a nice high level summary at http://runningagile.com/2010/05/31/a-few-weeks-of-kanba/ based on feedback in a presentation one of my teams gave, and I’m going to expand on the list:

Stand ups differ – going story by story, not based on individuals
The story-based focus, where we start at the right of the board and focus on how to move stories into the “complete” column, has encouraged the team to be more collaborative in analyzing, executing, and testing their story implementations. Previously, the teams were going down the list on the first day of the sprint and assigning items out one at a time in an effort to essentially maximize their work in progress. Unfortunately, this was pathological in that it ensured that only one person knew the implementation for that story (and that you were done if that person ended up getting sick) and that it discouraged the team from swarming around one or two items. Now, everyone on the team is aware of what the entire team is working on and can share the load as needed.

Less Rules: No roles prescribed, More Dynamic
This speaks to the core Kanban idea that the process itself is something that is being engineered and tailored to the team; the team feels that they don’t need to have roles (such as the Scrum Master) that the team has in many ways grown out of the need to have. If they are mature enough to self-organize, they are more than welcome to do so – and they are.

Backlog more Flexible
If you’ve ever worked in advertising you know there are a lot of stakeholders out there and when deals get done they need to be implemented irrespective of any two or four week sprint cycle. The flexibility of the “to-do” list in Kanban allows our product team to reprioritize any work that the team has not already started development on, which in turn gives the product team the ability to ensure we’re always working on the latest requirements from the business. Everyone knows they are working on the latest and most important features and that means real value when we deliver.

Pros: Better Team Focus, Less Meetings More Time to Deliver, More Flexible Work Process
Again, the second two points here speak to the ability of Kanban teams to tailor the process to themselves. The first point, however, is really worth digging in to. In my opinion, the increased team focus comes from the participation of the team in engineering their work process. Prescriptive methodologies such as Scrum, although good general rules for many teams, do not necessarily require the team to understand the why behind many of the reasons that the process is the way it is. In fact, they are designed so that even teams who do not understand the whys can be effective. In contrast, because Kanban drives change from within the team, the team is encouraged to experiment with the process (even making mistakes, as long as there are lessons learned), which results in a team that understands at a very deep level what works for them, and why. On my team Kaizen events have jumped significantly now that the team understands the process and knows that they are empowered to make changes in a safe environment.

Cons: Less Structure Could Cause issues with Some People
This is one item that I’m still learning a lot about. Kanban does seem like, much more than Scrum, it requires leaders at all levels in the organization to drive the positive process change. I’ve been on a couple different teams that were learning Scrum as they went – I’m not sure that Kanban is capable of working for teams at that level. It simply provides too much flexibility to the team and much of the guidance asks you to analyze your specific situation rather than reading out of a playbook. That said, I haven’t actually tried Kanban yet with a less mature team so my data is incomplete here.

How Its Working Out: Feels Good, Increased Our Collaboration, Communication Has Increased, Forces Follow Up to Eliminate Blockers
“Feels good” is another important component. If you saw the last video post I made, you know that two of the major elements that motivate people are autonomy and mastery. Kanban provides these in that it trusts the team to engineer their process in a way that works for them – the methodology is not from “on high” and passed down to them to execute; instead they are an integral part of the process. We hire smart people who are good at engineering software systems, it turns out those kind of people have an intuitive sense for engineering process as well. The Kanban board and metrics provide a visual way for the team to see that things are getting better and to know that they are leading themselves to an increased level of mastery.

All in all, moving to Kanban has so far been great for the team. We’ve only been experimenting with the new process for a few weeks now, but already it seems like things are on an exciting new trajectory. I’ll be updating this blog with more details on how things are going as we continue to learn and grow.

One easy way to get more value out of your build system is to improve one of it’s fundamental outputs: the build number.  Many of us have been using the basic major, minor, revision, increment format for years (I know I did!), but most build systems these days, especailly ones you’ve built yourself, can use any format you like.  A good one that I’ve used in the past is the following:

Template: (Major).(Minor)_(Branch Name)_(Datestamp).(DailyIncrement)

Build Number: 3.4_Prod_20090529.3

Translation: Production Branch, 3.4 release, built on 05/29/09, was the third build of the day.

This provides a wealth of more information than the basic format.  Just from seeing the build number you know when it was built, what major/minor release it is associated with, and what branch it was built out of.  The only piece of information you’re losing versus the old method is the running total number of builds you’ve created since started the project, which is of questionable value anyway.

One issue you may run into if you are using Windows is that it expects the “File Version” value for assemblies to be in the 0,0,0,0 format.   A good way to get around this is to put major, minor, date (with the year removed), and increment in that field and then use the “Product Version” field for your expanded build number.  So, with the example above, you’d end up with a file version of 3.4.529.3 which works great with the file version field.

In this article we’re going to dig into using value stream mapping to find inefficiencies in our build and release pipelines.   A value stream map is a Lean method for visualizing and understanding your processes and is a great starting point to finding and weeding out inefficiences.

How to create a value stream map

The simplified form of value stream maps which I’ll be showing here are actually pretty straightforward to create:

  • Start with the entry point into your process and generate a workflow that ends up with value created for your customers.   Be sure that each of your workflow steps are real work – someone should be actively doing something for each step (waiting for a build, or really any kind of waiting, is not a work step in a value stream because nothing is being produced.)  
  • Next, notate the average time spent doing each step as well as the average time spent in between each step.  Estimates are OK, but if you actually do the experiment and collect some real world data you will probably find some pretty surprising numbers.  If you’re using a full-fledged workflow tool you may be able to get these timings out of your reporting system, but if you’re not tracking this in software then the best way I’ve found to get this data is to use Excel to track all the timings for one or two instances of the process per day for a week or two.  After you’re finished you can average your timings to come up with reasonably accurate estimates.   Be sure to also count the number of times you rework issues and what rework steps are associated with doing so.
  • Finally, go back and notate any rework loops and any other ways to exit the process aside from creating value (e.g., cancelling a feature would be an early exit to a feature development process that does not create value.) Possible flows where no value gets created are huge potential time sinks, especially in cases where you rework an issue or feature many, many times only to leave it on the cutting room floor.

A real world example

Below is an example value stream map I’ve created for a typical build pipeline.  You’ll notice that I’ve mapped an entire defect resolution process from discovery through resolution, and that’s because it’s very important to always look at your systems within their broader context; the CM pipeline doesn’t exist within a vacuum and looking at your systems holistically is really the best way to approach solving systems level problems.

  1. After 3 hours of build testing a defect is found by QA.
  2. An hour goes by waiting as the QA team discusses the issue internally and verifies it is reproducible.
  3. QA spends an hour writing up the official bug report which includes all information needed to fix the issue.
  4. Sixteen hours go by (on average) as we wait for the next change control board meeting where the bug is assigned to a developer.
  5. The developer spends two hours making the fix and checking it in (this includes any unit testing or other pre check-in requirements.)
  6. An hour goes by as the continuous integration system monitors for and pools check-ins.
  7. The build runs for an hour.
  8. QA only tests one build per day, so we wait one day for the fix to be ready for deployment.
  9. The deployment runs for an hour.
  10. After about a half day QA gets to the test in question.
  11. QA spends three hours testing.
  12. QA spends an hour updating the (hopefully) fixed issue with updated information and adding it to their regression suite.  This is where the value is created in this example – the end product has one less critical defect.

For this example rework happens on average once on four out of five bugs submitted.

The value stream map

Mapped out using Visio, the value stream map looks like this:

CM value stream

Metrics

Once you’ve got your value stream mapped out you can find some interesting overall metrics. These are great to keep track of so over time so that you can see how you’re tracking at the macro level, especially since within configuration management it’s easy to get lost in the details.

You can find an overall ideal efficiency by dividing the number of hours spent doing real work by the total number of hours in the entire process, without including reworks.  In this value stream map there are 13 hours of real work happening (for the ideal situation where the defect is fixed on the first attempt) and 43 total hours of cycle time total.   This gives us an ideal efficiency of roughly 30%.  A good target I’ve found to shoot for with build systems is around 20-25% – you never want to get too close to 100% because at that point any new work added to the system will cause thrashing, and work coming into the build pipeline is spiky even on the best of days.  

Another interesting value you can pull out of this is rework cost, which can be found by adding up the total number of hours spent on each rework cycle.  In this example we add 21 total hours each time we come to a fix failed state.  Since we found we average one rework cycle on four out of five bugs our rework average is 80%.  Rework metrics are great resources to get a feel for quality coming into the system and amount of waste being produced based on quality issues.  

Next steps

One of the great wastes that value stream maps can highlight is time spent in rework cycles.   Because in our example we lose 21 hours per rework cycle (and we’re sitting at an 80% rework average) anything we can do to bring that iteration time or rework average down will be a huge win.  If we can increase development time, for example, in order to our reduce rework average then we’ve made a signficant overall improvement through making a change that in isolation would have appeared to be counterproductive.  Keep in mind that any improvements made during steps in a rework cycle have the potential to be gained every time that rework cycle happens, so if you’ve got a high rework average then spending time reducing rework cost is going to give you a lot of bang for your buck, but if your rework average is very low you may be able to exchange a higher rework cost for a lower ideal efficiency.

A great place to start to look for optimizations is by analyzing your “waiting steps.”  These are pure waste (they are idle inventory in Lean terms) and are prime targets for efficiency gains from simply reducing the time spent there.  In our example the most egregious waiting periods are the ones where we are spending two days for the original defect to be scheduled for work, a day for QA to pick up the fix in the next deployment, and four hours for a QA engineer to get a chance to take a look at the bug in the test environment.  Because there’s no real work happening here it’s often easy and cheap to find ways to reduce these waiting periods.

Process steps are also good potential candidates for making improvements, but here things are more complicated because someone is actually doing some kind of productive work and you’ll need to do some investigation into the details to find out what’s happening there.  In this example there aren’t any process steps that jump out as particularly slow, but if our build and deployment times started creeping up it would definitely be worth spending some engineering resources to bring that back down to a level that makes more sense.

Going further, the next steps would be to try to think about your map overall:  

  • What other kinds of timing improvements can you find?  
  • What other teams could you bring on board to improve iteration speed?  
  • Do you really need to do every step?  
  • Does this workflow even make sense now that you’ve written it out?  
  • Is there  a way to reduce efficiency in one area that increase overall effiency?  
  • Are there any more hidden rework cycles in here that can be looked at?  
  • Are there any possible flows where value is never created, and if so, how can these be avoided?  

Summary

Taking the time to analyze your processes using value stream maps can be a great way to look at your existing problems in a new way.  I hope that I’ve given you some ideas on ways to get started looking at your systems in terms of a value stream maps and that they can be as useful for you as they have been for me.

If you find that creating your value stream creates more questions than answers then you’re absolutely doing it right.  Value streams are a great method for highlighting waste and inefficiencies, but they aren’t a method for determining how to fix the problems.  In order to find the way to fix any problems you find you’re going to need to do some root cause analysis, which is a subject I hope to get into in more detail in a future article.

These days I think most of us are sold on continuous integration as a great way to find both code and process defects earlier so we can resolve them before they grow out of control, but did you know that by applying the Lean concept of Stop the Line manufacturing to the build pipeline there are even further gains we can take advantage of? 

First, some history

Stop the Line manufacturing is a technique introduced by Taiichi Ohno (of Toyota Production System fame)  in which every employee on the assembly line has a responsibility to push a big red button that stops everything whenever they notice a defect on the assembly line.  When this was first introduced people couldn’t wrap their heads around it; it was part of manufacturing dogma that the best thing you could do as a plant manager was to keep your assembly lines running full steam as many hours of the day as possible so that you’re maximizing throughput.  His idea, however, was that by fixing inefficiencies and problems as they occur what you’re doing instead of maximizing your existing process is actually proactively building a better one. 

When he put this system into practice he found that some of his managers took his advice and some didn’t.   The managers who implemented Stop the Line had their productivity drop by a shocking amount; they were spending much of their time fixing defects on the line rather than actually producing any goods.  The managers who hadn’t listened thought this was a great victory for them, and I can just imagine them feeling sorry for poor Taiichi Ohno who would be ruined for having come up with such a horrible and wasteful idea. 

Before long, however, something strange started to happen.  Slowly but surely the managers that had spent so much time fixing defects instead of producing goods started producing their goods faster, cheaper, and more reliably than their counterparts to the point where the caught up with and then exceeded the lines who hadn’t made improvements.  The initial investment in improved process and tools had paid off and Toyota went on to be quite successful using this method.  Even today their engineers and managers share a cultural belief that their job is not actually to manufacture cars but instead to learn to manufacture cars better than anyone else. 

How does this relate to Continuous Integration?

Continuous integration is a technique that allows us to run the build process just like it was a continuously running assembly line; fresh code goes in one end and (after a series of assembly steps) a build that’s ready for a human to test comes out the other.   On its own CI is a big win for a software team over an old style “daily build”, but by adding the concept of Stop the Line manufacturing to our continuous integration process we can really take things to the next level.

Some real world examples

A typical reason a release engineer might stop the line is because build or deployment times are slowing down.  The difficulty for the build engineer trying to speed things back up is that there are quite a few elements outside of his or her strict control, but by using a Stop the Line approach you can say OK, we’ve passed a critical threshold here and as an organization we need to stop what we’re doing to figure out how to fix this new constraint.  Let’s get our SDETs looking at improving unit test speed, let’s get our developers with the build engineer to look at reducing the number of configurations we’re building and to find some general compile-time optimizations, let’s get our systems engineers looking at our build and deployment machines, and let’s get our DBAs looking at the database deployment and servers to see what they can do.  This is a much more proactive and holstic approach to managing this problem than simply expecting your release engineer to manage this on his or her own, or worse yet not dealing with it until you really do need fast iterations.

Aside from build times, how many times have you come in in the morning to a broken that has been firing off failures all night thanks to a late night, last minute check-in?  One approach that a lot of teams take to handling this is to embarrass the person who broke the build; he gets to wear a goofy hat, gets called out in a daily meeting, or something along those lines.  In actuality, however, these types of defects are almost always systemic to the way people are working.  With a Stop the Line approach you’re not just accepting that Joe Schmoe is a slacker who broke the build and instead you’re getting a focused strike team together to look at what really caused the problem.  Maybe there’s no easy way for a developer to run a quick set of tests as a sanity check before check-in.  Maybe Joe is being asked to work on a component he’s not familiar with and needs some guidance.  Maybe the component itself is overly complicated and could use some refactoring.  Whatever the case may be, it is unlikely that the long term solution is to just blame Joe and call it a day.  If you stop the line to look at the situation in depth you can really take steps to understand what is happening at the systems level.

Summary

The key with Stop the Line is that when you find a defect that you stop the build pipeline and gather the important stakeholders together to look for the root cause.    In too many companies build system defects will continue to pile up for far too long as people work around them, only coming to a head when the build time and reliability becomes so outrageously bad that it’s affecting productivity organizationally, or when you’re late cycle and trying to push as many builds as you can through your now highly dysfunctional pipeline.  What adopting Stop the Line allows you to say as a manager is that you’re not going to tolerate that slow, silent accumulation of issues that are going to come back to bite you later; that you’re building an organization that isn’t just building software but is committed from the beginning to actively learning how to build software better. 

In closing I want to share a couple caveats with you.  It is important to adopt the Stop the Line mentality early so that you can reap the rewards of your work later when you really need it.  It’s much easier to stop everything and fix the pipeline when it isn’t already completely broken and you’re in danger of slipping.   Also, you can’t be afraid to dedicate some of your most talented people to do the systemic root cause analysis; they’re going to be able to identify and fix the problem faster and more effectively since they have the skills, experience, and are empowered to do so.

In my next article I’ll be discussing how to use value streams to analyze your build pipeline.  Value streams are a great way to figure out where you’re spending your time when you’ve hit that critical iteration time constraint, stopped the line, and need to figure out how to get back on track.