The Most Agile Way to Manage Technical Debt
As an engineering team we strive to find the right balance between aggressively building out product capabilities and managing technical debt. At AppNeta we have hardened our process to define and manage this balance. Our current process involves virtual teams who break away from the feature development at hand to focus on technical debt in a specific product area for one week. The key strategy here is simply to ensure that every product area receives an appropriate allocation of maintenance effort.
The Yin and Yang of Feature Velocity and Continuous Improvement
If we think of the two key business drivers in a SaaS model we want to bring new customers on board with new features while maintaining existing subscriptions by continuing to offer a solid solution. The effort allocated to these two pillars must remain balanced across all parts of the organization including the technical planning of product development.
In an agile startup that is aggressively expanding the capabilities of its SaaS offering to both meet the demands of the current market as well as reach into new markets there is a strong tendency to make the maintenance of existing capabilities second class citizens. The result will be a slow and subtle increase in the technical debt which ultimately manifests itself in the end user experience and in the foundation upon which new features are built. Managing the P1 and P2 issues isn't difficult: interrupt the work at hand and get a fix in the next scheduled release or get an unscheduled patch out. However, if an issue is called out through internal testing or by a single customer in a secondary use case, it is tougher to get in the pipeline. How do you make a call on each issue and whether it is more important to address than getting a strategic new initiative out the door?
Let's also consider this from the other side. As developers we all love to button up loose ends. At AppNeta we encourage pride in our work, and within the team many have a strong desire for perfection. Leaving blemishes, user facing or not, keeps us up at night. This is the right culture to have! Without a business influence, however, the balance will flip the other way and not give enough attention to new features, ultimately reducing ROI.
Don't Throw Tomatoes over the Fence
I have seen organizations build a separate maintenance team that is, for example, half the size of the new-feature team. In my opinion, this is the wrong approach (at least for the size of teams we work with). This feels like one team throwing tomatoes over the fence for the other to look after!
The follow through which comes from the pride of ownership is lost because someone else is dealing with the bugs created by a different person, in fact a different team. Without solid communication, the backstory of why a certain initial approach was taken is lost. The domain knowledge is not present and thus the efficiency in fixing the issue is reduced.
Even worse, I've seen maintenance teams of less experienced developers have trouble identifying the root cause of issues, resulting in band-aid fixes where rework would be preferred.
To avoid the issues with maintenance teams, we have employed catch-up sprints at AppNeta to step back from major new feature development and catch up on a bit of debt. Occasionally they are necessary and it’s great to get all hands on deck for particular themes. In our case, a catch-up session embraced one or two themes (like performance, UX or operations) and the team focused on those themes. We recently did one that expressly dealt with performance issues that had been creeping up on us over the course of several months. The downside to specific themes is that we continued to shelve issues that didn't make it into that theme. I believe these catch-up sprints are required once or twice a year to break up major new feature initiatives; sometimes debt isn't a quick fix. In my experience, though, catch-up sprints are not a complete solution to achieve consistent and timely servicing of general technical debt.
Teamwork and Time-slicing
At AppNeta, we have built virtual teams each with a focus on a specific area of the product or infrastructure, for one week at a time. The size of the team is just big enough to ensure full stack coverage so that any issue for that product area can be serviced. The number of teams in rotation determines the frequency that product area will be up for triage. We have four teams (blue, yellow, orange, red) and therefore each product area is up for servicing every four weeks. We have divided our SaaS offering across the four teams — consciously ensuring that high profile areas are not spread too thin.
Each team is comprised of development and non-development roles.
- Technical support- Represent and interface with the customer, sales team and sales engineers. The voice of the customer. This role gives weight to issues. How impactful is an issue and to which customers or potential customers?
- Tester- Ensure that quality assurance is improved through the triage process. How did this issue get out the door? Let's improve our automation to catch this and similar issues from escaping in the future. In addition, let's provide help to support and developers in reproducing the issue.
- Developers- In our case we have a full stack of developers (3) including: front end, back end, and agent/appliance.
- Prime- One of the the developers is designated prime. Their role is to monitor and assess incoming issues and tag them for the appropriate team. In addition, they lead day-to-day activities as necessary.
What Can You Get Done in a Week?
A week doesn't seem like a lot and it's not. But that's okay, it will come around again in just a few weeks. You would be amazed at what the team can get done in a week.
Before the week begins, the Technical Support Rep will review the backlog as tagged for his team and take a first pass at prioritization. On Monday morning the Developers and Tester jump in. They review the top of the backlog (kanban style) and scope it, dragging items likely to be in for the week. The team as a whole then gets together to discuss the top of the list. Everyone brings their own perspective to the conversation. There might be some clarification of details or requests for more information, such as logs, steps to reproduce, or impact/severity. Then the team is on it. Developers are fixing and testers are enhancing automation and testing fixes as they are available.
At the end of the week, pencils down. What is complete is merged to master and ready to be pushed out in the next release, possibly along with feature work which has been happening in parallel by the rest of engineering who are not on their triage week.
We also conclude the week with a retrospective so that we can continue to learn and tune this process.
We need to make sure that the scope of every team is balanced so we aren’t spending time on low priority issues while overloading a team with more than they can handle. We also need to ensure that the high business value areas of the product are allocated the right time. These considerations will result in slowly evolving areas of responsibility and perhaps even the number of teams.
- Customers see consistent attention given to technical debt
- Impact of product maintenance is predetermined so that development effort and timing can be scheduled around triage
- Developers are given a regular session to fix the nagging issues that bug them
- Fixed time windows and multi-role accountability forces the ROI to remain in check
- The balance of effort given to debt vs new features is communicated to the rest of the organization.
As we proceed we are continuing to measure the debt incurred vs the pay down with each iteration. We will tune the scope of responsibility for each team to ensure they continue to have a deep backlog and a high ROI. Just like our feature work, the name of the game isn't perfection; it's effectiveness. This schedule works for us, keeps everybody happy, and let's us keep building awesome software.
Filed Under: performance monitoring