A Bug Story
by March 8, 2016

Filed under: Industry Insights

AppNeta no longer blogs on DevOps topics like this one.

Feel free to enjoy it, and check out what we can do for monitoring end user experience of the apps you use to drive your business at www.appneta.com.

At AppNeta we make tools to help developers optimize application performance. So when one of our products gets janky we naturally take it personally. This is a story about how a particularly slippery bug put up a fight.

The Bug: One (1) of our biggest customers reports site-breaking frontend jank in TraceView, all the time, for all their users.

The First Clue: It’s specific to this single customer.
The Second Clue: An even larger customer has zero (0) UI jank.
The Third Clue: It’s only reproducible on Chrome.

Our parallelized process in serial form:

  1. Since this was a Chrome-only bug, the first thing we tried was disabling all extensions. We’ve been bitten by extensions in the past, so this was an obvious and easy experiment: Sadly, still jank. (Bad for us – Good for this story.)
  2. Compared Chrome versions – All newest/same.
  3. Reverted to a stable commit a month before the first symptom was reported – No dice.
  4. Widely blamed one of our React components, called <AppSelector>. Poor performance of this primary navigation component was one of the first reported symptoms. Users should be able to type-to-filter & navigate their org’s list of apps. (The affected customer has ~175 apps.) Using Chrome’s amazing DevTools Timeline we visualized the UI jank for this component and saw that DOM manipulation was taking 2 seconds per keypress! Chrome DevTools Timeline - before
  5. We updated to the latest version of React.
  6. Double-checked <AppSelector>’s render() method & the key attribute of our <li> loop. (Unique keys are prerequisite for fast & reliable loops/maps in React render methods.) Our key was solid, and the render method didn’t seem to be thrashing.
  7. Native Chrome DOM manipulation seemed to be the bottleneck, so we next tried using CSS to show/hide apps in the filterable list. This is often faster than adding/removing elements from the DOM. Chrome’s setAttribute(“class”,...) toggling took almost as long!
  8. Tried commenting out <AppSelector> entirely – Other React components remain janky.
  9. Disabled React altogether, and still saw varying degrees of jank in legacy JS frameworks/templates.
  10. …At this point we finally accepted that the slow DOM manipulation was in fact framework-agnostic, site-wide, & still single-customer-specific… <AppSelector> was just a red-herring/usual-suspect.
  11. Armed with this experience & knowledge we set out to find any and all differentiators between these two large clients’ web-apps and/or data-sets.
  12. We scrutinized JSON endpoints, queried SQL, profiled memory, & stumbled upon the smoking gun: The affected customer has ~37,000 custom stack layers associated with their apps – Most customers have less than 100. From here it’s a short leap to the next discovery: Many of these custom layers are assigned a custom color. Custom colors are handled by some of the oldest code in our codebase, which to our delight and horror we discover is instantiating a Protovis (the predecessor to d3) Color instance for EVERY custom color on page load, thus chewing up memory: Chrome DevTools Heap Snapshot

Victory is Ours!

Half an hour after we discovered the root cause we had a patch deployed to production, and a story at the top of our backlog to refactor legacy custom color logic to maximize scalability. It was extremely satisfying to be able to report to the customer that we had crushed this bug the same day we became aware of it. 

Chrome DevTools Timeline - after

Here’s the DevTools timeline of the same keypress post-patch.
We’ve gone from 2 seconds down to 50ms!

Lessons learned:

  1. If you build it, they will CURL it. In other words, never underestimate the max potential usage for any facet of an API. This customer changed the way they were using TraceView’s user-facing API and created these 37K custom layer colors over the course of a couple weeks.
  2. No browser is perfect. In the past we’ve seen Firefox be noticeably slower for certain types of thrashy DOM bugs. This was the first time we’ve seen a Chrome-specific DOM-performance bug that wasn’t extension-induced. (Looking at you ng-inspector.)
  3. Don’t make unfounded assumptions! We’ve recently been focused on squeezing every drop of performance out of our React components, which led us down a dead-end in this investigation. That said, complex React components seem to be excellent canaries in the JS coal-mine – they’ll be the first to complain if the environment is unhealthy!
  4. Chrome dev tools are awesome! ‘nuff said.