Trupath: A New Approach to Network Monitoring
The internet and rise of cloud computing has changed the way applications are built and deployed. Companies are moving more and more of their infrastructure from internal data centers to the cloud, and using more SaaS services than ever before.
Let’s look at how these changes affect a company’s satellite office. In the past, the office would be connected via an MPLS circuit back to either the central office or their data center. The applications they would run would be largely thick client (non-web browser), and the application infrastructure would be supported by servers that lived in the company’s data center. To support all of this, the company would have invested in network devices that connected the remote office back to the company, essentially creating a single large private network. The infrastructure was complex, with multiple routers and switches, servers and storage.
End-user application experience monitoring has become more challenging as the world has become more distributed and more dependent on network infrastructure.
In contrast, let’s explore how this office would be setup today. Today, the office would likely be connected via a public ISP such as Comcast or Verizon. All the applications that the people in the office use are delivered using a web browser. Some of these are SaaS applications such as Salesforce.com or Office365. The phone system would be cloud-based as well, utilizing VoIP technology. The remote users store files and backups not on servers, but in the cloud. And if there is a connection back to a data center, it is a VPN that provides the secure access.
That is a great deal of change. Most striking is what has disappeared: data center servers and network hardware have been replaced by the cloud and the open internet. In our new remote office, we likely have a router with a firewall, a switch and then wired and wireless connectivity to the individual users. That’s it. Everything else lives in the cloud.
Now let’s look at how the cloud affects traditional data center operations. More and more companies are moving portions or all of their applications to public and private cloud environments. And with good reason: companies can increase their IT agility by being able to rapidly scale up (and down) infrastructure. Many companies are employing hybrid approaches, maintaining their traditional data centers and using the cloud to support traffic or computing bursts.
What’s missing in the new cloud or hybrid infrastructure? As we saw with our remote office example, the missing element is infrastructure. Gone are the servers and disk arrays, and, perhaps more importantly, all the data center network infrastructure. Office network hardware is still required, but servers both in offices and in the data center as well as data center network hardware have all gone away. And gone is visibility into the network infrastructure once it goes beyond the cloud provider’s firewall.
In making these changes, we’ve created a much more agile and flexible company. But we have also increased the importance of network performance. It is important to remember, however, that users do not care about network performance. They care about application performance. And yet, by moving to the cloud, we have increased our dependency on networks as good application performance requires good network performance.
This combination of reducing infrastructure and increasing dependency on network creates a real challenge for IT. The tools that they have traditionally used to monitor systems and networks are almost all device-based. They rely on passive data from network infrastructure. In the new cloud world, IT does not own those devices, and, thus, cannot gain access to the data from them. The public routers that your company is now dependent on for application performance are simply not accessible. And the traditional monitoring tools they have can’t help.
The new question for IT is this: How do I gain visibility into application and network performance when I no longer control the path?
AppNeta developed TruPath to solve this problem. TruPath is a unique and patented technology that provides active, continuous and low overhead real-time network performance through every hop in the network path. TruPath provides the core technology leveraged across AppNeta tools.
Unlike passive, device-centric monitoring tools which depend upon a heavyweight process of capture, backhaul, store, concatenate, analyze, and report large amounts of information obtained from every element (typically via SNMP or RMON) in the network to piece together a reasonable facsimile of activity over a five to 20 minute averaged time sample, TruPath is built upon a radically different approach based on the actual network path.
Simply put, TruPath provides you an application’s view of the your network performance, monitoring the actual network path.
TruPath benefits include:
- Complete visibility: unique methodology gives you an “application’s view” of any network.
- No device dependency: see performance across all hops, regardless of private or public
- Low overhead: low network load means TruPath can be used in production environments without slowing down applications.
- Scalable: with low network loads and a SaaS model, TruPath can scale to support any size network without requiring any additional hardware or retooling.
- Flexible: single or dual-ended deployment, providing support for measuring asymmetric performance
- Ubiquity: broadly available on most IP devices
- Real-time: detects and reports results within minutes
Understanding network paths
An application delivery path, or network path, is the logical route through network devices to reach to a TCP/IP target (be they “real” or virtualized), regardless of device type (server, workstation, IP phone, video conference system, router, switch, firewall, wireless access point, load-balancer, etc) or media type (copper, fiber, or wireless). A single network path can be as short as a laptop connected to a local file server over the office Ethernet or wireless LAN or as long as a 35 hop, satellite-enabled, WAN connection around the globe, and everything in between.
Internet Protocol (the IP part of TCP/IP) networks are serial mechanisms, that is, only one bit of data can really be “on the network” at a given small slice of time based on the clocking speed of the network itself. To deal with this reality, all modern implementations of IP leverage multiple network queues that are designed to store and forward data frames as they are sent and received by the elements communicating on the network. Queuing theory is very complex and a detailed treatment is beyond the scope of this document, but at the highest level, the performance of given network queue determines the ultimate performance of the network data that travels to, from or through that queue.
If there is more room in the queue, then more data frames can be sent in a given time period and since bits over time equal throughput, the rate at which a given queue can fill and drain and repeat without data loss effectively becomes the maximum speed of the network. If too much data is pushed into a given queue before it can be effectively drained out the other side, then the queue fills and the data frames begin to “bump into each other,” which results in an immediate tipping point of steadily lower performance and data frame loss, resulting in unpredictable application performance or even total application failure.
TruPath allows the accurate and efficient reverse engineering of the performance of a given IP stack’s queue by varying the data frame (packet) sizes, the distribution of sizes amongst a multi-series of packets, the quantity of the packets in a given series and finally the precise space/timing between the packets (down to the microsecond level). The end result is that TruPath is able to quickly exercise any given network path to its maximum possible level, doing so with an absolute minimum of data inserted into the path and then dynamically learn how a given network path will perform from the application’s perspective.
Get on the packet train
TruPath is based on the principle of sending and receiving many varied short sequences of packets (or packet trains) that are transmitted using the commonly available IP network mechanisms ICMP or UDP to defined end-hosts (or targets). A target is any IP-stack that can respond to an ICMP-based ping or can send back an UDP or TCP packet.
AppNeta’s appliances mimic the activity of an end user by transmitting precise packet trains using standard ICMP and UDP protocols over the network
One core advantage of this patented approach is that it delivers very high accuracy without requiring an intrusively high instrumentation load be placed on the network (unlike “packet flooders”) path being measured. TruPath’s commonly used packet sequence lengths are 1, 5, 10, 20, 30 and 50 packets in length. In the case of the Continuous Path Analysis™ (CPA) mode, which by default runs once every 60 seconds, there are roughly 20 to 50 total packets per minute placed on the network. As network dysfunction is detected and higher granularity is needed to identify the location and the cause of the impairment, automated escalated analysis into Deep Path Analysis™ (DPA) may send as many as 400 to 2000 packets in a series of packet trains to delve into the cause of a particular performance bottleneck.
Since the packet sequences themselves are very short, the overall load on the network is kept very low, typically averaging 2 Kbps for CPA and only 10-200 Kbps during a single deep path analysis (DPA) test. For very slow speed links or networks with other restrictions like small maximum MTU size, TruPath automatically adjusts its timing, size and distribution curves during its optimization startup phase.
This very low-impact methodology also permits TruPath to scale very well when monitoring large amounts of infrastructure and only requires a total of about 2 Mbps per 1000 unique end-points (targets) during continuous monitoring which permits TruPath to be run on existing production networks (even ones that experience load, loss or other performance-related issues).
By sending multiple sets of distinct sequences, TruPath can measure and analyze a wide range of different traffic conditions that a network path might experience due to application use. By probing the path repeatedly with the set of packet sequences, a statistically significant collection of responses for each type is collected. If the period of the sampling is relatively short compared to the rate at which the traffic conditions are changing, then the sampled response represents a snapshot of the conditions at the time of testing. TruPath will automatically detect when samples are captured during times of rapidly changing conditions and will adjust its measurement patterns accordingly.
Most traffic conditions are known to change over time, sometimes as fast as minute-by-minute or hour-by-hour. For example, routes may change, capacities may be altered by resetting interfaces, or traffic levels may significantly rise or fall. This would be typical for LAN, WAN, and Internet paths. In some cases, such as mobile or wireless usage, the circumstances may be changing more rapidly, on the order of just a few seconds to minutes or faster. TruPath’s self-feedback loop automatically adjusts for these kinds of conditions and permits the accuracy of the analysis to remain very high even in difficult, fast changing conditions.
TruPath can build up a complete set of statistics very quickly, in many cases in just tens of seconds. However, with lots of cross-talk traffic and other performance impairments on the path, it is possible for TruPath’s transmitted sequences to begin to interfere with each other, which could distort the results. This has been the Achilles’ Heel of other packet train dispersion implementations in the past. TruPath automatically avoids this issue by first using special patterns designed to detect if instrumentation packets are interfering with each other and, if this condition happens, it begins to take more varied samples over a much longer timescale to ensure that the resulting statistics are clean. This balanced approach to continuous monitoring and escalated testing leads to typical analysis times of a few seconds with CPA up to several minutes with DPA (both of which are covered in more detail later in this document).
Support for single and dual-ended operation
TruPath can operate in either single-ended or dual-ended modes. Each mode is designed to address specific performance management needs and individual or groups of network paths can be instrumented in single-ended mode, dual-ended mode or both modes at the same time. Single-ended mode allows for monitoring in cases such as SaaS application providers, where you are unlikely to have the ability to install a monitoring node with the SaaS provider.
The default single-ended mode requires that TruPath technology only be present at one end of the network path. It relies upon the Internet Control Message Protocol (ICMP) combined with ICMP Echo Mode 8 that is automatically present in every modern TCP/IP stack to operate. Because ICMP is a core ISO layer 3 protocol used by routers to operate properly, the vast majority of IP addresses respond to an ICMP “echo request” with an ICMP “echo reply.”
Other commonly used network tools like ping and traceroute also leverage ICMP, but in far more simplistic ways compared to TruPath. Leveraging ICMP provides for a widespread, predictable and highly accurate mechanism for soliciting responses from any IP-based network host and requires that the TruPath sending device (also known as a sequencer) only has to live at one end of given network path in order to measure the complete round-trip performance. Since the sampling and associated analysis is taking place underneath either the TCP or UDP protocols (which live at layer 4 in the Internet Protocol Suite), TruPath is able to determine what the base IP network (or the layer 3 network) can actually deliver without the overhead/impact of layer 4 protocols (whose performance can be measured using alternative methods also available from AppNeta).
In some situations, ICMP Rate Limiting may be enabled. However, TruPath’s testing rates are well below the thresholds for rate limiting rules of all major firewall vendors. ICMP may occasionally be disabled at the target address, or blocked/shaped by some mid-path network element (although even when packet-shaping and/or other controls are deployed, TruPath’s CPA packets typically are not affected due to a combination of their short duration and small quantity as they tend to squeeze into the gaps of most packet shaping mechanisms and thus still measure the raw network capacity accurately). Moreover, if any of TruPath’s sampling packets are affected by traffic shapers in the path, these effects are very obvious to the analytics engine due to their mechanical signatures (very unlike “normal” traffic signatures, which are non-mechanical in nature) and a flag is raised that the results are being affected. Under the rare circumstances where ICMP is still not a viable option, the target side of a path can be instrumented with a second sequencer that can leverage the same TruPath methodology, but it can use UDP instead of ICMP.
The second mode, dual-ended, requires placing TruPath software (also called a sequencer) at both ends of a path. This allows TruPath to measure the asymmetric path performance to understand the differences in performance in each direction. In dual-ended mode, more paths are measured using UDP packets in order to measure upstream and downstream performance separately.
It’s important to note that the TruPath methodology can take advantage of nearly any network transport mechanism. ICMP and UDP are used because they are both prevalent in every modern IP-enabled device and because measuring some key path metrics, especially bandwidth, at the TCP level often leads to erroneous results that are widely affected by TCP window size and overall path latency and RTT.
TruPath automatically avoids all of that and yields a more accurate result with far less network overhead.
As such, the actual payload of the packets themselves and the protocol used to put the packets on the network path is completely irrelevant to TruPath’s overall accuracy – the protocol only becomes a consideration for production paths that employ protocol shapers that affect one protocol vs. another. Even in these cases, TruPath’s unique ability to measure “under Layer 4” allows it to understand what the base network is capable of delivering before any optimization effects come into play. The critical requirement for TruPath analytics is to extract packet timings from the end-to-end network – therefore almost any packet will do.
Measurement modes and accuracy
TruPath actively probes the specified network path and generates one or more packet timing distributions for that path. A number of different groupings of packets are sent, ranging from single packets to small bursts to short streams. Various sizes and, in some cases, various protocols are used.
This process ranges from just a few seconds in monitoring mode to a few minutes under escalated troubleshooting mode. By default, packet sequences are sent at an average of 2 Kbps when monitoring and 30 Kbps when troubleshooting. These non-intrusive levels are designed for network paths that operate at 512 Kbps or higher. If network paths with less than 512 Kbps of total capacity need to be measured, TruPath is capable of automatically controlling its own packet rate to ensure proper sampling without overwhelming the network itself. From the distributions of packet sequence timings that TruPath captures, including loss and various forms of network error, it extracts critical performance data through sophisticated analysis. The numbers produced exactly reflect the response of the end-to-end path and accurately reflect how the network will be seen by an application.
Based on precise network models that have been refined over the past 11+ years and after 20 million+ samples on real-world customer networks, the accuracy of TruPath’s measurements have been validated both in AppNeta’s lab testing (using dedicated hardware with timing resolution down to the nanosecond range) and by customers’ existing testing methodologies. The accuracy of these measurements proves to be within a few percentage points within results obtained from far more intrusive and less scalable methods including packet sniffing, pipe loading/packet flooding and other similar methods.
In general, TruPath’s lightweight continuous monitoring instrumentation will be within +/- 5% of results measured “on the wire,” and the deeper troubleshooting instrumentation will tighten the results to +/- 2%. TruPath’s accuracy results can be affected only by the quality of the timing distributions generated. For example, under moderate to heavy packet loss conditions, additional iterations may be needed to produce statistically accurate results—which TruPath automatically determines and adjusts accordingly.
The near real-time performance metrics produced by TruPath’s measurements include:
- Maximum available bandwidth (both for symmetric and asymmetric paths)
- Utilized bandwidth (both for symmetric and asymmetric paths)
- Available bandwidth (both for symmetric and asymmetric paths)
- Data and voice jitter
- Data and voice loss
- RTT (for total path for all mid-hops along path)
- Route maps and associated route history will complete RTT measures per route
- MTU size (and mismatches along the path)
- QoS markings and any mismatches along the path
- Voice Mean Opinion Score (MOS)
One of the key measurements is maximum bandwidth, which is the upper limit on the data transfer capacity of the end-to-end network path. Like looking through a series of keyholes of varying sizes, this path’s bandwidth is constrained by the smallest bandwidth on all the intervening links. This limiting value also constrains the performance of all applications using this path, giving “the application’s view of the network.”
Amongst other techniques, TruPath uses a form of analysis referred to as packet train dispersion. It notes how certain packet sequences are affected by the presence of a bottleneck. In particular, the bottleneck causes the distance between packets in a packet train to be increased. That separation exactly reflects the size of the bottleneck and can also be used to determine overall path utilization values. The packet train dispersion algorithms have been thoroughly measured and proven to scale very effectively and accurately from very low bandwidth links (<64Kbps) all the way to 10 Gbps and beyond.
Continuous Path Analysis (CPA) and Deep Path Analysis (DPA)
TruPath provides two distinct methods of path performance instrumentation. The first, Continuous Path Analysis (CPA) is designed to monitor a very large quantity of paths with as low amount of overhead as possible in order to get a general sense of the path quality and performance. For higher accuracy resolution, TruPath also includes Deep Path Analysis (DPA) which can instrument a path to a higher resolution and associated accuracy and also provide the additional leading indicators needed to feed into APEX for diagnostics and troubleshooting. This combination of CPA and DPA permits TruPath to maintain a very light touch in the network, auto-scale effectively to tens of thousands of paths easily and still offer extremely high levels of accuracy at the same time.
CPA takes advantage of the variability of “statistical resolution” to provide an automated mechanism for continuously monitoring tens of thousands of network paths simultaneously. The goal of CPA is to understand the general performance and quality characteristics of a network path to within +/- 5% of what could be measured “on the wire”, but be able to do this quickly, repeatedly and with the absolute lowest possible impact on the production network path.
This approach is distinctly different from RMON/SNMP techniques which only monitor the state of individual elements on the path or Cisco’s IP SLA which has no mechanism for measuring any of the three network capacities (total, used, or available) and only works to/from Cisco network elements (typically routers). Cisco’s IP SLA testing rates are also extremely heavy at 1000 packets per minute (20x TruPath’s CPA), and still offers no diagnostics. In contrast, CPA generates approximately 20-50 packets per minute to generate its measures, which are treated as “critical indicators” and are precursors to the more fine-grained results available in DPA.
When critical indicators vary from an expected/accepted value, these values (indicators) are picked up during the continuous monitoring phase, and CPA automatically responds by increasing statistical resolution to improve the accuracy of its measures in order to confirm the variation as an actual undesirable change in network conditions. This “escalated mode” (also called CPA2) in CPA prevents TruPath from auto-escalating too quickly into the more accurate (and slightly more intrusive) DPA mode unless a network path dysfunction is truly present.
Furthermore, the CPA2 mode makes use of intelligent algorithms in order to increase the resolution of the measures just far enough to confirm the defect, and keep the measurement packets as low as possible on the path. If the leading indicator that led to escalated mode is proven to be a false positive, the escalated mode automatically backs down to a normal monitoring level. TruPath’s ability to automatically vary the measurement resolution means that the monitoring system can operate without human intervention, scaling from very light-touch probing for most paths to comprehensive measurement and diagnostics where and when it is needed.
This auto-escalation along with variable resolution permits TruPath to scale easily to monitor tens of thousands of paths. Thanks to this methodology, TruPath can spread its attention very widely, focusing down as needed on the few paths that indicate deviation from performance norms. Once degradation has been confirmed by TruPath’s DPA, an alarm is generated to inform the appropriate individual so that timely remediation efforts can begin. Any operator responding to an alarm is presented with a fully detailed report including the precise measurements and a completed diagnosis of the degradation.
TruPath is an effective solution for proactive performance management. It delivers significant improvements over traditional SLA monitoring. Since TruPath works across third party networks and segments a network path to show the boundaries, it provides a vastly more thorough and accurate view of a network provider’s quality of service thanks to CPA’s ability to generate a continuous representation of a range of network behaviors over long periods of time such as bandwidth, loss, jitter and latency.
Application Path Expert System (APEX)
TruPath analyzes network paths in two ways: a functional network model and a dysfunctional network model. “Functional” implies that the path is performing according to normal network design—in that case, the measurements made represent its capacities and usage.
Being “dysfunctional” implies behaviors that are outside design norms. The simplest example of this is packet loss. A perfect functional IP network should never lose packets. Once traffic levels have exceeded capacity, it is possible to have congestion loss. However, that means that the network is then operating outside of design specification. Besides congestion, there are many other dysfunctional conditions that can cause loss or other behaviors that degrade performance.
When TruPath detects degradation symptoms, it automatically performs diagnostic analyses against models of network dysfunction. These models isolate and identify characteristics that are specific to a particular source. Each type of degradation affects the packet trains differently and thus creates a unique “signature” that distinguishes one type of degradation from any other. TruPath’s patented analytics engine, the Application Path Expert system (APEX), performs a form of pattern recognition on the packet timings, loss and other network errors to assess which known type of degradation may be present.
Information is extracted from the packet timings to construct a test signature that is unique to that path at the time of testing. The test signature is compared to all the known signatures to determine which one is the most likely match. APEX uses probabilistic analysis to indicate what problem the current behavior looks most like. This can analogously be compared to face recognition—a clear photograph of a face can be uniquely compared to sample photographs to generate a match, even if not identical.
The only obstacle to precise diagnostics is the quality of the information being analyzed. If the photograph is blurred, taken at a great distance, or otherwise indistinct, making a solid match is difficult. Similarly, with TruPath’s path analysis, insufficient iterations or high traffic noise may hamper successful diagnosis. Further, new sources of dysfunction occasionally appear and may not be recognized—or may be confused with a known cause incorrectly. Today, APEX contains approximately 88 unique signatures and observations. AppNeta routinely works with customers to identify the cause of any unresolved diagnostics, and then adds this information to APEX.
The end result is that TruPath can identify the common sources of significant degradation like duplex conflicts and distinguish them from others like congestion or media errors. APEX produces the various flags and statements that appear in AppNeta applications. It also produces the certainty measures that reflect how closely a particular signature has been matched. These measures show how similar the observed behavior is to the ideal case of a given degradation type and provides a built-in confidence scoring mechanism for the end user.
Since it is very difficult for end users to interpret ambiguous or conflicting information, TruPath presents a conservative analysis of the output of APEX. Although APEX always evaluates dysfunctional behaviors, TruPath will default away from showing unclear or misleading matches and instead recommend steps to improve the testing.
See your networks clearly
TruPath represents an entirely new way to assess, monitor, troubleshoot and report on network performance from the perspective of applications that run on the network. By providing real-time and historical application-aware network performance knowledge from the locations where applications and services are actually consumed, TruPath delivers on the promise of remote site network performance management through an integrated suite of modules that quickly and easily remove the haze of confusion and stops the blame storming, replacing them with the operational knowledge and resulting confidence needed for the successful delivery of any performance-critical application on any IP-based network.