A video conferencing partner of ours asked for help with a high profile technology customer recently. The vendor’s gear was being blamed for frequent video conferencing session drops. This was causing friction between vendor and the customer’s network team, and pressure on the IT manager was mounting to fix the issue at all costs.
The symptom: Otherwise healthy video conference sessions end without warning, often during executive meetings.
The video conferencing vendor deployed PathView microAppliances near the video endpoints at each of three high profile locations, and PathView’s advanced path monitoring was configured between sites using the best practice of single and dual-ended path monitoring. This ensured a multi-protocol active performance analysis that would exclude the vendor’s equipment from possible root cause.
With per-minute performance monitoring enabled against strict thresholds for latency, loss, jitter, QoS changes, route changes, and available capacity bottlenecks, we saw only low level, intermittent loss. The WAN circuits were 100Mbps Internet services with IPSEC tunnels between sites to corporate and video conferencing traffic.
Although low levels of packet loss were noted in the customer’s service provider network (WAN), it was intermittent and not observed during the time the session drops were occurring. Here’s PathView showing a totally clean end-to-end network during the time of video conferencing session drops. Ruling out the network, we move on to the application.
Next, FlowView was deployed to monitor traffic ‘on the wire’ during the project, and we noticed normal video conferencing behavior, shown in the below example. The media stream is RTP and the majority of the traffic (blue), with some TCP signaling (yellow). The customer reported a similar pattern; video sessions run fine for between 1:15 and 1:19, then terminate suddenly. FlowView shows a stable media stream for exactly an hour, then a sudden burst with increased TCP signaling traffic for another few minutes.
For thoroughness, we also ran multiple AppView Video sessions between the PathView appliances near each endpoint to validate how the network responds to our own h.264 video traffic streams. Since it removes the vendor’s gear from the picture we could further rule infrastructure in/out.
With a strong suspicion we’re looking at an application (video endpoint) issue, we employed FlowView’s remote full packet capture facility. While many network elements and video endpoints have the ability to run packet capture, the small buffers make them unusable for the type of long running captures you need to catch transient issues. With FlowView’s scheduler, we triggered simultaneous 3-hour captures at each location to coincide with scheduled video sessions.
FlowView uses a highly efficient capture mechanism that buffers, compresses, and encrypts traffic designated for capture. Think of it as a GUI-driven tcpdump that’s been battle tested for remote troubleshooting. The microAppliance is rated for 1Gbps wire-speed capture, so we’re clear for our testing.
With the scheduler we triggered synchronized 3 hour captures with filters using the video endpoints.
The resulting .pcap files showed something very interesting. During the RTP video session, RTP streams are accompanied by periodic TCP control and signaling messages. At just over 1 hour in, there’s a 9-minute gap between them.
A few seconds after the gap we see a flood of traffic between endpoints, and within a few minutes the session drops. The video endpoint logs showed normal disconnects was being caused by normal timeout behavior. Something in the network between the devices was altering normal communication, most likely a firewall. It appears that the mid-path device timed out the without persistent TCP control messages, and attempt to recover the communication when follow on signaling was received. This in turn caused the endpoints to time out and drop the session. When the customer reviewed the settings on the site’s Checkpoint firewall, they found the stateful TCP session timeout had been set to 1 hour. Bingo.
Blame: Network & Application
The customer moved to a longer timeout value and the sessions since have lasted the desired duration.
A few valuable lessons:
- It’s not always the network OR the application. It can be both, so be ready with tools for both.
- Not all network clients have identical signaling and keepalive behavior.
- Check with your voice/video conferencing vendor’s best practices on optimal firewall configuration.
- Some client signaling and keepalive behavior is configurable. For Linux hosts, check out http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html. This could replace altering firewall behavior in some deployments if your client equipment supports this or similar configuration.
- Sometimes you just need to look at what’s on the wire. Don’t count on traditional network tools like SNMP managers to help you solve application or actual network issues.