Real-Time Operational Insight into AppNeta using ElasticSearch
AppNeta’s customers deploy thousands of appliances at customer locations across the globe. These appliances continuously collect data about target applications and networks, then store and analyze it from a cloud-based management system.
Management of these appliances is not an easy task. They are often installed in remote locations where local IT staff is not readily available or not fully aware of the nature of the deployment. It has become increasingly important that the AppNeta Customer Success gets operational insight into the appliances for an effective support and troubleshooting. They need to know the last good state of the appliance and any state changes done on it. Keeping this goal in mind, we recently upgraded our monitoring tools to collect configuration, operational and usage data from the appliances in real time. This information is not only useful to the Customer Success team, but also to AppNeta’s product teams to understand how customers are using our solution. The monitoring system that collects the data is powered by ElasticSearch, a popular open-source search and analytics engine. The main reason for choosing ElasticSearch was its ease of use with RESTful APIs, availability of off-the-shelf data visualization tool, ease of deployment, support for clustering and flexible schema.
Appliances periodically gather data and send it to our data collectors running in Amazon. The data collectors are light-weight worker processes whose sole job is to collect data from the appliances and insert it into the ElasticSearch cluster. This ensures that the overhead incurred is minimal and has no impact on the performance monitoring and diagnostics tests running on the appliance. But it also means that additional processing may be required on the ElasticSearch data to generate new views that assist in answering specific queries. This is an acceptable approach because it allows us to scale to a large number of appliances and support ad-hoc queries.
The data gathered falls into 3 categories:
- Configuration data - this includes information such as appliance version, model number, hardware address, timezone, OS version, interfaces and routes. Each interface returns additional information such as IP address, static or DHCP, VLAN, wireless mode, FlowView packet capture settings, bridge or mirror mode etc. Configuration is sent periodically or whenever user makes any changes.
- Operational data - it keeps track of connection states between an appliance and remote hosts. The information includes last connect/disconnect time, protocol, remote address/port, cipher used, proxy vs. direct connection etc. Operational data is sent whenever connection state changes.
- Usage data - it captures CPU, memory, hard disk and RAM disk usage and reports it periodically.
This is a work-in-progress as more data will be collected in future either to support new features or to improve troubleshooting capabilities of the existing features. Either way it was important that we build a framework that is extensible. Extensibility means that the new data can be requested on the fly without any need to upgrade the appliance. This is achieved by designing a framework that can execute arbitrary shell scripts to collect additional custom data. Whenever new data is required, we can write a shell script to collect the data and push it to the appliances from a cloud-based management system. At pre-defined reporting interval, the script gets executed and new data element is reported back to a data collector. Since ElasticSearch does not require schema to be defined up-front, it creates an implicit mapping for any new data that is inserted on the fly. The new data is automatically inserted into ElasticSearch making it available for query and analysis.
Data Visualization and Analysis
One key advantage of the ElasticSearch platform is support for the data visualization tool called Kibana. It allows you to view your ElasticSearch data via custom dashboards. They are fully data-driven and easy to create. We have created few simple panels in Kibana to view the data snapshots in table and histogram formats. Below is the example of CPU and memory usage from one appliance for a 30-days period:
Configuration data in tabular format for the same appliance looks like this:
In addition to these dashboards, Kibana is also great tool to quickly search and filter through massive amount of data in ElasticSearch. We have also developed scripts to answer specific queries like “How many m22 appliances have wireless interface configured?”, “How many appliances have FlowView currently configured?” or “What is the appliance count per base image?”.
We have been using a 2-node ElasticSearch cluster for over 4 months now. Total number of documents inserted to date are around 25M with 200K new documents inserted every day.
Total data size is 10GB. Although the current footprint is relatively small, we expect to grow it quickly with new appliances provisioned every day.
We have put basic framework and simple tools in place to query the raw data. The next step is to build a layer on top of it to perform complex analysis using ElasticSearch’s powerful query DSL.
Filed Under: performance monitoring