New Techniques Detect Anomalies in Big Data
Anomaly detection algorithms use machine learning, statistical analysis, and human insight to classify and solve problems hidden within terabytes of data. The challenge: to react and respond to critical events in real time.
By Bruno Kurtic, Founding Vice President of Product and Strategy, Sumo Logic
Log management is often seen as the starting point for analyzing data that is generated by all IT and business systems with the intent to detect events in the data that are important to the operation of applications, IT infrastructure, and business systems. In short, it's one of the most important and broadest pillars of business intelligence (BI), helping analysts to improve system or application availability, prevent downtime, detect and avert fraud, and identify interesting changes in customer and application behavior. However, traditional operational analytics and log management tools fail to help users proactively discover events they don't anticipate, events that have not occurred before, and events that may have occurred before but are not understood.
One challenge is that traditional log management methods rely on written rules or queries to detect events. In reality, the explosion in machine data makes it impossible for humans to write every rule. Most of these events are not known, new ones occur all the time, or they can't be described by rules or queries, which causes them to go undetected. Administrators need a mechanism that allows the data to automatically "tell" users what is happening. New techniques for anomaly detection are being launched to address anomalies at the speed with which they occur in business, before they lead to application or system availability and performance issues, breaches, or outages.
Deciphering the Known and the Unknown
These techniques are now embodied in solutions for anomaly detection, which derive their power by combining a predictive and an investigative component. The main objectives of these anomaly-detection engines are three-fold:
- uncover unknown events
- enable humans to enrich them
- share that knowledge with others
These engines are designed to detect events across an enterprise's application and operational infrastructure, and they are conveniently classified as either known (formerly encountered, classified, and well understood) or unknown (not previously detected or identified, regardless if how many times it occurred).
If a business analyst observes an event that she is familiar with, she can remediate against it and set up an alert to find such incidents in the future. The challenge is that the bulk of the events that occur are unknown and as such, unfamiliar. Therefore, the goal of anomaly detection is to discover previously unknown events, surface them for investigation, and convert them into known events -- events that we know how to handle.
Over time, someone in product development, operations, or security facing these anomalies will discover more events she didn't know existed. A typical methodology is to detect the anomaly, classify and document the event based on relevance and severity, and embed human knowledge into the data stream to specify what to do if the event occurs again.
In anomaly detection, as in some other leading-edge technologies, the jargon of the trade is new (or at least in flux). When we talk about an anomalous event today, we may refer to terminology in the common parlance, saying that it's one that builds on a crescendo of certain individual events that occur, often suddenly, at approximately the same time. We can expect the terminology to evolve as the art and science of anomaly detection go mainstream.
In any case, it's clear that anomalies can play a big role, for the better or worse, in optimizing system availability and performance; when a process, application, or infrastructure component fails or slows down, it's typically presaged by usually multiple types of events happening simultaneously or in quick succession. Anomaly detection deciphers how this series of events and their patterns vary from the norm and enables experts to quickly determine what it means to the business.
The power behind anomaly detection is neither a single technology nor a single technique. It's typically a set of algorithms that work synergistically, leveraging machine learning techniques as well as mathematical and statistical analysis. The algorithms are developed specifically for anomaly detection -- for example, the Sumo Logic implementation of anomaly detection uses no "off the shelf" algorithms.
Learning Right from Wrong
One advantage of an effective anomaly detection engine is that it's "fuzzy." It doesn't rely on the common threshold-based techniques. In fact, the engine does not need to know anything about the nature of the data, or the expected types or numbers of events. It records baseline behavior, automatically determining what is "normal." The engine subsequently looks at a dataset that contains many events and signatures, compares it to its baseline data, and determines whether variances would signal a true anomaly. If not, it will even update its baseline to allow for the variance.
The entire procedure happens in real time -- or at the speed of the data flow. It's imperative that an anomaly-detection engine detect every new event in the data stream, even benign events. The engine may detect an impending disk failure, which is an anomalous and problematical event, or it may detect signs of a benign production push, which occurs when a programmer "pushes" new code into production. In the latter case, a number of formerly unknown events may occur, and a business analyst may worry about the sudden upheaval in the everyday event stream -- when, in fact, nothing is wrong. It's not something to be vanquished; it's something to be documented.
The nature of benign events versus malicious events points to the critical need for real-time detection, analysis, and remediation of events in a data flow. In the case of a malicious event such as a security breach, rapid resolution is paramount. While it's prudent to take certain types of anomalies offline for resolution sometime after they were witnessed, a breach or other such severe event must be dealt with while the event is, quite literally, in progress.
The Best of Human Plus Machine
An anomaly detection engine benefits greatly from a combination of an intuitive visual interface and the expertise of the analyst interacting with the data and the interface.
The visual interface enables analysts to review anomalies in real time, investigate details of contributing events, make informed decisions with all data at their fingertips, and provide feedback to the anomaly detection engine. In short, the visual interface is the key to productivity, success, and capturing feedback for continuous improvement of the anomaly detection engine.
Anomaly detection algorithms alone can only scratch the surface. It is the combination of those algorithms and human expertise in a specific domain that is much more powerful. This is especially the case if the anomaly detection engine can capture and encode experts' feedback and use it to better detect future events and equip other users with the captured knowledge. This is particularly powerful if the expertise can be collected and applied across different domains, use cases, and individual organizations in order to reuse context and knowledge the enterprise has acquired.
Upping the Ante for Big Data Management
Just how big is the challenge for anomaly detection engines? Take a look:
- Machine logs are the output of every application, website, server, and supporting IT infrastructure component in the enterprise. The sheer volume of machine data in the enterprise is expected to grow 15X between 2013 and 2020. [See Note below]
- The anomaly detection engine must scale to handle hundreds of gigabytes or terabytes of data, and to handle sudden or unpredictable increases in data.
- The anomaly detection engine must work continuously and in real time, to handle the aggregate data output of the enterprise, heading off the risks of both benign and malicious events, either as or before they occur.
- The anomaly detection engine must be designed to discover anomalies without the benefit of having rules to guide it – and, ideally, without the need for training users.
It's a tall order, geared towards seizing control of data that has always been there but is only now being exploited to solve some of the thorniest problems in the complex, expanding enterprise.
IDC Digital Universe study, "Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East," IDC and EMC Corporation, December, 2012
Bruno Kurtic is the founding vice president of product and strategy at Sumo Logic. You can contact the author at firstname.lastname@example.org.