Q&A: Splunk Helps Users Mine Insights from Machine Data
Unstructured data generated by machines is one of the fastest-growing and most pervasive areas of big data, and one of the most valuable, explains Raanan Dagan of Splunk
- By Linda L. Briggs
- March 19, 2013
Earlier this year, Fortune magazine cited Splunk for "pulling off one of 2012's strongest IPOs," pointing out that customers include Facebook, Staples, and the U.S. Defense Department. In this interview, TDWI posed questions about machine data and operational intelligence to Raanan Dagan, who is responsible for big data product marketing at Splunk, helping customers harness machine-generated big data to gain operational intelligence. Prior to Splunk, Dagan worked at Cloudera as a Hadoop sales engineer, and at Oracle. He is a certified Hadoop developer and administrator.
BI This Week: How big a challenge is posed by unstructured data generated by machines?
Raanan Dagan: Unstructured data generated by machines – or "machine data" -- is one of the fastest-growing and most pervasive areas of big data. It's also one of the most valuable. Machine data can contain a categorical record of user behavior, service levels, cyber-security risks, fraudulent activities, and more.
Collecting, processing, analyzing, and extracting value from machine data presents real challenges. Existing data analysis, management, and monitoring solutions are simply not engineered for this high-volume, high-velocity, and highly diverse data. Emerging open source technologies can provide inexpensive batch storage but require extensive and time-consuming integration with other open source projects and highly specialized skill sets to deploy.
What issues does machine data present beyond the oft-cited big data challenges of velocity, variety, and volume?
First, note that machine data can be generated by all technology infrastructures -- physical, virtual, and cloud. This includes websites, applications, servers, network devices, hypervisors, sensors, and mobile devices.
Issues associated with harnessing machine data include:
- Machine data is generated by a multitude of disparate sources; correlating meaningful events across these is complex
- The data is unstructured and difficult to fit into a pre-defined schema
- Machine data is high-volume and time-series based, requiring new approaches for management and analysis
- The most valuable insights from this data are often needed in real time
What sorts of insights can machine data generate? What kinds of information do companies want to pull from their machine data, especially in real time?
Here are some of the most important machine data sources and what they can tell us:
- Application logs are critical for day-to-day debugging of production applications by developers and application support. They're also often the best way to report on business and user activity and to detect fraud scenarios because they have all the details of transactions.
- Business process logs will generally include definitive records of customer activity across multiple channels such as the Web, contact center, or retail. They likely include records of customer purchases, account changes, and trouble reports.
- Call detail records contain useful details of the call or service that passed through the switch, such as the number making the call, the number receiving the call, call time, call duration, and the type of call. The data they contain is critical for billing, revenue assurance, customer assurance, partner settlements, marketing intelligence, and more.
- Clickstream data provides insight into a user's website and web page activity. This information is valuable for usability analysis, marketing, and general research. Formats for this data are non-standard and actions can be logged in multiple places, such as Web servers, routers, and proxy servers.
- Supervisory control and data acquisition (SCADA) refers to a type of industrial control system (ICS) that gathers and analyzes real-time data from equipment in industries such as energy, transport, oil and gas, water, and waste control. These systems produce significant quantities of data about the status, operation, use, and communication of components. This data can be used to identify trends, patterns, and anomalies in the SCADA infrastructure and is used to drive customer value. For example, smart-grid meter data can be captured so customers can become better informed about their electricity use through tools, programs, and services, thus helping them save energy and money and reduce their environmental footprint.
- Sensors generate data based on monitoring environmental conditions, such as temperature, sound, pressure, power, and water levels. This data can have a wide range of practical applications, including water-level monitoring, machine health monitoring, and smart home monitoring.
- System logs from routers, switches, and network devices record the state of network connections, failures of critical network components, performance, and security threats. Tapping into this data means tapping into a wide variety of devices for troubleshooting, analysis, and security auditing.
What are some of the limitations of traditional BI and Web analytics tools for many business intelligence and data warehousing technology stacks?
Consider traditional information management systems such as business intelligence and data warehouse tools. These systems are batch-oriented and designed for structured data with rigid schemas. IT management and security information and event management tools, on the other hand, provide a very narrow view of the underlying data and are hard-wired for specific data types and sources.
The definition of "real time" seems to be evolving. Is that true, and if so, what is Splunk's definition of real time?
The notion of real time is a foundational concept in the Splunk platform. It exists across all core functions of the product, and it means the ability to process a live stream of data collection, indexing, searching, correlating, analyzing, and visualizing. This means users can gain a deeper understanding using a live stream of machine data, reveal important and timely patterns by correlating events from many sources, and reduce the time to detect important events. They can also leverage live feeds with historical data to find trends and anomalies and provide ad hoc reports, answer questions, and add new data sources
What are some use cases for operational intelligence?
Operational intelligence has many critical uses across IT and the business:
- Application management: Provide end-to-end visibility across distributed infrastructures; troubleshoot across application environments; monitor for performance degradation; trace transactions across distributed systems and infrastructure
- Security and compliance: Provide rapid incident response, real-time correlation and in-depth monitoring across data sources; conduct statistical analysis for advance pattern detection and threat defense
- Infrastructure and operations management: Proactively monitor across IT silos to ensure uptime; rapidly pinpoint and resolve problems; report on SLAs/track SLAs of service providers
- Web and business analytics: Gain visibility and intelligence on customers, services, and transactions; identify trends and patterns in real time; fully understand the impact of new product features on back-end services
- Development: Accelerate development and test cycles; support advanced development methodologies (such as agile and continuous); integrate enterprise applications with APIs; build enterprise-class applications that leverage Splunk software
Fortune magazine cited Splunk in February for "pulling off one of 2012's strongest IPO's," and pointed out that customers include Facebook, Staples, and the U.S. Defense Department. What does Splunk's technology bring to this discussion?
Splunk focuses on the challenges and opportunities of effectively managing machine data and unlocking its largely untapped value. The Splunk Enterprise platform collects, analyzes, and visualizes machine data, whether it's generated by IT systems or infrastructure, sensors in a manufacturing facility, RFID tags on sensitive assets, or events from mechanical or security systems.
Integrated, end-to-end, and real-time, our software provides a unified way to organize and extract actionable insights from the massive amounts of machine data. There are no databases to limit scalability, and no rigid database schemas to limit flexibility. You can index the unstructured machine data and scale Splunk across low-cost commodity servers, or index terabytes per day and search on months or even years of data in seconds.