LESSON - Introducing a Data Warehouse for Event Data
How a columnar-based data warehouse offers superior price and performance optionsfor event data
By Ed Chopskie, Vice President of Marketing, SenSage, Inc.
Event data, once commonly referred to as “audit trails,” is data or a set of records sequenced chronologically. Event data contains evidence directly pertaining to and resulting from the execution of a business process or system function. Common event data captured and retained by organizations includes, for example, records resulting from activities such as business transactions or communications by individuals, systems, accounts, or other entities. Business transaction examples include event records created from banking transactions, updates to shipping status, historical prices, and radio frequency identification records. Examples of communication records include call detail records of telephony and internet traffic/transaction data by governments and commercial organizations.
Long-term retention of these types of records is often required to detect fraud or to analyze performance trends. In the case of communication records, the retention is mandated by government regulations for access by law enforcement agencies. One such communication regulation is Directive 2006/24/EC, which was passed by the member countries of the European Union in March 2006 and requires the member states to ensure that communications providers must retain, for a period of time between six months and two years, necessary data as specified in the directive.
Regardless of the source of event data, the data shares common characteristics. Event data is voluminous, and current implementations have broken through hundreds of terabytes and are approaching a petabyte. Event data is written once and never updated, as audit trails must never be modified; as a result, the use of data warehouse solutions built on relational database technologies originally designed for supporting OLTP is extremely efficient. Additionally, event data is always inserted and later searched on the basis of time, introducing storage and querying challenges that most relational databases do not easily support. Finally, event data is typically “flat” with many distinct columns and not subject to normalization.
Because of the volume and nature of the data being stored, traditional database warehouses quickly become impractical to use for event data for a number of reasons, including performance and cost.
The purpose of this lesson is to introduce a new type of data warehouse, an event data warehouse (EDW), built on a patented columnar database, to provide superior performance for processing event data—at a cost that is an order of magnitude less than traditional solutions.
Introduction to Columnar Databases
Before describing how the SenSage EDW provides a complete solution for collecting, storing, and analyzing event data, a brief description of columnar databases is required. This introduction will help differentiate the architectural differences between columnar databases and relational databases for the purpose of illustrating how a columnar organization is superior in performance and price for event data; it is not meant to be exhaustive.
A columnar database organizes data by columnar, rather than row, format used by relational database management systems. While the difference may sound trivial, a columnar architecture provides distinct advantages for certain classes of data, including event data. Data for each column is stored together, and this provides performance gains by allowing queries to reference only the data selected. Indexes are unnecessary in columnar databases, as each column is actually an index, thus significantly reducing the storage and maintenance requirements of relational databases. Additionally, data in columns provides a massive opportunity for data compression as the data in columns is similar, unlike the compression of an entire row with different data types.
Columnar databases are not new; several have existed on the market for some time in relative obscurity compared to relational databases.
Figure 1: Components included in the SenSage EDW architecture
The SenSage EDW Solution
While SenSage is built on a patented1 columnar database, the entire solution provides all components required for event data warehousing, including the ETL and analytics layer. Additionally, the SenSage EDW supports open access from a number of methods including SQL, Perl DBI, and JDBC.
While the SenSage EDW is practically a turnkey solution for event data requiring very little knowledge of columnar database administration, the underlying architecture is extremely advanced, utilizing a distributed, clustered, share-nothing architecture built from the ground up for event data.
The SenSage EDW is a software solution for deployment on commodity server “nodes.” This deployment model allows the SenSage ETL data loader to spread data across nodes allowing for record insertion rates of hundreds of thousands of records per second. As data is loaded, it is compressed at a typical rate of 10:1. Data queries are also distributed across the data warehouse nodes, providing increased performance. As columnar database, the SenSage EDW does not require indices but provides an advanced querying technique, known as Bloom Filters, to determine if the data required to satisfy a query exists in trees on the nodes. Using this combination of advanced features, the solution has been deployed and proven to be capable of querying over 30 million records a second, yielding query results in minutes. This performance is typically an order of magnitude faster than relational database management system capabilities.
Nodes can easily be added to the EDW cluster as data storage and query requirements increase. Additionally, since the SenSage EDW is delivered as software and not as a proprietary appliance, organizations can take advantage of performance increases in hardware and storage as they become available. The SenSage EDW supports storage of event data locally on the nodes themselves or via SAN/NAS solutions.
While traditional data warehouse solutions based on relational databases are very often the most practical solution for most business requirements, they are ill equipped to support the unique data characteristics of event data. Utilizing traditional solutions for event data is possible but at a price and performance differential that cannot be justified. Columnar databases are well suited to processing event data, and there are many columnar database management systems on the market, including open source solutions such as C-Store.
The SenSage EDW takes advantage of a columnar architecture but expands the solution by providing an architecture that provides a turnkey approach to managing event data:
- It includes the ETL applications within the product. Administrators can quickly map their event data sources without knowledge of the underlying table format or database administration skills. There is no requirement to obtain third-party ETL tools or expertise.
- SenSage provides analytics and querybuilding wizards that do not require users to have any knowledge of SQL or any other programming languages. Other access methods to the data are supported, but most users do not find the need to implement third-party query tools.
- It is deployed as software and takes advantage of implementations on commodity clustered server solutions. Customers are not locked into proprietary appliance-based solutions and therefore can easily increase performance by utilizing new technologies as they become available.
SenSage has been sucessfully deployed at more than 300 sites. With well over a petabyte under management, the SenSage EDW commonly supports implementation of more than 100 terabytes, with some scheduled to reach a petabyte at current growth rates. Unlike emerging or open source solutions, the SenSage EDW is a proven and supported solution used worldwide by Global 2000 organizations.
The SenSage EDW is a powerful and economic solution for event data when compared to traditional data warehouses.