Making Real-Time Analytics a Reality
In-memory analytics processing and access to live data is finally achievable and will make active live analytics a real possibility, and that opens up a new area of information processing and exchange.
By MC Brown, Senior Information Architect, Continuent
Analytics is evolving at an alarming rate. For many years, analytics has largely been a one-way transfer of information, feeding the data loaded into an existing, often transactional, database and then using ETL to transfer information into an analytics environment so that it can be processed and analyzed. When data volumes were low and the time required (and expected) for the data processing was measured in weeks or even days, these methods were more than adequate.
Times are changing.
Traditional ETL offers some specific advantages for the transfer of information, and that means particularly for the "T" in the acronym (for transformation). ETL tools extract full datasets from the database and then offer basic transformation, which may cover the gamut from simple sums and totals to complete translation into different formats or compound documents. However, as the power of data warehouse engines increases, particularly when looking at clustered solutions (such as Vertica or Hadoop), it is becoming easier and more practical just to copy all information and perform the transformation within the data warehouse.
There are two drivers behind this process. First SQL-based tools and environments such as Hive and Impala have improved considerably, which provides an SQL interface layer onto MapReduce and replicated storage of Hadoop. Second, live transactional databases have grown to a size where they have often outgrown a single data server. Instead, data is now loaded into the analytics engine from multiple database servers based on sharding (or other splitting) mechanisms and then combined within the analytics engine.
This use of sharded information often makes loading from ETL environments more difficult because the data may be sharded at an application level, not a database level. This makes it more difficult to translate the configuration into that required by the ETL application to export, transform, and load the data into Hadoop. Furthermore, the complex sharding and separation of information may be costly to import and merge during the ETL phase.
The result is that it is now easier to directly transfer raw transactional table data into your data warehouse, including Hadoop, and perform whatever transformations are required within the target analytics environment. That process can take into account the merging of data from multiple sources, and through the SQL interface, perform the required join process to combine the data into a format suitable for analytics. Using suitable workflow management, the entire process can be automated and simplified.
To put it simply, Hadoop now can take raw transactional data, execute SQL statements to build a suitable view of the data, and provide analytics output. The use of raw transactional tables within the data warehouse also opens up the possibility of further queries and analytics based on the raw data without having to repeat an explicit load. For example, when looking at sales information, if you have the raw tables for prices, sales, and receipts, you can perform analytics across all three within the more powerful Hadoop environment from a single load.
Returning to the original issue, to achieve this loading of transactional data into Hadoop requires a different approach than the intermittent ETL process. With the required turnaround times for modern analytics environments, data needs to be constantly loaded into Hadoop so that analytics can be performed daily, even every hour, on the incoming transactional data.
One solution is to use technology that extracts information from the transactional logs of modern databases and provides a list of changes that can be replicated into Hadoop. Log information can then be translated from the transactional format into carbon-copy tables that match the original tables from the transactional database. Within a sharded environment, data can be replicated into Hadoop from each shard and then merged and compounded into a single table to be used within the analytics environment.
There are a number of benefits to using the log extraction method. One is that it is an external, lightweight process to extract the data; it does not require an active connection into the database to make the changes. Because the database engine automatically generates the logs, the information is accessible without using the query engine on the database. More important, particularly for a transactional database, extracting the data in this method does not populate the RDBMS query and data caches, which may be detrimental to the performance of the source database.
There is another advantage of this approach to loading the transactional data, and that is we also gain the history of the changes to that information. Transactional databases hold the "current" information, but by processing the log of changes, the individual data points that led to the "current" value can be tracked and identified. This gives a very powerful mechanism within the analytical process, allowing both the current and historical values to be analyzed. Furthermore, we get this information for free from the replicated data rather than having to regularly extract and load the information into the data warehouse, which is both expensive and may miss changes to frequently updated tables.
The resulting process of loading information into your data warehouse also enables more frequent analytics and improves turnaround time from weeks or days to enable a live analytics view of the data. This will enable the next leap in data analytics to be achieved more easily. With the advent of in-memory analytics processing, such as Storm and Spark, access to live data -- rather than the information loaded days or even weeks beforehand -- is finally achievable. This, in turn, makes active live analytics a real possibility, and that opens up a new area of information processing and exchange.
MC Brown is a senior information architect at Continuent, a provider of open source database clustering and replication solutions. To learn more, contact Continuent at [email protected]
or visit http://www.continuent.com.