LESSON - Improve Your Analysis Capabilities
By Craig Abramson, Senior Technical Analyst, Syncsort Incorporated
Today’s businesses are accumulating ever-increasing volumes of data from a variety of different sources at record paces. For example, a successful Web site alone can generate over a billion clickstream records a day. Now add to this the data generated daily from high-volume transaction processing, coupled with detailed historical transactions already stored in a data warehouse. Then there’s the inventory and billing data, detailed financial information, customer data, and much more. While the growth of the data results in longer processing time, increased hardware costs, and more administrative requirements, demands on the data have also been increasing. This lesson describes two ways in which you can prepare your data in order to meet these demands: by performing data transformations and utilizing metadata.
Transform Your Data for Faster Queries
Transforming your data makes it easier and more efficient for you to directly access just the information that you need, minimizing the elapsed time of your queries. There are three different levels in which the data transformation processes can take place in order to reduce the elapsed times of your applications and speed the analysis of information:
Source level operations. At this level, you can convert database tables to flat files and vice versa.
Record level operations. This would encompass such processes as joins, sorts, merges, or just copying records to the appropriate target(s). Before outputting the records, they can then be filtered or reformatted for faster access. You can also perform aggregations at this level. Aggregates are one of the best ways to speed warehouse queries. A query answered from base-level data can take hours and involve millions of data records and millions of calculations. With precalculated aggregates, the same query can be answered in seconds with just a few records and calculations.
Field level operations. This includes data type and format conversions, arithmetic operations, string operations, date-time operations, pattern matching, and conditional operations.
These kinds of transformations allow you to cleanse data, create business rules for data quality, and concentrate on only the information you need. The data can then be loaded into a data warehouse or other data-intensive application for analysis. Additional processing may have to be completed in order to move specific data to the appropriate data mart.
Utilize Metadata to Quickly Find the Information You Need
Another way to minimize the elapsed times of your queries is to use metadata. Metadata describes what information is contained in your data. It details the context, structure, and location of the data, making it much more manageable. This may not sound like an important feature when you don’t have a large amount of data… but when the gigabytes start adding up, metadata helps you quickly find and utilize the data you need for your specific applications. To highlight the benefit that metadata provides in this type of environment, imagine that you’re in a large warehouse that’s filled with wooden crates. None of the crates have labels. Now imagine that your boss asks you to find and collect all the crates that contain parts relating to the production of widgets. This would be a daunting and time-consuming task for anyone. But what if you approached the same task with crates that are all clearly labeled? You can imagine how much easier the labels would make your job and how much faster you could get it done. That’s one of the advantages that metadata provides.
Metadata is also critical for data integration, which is the process of accumulating and combining data sets from disparate sources in various locations. It helps you identify the data sets so that, once again, you are only collecting the information you need. Data integration then incorporates the ETL process—the sequence of applications that extract data from various sources, bring them to a data staging area, and apply a sequence of processes to prepare the data for migration into the data warehouse and the actual loading process. After the data is extracted, a number of transformations may be applied in preparation for data consolidation and subsequently loaded into data marts, data warehouses, or dimensional data structures used for decision support or business intelligence systems. Metadata then provides the contextual information you need to effectively analyze the data.
Metadata can be located either within the data warehouse or scattered throughout a network. Source metadata is located within the data warehouse and includes all information about the source data. It can provide such details as:
- Record layouts. Each record from the source is made up of a series of fields that contain information on a particular item. The pattern of fields, together with the data types and lengths of the individual fields, is termed the record layout.
- Business rules. The metadata layer of the data warehouse enforces information consistency by allowing data to be defined in business terms, as opposed to using the database jargon. Rules that specify how business terms are determined or calculated, i.e., the business rules, are also defined within the metadata layer.
- Derived data. The transformation definitions derived from the contents of a source field in the records and/or from constants are referred to as derived data or derived values. Derived values provide a mechanism to simplify and reuse transformations defined on the source data.
- Customized collating sequences. To order character values according to a sequence other than one of the standard collating sequences, you have to define the rules that detail how character values are to be ordered. These rules describe a special collating sequence that defines the order in which single characters, and possibly double characters, collate.
Metadata makes data more manageable.
Speed Your Backups with Metadata
The use of metadata is also emerging in the area of backup and recovery strategies. For instance, on NetWare versions 6.5 and higher, you can back up data using a pool snapshot, which is a point-in-time metadata copy of a data pool. In addition to increasing the speed of backup jobs, pool snapshots can eliminate job disruptions caused by failed attempts to back up open files. Once a snapshot is taken, the original data pool is immediately available to users, instead of being unavailable for the length of the backup job. Scripts can be run before and after the snapshot to change the state of a database, gather information at the time of a snapshot or job, or perform other manual tasks.
In order to minimize the elapsed times of your applications and speed queries, you’ll need the right software to perform the necessary data transformations, process metadata, and quickly back up your critical information using snapshot technology. It’s important to test the various tools in your environment and take advantage of any proof-of-concept offers that are available. This will ensure that you make the right choice and reach the full potential that your data has to offer. The performance advantages that you’ll gain with the right tools will also give you more time to run additional applications and leverage your existing hardware investment.