Why Data Agility is a Key Driver of Big Data Technology Development
Hadoop and Apache Drill can help you guide your organization's agility towards real-time business impact.
By Jim Scott, Director of Enterprise Strategy and Architecture, MapR Technologies
As technology advances at breakneck speed, our lives are becoming increasingly digitized. From Twitter feeds to sensor data to medical devices, companies are drowning in big data yet starving for actionable information. Most likely, you've heard a lot of talk about the volume, variety, and velocity of big data and how challenging it is to keep up with that explosion of data.
For many enterprises, their ability to collect data has surpassed their ability to organize it quickly enough for analysis and action. Executives, IS staff, and analysts alike have been frustrated with traditional rigid processes for data processing that require a series of steps before data is ready for analysis. Relational databases and data warehouses have served businesses well for collecting and normalizing other relational data from point of sale (POS), ERP, CRM, and other data sources where the data format and structure is known and doesn't change frequently. However, the relational model and process for defining schema in advance cannot keep pace with the rapidly evolving variety and format of data.
Sometimes an analyst just wants to start playing with data to understand what's in it and what new insights it can reveal before the data is modeled and added to the data warehouse schema. Sometimes you're not even sure what questions to ask. This process drives up the costs for using traditional relational databases and data warehouses because DBA resources are required to flatten, summarize, and fully structure the data, and these DBA costs can delay access to new data sources. Legacy databases are simply not agile enough to meet the growing needs of most organizations today.
What is Data Agility and Why is it Important?
Hadoop has become a mainstream technology for storing and processing huge amounts of data at a low cost, but now the conversation has pivoted. These days, it's not about how much data you can store and process. Instead, it's about data agility, meaning how fast can you extract value from your mountains of data and how quickly can you translate that information into action? After all, you still need someone to apply structure or schema to the data before it can be analyzed. Just because you can get data into Hadoop easily doesn't mean an analyst can easily get it out.
Executives want their teams to focus on business impact, not on how they should store, process, and analyze their data. How does the ability to process and analyze data impact their operations? How quickly can they adjust and respond to changes in customer preferences, market conditions, competitive actions, and operations? These questions will direct the investment and scope of big data projects in 2015 as enterprises shift their focus from simply capturing and managing data to actively using it.
This concept can be applied not just to your big data infrastructure; it can be applied across all business activities, from risk management to marketing campaigns to supply chain optimization.
When the concept of data agility was first talked about, the discussion centered on an organization's ability to quickly gather business intelligence. However, the concept of data agility can also apply to data warehouse architecture. With traditional data warehouse architectures based on relational database systems, the data schema has to be carefully designed and maintained. If the schema must be changed, it can sometimes take up to a year to make the change to an RDBMS. Even the process of extracting data from a data store and loading it into a data warehouse can take an entire day before it's available to be analyzed.
With Hadoop, storing a new kind of data doesn't mean having to redesign the schema. It's as simple as creating a new folder and moving the new type of files to that folder. By using Hadoop for storing and processing data, teams can develop products in a much shorter timeframe.
The Real Hindrance to Data Agility
Traditional databases require a schema before writing data. Couple that with the time needed to get the data into the database and the process can no longer be considered agile. Worse yet, there are times those DBAs must perform complicated processes that require dropping foreign keys or exporting data, altering table designs, and even reloading data in a specific order to satisfy the table design. Some big data technologies such as Apache Hive are able to get around the schema-on-write but still require defining a schema before users can ask the very first question.
You Will Know Data Agility When You See It
New data discovery and data exploration technologies are being developed to provide greater flexibility. Apache Drill is a great example of "the" business enabler for data agility. Inspired predominantly by Google's Dremel, Apache Drill is an open source, low-latency SQL query engine for Hadoop and NoSQL that can query across data sources. It can handle flat fixed schemas and is purpose-built for semi-structured/nested data.
What does this mean to be "the" business enabling technology? Think real-time business intelligence. Drill is opening the door to this inevitable future of shortened cycle times for data processing to support faster responses to opportunities and threats. Ultimately, the faster you can ask a question and get the right answer, the better for your business.
Drill implements schema-on-the-fly. This means that when a new data format arrives, nothing has to be done to be able to process the data with Drill. No DBAs are required to build and maintain schema designs. Commercial off-the-shelf business intelligence tools can communicate with Drill because Drill implements standards. It is ANSI SQL:2003-compliant and ships with JDBC and ODBC drivers. The business doesn't have to adopt new tools to work with all the data from all data sources.
Of course, for any new technology, an opposing view can always be considered. The question that may arise is: What innovations are fueling the need for these new technologies? The dominant change in the industry falls on the utilization of data interchange formats such as JSON. Data that comes from applications that publish data in JSON do not require a DBA to structure the inbound data because it shows up already structured, thus eliminating the personnel and process bottleneck.
Drill fuels data agility by allowing users to perform self-service data ingestion and data source management, whether due to adding a new data source or adapting for a change in the incoming data structure.
Agility in Your Enterprise
Data agility should be an important aspect of all your big data initiatives in the future. Individuals can analyze and explore data directly. Self-service data exploration eliminates the dependency on IT to set up data definitions and structures, and frees up IT staff to perform more valuable and leveraged activities.
By implementing agile technologies such as Hadoop and Apache Drill into your enterprise and existing data management and analytics capabilities, you'll be able to guide your organization's agility towards real-time business impact.
Jim Scott is the director of enterprise strategy and architecture at MapR Technologies (). Jim has held positions running operations, engineering, architecture, and QA teams. He is the cofounder of the Chicago Hadoop Users Group (CHUG), where he has coordinated the Chicago Hadoop community for the past five years. Jim has worked in the consumer packaged goods, digital advertising, digital mapping, chemical, and pharmaceutical industries and has built systems that handle more than 50 billion transactions per day. Jim's work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts such as Hadoop. You can contact the author at firstname.lastname@example.org.