TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - Expert Panel: Building an AI-Driven Data Strategy September 15, 2025
  - Why Enterprises Aren’t Ready for AI—And How to Fix It September 18, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

Preparing Data for Analytics: Making it Easier and Faster

Advances in data preparation and integration will have a major impact on BI, visual analytics, and data discovery. Here are three points to consider as you evaluate and deploy these recent software entries.

By David Stodder
April 14, 2015

As users seek to go beyond canned reports, dashboards, and spreadsheets to employ sophisticated visual analytics to drive decisions and actions, traditional processes for preparing data are under pressure. These include steps for data quality and profiling, data transformation, and other forms of enrichment.

To perform business-driven analytics, users want flexibility in data preparation; they don't want to wait for long cycles of extraction, transformation, and loading (ETL) only to gain access to a limited selection in the data warehouse. In today's brave new big data world full of Hadoop clusters and nontraditional data types, users want to explore it all with less restraint and more self-service.

The established world of ETL and data integration is thus in the midst of a shakeup. Innovators are coming out of Google, Facebook, and leading universities to launch new companies with the backing of top-flight venture capitalists (VCs). Traditional vendors have had to adjust quickly and introduce solutions that are geared more to ad hoc, on-the-fly data integration, transformation, quality improvement, and other preparation for analytical activities such as blending internal and external data to gain insights into competitive pricing. Newer solutions employ machine learning and other advanced analytics to enable users to learn about the data faster, with algorithms for finding relevant data relationships and anomalies.

As always happens in our industry, new technologies bring new terminology with them to draw distinctions from the old. Rather than ETL and data integration, the latest data preparation and integration technologies apply terms such as "data blending," "data munging," and "data wrangling." Although the vendors use them somewhat differently, the terms generally stand for easier and faster data preparation and integration of a wider range of sources, usually through automated processes driven by advanced analytics.

Hiding Complexity

Being able to integrate and prepare a wider variety of data types is a major distinction between the newer solutions and the old. Both inexperienced and expert analysts today increasing want to blend views of disparate data types including geospatial, text, and demographic data with their more traditionally structured transactional data. These nonstandard data types are often voluminous, varied, and messy; to gain business value from them sooner, manual work must be replaced by automated methods.

Although experienced data scientists and analysts may still prefer to get their hands dirty and write code to analyze the data based on intimate knowledge of the sources, most users need automation to run queries and models against what could be petabytes of highly varied data.

To hide the complexity of selecting, blending, and accessing data sources, many of the newer tools provide graphical user interfaces of their own or the ability to embed icons in leading business intelligence and visual analytics solutions. Users can work with icons rather than code to perform data mashups, set filters, or create custom data blends for their immediate analytic needs. The tools are thus fueling the trend toward self-service data integration, taking tasks out of the hands of IT to enable business analysts and other nontechnical users work on their own to develop variables, build models, or query sources to find data patterns and correlations. (For an in-depth discussion of data blending, see Fern Halper's TDWI Checklist report, Seven Keys to Data Blending.)

Of course, much of this innovation is aimed at enabling organizations to gain more value out of the growing "lake" of data stored in Hadoop clusters. Organizations need tools that are geared to the "schema-on-read" data analysis style prevalent with Hadoop where schema, transformation, and other steps are applied to data when it is accessed rather than as it enters the systems, which is typical with traditional BI and data warehousing systems. Because no one vendor is entrenched as the market leader for data preparation on Hadoop, the new firms and their VC backers see a major opportunity.

Many of the new data preparation software providers are led by technologists with deep experience in using Hadoop, MapReduce, Spark, and related Apache open source technologies. Offloading of ETL jobs to cheaper Hadoop systems has already been growing and will likely accelerate as Spark and commercial SQL-on-Hadoop options mature. Over time, these trends will make it easier for organizations to view Hadoop as an appropriate platform for a greater share of their data preparation, enrichment, and integration tasks. (For more on ETL and Hadoop, see Philip Russom's recent article, Can Hadoop Replace My ETL Tool?.)

#1. Ensure good governance. One of the potential dangers of breaking away from IT control and increase users' self-service with data preparation is that proper data governance can become more difficult. Data preparation and integration tools are increasingly providing data lineage tracking capabilities, which can be helpful for data governance. Users and IT should work together to set rules and ensure that they are followed.

#2. Manage performance carefully. Whether they are using traditional ETL or newer data preparation and integration software, better performance is always a key goal and a high priority for users. Look carefully at how vendors are currently employing or planning to employ in-database and in-memory processing for data preparation analytics to improve performance.

#3. Make it easier for users, not harder. Many newer technologies are attempting a transition from Hadoop's developer-oriented culture to the world of nontechnical users who generally do not want to code and are more focused on solving business problems. Graphical interfaces help, but they can also mask confusion. Ensure that users are properly trained and guided as they move toward self-service data preparation and integration.

New Opportunities, New Responsibilities

New technologies entering the market mean that these are exciting times for users who have been frustrated with traditional ETL and data integration and seek more flexibility and control. However, as Uncle Ben so famously said in Spider-Man, "With great power comes great responsibility." Users and IT must adjust rules, practices, and their relationship to make fortuitous use of new data preparation technologies and avoid potential pitfalls.

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

Preparing Data for Analytics: Making it Easier and Faster

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research