TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Modernize and Govern: Unifying Your Data Strategy July 10, 2025
  - Expert Panel: Best Practices for Modernizing Your Data Environment July 14, 2025
  - Powering Data Science with AI-Driven Tools and Practices July 15, 2025
  - Smarter Marketing in Retail: How AI and Modern Data Foundation Drive Growth July 17, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Executive Summit AI Accelerate 2025, Brought to You by AI Boadroom & TDWI August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
- Virtual Live Seminars
  - TDWI Data Governance Principles and Practices: Managing Data as an Asset June 25, 2025
  - Building Your Company’s Data Governance Roadmap June 25, 2025
  - Data Governance: Driving Engagement and Organizational Change June 26, 2025
  - A Framework for Modern Data Governance June 25, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

3 Things You Need to Know about Big Data

Before you jump into a big data project, be sure to think about these three big data issues related to data integrity, data quality, and the nuances of analysis.

By Fern Halper, Ph.D.
March 5, 2013

Big data analytics is exciting to many organizations and for good reason. The ability to analyze big data provides unique opportunities for organizations in terms of the kind of analysis they can do. For example, instead of being limited to sampling large data sets, organizations can utilize much more granular and complete data to do analysis. This may lead to patterns that they hadn't seen before. Additionally, the ability to analyze data in real time in order to gain real time insights can be a competitive differentiator for an organization.

However, analyzing big data does have its own set of issues. Here are three related to data integrity, data quality, and the nuances of analysis worth thinking about.

#1: Big data can come from untrusted sources

Big data analysis often involves aggregating data from various sources. These may include both internal and external data sources. For instance, you may want to aggregate data extracted from unstructured data sources such as e-mails and call center notes together with structured data about your customers from your data warehouse, or you may be interested in analyzing social data.

The question is how trustworthy are those external sources of information? For example, how trustworthy is social media data such as a tweet? This information may come from an unverified source, or the information itself, reported by the source, can be false or misleading. The integrity of this data therefore needs to be considered in the analysis. Before you start utilizing this data you need to understand your data sources and decide what you are comfortable with including in your analysis.

#2: Big data is dirty

Dirty data refers to inaccurate, incomplete, incorrect, duplicate, or erroneous data. This may include the misspelling of words or data that comes from a piece of equipment that is broken or corrupted in some way. It can even include duplicate data such as retweets or company press releases that appear numerous times in social media.

Here's a simple example of how this can impact your analysis. Suppose you are interested in performing a competitive analysis using social media data. You are interested in seeing how often your competitor's product appears in the external sources you are monitoring and the sentiment associated with those posts. You look at the data and see that the number of positive posts about the competitor is twice as large as yours. Aside from the fact that sentiment scoring needs to be checked carefully, this could simply be a case where the competitor is pushing its own press releases out to multiple sources (in essence, tooting its own horn) or getting lots of people to retweet an announcement.

In this case, you need to make sure you know how you want to handle this duplicate social media data (e.g., there are cases where you might want the duplicates because it is valuable information in and of itself).

In general, the cleansing strategy you employ will depend on the source and type of data and the goal of your analysis. Big data cleansing strategies are evolving (perhaps not rapidly enough), and although some people debate whether big data needs to be cleaned, the reality is that it does -- except, perhaps, if you're developing a filter where your goal is to detect bad elements in the data.

#3: Big data changes, especially in data streams

If you're going to analyze big data streams, you need to be aware that data quality in your analysis can change or even that data itself can change because the conditions under which you're capturing data may change.

For instance, imagine a utility company collecting weather data to use as part of a big data analysis that also uses smart-meter data as well as other geospatial data to predict events or actions. What happens when analyzing this data in real time a gap is discovered in the smart-meter data? What if an environmental variable monitored by the smart meters changes?

How do you deal with this in your own analysis? Change detection algorithms in stream mining are an active area of research in data quality and analysis. (See, for example, work done by Tamparini Dasu et al at http://www.research.att.com/export/sites/att_labs/techdocs/TD_100014.pdf.) You need to at least be aware of the issues before you start trying to analyze a big data stream.

A Final Word

Of course, there are many more issues than the three listed here. Some of these issues are unique to big data analytics, and some are important in all kinds of data analysis. Organizations using big data analytics must make sure they have access to the skills to deal with the nuances of analyzing big data and the governance in place to deal with managing it.

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

RESEARCH & RESOURCES

3 Things You Need to Know about Big Data

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

TDWI

Engage

Research