Top 4 Myths about Big Data
Don't let these myths lead you astray in considering a big data initiative.
By Raghu Sowmyanarayanan
Big data analytics is one of the major trends every organization tends to jump on for competitive advantage -- even survival. As a result, there are many myths around big data. Those myths can lead you to waste resources or put you on dead-end paths. They can also make you to miss opportunities where budget is really needed.
Here are the four biggest myths about big data that you should not believe.
Myth #1: Big data technology is mature and cool
Big data technology is maturing and adoption is gradually increasing. Big data technology isn't actually a technology at all. It's an ecosystem of software and hardware products -- at different levels of maturity -- that are enabling users to deal with:
- The rapid growth of data
- New data types (such as sensor data)
- Complex data types (such as video)
- The increased need to exploit data in real time to aid decision making and gain new levels of insight
As the vendor ecosystem surrounding big data technology matures, users are moving from experiments and pilots to more strategic use cases. However, the number of real deployments is still small. When we talk about unstructured content, we refer to documents such as spreadsheets, Word files, video, or CAD drawings. Usually we manage these files in folders or shared drives in Windows. Structured data is organized in tables, but unstructured data -- which isn't so neatly organized -- contains a huge amount of data that is necessary to run a business.
The subjective approach to organizing data can make it difficult to locate that data once it moves to production. The challenge is to capture the data in such a way that it helps users find what they need when they need it. It's about how the data fits into the workflow. We need to organize the data based on use.
One of the challenges in managing data is the flood of different types of information that needs to be managed. It's not just office documents. Now it's social media -- not the popular social media but the industrial social media. All of these trends increase information dramatically. Every business has a history of revision, and you may want to maintain historical data as well as new data. All of that unstructured content, either new or historical, has become a big data problem.
Myth #2: Hadoop will replace enterprise data warehouses
Replacing the data warehouse with Hadoop technology is risky. Enterprise data warehouses (EDW) are the most favored technology that organizations are using (or plan to use) for big data initiatives. However, the EDW won't remain the single repository, as new data types and a greater number of sources make it hard to consolidate all data in the EDW.
Also, the different data types mature at different speeds. This means you need diverse technologies; you need to keep mature technologies and add maturing approaches when appropriate. The logical data warehouse addresses these new requirements. New big-data-driven use cases will be used to evolve EDWs towards logical data warehouses. EDW with its longer existence and greater maturity is a smart choice when data needs to be stored for pervasive and persistent use in a single data model. Potential use cases for big data management and analytics need to be separately piloted without impacting existing business users' analytics using EDWs.
Myth #3: With huge volumes of data, small data quality issues are acceptable
There is a shift from EDWs focused on a "single version of truth" to "trust" in big data initiatives, primarily because big data needs data from variety of sources (structured and unstructured).
That does not mean data quality is not important for big data initiatives. It is. The problem is that although each single data quality issue has a much smaller impact on the whole dataset than it did when there was less data, there are a greater number of flaws than before because there is more data. This makes it appear that the overall impact of poor-quality data on the whole dataset remains the same.
In fact, data quality could be worse than before. Much of the data enterprises use in a big data environment comes from outside the enterprise or is of unknown structure and origin. This could degrade the quality of your data. Core principles of data quality assurance need to be followed. Data is no longer sitting static (big data has a high velocity) and its structure is no longer fixed (the variety of data types and sources keeps increasing).
Myth #4: Every problem is a big data problem
If you are matching a couple of fields with a couple of conditions across a couple of terabytes, it isn't really a big data problem. Don't treat every analytics need as a big data project.
However, there is no predefined threshold of data volume, variety, and/or velocity ("3Vs") to indicate when an enterprise has actually reached "big data" status. The threshold is relative for each company and is based on two factors, one tactical and one strategic:
- Tactical: If your existing IT infrastructure cannot cope with the growing dimensions of one or more of the "3Vs" cost-effectively, you have a potential big data problem. You might also face a scaling issue.
- Strategic : Your business cannot achieve its objectives without analyzing a broader range of data, and one of these new information assets complicates one of the existing Vs you are already managing.
Raghuveeran Sowmyanarayanan is a vice president at Accenture and is responsible for designing solution architecture for RFPs and opportunities. You can reach him at firstname.lastname@example.org add .