Getting Started with Big Data
How should a manufacturer get started with big data? TDWI's Philip Russom offers suggestions for getting off on the right foot.
- By Philip Russom, Ph.D.
- August 26, 2014
[Editor's note: In every issue of the Business Intelligence Journal, our "BI Experts' Columns" asks BI professionals to answer a series questions about a scenario we pose. Space limitations prevented us from publishing the response of Philip Russom, research director for data management at TDWI, in our next issue, so we've adapted his recommendations for this article.
In the scenario, Nicole Mercer, BI director for Everything Rugs, is wondering about what steps to take to get the rug manufacturer on the road to big data success.]
Nicole is right to avoid a risky "big bang project" when getting started with big data. A small project for big data sounds like an oxymoron, but a number of well-contained, low-risk starting points have proved successful for early adopters.
One of the more common big data starting points is sentiment analysis. In fact, sentiment analysis is also one of the first applications a BI team will attempt with so-called unstructured data, in this case human language text. The point of sentiment analysis is to deduce consumers' opinions and satisfaction with a given manufacturer or product. That information then guides the manufacturer in improving its products, customer service, and public relations.
At the risk of stating the obvious, you need to have big data before you can work with it. Likewise, you need some kind of social media data before you can execute sentiment analysis with it. Because Everything Rugs does not sell directly to consumers, it's unlikely the company has much of its own social media data. (Firms with direct consumer contact would have the equivalent of social media data drawn from call center applications and self-service support areas on the corporate website.) This should not be a show stopper if Everything Rugs can acquire social media data from its retail partners and third parties that specialize in aggregating social media data.
Another safe starting point would be to analyze supply chain data. Manufacturers with a modern, digital supply chain exchange large amounts of data with suppliers, retailers, and wholesalers. This form of big data accumulates as large collections of files and documents in XML, EDI, JSON, and proprietary formats. Supply chain big data can be analyzed to understand which partners are most profitable, reliable, and quality-driven.
Before such analytic applications can be implemented, the data warehouse must be extended to accommodate the extra storage and processing. There are two broad considerations here: capacity for storing large data volumes and selecting a data platform that's designed to manage the type of big data to be analyzed. Concerning the former, licenses for database management systems (DBMSs) and other data warehouse platform components vary, but most require a revised license when users want to increase capacity. Concerning the latter, with more "exotic" data types (such as the social media data and file-based supply chain data mentioned earlier), the relational DBMS may not be an appropriate choice. In that case, an additional data platform may be in order. Either way, determining the sponsor and getting funds from that sponsor are common prerequisites to big data analytic applications.
Note that more organizations are opting to integrate greater numbers and diversity of data platforms into their data warehouse "environments" to accommodate exotic types of big data or departmental funding (which is increasingly common with analytic applications). A recent TDWI survey showed that almost two-thirds of user respondents are already doing this. The result is a multi-platform data warehouse environment, where a traditional warehouse is just one of those platforms.
Nicole's team at Everything Rugs is firmly committed to a more monolithic environment that's almost exclusively the warehouse proper, and they have a stated policy against standalone data marts. Distributing big data types and their processing workloads to additional platforms would be a dramatic cultural shift they would need to discuss and debate.
Hadoop is one of the data platforms being added to some data warehouse environments. It has advantages for Everything Rugs in that it is designed for file-based big data, again as in the social media and supply chain data. Also, once startup costs are recouped, Hadoop is amazingly inexpensive compared to a relational warehouse platform that supports multi-terabyte capacity. Hadoop is how many organizations avoid the high cost of big data capacity.
The catch with Hadoop is that it's very different from the relational technologies that BI and DW professionals are used to. Early adopters have shown that BI/DW personnel can learn Hadoop (and related technologies such as Hive, HBase, MapReduce, Java programming, etc.), but it takes significant time and training. Assuming the DW team at Everything Rugs has the time, the training budget, and the willingness to do something new, Hadoop could be a good choice.
Early adopters have shown that implementing Hadoop as a bigger and more diverse data staging platform fits well within most existing data warehouse architectures. Hadoop handles exotic data types that are problematic for some staging areas. Hadoop is an inexpensive platform for massively scalable row stores, which can handle much of the relational data that costs dearly on a relational warehouse platform. This explains why data staging is a safe and desirable starting point for Hadoop integration into a data warehouse environment.