TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

TDWI Blog

TDWI Blog: Data 360

An Introduction to Data Warehouse Modernization

By Philip Russom, Senior Research Director for Data Management, TDWI

As any data warehouse professional can tell you, the average data warehouse (DW) is today evolving, extending, and modernizing, to support new technology and business requirements, as well as to prove its continued relevance in the age of big data and analytics. This process has become known as data warehouse modernization; synonyms include DW augmentation, automation, and optimization. Every user organization and its DW is a unique scenario, so every modernization program is, too. Even so, a few common situations, drivers, and outcomes have arisen.

DW modernization takes many forms.

For example, common scenarios range from software and hardware server upgrades to the periodic addition of new data subjects, sources, tables, and dimensions. However, data types and data velocities are diversifying aggressively, so data modernization progressively involves users’ diversifying their software portfolios to include tools and data platforms built for big data from new sources. As portfolios swell, most data warehouses (DWs) are evolving – or modernizing – into complex and hybrid multi-platform data warehouse environments (DWEs). Though surrounded by complementary systems and tools, the traditional data warehouse is still the primary core of the modern DWE. Even so, a few organizations are decommissioning current data warehouse platforms to replace them with modern ones optimized for today’s requirements in big data, analytics, real-time operation, high-performance, and cost control. No matter what modernization strategy is in play, all require significant adjustments to the logical layers and systems architectures of the extended DWE.

Looking inside the average data warehouse, we see many opportunities for DW professionals to initiate or expand the use of recent technology advancements, such as in-memory processing, in-database analytics, massively parallel processing (MPP), multi-platform federated queries, and Hadoop. Furthermore, there are many new database management systems purpose-built for analytics, based on columns, appliances, graph, MapReduce, NoSQL, and other innovations. Best practices can likewise be modernized by adapting agile, lean, logical, and virtual methods, or by moving to modern team structures, such as the competency center or center of excellence.

Systems outside the DW need modernization, too.

Looking outside the warehouse, multiple disciplines have their own modern innovations that need support from a more modern DW. For example, new business practices need bigger, newer, and fresher data, so the business can compete on analytics, get actionable business value from new big data, and monitor the business in real time. As another example, business intelligence (BI) is experiencing its own modernization right now, and BI needs the DW to provision data for modern BI practices, such as visualization, data exploration, and self service. Likewise, many organizations are complementing their mature investments in online analytic processing (OLAP) with an exploding array of techniques for advanced analytics.

ANNOUNCEMENT

To learn more about modernizing data warehouses and related IT systems, attend my TDWI webinar Data Warehouse Modernization in the age of Big Data Analytics, coming up on April 14, 2016. Register online for the webinar: http://bit.ly/DWMod16

This webinar will quantify trends in data warehouse modernization and catalog technologies that are relevant. It will also document strategies and user best practices for organizing modernization projects. The goal is to help DW professionals and their business counterparts plan the next generation of their data warehouse, in alignment with business goals.

0 comments

Big Themes under the Big Tent

By David Stodder, Senior Research Director for Business Intelligence, TDWI

Hard to believe, but the New Year is over a month old now and moving by fast. TDWI just finished its first Conference of the year in Las Vegas, which included the co-located Executive Summit chaired by me and my TDWI colleague, Research Director Fern Halper. The Summit was fantastic; many thanks to our great speakers, sponsors, and attendees. Other industry events focused on TDWI’s core topics are coming up, including the TDWI Solution Summit in Savannah, Strata and Hadoop World, and Gartner Business Intelligence & Analytics Summit. So, it’s time to check the condition of my shoes, luggage, and lumbar vertebrae (have to stop carrying that heavy computer bag) because they are all about to get a workout.

These events and others later in the year will no doubt highlight some of the major themes that TDWI Research is seeing as top concerns among leadership at user organizations. Here are three themes that we expect to be top of mind at conferences the rest of this year:

Theme #1: “Governed” self-service analytics and data discovery. At the Summit, several attendees and speakers observed that the pendulum in organizations could be swinging toward stronger data governance. As organizations supply users with self-service visual analytics and data discovery tools and ease constraints on data access and data blending, they are becoming increasingly concerned about data quality, management, and security. TDWI Research advises that the best approach to expanding self-service analytics and data discovery is a balanced one that includes data governance. Our research finds that this is largely IT’s responsibility, but governance is better tailored to users’ needs if the business side is closely involved, such as through establishment of a committee that includes stakeholders from business and IT. Governance and other steps organizations can take to improve their "analytics culture" will be a key topic at TDWI and other events.

Theme #2: Self-service data preparation. One of the hot trends in the industry is the technology evolution in data preparation toward self-service data blending, data wrangling, and data munging. I heard a great deal about this at Strata in 2015 and expect to again this year. Not only business users but data scientists working with Hadoop data lakes need technologies that can support easier, faster, and more standardized processes for data access, cataloging, integration, and transformation. I will be researching and writing a TDWI Best Practices Report on this topic in the first half of this year; look for the research survey to be launched at the end of February. I expect that this will be a major topic at the aforementioned events as organizations try to improve the productivity and satisfaction of business users and data scientists.

Theme #3: The maturing Hadoop ecosystem. Within the past few years, the developers across the Hadoop landscape have made progress in taking what has been a disparate collection of open source projects and technologies and moving the ecosystem toward a more coherent ecosystem. To be sure, most organizations still need to work with vendors’ platforms to achieve the level of integration and management they need. What will be interesting to see at TDWI's Savannah Solution Summit and at Strata and Hadoop World is how the pendulum is swinging in the Hadoop environment between the tradition of freewheeling development focused on innovation and the use of more tightly integrated systems based on frameworks, governance, and management processes.

As we move forward in 2016, I hope to see members of the TDWI and greater business intelligence and analytics community at these events. I also look forward to hearing your thoughts about how these major themes will play out during the course of this year.

0 comments

New big data sources and data types – and the need to get business value from new data – are forcing organizations to evolve their data management practices.

By Philip Russom, TDWI Research Director for Data Management

I recently participated as a core speaker in the Informatica Big Data Ready Virtual Summit, sharing a session with Amit Walia, the Chief Product Officer at Informatica Corporation. Amit and I had an interactive conversation where we discussed one of the most pressing questions in data management today, namely: How should an organization get ready to capture and leverage big data? This is an important question, because many organizations in many industries are facing big data, with its new data sources, data types, large volumes, and fast generation rates. Organizations need to modernize their data integration (DI) infrastructure, so they can capture and leverage the new data for new business insights and analytics.

Amit Walia and I boiled down this complex issue to seven recommendations, which I will now summarize:

Achieve agility and autonomy, as required of big data and analytics. The creation of data management solutions must keep up with the pace of business by adopting agile and lean development methods. New tool functions that assist with agility and autonomy include those for data exploration and profiling, self-service data access, and rapid dataset prototyping (or “data prep”).

Govern big data, as you would any enterprise data asset. Big data has a bit of a “hall pass” today, because it’s new and exotic. But eventually, it will be assimilated as yet another category of enterprise data. Prepare for that day, by assuming that new data demands governance, stewardship, privacy, security, quality, and standards.

Include Hadoop in your data integration infrastructure. Hadoop can replace some of the database management systems and file systems you’re using today, while scaling at a reasonable cost and handling new data types. Modern users’ DI architectures already include Hadoop for landing, staging, push-down processing, archiving, hubs, and lakes.

Integrate fit-for-purpose data to enable data exploration and profiling. The trend is to integrate big data in its raw, original state, into a big data platform, such as Hadoop or a large relational MPP implementation. That way, users can explore and profile new big data to determine its business value. Later, users can repurpose discovered data many ways, sometimes at runtime, as new requirements arise for analytics or operations.

Embrace real-time data ingestion, as required by some forms of big data and analytics. A modern DI infrastructure supports many speeds and frequencies of data ingestion, because diverse data sources and business processes have diverse requirements relative to time. A new challenge for DI is to capture and process, streaming data in real time, to enable near time analytics and business operations.

Prepare to integrate big data by upgrading skills and team structures. TDWI surveys say that a lack of skill is the biggest barrier to success with new big data. Data management professionals need training for Hadoop, NoSQL, natural language processing, and new data types (e.g., JSON, social media, streams). These competencies should be added to those of existing DI competency centers.

Modernize data management solution development by combining agile, stewardship, and collaborative methods. Both agile and stewardship methods recommend the use of a pair of specialists, working together closely: a data specialist and a business representative (or steward). This “dynamic duo” accelerates requirements gathering, ensures data-to-business alignment, and delivers solutions faster than ever.

If you’d like to hear more of my discussion with Informatica’s Amit Walia (and hear other expert speakers in the Informatica Big Data Ready Virtual Summit, too), please replay the Informatica Webinar by clicking here.

0 comments

Faster Analytics Processing with Open Source

By David Stodder, TDWI Director of Research for Business Intelligence

A tsunami of big data is hitting many organizations and the demand for faster, more frequent, and more varied analytics is riding the crest of that wave. Organizations want to apply predictive analytics, stream analytics, machine learning, and other forms of advanced analytics to their key decisions and operations. They are also experiencing the rise of self-service visual analytics, which is whetting the appetite of nontechnical users throughout organizations who want do more with data than they can using standard business intelligence (BI) reports and spreadsheets.

Fortunately, technology trends are moving in a positive direction for organizations seeking to expand the business impact of analytics and send data exploration in new directions. Many of the most important innovations are occurring in the open source realm. In the decade since Hadoop and MapReduce were first developed, we have seen a flurry of initiatives, the best of which have become ongoing Apache Software foundation projects. Today, with the Hadoop 2.0 ecosystem and YARN, it is more possible for organizations to plug their choice of interactive SQL programs, advanced analytics, open source-based processing and execution engines, and other best-of-breed tools into something resembling a unified architecture.

TDWI has just published my new Checklist Report, “Seven Steps to Faster Analytics Processing with Open Source.” We also did a Webinar on this topic that featured discussion with representatives of the four sponsors of the checklist: Cloudera, DataTorrent, Platfora, and Talend. I invite you to check out these resources.

One of the key areas that I wrote about in the checklist—and that was also discussed in the Webinar—was open source stream processing and stream analytics. With interest growing in Internet of Things (IoT) data streams from sensors and other machines, many organizations need to develop a technology strategy for stream processing and stream analytics. The Apache Spark Streaming module, Apache Storm, and Apache Apex are aimed at processing streams of real-time data for analytics. These technologies can be integrated with Apache Kafka, a popular publish-and-subscribe messaging system that can serve data to streaming systems. In the coming year, I am sure we will see rapid evolution of open source technologies for gaining value from real-time data streams.

Other important topics that we discussed in the Webinar and I covered in the report are interactive SQL-based querying of Hadoop systems, and data integration and preparation. Good interactivity with Hadoop data, which includes the ability to send ad hoc SQL queries and receive responses in a reasonable time, is critical to analytics. However, until recently interactivity with Hadoop data was slow and difficult. New options involving SQL-on-Hadoop, Hive/Spark integration, and packaged MapReduce-based big data discovery are improving performance and making interactivity easier for users and developers. Data integration is also getting a push from Spark. Programs for data integration and preparation can use its in-memory data processing and generally better performance to quicken the pace of what are often the most time-consuming steps in BI and analytics.

I expect an active year ahead in open source-based technologies for BI and analytics and will be observing them closely in my 2016 research and analysis.

Hyperlinks embedded in this blog:

Cloudera: http://www.cloudera.com

DataTorrent: https://www.datatorrent.com/

Platfora: http://www.platfora.com/

Talend: http://www.talend.com/

Apache Spark Streaming: https://spark.apache.org/streaming/

Apache Storm: https://storm.apache.org/

Apache Kafka: http://kafka.apache.org/

Apache Apex: http://apex.incubator.apache.org/

Posted by David Stodder0 comments

Igniting the Analytic Spark

An Introduction to Apache Spark and its uses in Business Intelligence (BI), Data Warehousing (DW), and Advanced Analytics

Blog by Philip Russom
Research Director for Data Management, TDWI

At TDWI, we’re hearing a lot of interest in Apache Spark, although it’s still new and most users are unfamiliar with it. So, please allow me to define Spark for you, explain its potential benefits, and describe actual use cases.

Apache Spark is a parallel processing engine. It specializes in big data, and works well with Hadoop environments. However, Apache is not just for Hadoop; it provides parallel processing for other environments, too. Spark is known for high speed and low latency, which it achieves by leveraging in-memory computing and cyclic data flows.

Spark is fast. Very fast. Benchmarks show Spark to be up to one hundred times faster than Hadoop MapReduce with in-memory operations. Spark is ten times faster than MapReduce with disk-bound operations. The point is that Spark has the low latency required of new data-driven practices, like data exploration, discovery, streaming analytics, and SQL-based analytics.

Spark functions apply directly to applications in BI, DW, DI, & analytics. Spark today includes four libraries of functionality, and each is of interest to professionals in BI, DW, and analytics. The libraries support ANSI-standard SQL, streaming data, machine learning, and graph analytics.

A Spark library provides native support for ANSI and ISO standard SQL. In a recent TDWI survey, 69% of users surveyed said that ANSI- and ISO-standard SQL on Hadoop is required for broad enterprise use. That’s because a modern enterprise wants to leverage pre-existing SQL skills and SQL-based tools. Furthermore, users want fast queries on Hadoop, to enable data exploration, analytics, and other interactive, data-driven practices. Spark and its SQL support promise to enable these – in both batch or interactive sessions, for Hadoop and other environments – which in turn will spark big data analytics for users in BI, DW, and analytics.

Spark offers broad compatibility. Spark SQL reuses the Hive front-end and metastore, to provide compatibility with existing Hive data, queries, UDFs. Spark SQL’s server mode extends interoperability via industry-standard ODBC/JDBC. Spark can process data in S3, HDFS, HBase, Hive, Cassandra, and any Hadoop InputFormat.

Spark can be deployed many ways. Spark requires some kind of shared file system (NFS compliant), so its deployment options are diverse. Spark runs on its standalone cluster, Hadoop YARN, Apache Mesos, and Amazon EC2; on premises or cloud. A single job, query, or stream processing can be deployed in either batch or interactive mode via Scala, Python, and R shells.

Spark has one console for the seamless development of diverse functionality. Apache Spark includes libraries for four high-level applications: SQL, streaming data, machine learning, and graph analytics. These are integrated tightly, so users can create applications that mix SQL queries and stream processing alongside complex analytic algorithms.

Spark and its libraries enable several application types for BI, DW, and analytics:

SQL analytics and related set-based applications – e.g., data exploration and discovery, customer-base segmentation, financial analyses, dimensional modeling and analysis, reporting, ETL pushdown that requires SQL
Stream capture and analysis -- monitoring facilities (utilities, factories), tracking social sentiment, predictive machine maintenance, reroute vehicle traffic, manage mobile assets, any time-sensitive process
Graph analytics -- anomaly detection for fraud or risk, behavioral analysis, entity clustering, patient outcome optimization
Mixtures of the above – a trend among users is to mix multiple analytic methods in a single application, because each reveals different insights

Want to learn more about Spark? Click here to replay my recent TDWI Webinar, where go into more detail about Spark and its uses in BI, DW, and analytics.

Posted by Philip Russom, Ph.D.0 comments

Emerging Technologies and Methods: An Overview in 25 Tweets

Blog by Philip Russom
Research Director for Data Management, TDWI

To help you better understand what today’s emerging technologies and methods (ETMs) are – especially those related to business intelligence, analytics, and data warehousing – I’d like to share with you the series of 25 tweets I recently issued on the topic. I think you’ll find the tweets interesting, because they provide an overview of ETMs in a form that’s compact, yet amazingly comprehensive.

Each tweet below is a short sound bite or stat bite drawn from the recent TDWI report “Emerging Technologies for Business Intelligence, Analytics, and Data Warehousing,” which I researched and wrote with my colleagues David Stodder and Fern Halper. Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

I left in the arcane acronyms, abbreviations, and incomplete sentences typical of tweets, because I think that all of you already know them or can figure them out. Even so, I deleted a few tiny URLs, hashtags, and repetitive phrases. I issued the tweets in groups, on related topics; so I’ve added some headings to this blog to show that organization. Otherwise, these are raw tweets.

Examples of Emerging Technologies and Methods (ETMs)

1. Most Emerging Techs & Methods (#ETMs) fall into 3 layers of BI, #analytics & #EDW tech stack.

2. #ETMs for #BI include #DataViz, #DataExploration, #DataPrep, #Dashboards, #MashUps, #MobileBI.

3. #ETMs for #Analytics operate on data from #SocialMedia, #IoT, streams, #MachineData.

4. #ETMs for #DataMgt include #Hadoop, #ApacheSpark, #NoSQL, in-DBMS #analytics, in-mem DBMS, columnar.

Examples of Emerging Methods and Platforms

5. #EmergingMethods include agile & lean dev methods applied to whole BI/DW/#analytics tech stack.

6. Other #EmergingMethods include #CompetencyCenters, #CollaborativeBI, #StoryTelling, #DataGovernance.

7. Emerging platforms include many types of clouds, #SaaS, #OpenSource, appliances...

The Importance of ETMs

8. #TDWI SURVEY SEZ: Emerging Techs & Methods (#ETMs) are very important (53%) or somewhat (39%).

9. #TDWI SURVEY SEZ: Emerging Techs & Methods (#ETMs) are opportunity to compete, evolve, perform (79%).

10. #TDWI SURVEY SEZ: Two-thirds of respondents (64%) already have #ETMs in production.
Benefits and Barriers for ETMs

11. #TDWI SURVEY SEZ: Top benefits of #ETMs = competitiveness, decision support, biz performance, innovation.

12. #TDWI SURVEY SEZ: Top barriers to #ETMs = LACK of skills, budgets, biz value, innovation.

13. #TDWI SURVEY SEZ: Other barriers to #ETMs = poor state of IT infrastructure & poor #DataGovernance.

User Satisfaction with Current State of ETMs

14. #TDWI SURVEY SEZ: 41% dissatisfied with their enterprise adoption of Emerging Techs & Methods (#ETMs).

15. Adoption of agile development methods is one of strongest trends in BI, #analytics, #EDW today.

16. #TDWI SURVEY SEZ: 55% dissatisfied with time required of development for BI, #analytics, #DataMgt.

User Success with Current State of ETMs

17. #TDWI SURVEY SEZ: Users successful with #ETMs for #SelfServiceBI (54%) & #DataPrep (50%).

18. #Hadoop & #NoSQL #ETMs are challenging for tools & apps built for relational data.

Emerging Data Types for Analytics

19. #TDWI SURVEY SEZ: 84% analyze structured data today. Suprising that 16% are not; maybe text analytics?

20. #TDWI SURVEY SEZ: #IoT data used by <20% of respondents today, but 40% more will use within 3 years.

21. Other data sources poised for growth = Machine data (sensors, devices) & #RealTime #EventStreaming.

22. #TDWI SURVEY SEZ: In clouds, users already do #EDW (35%), #Analytics (31%), sandbox (29%), DataInt (24%).

23. #TDWI SURVEY SEZ: 49% have production #PredictiveAnalytics today; another 39% will in 3 yrs.

ETMs for Data Warehousing & Data Management

24. #TDWI SURVEY SEZ: 3-yr hi growth in #DataMgt #ETMs = #RealTime, streams, #DataPrep, #Hadoop, #CloudDW.

25. Top security #ETMs for #DataMgt = #DataProtection (encrypt, mask, token), not just user name/pswd.

Want to learn more about Emerging Technologies and Methods (ETMs)?

For a more detailed discussion – in a traditional publication! – get the TDWI Best Practices Report, titled “Emerging Technologies for Business Intelligence, Analytics, and Data Warehousing,” which is available in a PDF file via a free download.

You can also register for and replay the TDWI Webinar, where David Stodder, Fern Halper, and I discuss the findings of the TDWI report.

Posted by Philip Russom, Ph.D.0 comments