By David Stodder, Senior Research Director for Business Intelligence, TDWI
Hard to believe, but the New Year is over a month old now and moving by fast. TDWI just finished its first Conference of the year in Las Vegas, which included the co-located Executive Summit chaired by me and my TDWI colleague, Research Director Fern Halper. The Summit was fantastic; many thanks to our great speakers, sponsors, and attendees. Other industry events focused on TDWI’s core topics are coming up, including the TDWI Solution Summit in Savannah, Strata and Hadoop World, and Gartner Business Intelligence & Analytics Summit. So, it’s time to check the condition of my shoes, luggage, and lumbar vertebrae (have to stop carrying that heavy computer bag) because they are all about to get a workout.
These events and others later in the year will no doubt highlight some of the major themes that TDWI Research is seeing as top concerns among leadership at user organizations. Here are three themes that we expect to be top of mind at conferences the rest of this year:
Theme #1: “Governed” self-service analytics and data discovery. At the Summit, several attendees and speakers observed that the pendulum in organizations could be swinging toward stronger data governance. As organizations supply users with self-service visual analytics and data discovery tools and ease constraints on data access and data blending, they are becoming increasingly concerned about data quality, management, and security. TDWI Research advises that the best approach to expanding self-service analytics and data discovery is a balanced one that includes data governance. Our research finds that this is largely IT’s responsibility, but governance is better tailored to users’ needs if the business side is closely involved, such as through establishment of a committee that includes stakeholders from business and IT. Governance and other steps organizations can take to improve their "analytics culture" will be a key topic at TDWI and other events.
Theme #2: Self-service data preparation. One of the hot trends in the industry is the technology evolution in data preparation toward self-service data blending, data wrangling, and data munging. I heard a great deal about this at Strata in 2015 and expect to again this year. Not only business users but data scientists working with Hadoop data lakes need technologies that can support easier, faster, and more standardized processes for data access, cataloging, integration, and transformation. I will be researching and writing a TDWI Best Practices Report on this topic in the first half of this year; look for the research survey to be launched at the end of February. I expect that this will be a major topic at the aforementioned events as organizations try to improve the productivity and satisfaction of business users and data scientists.
Theme #3: The maturing Hadoop ecosystem. Within the past few years, the developers across the Hadoop landscape have made progress in taking what has been a disparate collection of open source projects and technologies and moving the ecosystem toward a more coherent ecosystem. To be sure, most organizations still need to work with vendors’ platforms to achieve the level of integration and management they need. What will be interesting to see at TDWI's Savannah Solution Summit and at Strata and Hadoop World is how the pendulum is swinging in the Hadoop environment between the tradition of freewheeling development focused on innovation and the use of more tightly integrated systems based on frameworks, governance, and management processes.
As we move forward in 2016, I hope to see members of the TDWI and greater business intelligence and analytics community at these events. I also look forward to hearing your thoughts about how these major themes will play out during the course of this year.
Posted on February 10, 20160 comments
By David Stodder, TDWI Director of Research for Business Intelligence
A tsunami of big data is hitting many organizations and the demand for faster, more frequent, and more varied analytics is riding the crest of that wave. Organizations want to apply predictive analytics, stream analytics, machine learning, and other forms of advanced analytics to their key decisions and operations. They are also experiencing the rise of self-service visual analytics, which is whetting the appetite of nontechnical users throughout organizations who want do more with data than they can using standard business intelligence (BI) reports and spreadsheets.
Fortunately, technology trends are moving in a positive direction for organizations seeking to expand the business impact of analytics and send data exploration in new directions. Many of the most important innovations are occurring in the open source realm. In the decade since Hadoop and MapReduce were first developed, we have seen a flurry of initiatives, the best of which have become ongoing Apache Software foundation projects. Today, with the Hadoop 2.0 ecosystem and YARN, it is more possible for organizations to plug their choice of interactive SQL programs, advanced analytics, open source-based processing and execution engines, and other best-of-breed tools into something resembling a unified architecture.
TDWI has just published my new Checklist Report, “Seven Steps to Faster Analytics Processing with Open Source.” We also did a Webinar on this topic that featured discussion with representatives of the four sponsors of the checklist: Cloudera, DataTorrent, Platfora, and Talend. I invite you to check out these resources.
One of the key areas that I wrote about in the checklist—and that was also discussed in the Webinar—was open source stream processing and stream analytics. With interest growing in Internet of Things (IoT) data streams from sensors and other machines, many organizations need to develop a technology strategy for stream processing and stream analytics. The Apache Spark Streaming module, Apache Storm, and Apache Apex are aimed at processing streams of real-time data for analytics. These technologies can be integrated with Apache Kafka, a popular publish-and-subscribe messaging system that can serve data to streaming systems. In the coming year, I am sure we will see rapid evolution of open source technologies for gaining value from real-time data streams.
Other important topics that we discussed in the Webinar and I covered in the report are interactive SQL-based querying of Hadoop systems, and data integration and preparation. Good interactivity with Hadoop data, which includes the ability to send ad hoc SQL queries and receive responses in a reasonable time, is critical to analytics. However, until recently interactivity with Hadoop data was slow and difficult. New options involving SQL-on-Hadoop, Hive/Spark integration, and packaged MapReduce-based big data discovery are improving performance and making interactivity easier for users and developers. Data integration is also getting a push from Spark. Programs for data integration and preparation can use its in-memory data processing and generally better performance to quicken the pace of what are often the most time-consuming steps in BI and analytics.
I expect an active year ahead in open source-based technologies for BI and analytics and will be observing them closely in my 2016 research and analysis.
Hyperlinks embedded in this blog:
Cloudera: http://www.cloudera.com
DataTorrent: https://www.datatorrent.com/
Platfora: http://www.platfora.com/
Talend: http://www.talend.com/
Apache Spark Streaming: https://spark.apache.org/streaming/
Apache Storm: https://storm.apache.org/
Apache Kafka: http://kafka.apache.org/
Apache Apex: http://apex.incubator.apache.org/
Posted by David Stodder on December 21, 20150 comments
By David Stodder, TDWI Director of Research for Business Intelligence
We are past the half-way point of 2015. Major League Baseball is celebrating its all-stars in Cincinnati as teams contemplate trades that they hope will make them stronger for the second-half run. Meanwhile, fall sports are starting to stir; National Football League teams open their training camps around the end of the month. Even pumpkin farmers are aware of time passing; to have fully grown pumpkins for Halloween, they need to have their seeds planted by now. While the air is warm and the sun isstill high in the sky, it’s a good time to contemplate significant trends in our industry this year.
The top trend on my list would be the flourishing of Apache Spark, the open source parallel processing framework (or “engine”) for developing analytic applications and systems working with big data. If Spark “went supernova in 2014,” as Stephen Swoyer put it in a fine article earlier this year, the energy from its explosion is forcefully generating a lot of industry activity in 2015. And not just among the small, newer vendors: IBM, Intel, Microsoft, and other mainstream vendors have issued major Spark announcements and product releases already this year, with more to come. Describing Spark’s potential impact, IBM experts have called Spark “the next Linux.”
As I learned at Strata in February and even more at the Spark Summit in June, Spark is shaking up the big data realm, whichhasbeen dominated by Hadoop, MapReduce, Hive, and Storm technologies. While compatible with them, Sparkoffers performance and scalability advantages over these technologies, including through support for multi-step pipelines that reduce the wait for steps to complete, and support for in-memory data sharing.
One of Spark’s most important attributes is a unified approach tothe management and interaction with a greater diversity of data. The Spark framework can support not only batch processing a la Hadoop but also interactive SQL, real-time processing, machine learning, and stream analytics. At Strata, I met with Matei Zaharia, CTO of Databricks, which was founded by Zaharia and other members of the University of California, Berkeley’s AMPLab team that created Spark and launched it as an Apache project. He did not envision organizations being satisfied with putting all their data into massive Hadoop data lakes; he saw instead increasing diversity in data sources that users seek to access, which requires the unified framework and processing layer that Spark provides.
Spark has changed the parameters of the debate about how SQL-based business intelligence and visual analytics tools and application users might access big data. With Spark SQL, one of the four primary AMPLab-developed libraries that fit into the Spark framework, organizations could bypass some of the steps that have been necessary to move and transform Hadoop files into data warehouses before they can fully analyze the data. Application programming interfaces, such as SparkR for R language programming, are broadening the toolkit available for analytics.
Spark is not as mature as Hadoop or the SQL-on-Hadoop offerings in the market. Spark is also not the only “star” in the open source interactive analytic SQL query galaxy; Presto, which is now strongly backed by Teradata, is another interesting distributed SQL query engine to watch. All of these technologies are enabling organizations to do broader and deeper analytics with data and are becoming important parts of emerging diverse, “hybrid” data architectures (pardon a shameless plug: this topic will be covered at our Solution Summit in Scottsdale later this year).
Spark is a major trend in 2015. What are other trends you are seeing? I would be interested to hear your thoughts.
Hyperlinks embedded in this blog:
Apache Spark: https://spark.apache.org/
Swoyer article: http://tdwi.org/articles/2015/01/06/apache-spark-next-big-thing.aspx
IBM announcement: https://www-03.ibm.com/press/us/en/pressrelease/47107.wss
Intel: https://software.intel.com/sites/campaigns/sparks/IgnitingSparks.php
Microsoft: http://azure.microsoft.com/blog/2015/07/10/interactive-analytics-on-big-data-with-the-release-of-spark-for-azure-hdinsight/
“the next Linux”: https://youtu.be/CrGB_2GJ-fA
Strata: http://strataconf.com/
Spark Summit: https://spark-summit.org/
Databricks: http://www.databricks.com/
AMPLab: https://amplab.cs.berkeley.edu/
Presto: https://prestodb.io/
Teradata Presto announcement: http://www.teradata.com/News-Releases/2015/Teradata-Launches-First-Enterprise-Support-for-Presto/?LangType=1033&LangSelect=true
Posted by David Stodder on July 13, 20150 comments
As my flight west from Orlando began its descent into San Francisco, I thought about how touching ground was a good metaphor for the just-completed TDWI World Conference. The theme of the conference was “Emerging Technologies 2014,” but one of my strongest impressions from the keynotes and sessions was the deflation of the hype surrounding those emerging technologies. Speakers addressed what’s new and exciting in business intelligence, big data, analytics, the “Internet of things,” data warehousing, and enterprise data management. However, they were careful to point out potential weaknesses in claims made by proponents of the new technologies and where spending on the new stuff just because it’s new could be an expensive mistake.
Setting the tone on Monday morning in their “Shiny Objects Show” keynote presentation, Marc Demarest and Mark Madsen debated pros and cons of new technologies, including cloud (the pursuit of “instant gratification”), in-memory computing, visualization, and Hadoop. Overall, they advised attendees to be wary of hype. “Strike out every adjective on the marketing collateral piece and see what’s left,” Demarest advised. The speakers were able to drill down to what are truly significant emerging trends, helping attendees focus on those instead of being distracted by the noise.
Evan Levy’s “Tipping the Sacred Cows of Data Warehousing” session was similarly educational. While deflating hype about various emerging technologies, Levy at the same time advised his audience to always question the value proposition of existing systems and practices to see if there might be a better way. He took particular aim at operational data stores (ODSs), noting that database and data integration technologies have matured to the point where maintaining an ODS is unnecessary.
I caught part of Cindi Howson’s session, “Cool BI: The Latest Innovations.” With guest appearances by some leading vendors to demo aspects of their products, the session covered promises and challenges inherent in several key emerging BI trends, including mobile BI, cloud BI, and visual data discovery. Cindi has just published the second edition of her book, Successful Business Intelligence, which offers a combination of interesting case studies and best practices advice to help organizations get BI projects off on the right foot and keep them going strong.
The Thursday keynote by Krish Krishnan and Fern Halper introduced TDWI’s Big Data Maturity Model Assessment Tool. Krish and Fern have been working on this project throughout 2013. It is a tool designed to help organizations assess their level of maturity across five dimensions important to realizing value from big data analytics: organization, infrastructure, data management, analytics, and governance. It is the first assessment tool of its kind. Taking such an assessment can help organizations look past the industry hype to gain a “grounded” view of where they are and what areas they need to address with better technologies and methods. Check it out!
Grounded: that’s where my plane is now, at SFO. Time to head home.
Posted by David Stodder on December 13, 20130 comments
We just concluded the
TDWI Big Data Analytics Solution Summit in Austin, Texas (September 15–17). It was a great success; many thanks go to our speakers, sponsors, TDWI colleagues who managed the event, and to everyone who attended. A special thanks to Krish Krishnan, who co-chaired the conference. We are already planning the 2014 Big Data Analytics Solution Summits to be held in the spring and fall, so keep an eye out for details on these events if you are interested in attending.
In Austin, I had the chance to talk with a broad range of attendees. Some were in the early stages of planning and technology acquisition for big data analytics, while others were in the middle of ongoing, funded projects involving enterprise data warehouses, analytic platforms, Hadoop, Hive, MapReduce, and related technologies. We had data scientists and BI and data warehouse architects in attendance as well as business and IT leadership.
I heard exciting tales of initiatives driven by C-level executives who were pushing hard to gain competitive advantages by infusing new business ventures with richer data insights about customer behavior, product and service affinity, and process optimization. It was clear that in the often confusing world of big data, where organizations are on a voyage of discovery, it is a major plus to have high-level leadership that can define objectives and desired outcomes.
Briefly, here are three takeaways from the Summit:
- Finding professionals with big data skills remains a huge challenge. In my introductory remarks at the Summit, I reported on results of our latest TDWI Technology Survey, which asked attendees at the August 2013 World Conference in San Diego to rank their big data challenges. The survey found that dealing with data variety and complexity is the biggest challenge right now, followed by data volume and data distribution. However, when I wrote the survey, I neglected to include finding skilled professionals among the challenges that attendees could rank. In conversations with Summit attendees, this was most often cited as their biggest challenge.
- Big data analytics is about speed. In both presentations and sponsor panel discussions, “speed” was cited numerous times as the chief benefit sought from big data analytic discovery. Organizations want faster speed to insight than they are getting from traditional BI and data warehousing systems; they know that if they can apply insights about customer behavior, marketing campaign performance, projected margins, and other concerns faster, they will save their organizations money and create business advantages. David Mariani, CEO of @Scale, Inc., and former VP of engineering at the social analytics data services provider Klout, gave a great presentation that brought into focus why Hadoop has been so valuable. Mariani discussed why emerging interactive query engines like Cloudera’s Impala and Apache Shark will change the game by adding significant speed-to-insight capabilities to the Hadoop environment.
- Integrating data views is essential to realizing big data value. Some of the most compelling case studies at the conference were about how organizations can build profitable ventures based on a foundation of integrated data analysis. Dr. Tao Wu, lead data scientist at Nokia’s Data and Analytics organization, offered a powerful case study presentation about Nokia’s HERE business. With a centralized analytics platform rather than disconnected silos, Nokia has been able to improve products by analyzing the combination of mobile and location data.
Posted by David Stodder on September 24, 20130 comments
Good information and analytics are vital to enabling organizations of all stripes to survive tumultuous changes in the healthcare landscape. The latest issue of TDWI’s
What Works in Healthcare focuses on data-driven transformations in healthcare. I wrote an article for the issue that looks at some of the business intelligence and analytics issues surrounding the transition from a traditional, fee-for-service system to a value-based, “continuum of care” approach. One thing is clear: The importance of data and information integration as the fabric of this approach cannot be overstated.
A continuum (or “continuity”) of care is where a patient’s care experiences are connected across multiple providers: doctors, therapists, clinics, hospitals, pharmacies, and so on, including social programs. The traditional, fee-based approach has encouraged a disconnected experience for patients; visits to providers are mutually exclusive events and their patient data also lives in disparate silos. This disconnect increases the risk of patients getting the wrong treatments, taking medications improperly due to poor follow-up, or falling through the cracks entirely until there is an emergency. When patients only engage with healthcare when there is an emergency, costs go up. If there is poor follow-up after a hospital or emergency care visit, there is a greater likelihood that patients will have to be readmitted soon for the same problem.
Information integration plays a key role in the business model convergence that many experts envision as essential to improving care. “We see new partnerships or communities of care forming to improve collaboration across boundaries,” said Karen Parrish, IBM VP of Industry Solutions for the Public Sector during a recent conversation about IBM’s
Smarter Care. IBM’s ambitious program, announced in May, “enables new business and financial models that encourage interaction among government and social programs, healthcare practitioners and facilities, insurers, employers, life sciences companies and citizens themselves,” according to the company. Improving the continuum of a particular patient’s care among these participants will require good quality data and fewer barriers to the flow of information so that the right caregivers are involved, depending on the circumstances.
At the center of this information flow must be the patient. “Access to the unprecedented amount of data available today creates an opportunity for deeper insight and earlier intervention and engagement with the patient,” said Parrish. This includes unstructured data, such as doctor’s notes. In an insightful
interview with TDWI’s Linda Briggs, Ted Corbett, founder of Vizual Outcomes (and a speaker at the upcoming TDWI
BI Executive Summit in San Diego) points out that while unstructured data “houses some of the richest data in the hospital system…there is little consistency across providers in note format, which makes it difficult to access this rich store of information.”
To improve the speed and quality of unstructured data analysis, IBM puts forth its cognitive computing engine
Watson, which understands natural language. While Watson and cognitive computing are topics for another day, it’s clear that when we talk about information integration in healthcare, we have to remember that the vast majority of this information is unstructured. There will be increasing demand to apply machine learning and other computing power to draw intelligence from an integrated view of multiple sources of this information to improve patient care and treatment.
Posted by David Stodder on July 17, 20130 comments