TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

TDWI Blog

TDWI Blog: Data 360

Growing Use Cases for Learning R and Python

By Fern Halper, VP Research, Advanced Analytics

There was a time when choosing a programming language for data analysis had essentially no choice at all. The tools were few and they were usually developed and maintained by individual corporations that, though they ensured a reliable level of quality, could sometimes be quite difficult to work with and slow to fix bugs or innovate with new features. The landscape has changed, though.

Thanks to the Web, the open source software development model has shown that it can produce robust, stable, mature products that enterprises can rely upon. Two such products are of special interest to data analysts: Python and R. Python is an interpreted, interactive, object-oriented scripting language created in 1991 and now available through the Python Foundation. R, which first appeared at roughly the same time, is a language and software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing.

Each comes with a large and active community of innovative developers, and has enormous resources readily available through libraries for analytics and processing—libraries such as NumPy, SciPy, and Scikit-learn for data manipulation and analytics in Python, and purr, ggplot2, and rga in R. These features can go a long way to speeding time-to-value for today’s businesses.

Python pro Paul Boal and long-time TDWI instructor Deanne Larson (who will both be bringing their Quick Camps to Accelerate Seattle) both see the value in learning these tools.

“R is very popular with data scientists,” Larson says. “They’re using it for data analysis, extracting and transforming data, fitting models, drawing inferences, even plotting and reporting the results.”

Scaling to the Enterprise Level

Debraj GuhaThakurta, senior data and applied scientist at Microsoft, warns that “Although R has a large number of packages and functions for statistics and machine learning,” he says, “many data scientists and developers today do not have the familiarity or expertise to scale their R-based analytics or create predictive solutions in R within databases.” Work is needed, he says, to create and deploy predictive analytics solutions in distributed computing and database environments.

And although there are adjustments to be made at the enterprise level to integrate and scale the use of these technologies, the use cases continue to grow. According to Natasha Balac of Data Insight Discovery, Inc., the number of tools available for advanced analytics techniques such as machine learning has exploded to the level where help is needed to navigate the field. She chalks it up to increasingly affordable and scalable hardware. “It’s not limited to any one vertical either,” Balac says. “Users need help integrating their experience, goals, time lines, and budgets to sort out the optimal use cases for these tools—both for individuals and teams.”

There are significant resources available to companies and data professionals to close the gap between skilled users and implementation. TDWI has integrated new courses on machine learning, AI, and advanced analytics into events like TDWI Accelerate. The next event will be held in Seattle on October 16-18, check out the full agenda here.

Build your expertise. Drive your organization’s success. Advance your career. Join us at TDWI Accelerate, October 16-18 in Seattle, WA.

0 comments

Leading Mission-Critical Analytics Teams and Programs

By Meighan Berberich, President, TDWI

Analytics and data science have moved to the forefront of business decision making. The size and scope of the organizations and the complexity of tools and technologies that support these mission critical initiatives only continues to grow. It is critical for analytic leaders to maintain focus on the key factors that drive success of their analytics teams and deployments.

TDWI Accelerate will not only provide analytics leaders with insight on what’s new (and what’s next) in advanced analytics, but also on the factors beyond technology that are instrumental to driving business value with data.

Check out our leading sessions during this interactive and engaging three day event:

Building a High-Performance Analytics & Data Science Team
Bill Franks, Chief Analytics Officer, International Institute for Analytics

This talk will discuss a variety of topics related to how a successful modern-day analytics and data science organization can grow, mature, and succeed. Topics will include guidance on organizing, recruiting, retaining, and instilling a winning culture within an analytics and data science organization.

Pay No Attention to the Man behind the Curtain
Mark Madsen, President, Third Nature, Inc.

Using data science to solve problems depends on many factors beyond technology: people, skills, processes, and data. If your goal is to build a repeatable capability then you need to address parts that are rarely mentioned. This talk will explain some of the less-discussed aspects of building and deploying machine learning so that you can better understand the work and what you can do to manage it.

From BI to AI - Jumpstart your ML/AI projects with Data Science Super Powers
Wee Hyong Tok, Principal Data Science Manager, Microsoft Corporation

Why do some companies drown in volumes of data, while others thrive on distilling the data in the data warehouses/databases into golden strategic advantages? How do business stakeholders and data scientists work together to evolve from BI to ML/AI, and leverage data science in co-creating new value for the company? Join Wee Hyong in this as he shares with you 5 super powers that will help you jumpstart your ML/AI projects.

Designing for Decisions
Donald Farmer, Principal, TreeHive Strategy

Over the years, Business Intelligence has grown in many directions. Today, the practice can include data mining, visualization, Big Data and self-service experiences. But we’re still, fundamentally, in the business of decision support.

As analysts and developers we can support better business decisions more effectively if we understand the cognitive aspects of data discovery, decision making and collaboration. Luckily, we have a wealth of fascinating research to draw on, ranging from library science to artificial intelligence research.

We’ll explore the implications of some of this research, in particular how we can design an analytic experience that leads to more engaging, more insightful and measurably better decision-making.

View the complete agenda of more than 30 session across 3 days.

The time is now. The demand for skills in analytics and data science has reached an all-time high. Everywhere, organizations are building advanced analytics practices to drive business value, improve operations, enrich customer experiences, and so much more.

Build your expertise. Drive your organization’s success. Advance your career. Join us at TDWI Accelerate, October 16-18 in Seattle, WA.

0 comments

Some People Call Me the Data Wrangler, Some Call Me the Gangster of Prep

By Meighan Berberich, President, TDWI

Data prep. Wonderful, terrible data prep. According to John Akred of Silicon Valley Data Science, “it’s a law of nature that 80% of data science” is data prep. Although our surveys average closer to 60%, even that’s an awful lot of time to spend not analyzing data, interpreting results, and delivering business value—the real purpose of data science.

Unfortunately, real-world data doesn’t come neatly prepackaged and ready to use. It’s raw, messy, sparse, and exists across a million disparate sources. It can be dirty, poorly formatted, unclear, undocumented, or just plain wrong. One can easily see what makes Exaptive data scientist, Frank Evans, ask “Are we data scientists or data janitors?”

The news isn’t all bleak, though. If there’s one thing we know, it’s that the data scientist’s mindset is perfectly suited to grappling with a seemingly intractable problem and coming up with answers. For example, even Evans’ cynical-seeming question isn’t offered without some solutions.

“Most projects are won or lost at the wrangling and feature engineering stage,” Evans says. “The right tools can make all the difference.” We have a collection of best practices and methods for wrangling data, he offers, such as reformatting it to make it more flexible and easier to work with. There are also methods for feature engineering to derive the exact elements and structures that you want to test from raw data.

Akred is similarly solutions-oriented. His many years of experience applying data science in industry has allowed him to develop a framework for evaluating data.

“You have data in your organization. So you need to locate it, determine if it’s fit for purpose, and decide how to fill any gaps,” he says. His experience has allowed him to be equally pragmatic about the necessity to navigate the political and technical aspects of sourcing your data—something that can often be neglected.

Data Exploring—Literally

Wes Bernegger of Periscopic takes a somewhat more playful tack.

“The road to uncovering insight and narratives in your data begins with exploration,” he says. “But though there are all kinds of tools to help you analyze and visualize your data, it’s still mostly an undefined process.” Bernegger suggests coming to the task with the attitude of an old-fashioned explorer.

“If you plan your voyage and are prepared to improvise with relentless curiosity,” he says, “you can often come away with unexpected discoveries and have more fun along the way.” Bernegger advises to lay out a system for the data exploration practice, from wrangling and tidying to visualization, through many rounds of iteration, and to stock up on some tools (such as continuing education) to help you find your way in unfamiliar terrain.

Build your expertise. Drive your organization’s success. Advance your career. Join us at TDWI Accelerate, October 16-18 in Seattle, WA.

0 comments

Learn the Most Valuable Visualization Skills From Industry’s Best

By Meighan Berberich, President, TDWI

Communication—the process by which information is exchanged between individuals. In the analytics field, we like to call it “data visualization,” but it’s really just a particular form of communication. There’s nothing special about that. Even bacteria can communicate with each other. So why can it be so difficult for data professionals to get their meaning across?

There are few other areas where Art and Science collide in such a head-on way. Effective data visualization requires its practitioners to be constantly threading the needle between the art of the visual (How many colors is too many? Will viewers tune out at another bar chart?) and the science of the numbers. In addition, there can be a lot riding on a visualization’s effectiveness—business opportunities lost, warning signs missed, promising applications abandoned.

As Nick Kelly, vice president of BluLink Solutions, says, “many analytics projects start well intentioned” but are never fully adopted by frontline business users. Though there are a number of potential reasons, he says, Kelly identifies poor user experience as a common one.

Designing With Data

Dave McColgin, executive creative director of Seattle-based design firm Artefact, takes a broader view. “We’re still exploring and experimenting how we use, share, and communicate huge amounts of information,” he says. Artefact’s own website speaks as much about its data and research efforts than its jazzy portfolio pieces—a rarity among design firms.

“The task is to transform complex information into designs that engage people and empower them to act,” McColgin says.

Not to be outdone, Datarella’s Joerg Blumtritt sees even greater potential. “In addition to data storytelling, data journalism, and even simple dashboards, data has grown into a medium for creative output.”

Blumtritt is a co-author of “The Slow Media Manifesto,” which emphasizes choosing one’s media ingredients slowly and with care. “Contemporary artists are still struggling to find the language for their output in the post-internet age. Parametric design, algorithmic architecture, and rapid prototyping technologies have redefined the relationship between creator and artist tools.”

You can catch more from each of these visionaries at TDWI Accelerate Seattle, October 16-18.

The Data Scientists Guide to User Experience
Nick Kelly, Vice President, BluLink Solutions

This session takes a practical approach to addressing user experience problems—from strategies such as conducting user interviews at project inception through to the completion of the project and addressing user adoption and sharing of insights.

Core UX principles to apply to analytics requirements gathering
How a workshop format can address stakeholder challenges
Using wire-framing to design the end state
Key steps to operationalize insights
Reasons why sharing and gamification matter

Beyond Visualization: Designing Data For Insights And Action
Dave McColgin, Executive Creative Director, Artefact

The attendees of this session will take away tangible methods for approaching designing data for people. Dave will share how leading design and innovation consultancy Artefact approaches the design of data visualization and analytics tools, based on a range of desired outcomes and audiences and using examples from award-winning projects like the SIMBA breast cancer decision tool, USAFacts, and more.

Data Art: Beyond Infographics
Joerg Blumtritt, CEO, Datarella

When it comes to data visualizations, we usually think of infographics. However, contemporary artists have been enacting critical examination of technology and its impact on society, such as surveillance and self-determination. This talk takes you to the very edges of what is being done with data in mediums ranging from video, software, and websites to hardware, kinetic machines, and robotics.

View the complete agenda of more than 30 sessions across 3 days.

Build your expertise. Drive your organization’s success. Advance your career. Join us at TDWI Accelerate, October 16-18 in Seattle, WA.

0 comments

Are they still relevant?

By Chris Adamson, Founder and BI Specialist, Oakton Software LLC

Technological advances have enabled a breathtaking expansion in the breadth of our BI and analytic solutions. On the surface, many of these technologies appear to threaten the relevance of models in general, and of the dimensional model in particular. But a deeper look reveals that the value of the dimensional model rises with the adoption of big data technologies.

The Dimensional Model of Yesterday

The dimensional model rose to prominence in the 1990’s as data warehouse architectures evolved to include the concept of the data mart. During this period, competing architectural paradigms emerged, but all leveraged the dimensional model as the standard for data mart design. The now familiar “stars” and “cubes” that comprise a data mart became synonymous with the concept of the dimensional model.

In fact, schema design is only one of several functions of the dimensional model. A dimensional model represents how a business measures something important, such as an activity. For each process described, the model captures metrics that describe the process (if any), and the associated reference data. These models serve several functions, including:

Capture business requirements (information needs by business function)
Manage scope (define and prioritize data management projects)
Design data marts (structure data for query and analysis)
Present information (a business view of managed data assets)

Because the dimensional model is so often instantiated in schema design, its other functions are easily overlooked. As technologies and methods evolve, some of these functions are beginning to outweigh schema design in terms of importance to data management programs.

New Technology and Data Management Programs

Since the 1990’s, business uses for data assets have multiplied dramatically. Data management programs have expanded beyond data warehousing to include performance management, business analytics, data governance, master data management, and data quality management.

These new functions have been enabled, in part, by advances in technology. Relational and multidimensional databases can sustain larger data sets with increased performance. NoSQL technology has unlocked new paradigms for organizing managed data sets. Statistical analysis and data mining software have evolved to support more sophisticated analysis and discovery. Virtualization provides new paradigms for data integration. Visualization tools promote communication. Governance and quality tools support management of an expanding set of information assets.

As the scope of data management programs has grown, so too has the set of commensurate skills required to sustain them. The field of data management encompasses a broader range of specialties than ever before. Teams struggle to keep pace with the expanding demands, and data generalists are being stretched even thinner. These pressures suggest that something must give.

Amidst the buzz and hype surrounding big data, it’s easy to infer that the dimensional modeling skills might be among the first to go. It is now possible to manage data in a non-relational format such as a key-value store, document collection, or graph. New processing paradigms support diverse data formats ranging from highly-normalized structures to wide, single table paradigms. Schema-less technologies do not require a model to ingest new data. Virtualization promises to bring together disparate data sets regardless of format, and visualization promises to enable self-service discovery.

Coupled with the notion that the dimensional model is nothing more than a form of schema design, these developments imply it is no longer relevant. But the reality is precisely the opposite.

Dimensional Models in the Age of Big Data

In the wake of new and diverse ways to manage data, the dimensional model has become more important, not less. As a form of schema design, the news of its death has been greatly exaggerated. At the same time, the prominence of its other functions has increased.

Schema Design
The dimensional model’s best-known role, the basis for schema design, is alive and well in the age of big data. Data marts continue to reside on relational or multi-dimensional platforms, even as some organizations choose to migrate away from traditional vendors and into the cloud.

While NoSQL technologies are contributing to the evolution of data management platforms, they are not rendering relational storage extinct. It is still necessary to track key business metrics over time, and on this front relational storage reigns. In part, this explains why several big data initiatives seek to support relational processing on top of platforms like Hadoop. Non-relational technology is evolving to support relational; the future still contains stars.

The Business View
That said, there are numerous data management technologies that do not require the physical organization of data in a dimensional format, and virtualization promises to bring disparate data together from heterogeneous data stores at the time of query. These forces lead to environments where data assets are spread across the enterprise, and organized in dramatically different formats.

Here, the dimensional model becomes essential as the business view through which information assets are presented and accessed. Like the sematic layers of old, the business view serves as a catalog of information resources expressed in non-technical terms, shielding information consumers from the increasing complexity of the underlying data structures and protecting them from the increasing sophistication needed to formulate a distributed query.

This unifying business view grows in importance as the underlying storage of data assets grows in complexity. The dimensional model is the business’s entry point into the sprawling repositories of available data, and the focal point that makes sense of it all.

Information Requirements and Project Scope
As data management programs have expanded to include performance management, analytics, and data governance, information requirements take on a new prominence. In addition to supporting these new service areas, they become the glue that links them together. The process-oriented measurement perspective of the dimensional model is the core of this inter-connected data management environment.

The dimensional model of a business process provides a representation of information needs that simultaneously drives the traditional facts and dimensions of a data mart, the key performance indicators of performance dashboards, the variables of analytic models, and the reference data managed by governance and MDM.

In this light, the dimensional model becomes the nexus of a holistic approach managing BI, analytics, and governance programs. In addition to supporting a unified roadmap across these functions, a single set of dimensional requirements enables their integration. Used at a program level to define the scope of projects, the dimensional model makes possible data marts and dashboards that reflect analytic insights, analytics that link directly to business objectives, performance dashboards that can drill to OLAP data, and master data that is consistent across these functions.

As businesses move to treat information as an enterprise asset, a dimensional model of business information needs has become a critical success factor. It enables the coordination of multiple programs, provides for the integration of their information products, and provides a unifying face on the information resources available to business decision makers.

Chris Adamson is an independent BI and Analytics specialist with a passion for using information to improve business performance. He works with clients worldwide to establish BI programs, identify and prioritize projects, and develop solutions. A recognized expert in the field of BI, he is the author of numerous publications including the books Star Schema: The Complete Reference and Data Warehouse Design Solutions.

0 comments

A hub should centralize governance, standards, and other data controls, plus provide self-service data access and data prep for a wide range of user types.

By Philip Russom, Senior Research Director for Data Management, TDWI

I recently spoke in a webinar run by Informatica Corporation, sharing the stage with Informatica’s Scott Hedrick and Ron van Bruchem, a business architect at Rabobank. We three had an interactive conversation where we discussed the technology and business requirements of data hubs, as faced today by data management professionals and the organizations they serve. There’s a lot to say about data hubs, but we focused on the roles played by centralization and self-service, because these are two of the most pressing requirements. Please allow me to summarize my portion of the webinar.

A data hub is a data platform that serves as a distribution hub.

Data comes into a central hub, where it is collected and repurposed. Data is then distributed out to users, applications, business units, and so on.

The feature sets of data hubs vary. Home-grown hubs tend to be feature poor, because there are limits to what the average user organization can build themselves. By comparison, vendor-built data hubs are more feature rich, scalable, and modern.

A true data hub provides many useful functions. Two of the highest priority functions are:

Centralized control of data access for compliance, governance, security
Self-service access to data for user autonomy and productivity

A comprehensive data hub integrates with tools that provide many data management functions, especially those for data integration, data quality, technical and business metadata, and so on. The hallmark of a high-end hub is the publish-and-subscribe workflow, which certifies incoming data and automates broad but controlled outbound data use.

A data hub provides architecture for data and its management.

A quality data hub will assume a hub-and-spoke architecture, but be flexible so users can customize the architecture to match their current data realities and future plans. Hub-and-spoke is the preferred architecture for integration technologies (for both data management and applications), because it also falls into obvious, predictable patterns that are easy to learn, design, optimize, and maintain. Furthermore, a hub-and-spoke architecture greatly reduces the number of interfaces deployed, as compared to a point-to-point approach, which in turn reduces complexity for greater ease of use and maintainability.

A data hub centralizes control functions for data management.

When a data hub follows a hub-and-spoke architecture, it provides a single point of integration that fosters technical standards for data structures, data architecture, data management solutions, and multi-department data sharing. That single point also simplifies important business control functions, such as governance, compliance, and collaboration around data. Hence, a true data hub centralizes and facilitates multiple forms of control, for both the data itself and its usage.

A data hub enables self-service for controlled data access.

Self-service is very important, because it’s what your “internal customers” want most from a data hub. (Even so, some technical users benefit from self-service, too.) Self-service has many manifestations and benefits:

Self-service access to data makes users autonomous, because they needn’t wait for IT or the data management team to prepare data for them.
Self-service creation of datasets makes users productive
Self-service data exploration enables a wide range of user types to study data from new sources and discover new facts about the business

These kinds of self-service are enabled by an emerging piece of functionality called data prep, which is short for data preparation and is sometimes called data wrangling or data munging. Instead of overwhelming mildly technical or non-technical users with the richness of data integration functionality, data prep boils it down to a key subset of functions. Data prep’s simplicity and ease-of-use yields speed and agility. It empowers a data analyst, data scientist, DM developer, and some business users to construct a dataset with spontaneity and speed. With data prep, users can quickly create a prototype dataset, improve it iteratively, and publish it or push it into production.

Hence, data prep and self-service work together to make modern use cases possible, such as data exploration, discovery, visualization, and analytics. Data prep and self-service are also inherently agile and lean, thus promoting productive development and nimble business.

A quality hub supports publish and subscribe methods.

Centralization and self-service come together in one of the most important functions found in a true data hub, namely publish-and-subscribe (or simply pub/sub). This type of function is sometimes called a data workflow or data orchestration.

Here’s how pub/sub works: Data entering the hub is certified and cataloged on the way in, so that data’s in a canonical form, high quality, and audited, ready for repurposing and reuse. The catalog and its user-friendly business metadata then make it easy for users and applications to subscribe to specific datasets and generic categories of data. That way, users get quality data they can trust, but within the governance parameters of centralized control.

Summary and Recommendations.

Establish a data architecture and stick with it. Rely on a data hub based around a hub-and-spoke architecture, not point-to-point hairballs.
Adopt a data hub for the business benefits. At the top of the list would be self-service for data access, data exploration, and diverse analytics, followed by centralized functions for data governance and stewardship.
Deploy a data hub for technical advancement. A hub can organize and modernize your infrastructure for data integration and data management, as well as centralize technical standards for data and development.
Consider a vendor-built data hub. Home-grown hubs tend to be feature-poor compared to vendor-built ones. When it comes to data hubs, buy it, don’t build it.
Demand the important, differentiating functions, especially those you can’t build yourself. This includes pub/sub, self-service data access, data prep, business metadata, and data certification.
A modern data hub potentially has many features and functions. Choose and use the ones that fit your requirements today, then grow into others over time.

If you’d like to hear more of my discussion with Informatica’s Scott Hedrick and Rabobank’s Ron van Bruchem, please click here to replay the Informatica Webinar.

0 comments