By Fern Halper, VP  Research, Advanced Analytics
 
There was a time when choosing a programming language for data analysis had essentially no choice at all. The tools were few and they were usually developed and maintained by individual corporations that, though they ensured a reliable level of quality, could sometimes be quite difficult to work with and slow to fix bugs or innovate with new features. The landscape has changed, though. 
Thanks to the Web, the open source software development model has shown that it can produce robust, stable, mature products that enterprises can rely upon. Two such products are of special interest to data analysts: Python and R. Python is an interpreted, interactive, object-oriented scripting language created in 1991 and now available through the Python Foundation. R, which first appeared at roughly the same time, is a language and software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing. 
Each comes with a large and active community of innovative developers, and has enormous resources readily available through libraries for analytics and processing—libraries such as NumPy, SciPy, and Scikit-learn for data manipulation and analytics in Python, and purr, ggplot2, and rga in R. These features can go a long way to speeding time-to-value for today’s businesses.
Python pro Paul Boal and long-time TDWI instructor Deanne Larson (who will both be bringing their Quick Camps to Accelerate Seattle) both see the value in learning these tools. 
“R is very popular with data scientists,” Larson says. “They’re using it for data analysis, extracting and transforming data, fitting models, drawing inferences, even plotting and reporting the results.” 
Scaling to the Enterprise Level 
Debraj GuhaThakurta, senior data and applied scientist at Microsoft, warns that “Although R has a large number of packages and functions for statistics and machine learning,” he says, “many data scientists and developers today do not have the familiarity or expertise to scale their R-based analytics or create predictive solutions in R within databases.” Work is needed, he says, to create and deploy predictive analytics solutions in distributed computing and database environments.
And although there are adjustments to be made at the enterprise level to integrate and scale the use of these technologies, the use cases continue to grow. According to Natasha Balac of Data Insight Discovery, Inc., the number of tools available for advanced analytics techniques such as machine learning has exploded to the level where help is needed to navigate the field. She chalks it up to increasingly affordable and scalable hardware. “It’s not limited to any one vertical either,” Balac says. “Users need help integrating their experience, goals, time lines, and budgets to sort out the optimal use cases for these tools—both for individuals and teams.” 
There are significant resources available to companies and data professionals to close the gap between skilled users and implementation. TDWI has integrated new courses on machine learning, AI, and advanced analytics into events like TDWI Accelerate. The next event will be held in Seattle on October 16-18, check out the full agenda here.
Build your expertise. Drive  your organization’s success. Advance your career. Join us at TDWI Accelerate, October 16-18 in Seattle,  WA. 
 
	0 comments
          
	
 
            
                
                
 
    
    
	
    By Meighan Berberich, President, TDWI
 
Analytics and data science have moved to the forefront of business  decision making. The size and scope of  the organizations and the complexity of tools and technologies that support  these mission critical initiatives only continues to grow. It is critical for  analytic leaders to maintain focus on the key factors that drive success of their  analytics teams and deployments.
TDWI Accelerate will not only provide analytics leaders with insight on what’s new (and what’s next) in  advanced analytics, but also on the factors beyond technology that are  instrumental to driving business value with data.
Check out our  leading sessions during this interactive and engaging three day event:

 Building a High-Performance Analytics  & Data Science Team
Building a High-Performance Analytics  & Data Science Team
Bill Franks, Chief Analytics Officer,  International Institute for Analytics
This talk will discuss a variety of topics related to how a successful  modern-day analytics and data science organization can grow, mature, and  succeed. Topics will include guidance on organizing, recruiting, retaining, and  instilling a winning culture within an analytics and data science organization.
 Pay No Attention  to the Man behind the Curtain
Pay No Attention  to the Man behind the Curtain
Mark Madsen, President, Third Nature, Inc.
Using  data science to solve problems depends on many factors beyond technology:  people, skills, processes, and data. If your goal is to build a repeatable  capability then you need to address parts that are rarely mentioned. This talk  will explain some of the less-discussed aspects of building and deploying  machine learning so that you can better understand the work and what you can do  to manage it.
 From BI to AI -  Jumpstart your ML/AI projects with Data Science Super Powers
From BI to AI -  Jumpstart your ML/AI projects with Data Science Super Powers
Wee Hyong Tok, Principal Data Science Manager,  Microsoft Corporation
Why  do some companies drown in volumes of data, while others thrive on distilling  the data in the data warehouses/databases into golden strategic advantages? How  do business stakeholders and data scientists work together to evolve from BI to  ML/AI, and leverage data science in co-creating new value for the company? Join  Wee Hyong in this as he shares with you 5 super powers that will help you  jumpstart your ML/AI projects.
 Designing for  Decisions
Designing for  Decisions
Donald Farmer, Principal,  TreeHive Strategy
Over  the years, Business Intelligence has grown in many directions. Today, the  practice can include data mining, visualization, Big Data and self-service  experiences. But we’re still, fundamentally, in the business of decision  support.
As analysts and developers we can support better business decisions  more effectively if we understand the cognitive aspects of data discovery,  decision making and collaboration. Luckily, we have a wealth of fascinating  research to draw on, ranging from library science to artificial intelligence  research.
We’ll  explore the implications of some of this research, in particular how we can  design an analytic experience that leads to more engaging, more insightful and  measurably better decision-making.
View the complete agenda of more than 30 session across 3 days.
The time is now. The demand for  skills in analytics and data science has reached an all-time high. Everywhere,  organizations are building advanced analytics practices to drive business  value, improve operations, enrich customer experiences, and so much more.
Build your expertise. Drive your organization’s success. Advance  your career. Join us at TDWI Accelerate, October  16-18 in Seattle, WA. 
 
	0 comments
          
	
 
            
                
                
 
    
    
	
    By Meighan Berberich, President, TDWI
 
Data prep. Wonderful, terrible data prep. According to John Akred of Silicon Valley Data Science, “it’s a law of nature that 80% of data science” is data prep. Although our surveys average closer to 60%, even that’s an awful lot of time to spend not analyzing data, interpreting results, and delivering business value—the real purpose of data science.
Unfortunately, real-world data doesn’t come neatly prepackaged and ready to use. It’s raw, messy, sparse, and exists across a million disparate sources. It can be dirty, poorly formatted, unclear, undocumented, or just plain wrong. One can easily see what makes Exaptive data scientist, Frank Evans, ask “Are we data scientists or data janitors?”
The news isn’t all bleak, though. If there’s one thing we know, it’s that the data scientist’s mindset is perfectly suited to grappling with a seemingly intractable problem and coming up with answers. For example, even Evans’ cynical-seeming question isn’t offered without some solutions. 
“Most projects are won or lost at the wrangling and feature engineering stage,” Evans says. “The right tools can make all the difference.” We have a collection of best practices and methods for wrangling data, he offers, such as reformatting it to make it more flexible and easier to work with. There are also methods for feature engineering to derive the exact elements and structures that you want to test from raw data.
Akred is similarly solutions-oriented. His many years of experience applying data science in industry has allowed him to develop a framework for evaluating data. 
“You have data in your organization. So you need to locate it, determine if it’s fit for purpose, and decide how to fill any gaps,” he says. His experience has allowed him to be equally pragmatic about the necessity to navigate the political and technical aspects of sourcing your data—something that can often be neglected.
Data Exploring—Literally 
Wes Bernegger of Periscopic takes a somewhat more playful tack. 
“The road to uncovering insight and narratives in your data begins with exploration,” he says. “But though there are all kinds of tools to help you analyze and visualize your data, it’s still mostly an undefined process.” Bernegger suggests coming to the task with the attitude of an old-fashioned explorer. 
“If you plan your voyage and are prepared to improvise with relentless curiosity,” he says, “you can often come away with unexpected discoveries and have more fun along the way.” Bernegger advises to lay out a system for the data exploration practice, from wrangling and tidying to visualization, through many rounds of iteration, and to stock up on some tools (such as continuing education) to help you find your way in unfamiliar terrain.
Build your expertise. Drive  your organization’s success. Advance your career. Join us at TDWI Accelerate, October 16-18 in Seattle,  WA. 
 
	0 comments
          
	
 
            
                
                
 
    
    
	
    By Meighan Berberich, President, TDWI
 
Communication—the  process by which information is exchanged between individuals. In the analytics  field, we like to call it “data visualization,” but it’s really just a  particular form of communication. There’s nothing special about that. Even  bacteria can communicate with each other. So why can it be so difficult for  data professionals to get their meaning across?
There are  few other areas where Art and Science collide in such a head-on way. Effective  data visualization requires its practitioners to be constantly threading the  needle between the art of the visual (How many colors is too many? Will viewers  tune out at another bar chart?) and the science of the numbers. In addition,  there can be a lot riding on a visualization’s effectiveness—business opportunities  lost, warning signs missed, promising applications abandoned.
As Nick  Kelly, vice president of BluLink Solutions, says, “many analytics projects  start well intentioned” but are never fully adopted by frontline business  users. Though there are a number of potential reasons, he says, Kelly  identifies poor user experience as a common one. 
Designing With Data
Dave  McColgin, executive creative director of Seattle-based design firm Artefact,  takes a broader view. “We’re still exploring and experimenting how we use,  share, and communicate huge amounts of information,” he says. Artefact’s own  website speaks as much about its data and research efforts than its jazzy  portfolio pieces—a rarity among design firms. 
“The task is  to transform complex information into designs that engage people and empower  them to act,” McColgin says. 
Not to be  outdone, Datarella’s Joerg Blumtritt sees even greater potential. “In addition  to data storytelling, data journalism, and even simple dashboards, data has  grown into a medium for creative output.” 
Blumtritt is  a co-author of “The Slow Media Manifesto,” which emphasizes choosing one’s  media ingredients slowly and with care. “Contemporary artists are still  struggling to find the language for their output in the post-internet age. Parametric  design, algorithmic architecture, and rapid prototyping technologies have  redefined the relationship between creator and artist tools.”
You can  catch more from each of these visionaries at TDWI Accelerate Seattle, October 16-18. 

 The Data Scientists  Guide to User Experience
The Data Scientists  Guide to User Experience
Nick Kelly, Vice President, BluLink Solutions
This session takes a practical approach to addressing user  experience problems—from strategies such as conducting user interviews at  project inception through to the completion of the project and addressing user  adoption and sharing of insights.
  - Core UX principles to apply to analytics  requirements gathering
- How a workshop format can address stakeholder  challenges
- Using wire-framing to design the end state
- Key steps to operationalize insights
- Reasons why sharing and gamification matter
 Beyond Visualization: Designing Data For  Insights And Action
Beyond Visualization: Designing Data For  Insights And Action 
  Dave McColgin, Executive Creative Director,  Artefact 
The  attendees of this session will take away tangible methods for approaching  designing data for people. Dave will share how leading design and innovation  consultancy Artefact approaches the design of data visualization and analytics  tools, based on a range of desired outcomes and audiences and using examples  from award-winning projects like the SIMBA breast cancer decision tool,  USAFacts, and more.
 Data Art: Beyond Infographics
Data Art: Beyond Infographics 
  Joerg Blumtritt, CEO, Datarella 
When it  comes to data visualizations, we usually think of infographics. However,  contemporary artists have been enacting critical examination of technology and  its impact on society, such as surveillance and self-determination. This talk  takes you to the very edges of what is being done with data in mediums ranging  from video, software, and websites to hardware, kinetic machines, and robotics.
View the complete agenda of more than 30 sessions across 3 days.
The  time is now. The demand for skills in analytics and data science has reached an  all-time high. Everywhere, organizations are building advanced analytics  practices to drive business value, improve operations, enrich customer  experiences, and so much more.
Build  your expertise. Drive your organization’s success. Advance your career. Join us  at TDWI Accelerate, October 16-18 in Seattle, WA.
 
	0 comments
          
	
 
            
                
                
 
    
    
	
    Are they still relevant?
By Chris Adamson, Founder and BI Specialist, Oakton Software LLC
 
Technological advances have  enabled a breathtaking expansion in the breadth of our BI and analytic  solutions. On the surface, many of these technologies appear to threaten the  relevance of models in general, and of the dimensional model in particular. But  a deeper look reveals that the value of the dimensional model rises with the  adoption of big data technologies.
The Dimensional Model of Yesterday
The dimensional model rose to prominence in the 1990’s as  data warehouse architectures evolved to include the concept of the data mart. During  this period, competing architectural paradigms emerged, but all leveraged the  dimensional model as the standard for data mart design. The now familiar “stars”  and “cubes” that comprise a data mart became synonymous with the concept of the  dimensional model.
In fact, schema design is only one of several functions of  the dimensional model. A dimensional model represents how a business measures  something important, such as an activity. For each process described, the model  captures metrics that describe the process (if any), and the associated  reference data. These models serve several functions, including:
    - Capture business requirements (information needs  by business function)
- Manage scope (define and prioritize data  management projects)
- Design data marts (structure data for query and  analysis)
- Present information (a business view of managed  data assets)
Because the dimensional model is  so often instantiated in schema design, its other functions are easily  overlooked. As technologies and methods evolve, some of these functions are  beginning to outweigh schema design in terms of importance to data management  programs.
New Technology and Data Management Programs
Since the 1990’s, business uses for data assets have multiplied  dramatically. Data management programs have expanded beyond data warehousing to  include performance management, business analytics, data governance, master  data management, and data quality management. 
These new functions have been enabled, in part, by advances  in technology. Relational and multidimensional databases can sustain larger  data sets with increased performance. NoSQL technology has unlocked new  paradigms for organizing managed data sets. Statistical analysis and data  mining software have evolved to support more sophisticated analysis and  discovery. Virtualization provides new paradigms for data integration. Visualization  tools promote communication. Governance and quality tools support management of  an expanding set of information assets.
As the scope of data management programs has grown, so too has  the set of commensurate skills required to sustain them. The field of data  management encompasses a broader range of specialties than ever before. Teams  struggle to keep pace with the expanding demands, and data generalists are  being stretched even thinner. These pressures suggest that something must give. 
Amidst the buzz and hype surrounding big data, it’s easy to  infer that the dimensional modeling skills might be among the first to go. It  is now possible to manage data in a non-relational format such as a key-value  store, document collection, or graph. New processing paradigms support diverse  data formats ranging from highly-normalized structures to wide, single table  paradigms. Schema-less technologies do not require a model to ingest new data. Virtualization  promises to bring together disparate data sets regardless of format, and  visualization promises to enable self-service discovery. 
Coupled with the notion that the dimensional model is nothing  more than a form of schema design, these developments imply it is no longer  relevant. But the reality is precisely the opposite. 
Dimensional Models in the Age of Big Data
In the wake of new and diverse ways to manage data, the  dimensional model has become more important, not less.  As a form of schema design, the news of its  death has been greatly exaggerated. At the same time, the prominence of its  other functions has increased.
Schema Design
The dimensional model’s best-known role, the basis for  schema design, is alive and well in the age of big data. Data marts continue to  reside on relational or multi-dimensional platforms, even as some organizations  choose to migrate away from traditional vendors and into the cloud.  
While NoSQL technologies are contributing to the evolution  of data management platforms, they are not rendering relational storage extinct.  It is still necessary to track key business metrics over time, and on this  front relational storage reigns. In part, this explains why several big data initiatives  seek to support relational processing on top of platforms like Hadoop.  Non-relational technology is evolving to support relational; the future still  contains stars.
The Business View
That said, there are numerous data management technologies  that do not require the physical organization of data in a dimensional format,  and virtualization promises to bring disparate data together from heterogeneous  data stores at the time of query. These forces lead to environments where data  assets are spread across the enterprise, and organized in dramatically  different formats.
Here, the dimensional model becomes essential as the business view through which information  assets are presented and accessed.  Like the sematic layers of old, the business view serves as a catalog of  information resources expressed in non-technical terms, shielding information  consumers from the increasing complexity of the underlying data structures and protecting  them from the increasing sophistication needed to formulate a distributed query.
This unifying business view grows in importance as the  underlying storage of data assets grows in complexity. The dimensional model is  the business’s entry point into the sprawling repositories of available data,  and the focal point that makes sense of it all.
Information Requirements and Project Scope
As data management programs have expanded to include performance  management, analytics, and data governance, information requirements take on a  new prominence. In addition to supporting these new service areas, they become  the glue that links them together. The process-oriented measurement perspective  of the dimensional model is the core of this inter-connected data management  environment. 
The dimensional model of a business process provides a representation  of information needs that simultaneously drives the traditional facts and  dimensions of a data mart, the key performance indicators of performance  dashboards, the variables of analytic models, and the reference data managed by  governance and MDM.  
In this light, the dimensional model becomes the nexus of a  holistic approach managing BI, analytics, and governance programs. In addition  to supporting a unified roadmap across these functions, a single set of  dimensional requirements enables their integration.  Used at a program level to define the scope of projects, the dimensional model makes  possible data marts and dashboards that reflect analytic insights, analytics  that link directly to business objectives, performance dashboards that can  drill to OLAP data, and master data that is consistent across these functions.
As businesses move to treat information as an enterprise  asset, a dimensional model of business information needs has become a critical  success factor. It enables the coordination of multiple programs, provides for  the integration of their information products, and provides a unifying face on  the information resources available to business decision makers.
Chris  Adamson is an independent BI and Analytics specialist with a passion for using  information to improve business performance. He works with clients worldwide to  establish BI programs, identify and prioritize projects, and develop solutions.  A recognized expert in the field of BI, he is the author of numerous  publications including the books Star  Schema: The Complete Reference and Data  Warehouse Design Solutions.
 
	0 comments
          
	
 
            
                
                
 
    
    
	
    A hub should centralize governance, standards, and other data controls, plus provide self-service data access and data prep for a wide range of user types.
By Philip Russom, Senior Research Director for Data Management, TDWI
I recently spoke in a webinar run by Informatica Corporation, sharing the stage with Informatica’s Scott Hedrick and Ron van Bruchem, a business architect at Rabobank. We three had an interactive conversation where we discussed the technology and business requirements of data hubs, as faced today by data management professionals and the organizations they serve. There’s a lot to say about data hubs, but we focused on the roles played by centralization and self-service, because these are two of the most pressing requirements. Please allow me to summarize my portion of the webinar.
A data hub is a data platform that serves as a distribution hub.
Data comes into a central hub, where it is collected and repurposed. Data is then distributed out to users, applications, business units, and so on.
The feature sets of data hubs vary. Home-grown hubs tend to be feature poor, because there are limits to what the average user organization can build themselves. By comparison, vendor-built data hubs are more feature rich, scalable, and modern. 
A true data hub provides many useful functions. Two of the highest priority functions are:
    - Centralized control of data access for compliance, governance, security
- Self-service access to data for user autonomy and productivity
A comprehensive data hub integrates with tools that provide many data management functions, especially those for data integration, data quality, technical and business metadata, and so on. The hallmark of a high-end hub is the publish-and-subscribe workflow, which certifies incoming data and automates broad but controlled outbound data use.
A data hub provides architecture for data and its management.
A quality data hub will assume a hub-and-spoke architecture, but be flexible so users can customize the architecture to match their current data realities and future plans. Hub-and-spoke is the preferred architecture for integration technologies (for both data management and applications), because it also falls into obvious, predictable patterns that are easy to learn, design, optimize, and maintain. Furthermore, a hub-and-spoke architecture greatly reduces the number of interfaces deployed, as compared to a point-to-point approach, which in turn reduces complexity for greater ease of use and maintainability.
A data hub centralizes control functions for data management.
When a data hub follows a hub-and-spoke architecture, it provides a single point of integration that fosters technical standards for data structures, data architecture, data management solutions, and multi-department data sharing. That single point also simplifies important business control functions, such as governance, compliance, and collaboration around data. Hence, a true data hub centralizes and facilitates multiple forms of control, for both the data itself and its usage.
A data hub enables self-service for controlled data access.
Self-service is very important, because it’s what your “internal customers” want most from a data hub. (Even so, some technical users benefit from self-service, too.) Self-service has many manifestations and benefits:
    - Self-service access to data makes users autonomous, because they needn’t wait for IT or the data management team to prepare data for them.
- Self-service creation of datasets makes users productive
- Self-service data exploration enables a wide range of user types to study data from new sources and discover new facts about the business
These kinds of self-service are enabled by an emerging piece of functionality called data prep, which is short for data preparation and is sometimes called data wrangling or data munging. Instead of overwhelming mildly technical or non-technical users with the richness of data integration functionality, data prep boils it down to a key subset of functions. Data prep’s simplicity and ease-of-use yields speed and agility. It empowers a data analyst, data scientist, DM developer, and some business users to construct a dataset with spontaneity and speed. With data prep, users can quickly create a prototype dataset, improve it iteratively, and publish it or push it into production. 
Hence, data prep and self-service work together to make modern use cases possible, such as data exploration, discovery, visualization, and analytics. Data prep and self-service are also inherently agile and lean, thus promoting productive development and nimble business.
Centralization and self-service come together in one of the most important functions found in a true data hub, namely publish-and-subscribe (or simply pub/sub). This type of function is sometimes called a data workflow or data orchestration.
Here’s how pub/sub works: Data entering the hub is certified and cataloged on the way in, so that data’s in a canonical form, high quality, and audited, ready for repurposing and reuse. The catalog and its user-friendly business metadata then make it easy for users and applications to subscribe to specific datasets and generic categories of data. That way, users get quality data they can trust, but within the governance parameters of centralized control.
Summary and Recommendations.
    - Establish a data architecture and stick with it. Rely on a data hub based around a hub-and-spoke architecture, not point-to-point hairballs. 
- Adopt a data hub for the business benefits. At the top of the list would be self-service for data access, data exploration, and diverse analytics, followed by centralized functions for data governance and stewardship. 
- Deploy a data hub for technical advancement. A hub can organize and modernize your infrastructure for data integration and data management, as well as centralize technical standards for data and development.
- Consider a vendor-built data hub. Home-grown hubs tend to be feature-poor compared to vendor-built ones. When it comes to data hubs, buy it, don’t build it.
- Demand the important, differentiating functions, especially those you can’t build yourself. This includes pub/sub, self-service data access, data prep, business metadata, and data certification.
- A modern data hub potentially has many features and functions. Choose and use the ones that fit your requirements today, then grow into others over time.
If you’d like to hear more of my discussion with Informatica’s Scott Hedrick and Rabobank’s Ron van Bruchem, please click here to replay the Informatica Webinar.
 
	0 comments