TDWI Upside - Where Data Means Business

4 Data Mistakes Newbie Analysts Make

Don’t overcomplicate your analytics. Sometimes the simplest analysis is the best.

Have you ever encountered newbie analysts who are well-trained but still make fundamental mistakes with their analysis? Data analytics programs focus the bulk of their training on technical proficiencies -- are they neglecting more basic skills that lead to analytics errors?

To help you and your team understand what kind of mistakes new analysts make, I spoke with two leading data analytics and statistics experts, Dr. Wayne Thompson, chief data scientist at SAS, and Alex Reinhart, author of “Statistics Done Wrong.”

For Further Reading:

Training Data Scientists in Eight Weeks

Data Science for Everyone

Careers: The Case for Specialization

Here are four common mistakes they noted:

Mistake #1: Focusing on building models over monitoring them

Data analyst and statistics training programs emphasize how to develop algorithms and models but are short on teaching what goes wrong when applying them to real-world solutions.

According to Dr. Thompson, “There’s not enough focus on putting models into production for decision making. Also, the models degrade over time so it is important to monitor the model and retrain it at frequent intervals.”

Focus on developing models that are simple enough to do a good job. Once you have a model that’s “good enough,” implement it, monitor its performance, and update it based on how well it predicts new data.

Mistake #2: Fixating on the algorithm and ignoring the problem you’re attempting to solve

The goal of data science is to collect past data from customers or another real-world population and use it to develop models that predict future behaviors. Data analysts often make the mistake of developing algorithms or models for their data while neglecting the overall purpose of their models.

“Machine learning models are so powerful now that data scientists will overfit their historical data until their models don’t generalize very well to new data,” Thompson explains. “In other words, they continue to tune the algorithm to fit the existing learning data too tightly. One tip is to measure model fit on lots of holdout data.”

Newbie analysts also neglect Occam’s Razor. Don’t overfit data to the model. Instead, focus on building a model that fits the historical data but also generalizes well.

Mistake #3: Emphasizing prediction accuracy over applicability

Developing accurate models becomes a competition to many analysts. Sometimes you can have the best model in terms of prediction accuracy that isn’t practical or applicable to the real-world problems your organization faces.

An extreme version of this is data leakage. Thompson gives an example of a prostate cancer model that includes, as a variable, whether a patient has been treated for prostate cancer. This variable would be highly predictive but not at all useful for the end goal of detecting or predicting prostate cancer in patients.

Another extreme example: “an instance where an analyst attacked a reasonable problem but tried to apply four or five statistical or machine learning methods,” says Reinhart. This resulted in “a cross-validated prediction from an elastic net logistic regression model that’s been reduced in dimension by stepwise regression using variables coming from kernel density estimates.”

To avoid going overboard trying to perfectly fit your variables, keep and fit as few variables as possible to still be able to test and apply the model in a reasonable fashion.

Mistake #4: Jumping into analysis before ensuring data quality

A model depends on the underlying data as its foundation -- without quality data, you have nothing. Take an active involvement in data procurement and cleaning.

Get to know the people who collect or create your data. “A data analyst needs to understand the business, its operations, and where, exactly, the numbers come from. Once you do, you can spot problems,” advises Reinhart.

There also are many new sources of data, such as image data, auditory data, and textual data, on top of the more commonly seen transactional data. Think of data source selection as a buffet: add ingredients from lots of different sources -- what customers say, type, or are doing -- to get a better model.

Also, don’t forget the basic measures of data: mean, standard deviation, skewness. Examine these before delving into more complicated analytics. “You want to print out and look at your data, almost to where you can touch it, just to do a quick check to see whether or not the structure of the data represents how you put it together and aggregated it,” Thompson suggests.

“I often run into data that’s been miscoded, mistyped, or just mixed up -- like two columns of a spreadsheet having their labels swapped,” Reinhart adds, “and it’s easy to spot when you know what a reasonable value looks like.”

A Final Word

The common note in each of these four mistakes is over-complicating analysis and neglecting more basic considerations. Remember Occam’s advice: “With all things being equal, the simplest explanation tends to be the right one.”

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, & Team memberships available.