How Do Data Scientists Really Feel about Their Work?
Being a data scientist isn't everything it's cracked up to be. Data science has its share of boring, repetitive tasks. On the whole, however, data scientists really love their work.
Being a data scientist isn't everything it's cracked up to be. It has its share of boring, repetitive tasks.
According to a new survey, on average data scientists spend more than half their time (53 percent) doing stuff they don't dig -- such as cleaning and organizing data for analysis.
The "2017 Data Scientist Report," published last month, finds that nearly half (45 percent) of data scientists spend huge chunks of their time cleaning and organizing data for analysis -- a task, not coincidentally, that a clear majority (60 percent) say they enjoy least.
Tasks such as collecting and labeling data are only slightly more popular: 48 percent of data scientists say they dislike collecting data; 51 percent say the same thing about labeling it. Unfortunately, both tasks occupy a disproportionate share of the data scientist's time.
The "2017 Data Scientist Report" is published by CrowdFlower, a data mining and crowdsourcing company that markets software-as-a-service (SaaS) offerings for data cleansing and enrichment. It's based on a survey of 179 data scientists who work with companies large (greater than 10,000 employees) and small (fewer than 100). Irrespective of where they work, data scientists seem to share certain defining qualities. "Data scientists are happiest building and modeling data, mining data for patterns, and refining algorithms," the report indicates.
The problem is that few data scientists get to do this stuff as much as they'd like.
"These three more cerebral tasks rank nearly eight times higher in popularity amongst data scientists than more 'janitorial tasks,' yet a mere 19 percent of data scientists report spending most of their time on the top ranked activity -- 'Building and Modeling Data,'" the report says.
Image Is Everything
The challenge of collecting, cleansing, and organizing data is compounded by the increasing importance of poly-structured data: files (binary objects, multimedia and/or image files, etc.), text, and other forms of data that, unlike relational data, are not strictly structured.
More than half of data scientists surveyed (51 percent) say a "significant" amount of their work involves poly-structured data. Ninety-one percent of respondents say they're working with text, 33 percent with image data, 15 percent with video, and 11 percent with audio.
The "2017 Data Scientist Report" cites research from Gartner concerning the growth of video and image data. Thanks to the ubiquity of cameras and sensors, Gartner projects that video and image data will account for 80 percent of all Internet traffic by the end of this decade.
The vast majority of this data (95 percent) will be analyzed automatically, by machines.
"This significant uptick in visual data was reflected in our survey responses as well. [Although] it's no surprise that almost all respondents are working with text data, a good portion of data scientists are utilizing images ... and video ... as well," the report says.
It's All about the Data
Not surprisingly, most of the data (71 percent) data scientists work with is sourced from internal systems. However, most data scientists work with public (41 percent), self-collected (43 percent), and internally collected (68 percent) data sets, too.
One last thing: data scientists really love the cleansed, qualified data they use to train their models. One of the questions in the Crowdflower survey gave data scientists a choice between three options: accidental deletion of their machine code and algorithms, accidental deletion of their training data, and ... accidental shattering of one of their femurs. Most respondents (52 percent) said they'd rather suffer loss of their algorithms than loss of data. However, 28 percent said they'd rather break a leg than lose their training data.