It's time to put the science back in "data science"- ZWMiller

It's time to put the science back in "data science"

Feb. 27, 2018 - Opinion, Theory

I've seen a lot of think-pieces on the internet that conclude a data scientist's job is to "tell the story in the data." These think-pieces are aimed at CEO's and hiring managers, trying to help them gain insight into "what data scientists do." To be blunt, it's been very frustrating to me as a practitioner of data science to hear the field condensed down to "story time with Uncle Data Guy." I'm aware that "telling the data's story" is an idiom, but I'd like to submit that this is a bad slope for data science to be walking down.

By allowing this narrative, we're instilling in young data scientists that the most important skill is to be able to find a story in the data. However, that's not what data science is supposed to be about. It's AN important skill, but the whole package is supposed to be about finding statistically significant stories in the data. There's a world of difference between a story teller and a statistician. Story tellers leave things out that aren't directly related to the main theme, bend the truth for the interest of the story, and add tension in places where none exists. These are TERRIBLE practices for data scientists, and I'm seeing these practices pop up in Meetups and talks all around my area. I'm seeing folks presenting projects that don't address the outliers in the data, showing results that are consistent with zero effect as if they're breakthroughs, and touting projects that don't even have a well-defined goal. These are the folks that are presenting to, and inspiring, the next generation of aspiring data scientists, and that terrifies me.

Data science, as a field, works because the results we present are practical. We look through the data, suss out trends in the data, make sure those trends are real, and make suggestions on how we can use those trends. The success of the field is dependent on "less technical decision makers" trusting the results of the "more technical analysts." We must foster that trust and build upon it. If data scientists become nothing more than story tellers for the data, that trust is going to erode as the stories leave behind the practicality of "just tell me what's in the data and how we can use it for our business." To keep the field progressing, we need to be better about not skipping the "make sure those trends are real" step in the process. How do we do that?

It might be time to bring error bars into the fold of data science. It might be time to start making sure that blind studies reproduce the results of original studies. It might be time to bring hypothesis testing to the forefront of evaluating a model's prediction power as a requirement for model use. These practices are already happening in some places, but not all of them, and that hurts the brand of data science as a whole. Managers and CEOs are noticing that data science is fallible, and that's both fine and expected as the field matures, but we aren't helping our cause as data scientists by making bold claims and not explaining the difference between a sure thing and a "minor effect."

We aren't story tellers for the data, we're interpreters. The jobs of an interpreter is to take concepts that exists in one language and make it understandable in the other. For data science, part of that job is finding the story in the data and explaining it. Another equally important part of the job is making sure that all the idioms and parlance that exist in statistical analysis are translated to a language your manager speaks. A good interpreter doesn't leave out the parts that are hard to explain, they spend the time and effort to make sure the point isn't lost in translation. You aren't going to be able to show up and talk to most executives about the z-scores of your result. However, you can explain that you're seeing an effect that's consistent with XYZ occurring in our customer base. You can explain that, "yes I'm seeing this effect, but it's only 2% larger than our normal average and that could be due to which subset of people came to our website in the last month."

Data scientists aren't just interpreters though, they're also investigative journalists. An investigative journalist dives into a topic head-first and immerses themselves in the content on behalf of the general public. The journalist learns all the slang, tries to fit in with the surrounding they're exploring, and brings back the facts about whatever they're investigating. When they're done, the journalist breaks down all the positive and negatives. They don't bring an agenda to the story or let their opinions color the outcome. A good journalist simply does the work others don't want to do, then brings back the facts about the topic. Once they have the facts, they write it all in an easily digestible format. In good journalism, the facts always speak. This is what we should strive for in data science. Every time we engage with a new dataset, we're trying to bring back the unbiased facts and find a way to use those facts to our advantage. The fate of investigative journalism should also be a cautionary tale. In the modern world, too many "journalists" have gotten caught up in the story telling, and let the story drive the research instead of the other way around. With that trend has come a rise in distrust for journalism. The bad behavior of the loud few has clouded the results of those that are doing good work. That's my fear for data science as we continue to see the field rise into the public eye, especially if we continue to pursue the story of the data over the facts of the data. If businesses start to see data scientists as expensive story tellers, the field will cease to exist, as will the progress the field is bringing with it.

Let me be clear, I'm 100% guilty of not pulling my weight on this topic. I've shortened explanations for the purpose of wowing people. I've only shown the good results because I felt like showing the bad results would be demonstrating that my project had failed. We (myself included) must do better on that front. That's why we need to act as scientists of data. In science there is no failure, only hypothesis confirmation or rejection. If we want data scientists to continue being valued and trusted, we have to stop teaching young data scientists that their job is to "tell the data's story" and start teaching them that their job is to "understand and interpret the data." We need to emphasize that a model isn't the goal, but rather a method for understanding the data and then using that understanding to our advantage. We must instill in the next generation of data scientists that statistical rigor is mandatory, not something that only the weird guy on your team with the PhD does. Otherwise, I fear that trust in data science will die off as we make bold claim after bold claim about what the "story" in the data is, without the ability to back up our findings with rigorous results.

Zachariah W Miller, PhD

Data Science · Physics · Software

zachariah.w.miller@gmail.com

It's time to put the science back in "data science"