Scientist Or Analyst?

I’ve been analyzing data for 30 years.

I’ve studied science and engineering.

I read about science.

I taught myself to code and created a data engineering and analytics web application (

I have thought a lot about my work as I have developed my domain expertise.

One of the things I have come to realize is that if you are designing and running experiments to prove / disprove a hypothesis, then you are performing science. You may even call yourself a scientist.  Testing a hypothesis requires evidence, usually in the form of objective data. If you are a scientist, you use data, no matter what the discipline.  You can’t be a scientist without data. The term data in Data Science is  redundant, like calling yourself a Religious Priest or an Oral Dentist.

In contrast, if you use data to look for a story or a correlation (causal or otherwise) and you aren’t testing a hypothesis. You aren’t a scientist, you are an analyst. In this case a qualifying noun is useful (data, systems, etc).

I suspect most people who call themselves data scientists are actually analysts that have taught themselves to write code. This isn’t science. Science is the method by which we create persistent state knowledge, code is just a tool we should use to process data to test a hypothesis.

Should Linear Analysis Be The Only Tool We Use

99% of analysis carried out by analysts involves a cross tab – analyzing one piece of data through the lens of another.

The cross tab is the de facto standard tool and while it has limitations from an analytical perspective, the cross tab is produces human readable outputs. The challenge lies in the fact that the cross tab produces linear results but not definitive results. They tell a story but often not a satisfactory one. For instance, if we look at how people voted in the 2016 Presidential Election in the USA using this data we can see a weak story appear. While many commentators wanted to label Trump supporters as white, poor and uneducated, these labels are only partially true. They are not definitive.  Were we to use just these simple descriptors to predict who voted for Trump (or Clinton) and provide a definitive story then the story would be much more convoluted to relay, since it would rely on non-linear transformation of these descriptors.

The challenge for analytics is to find the right blend of Linear Analytics and Non-Linear Analytics that combines predictive power and retains human-readability.

Stories Vs. Predictions

Having worked within an industry (market research) for some time, I am intrigued how other occupations use data, particularly data scientists.  After a conversation with a new friend – a data scientist – last week I had a revelation.

Data scientists have created a new words to talk about data analysis. The ones that stand out are features and feature sets.   Quantitative market researchers talk about questions and surveys but never features. Essentially, they are the same thing; features are traits, attributes and behaviors of people that can be used to describe/predict specific outcomes.  The big difference is that data scientists don’t care so much that the features are not human-readable (i.e. they can be read and understood like a book), as long as they help make a prediction.  For example, Random Forests make good predictors but aren’t easily understandable. The same is true of Support Vector Machines. Excellent predictors but in higher dimensions they are hard to explain.

In contrast, market researchers are fixated on the predictive features being human-readable.  As data science has shown, a market researcher’s predictions, their stories, will always be weaker than those of a data scientist. This in-part explains the continued trend of story-telling in market research circles.  Stories are popular, and contain some ambiguity, this ambiguity can allow people to take out from them what they wish. This is an expedient quality in the short term but damaging long term to the industry.

I think market researchers need to change, my aim with Knowledge Leaps is to try and bridge the gap between highly predictive features and human-readable stories.