Arrowheads Vs. Cave Paintings

Cave of Hands (13000 - 9000 BCE), Argentina.

Why Human Data Is More Powerful than Tools or Platforms.

At KL we realize the value of data is far greater than either analytic tools or platforms.  As a team, we spend a lot of our time discussing the topics of data and analytics, especially analytics tools. We used to devote more time to this latter topic in terms of selection of existing tools and development of new ones. We spent less time talking about platforms and data.  Overt time we have come to understand that all three of Data, Platform, Analytics are vital ingredients to what we do.  This is visualized in our logo, we are about the triangulation of all three.

On this journey, I have come to realize that some things take a long time to learn. In my case , when you study engineering, you realize that the desire to make tools (in the broadest sense) is in your DNA. Not just your own, in everyone's.

Building tools is what humans do, whether it's a flint arrowhead, the first machine screw or a self-driving car. It's what we have been doing for millennia and what we will continue to do.

As a species I think we are blind to tools because they are so abundant and seemingly easy to produce - because as a species we make so many of them.  In that sense they are not very interesting and those that are interesting are soon copied and made ubiquitous.

What is true of axes, arrowheads and pottery is also true of analytics businesses. The reason it is hard-to-build a tool-based business is that the competition is intense. As a species, this won't stop us trying.

In stark contrast to analytics tools, is the importance of data and platforms. If a flint arrowhead is a tool then the cave painting is data. When I look at images of cave paintings, such as the cave of hands shown, I am in awe.  A cave painting represents a data point of human history, the cave wall the platform that allows us to view it.

This is very relevant to building a data-driven business, those firms that have access to data and provide a platform to engage with it will always find more traction than those that build tools to work on top of platforms and data.

Human data points are hard to substitute and, as a result, are more interesting and have a greater commercial value than tools.

Machine Screws and AI

 

The attraction of AI is that it is learns by experience. All learning requires feedback, whether its an animal, a human or a computer doing the learning, The learning-entity needs to explore its environment in order to try different behaviors, create experiences and then learn from them.

For computers to learn about computer-based environments is relatively easy. Based on a set of instructions, a computer can be trained to learn about code that it is executing or is being executed by another computer. The backbone of the internet uses something to similar to ensure data gets from point A to point B.

For humans to learn about human-environments, this is also easier. It is what we have been doing for tens of thousands of years.

For humans to learn about computer-based environments is hard. We need a system to translate from one domain into another. Then we need a separate system to interpret what we have translated. We call this computer programming, and because we designed the computer, we have a bounded-system. There is still a lot to understand, but it is finite and we know the edges of the system, since we created it.

It is much harder for computers to learn about human environments. The computer must translate real-world (human) environment data into its own environment, and then the computer needs to decode this information and interpret what it means. Because the computer didn't design our world it doesn't have the advantage that humans do when we learning about computers. It also doesn't know if it is bounded-system or not. For all the computer knows, the human-world is infinite and unbounded - which it could well be.

In the short term, to make this learning feasible we use human input. Human's help train computers to learn about the real-world environments. I think of the reasons that driver-less car technology is being focused on, is that the road system is a finite system (essentially its 2D) that is governed by a set of rules.

•Don't drive into anything, except a parking spot.

•Be considerate to other drivers, e.g. take turns at 4-way stop signs.

•Be considerate to pedestrians and cyclists.

•etc

This combination of elements and rules makes it a perfect environment to train computers to learn to drive, not so much Artificial Intelligence but Human-Assisted Intelligence. Once we have trained a computer to decode the signals from this real-world environment and make sensible decisions with good outcomes, we can then apply this learning to different domains that have more variability in them, such as delivering mail and parcels.

This is very similar to the role of the machine screw in the industrial revolution. Once we had produced the first screw, we could then make a machine that could produce more accurate screws. The more accurate the screw, the more precise the machine, the smaller tolerance of components it could produce, the better the end-machine. Without the machine screw, there would have been no machine age.

This could open the doors to more advanced AI, it is some way off though because time required to train computers to learn about different domains.

Should Linear Analysis Be The Only Tool We Use

99% of analysis carried out by analysts involves a cross tab - analyzing one piece of data through the lens of another.

The cross tab is the de facto standard tool and while it has limitations from an analytical perspective, the cross tab is produces human readable outputs. The challenge lies in the fact that the cross tab produces linear results but not definitive results. They tell a story but often not a satisfactory one. For instance, if we look at how people voted in the 2016 Presidential Election in the USA using this data we can see a weak story appear. While many commentators wanted to label Trump supporters as white, poor and uneducated, these labels are only partially true. They are not definitive.  Were we to use just these simple descriptors to predict who voted for Trump (or Clinton) and provide a definitive story then the story would be much more convoluted to relay, since it would rely on non-linear transformation of these descriptors.

The challenge for analytics is to find the right blend of Linear Analytics and Non-Linear Analytics that combines predictive power and retains human-readability.

Human Learning: A Necessary Output From Machine Learning

To test the functionality of the application we have been using some real life data either from people we know who work with data in various companies or from Kaggle (the data science community recently acquired by Google).

Our favorite test data set from Kaggle is the Titanic survivor data.   We like it because there are a small number of records in it (c. 900)  and it does not contain many variables (the notable ones are gender, age, point of embarkation, cabin number, cabin level and whether they survived or not).

Kaggle runs competitions to see which data scientist can produce the most accurate prediction of survival.  While we are interested in accuracy (the model produced on KL  has an accuracy of 80% vs a guessing accuracy of 51% based on the incidence of survivors in the data we have), we are more interested in both accuracy and human readability of the model. This graph shows the outputs of the model drivers, this shows, for example, that a passenger's gender contributes 22.7% of our knowledge about whether they survived or not.

While accuracy is important, being able to relate the model to other people is just as important as it means that we humans can learn, not just machines.

 

Contribution of Factors On Survival Outcome of Titanic Passengers

Unbiased Analytics

When we excitedly tell people that the new version of Knowledge Leaps incorporates k-fold validation, their eyes glaze over. When we tell people about the benefits of this feature, we usually get the opposite response.

In simple terms, k-fold validation is like having a team of 10 pHDs working on your data, independently and simultaneously. The application doesn't produce just one prediction, it makes 10 which are all independent of one another. This approach outputs more general models, these are closer to a rule of thumb and are consequently useful in more contexts. Another step toward human-centered analytics without the human bias.