I have spent a lot of time thinking about data and data structures. What I have learnt is that there are two types of data structures; data which has only one row per user (e.g survey data) and data which has one row for each unique user event (i.e. click stream data from an app or website) and multiple rows for any user.
Many web-based analytics platforms, like Amazon’s own ML platform, only let its users upload data that has a simple data structure (one row per user such as survey data and customer profile data). Very few platforms allow users to upload event-type data and engineer it into a simple form that can be used in predictive analytics.
Transforming event data requires data engineering and this process can be daunting. To develop Knowledge Leaps further, we have spent a lot of time looking at a wide range of event-type data use cases. Our aim has to been to create a systematic, easy-to-use (given the task) approach to simplifying the data engineering work flow. As with our models, we also want our user interface and processes to be human-readable too.
In our latest release we are launching the Data Processor module. The design of this module has drawn heavily on working with real-world event data. This new feature allows the platform to take in any data type and perform simple processing rules to create analytics-ready data sets in minutes.
99% of analysis carried out by analysts involves a cross tab – analyzing one piece of data through the lens of another.
The cross tab is the de facto standard tool and while it has limitations from an analytical perspective, the cross tab is produces human readable outputs. The challenge lies in the fact that the cross tab produces linear results but not definitive results. They tell a story but often not a satisfactory one. For instance, if we look at how people voted in the 2016 Presidential Election in the USA using this data we can see a weak story appear. While many commentators wanted to label Trump supporters as white, poor and uneducated, these labels are only partially true. They are not definitive. Were we to use just these simple descriptors to predict who voted for Trump (or Clinton) and provide a definitive story then the story would be much more convoluted to relay, since it would rely on non-linear transformation of these descriptors.
The challenge for analytics is to find the right blend of Linear Analytics and Non-Linear Analytics that combines predictive power and retains human-readability.
I think this article sums up the challenges of facing the data science community and, by extension, all data analysts. While much of what we are doing isn’t in the realms of AI, a lot of the algorithms that are being used are equally opaque and hard to comprehend with the human brain. However, there is an allure in the power of these techniques but without easy comprehension I fear we are moving into an era of data distrust.
We have been thinking a lot about the relationship between the incidence of the feature we are trying to predict and the usefulness of analytics algorithms. In previous posts (here and here) we looked at the guessing the feature rather than using an analytics model. When the incidence of the feature you are trying to predict is low, it is sometimes worth guessing than running an analytics algorithm since the accuracy will be higher for low incidence features.
If you then consider how Random Forests work (create a family of decision trees at random -> use the modal value predicted by the family as the correctly classified answer), it becomes clear that these are just a mechanism for creating lots of guesses and when the incidence is low, a guess is better than an analytical prediction. Obviously, this isn’t to undermine Random Forests, more an observation as to perhaps why they work so well.
We have never really looked at the efficiency of the KL algorithm vs a straight guess as we work down further into a decision tree. However, what we have incorporated is a means of more efficient deployment of resources (servers and processors). The latest release of the product allows users to set a stopping criteria based on the incidence of the predicted feature for a particular branch in the learning tree. As we have seen (here) , incidence levels effect the point at which the user is better off making a guess than relying on an analytics algorithm. The stopping criteria prevents the application going past the point at which a guess would be better.
The secret to successful analytics lies in data engineering, as much as algorithm selection. Sure, there are exceptions to this. No doubt there are times when only one specific algorithm will work for a particular set of data. However, we believe there is no substitute for sound data engineering.
Data engineering is the process of feature creation. Features in the data are what an analytics algorithm will use to making predictions or estimation. Depending on how features are being created by a data engineering process will ultimately determine how human-readable the final models will be. It is easy to go from data engineering to data over-engineering.
An example of the pitfalls of data over-engineering is in the use of Support Vector Machines. The SVM classification algorithm is very powerful, it achieves this by a) only focusing on the handful of data points which defy a simple black-and-white separation of the data and b) performing data engineering that exposes powerful data features but which might not make sense to the ordinary person. For some use cases this is acceptable, but SVM classifications could easily enter the territory of “snake oil”. SVM are an expert-user tool and the end user has to trust the person performing the analytics, because the outputs become too complex to explain in simple human terms.
Human readable models are a current focus of KL. We are in the middle of building out our data engineering functionality to allow users to create human-readable features from many different data-structure types. These new features will improve the power of KL’s analytics algorithms without rendering them exclusively machine-readable.
I used the accuracy calculation equation to make this simple form that works out how well a prediction must perform to be better than a weighted guess. For example if the incidence of what we are trying to predict is 40% (gender=female, for example) then the model prediction must have an accuracy greater than 52% for it to be better than randomly assigning 40% of cases to gender is female and assigning the other 60% to gender isn’t female. As this weighted-guess will have an accuracy of 52% over a large sample.
We have been running trials on a 16 question survey, predicting the responses to a particular question using other data in the survey. What we discovered is that the more rules we allowed KL analytics engine to produce, the lower the accuracy and the harder it becomes to explain the model to another person.
To test the functionality of the application we have been using some real life data either from people we know who work with data in various companies or from Kaggle (the data science community recently acquired by Google).
Our favorite test data set from Kaggle is the Titanic survivor data. We like it because there are a small number of records in it (c. 900) and it does not contain many variables (the notable ones are gender, age, point of embarkation, cabin number, cabin level and whether they survived or not).
Kaggle runs competitions to see which data scientist can produce the most accurate prediction of survival. While we are interested in accuracy (the model produced on KL has an accuracy of 80% vs a guessing accuracy of 51% based on the incidence of survivors in the data we have), we are more interested in both accuracy and human readability of the model. This graph shows the outputs of the model drivers, this shows, for example, that a passenger’s gender contributes 22.7% of our knowledge about whether they survived or not.
While accuracy is important, being able to relate the model to other people is just as important as it means that we humans can learn, not just machines.
My patent attorney sent me through some interesting data about the USPTO patent agent reviewing my application. The agent takes an average of 886 days to first respond to an application and if we follow the average path then we will expect to receive our patent approval on August 3, 2020, a mere 2041 days after filing it.
When we excitedly tell people that the new version of Knowledge Leaps incorporates k-fold validation, their eyes glaze over. When we tell people about the benefits of this feature, we usually get the opposite response.
In simple terms, k-fold validation is like having a team of 10 pHDs working on your data, independently and simultaneously. The application doesn’t produce just one prediction, it makes 10 which are all independent of one another. This approach outputs more general models, these are closer to a rule of thumb and are consequently useful in more contexts. Another step toward human-centered analytics without the human bias.