Big Data, Laws of Physics and Sampling

One of the issues with large data files, is that very quickly you come up against the physical laws of the universe; hash function collision rates have meaningful impact on how exhaustive your calculations are and unbounded memory structures create significant performance issues.

With our KL app, we are building  technology to get round that.  As our Maximum Viable File Size has grown from thousands of rows, to millions of rows and now to billions of rows we realized that the laws of physics are a real nuisance when analyzing data.

To that end, we have rolled out a data sampling feature that allows users to run analysis on a randomized subset of a data file.  When speed of analysis is important then this feature allows users to get round the laws of physics and produce representative results.

Parallelization Begins

Having built a bullet-proof k-fold analytics engine, we have begun the process of migrating it to a parallel computing framework. As the size of the datasets that Knowledge Leaps is processing has increased in terms of volume and quantity, switching to a parallel framework will add scalable improvements in speed and performance. While we had limited the number of cross validations (the k value) to a maximum of 10, we will be able to increase it further with a minimal increase in compute time and much improved accuracy calculations.

Adding parellel-ization to the batch data engineering functionality will also increase the data throughput of the application. Our aim is to deliver a 10X – 20X improvements data throughput on larger datasets.

When It Comes To Rules Based Classification, Less Is More

We have been running trials on a 16 question survey, predicting the responses to a particular question using other data in the survey. What we discovered is that the more rules we allowed KL analytics engine to produce, the lower the accuracy and the harder it becomes to explain the model to another person.

Predictive Accuracy of Knowledge Leaps Analytics Engine Using 10-Fold Validation – Best Guess Accuracy = 50.2%