Big Data, Laws of Physics and Sampling

One of the issues with large data files, is that very quickly you come up against the physical laws of the universe; hash function collision rates have meaningful impact on how exhaustive your calculations are and unbounded memory structures create significant performance issues.

With our KL app, we are building  technology to get round that.  As our Maximum Viable File Size has grown from thousands of rows, to millions of rows and now to billions of rows we realized that the laws of physics are a real nuisance when analyzing data.

To that end, we have rolled out a data sampling feature that allows users to run analysis on a randomized subset of a data file.  When speed of analysis is important then this feature allows users to get round the laws of physics and produce representative results.

Redesign Rationale and New Features


Knowledge Leaps Landing Page Image

The objective behind the redesign is to make better use of screen real estate, to ease navigation and simplify work flows. Since we began development, the product has become more complex, by necessity. Making it simple and easy to use is central to the brief.

The rolling brief of “simplify” will continue to be used as the capabilities of the platform become more advanced.  The UI will continue to evolve as more features are launched. In this release we have added the following features:

Data formats – users can now import zipped files, comma- , semicolon-, and pipe-delimited data files structures. For parsing we now have automatic detection of delimiters.

Column Reduction – users can use this feature to delete fields in the data and save a new, reduced, version of the data. This is a useful feature for stripping out PII fields or fields that contain “bloat”.  Improving performance and enhancing security.

Data Extraction – users can extract unique lists of values from fields in a data set. The primary use case for this feature is to allow users to create audiences based on behaviors. These audiences can then be appended to new data sets to identify cross-over behavior.

Data Sampling – users can randomly sample rows from a data file. For very large data sets, performing exhaustive calculations is time and resource intensive. Sampling a data set and analyzing a subset is based on sound statistical principles and rapidly increases productivity for large data sets.

Transform Filters – users can transform a filter in to a mapping file. Data reduction is an important step in data analysis, converting filters into data reduction maps will make this effortless.

Dynamic Mapping – users can access API end points, pass values to the end point and take the returned value as the “mapped value”. Initially this will be limited to an internal api that maps product code to brand and owner. New API connections will be added over time.

Multiple AWS Accounts – users can now specify multiple AWS account access keys to connect to. This is to incorporate the launch of KL data products. KL now offers a range data products that firms can subscribe to. Multiple AWS account capabilities allows for customers to bring many different data streams into the account environment on the platform.

As well as building solutions that can be accessed through a simple form/button led UI, these features are the building-blocks of future analytics solutions. These features are be platform-wide universal tools, untethered from a specific context or environment. This will give our product development team greater flexibility to design and implement new functions and features.

Innovation: A Thought And A Lesson or Two

Spending time with people in different professions and trying to do their jobs is an effective way of innovating new ideas. This is very different to talking to people about what they do in their job,  unless of course you are talking to someone who is very self-aware about the pain points in their current job’s functions.

Over the past year, this is what we have been doing with Knowledge Leaps. Rather than invest money in building “cool” features, we have been taking the following approach:

  • Design it
  • Build it
  • Use it
  • Improve it

The Use it phase goes beyond testing functionality, we are testing the applications performance envelope as well as it usability and seamless-ness. Using the product to do actual work on actual data – i.e. doing the job of an analyst – is central to developing and innovating a useful and useable product.

Along the way I have learnt some lessons about myself too:

  • I hate typing into text boxes and implement complex features to avoid it.
  • I feel resentful when I have to tell an application something it already “knows”.
  • I am impatient. I don’t like waiting for calculations to be finished.
  • I like multitasking in real life, and I like my applications to multitask.

We plan to keep this process up as we roll-out new features – advanced reporting, audience identification, and beyond.






New Feature: Handling Event-Type Data

I have spent a lot of time thinking about data and data structures. What I have learnt is that there are two types of data structures; data which has only one row per user (e.g survey data) and data which has one row for each unique user event (i.e. click stream data from an app or website) and multiple rows for any user.

Many web-based analytics platforms, like Amazon’s own ML platform, only let its users upload data that has a simple data structure (one row per user such as survey data and customer profile data). Very few platforms allow users to upload event-type data and engineer it into a simple form that can be used in predictive analytics.

Transforming event data requires data engineering and this process can be daunting. To develop Knowledge Leaps further, we have spent a lot of time looking at a wide range of event-type data use cases. Our aim has to been to create a systematic, easy-to-use (given the task) approach to simplifying the data engineering work flow. As with our models, we also want our user interface and processes to be human-readable too.

In our latest release we are launching the Data Processor module. The design of this module has drawn heavily on working with real-world event data.  This new feature allows the platform to take in any data type and perform simple processing rules to create analytics-ready data sets in minutes.

Stories Vs. Predictions

Having worked within an industry (market research) for some time, I am intrigued how other occupations use data, particularly data scientists.  After a conversation with a new friend – a data scientist – last week I had a revelation.

Data scientists have created a new words to talk about data analysis. The ones that stand out are features and feature sets.   Quantitative market researchers talk about questions and surveys but never features. Essentially, they are the same thing; features are traits, attributes and behaviors of people that can be used to describe/predict specific outcomes.  The big difference is that data scientists don’t care so much that the features are not human-readable (i.e. they can be read and understood like a book), as long as they help make a prediction.  For example, Random Forests make good predictors but aren’t easily understandable. The same is true of Support Vector Machines. Excellent predictors but in higher dimensions they are hard to explain.

In contrast, market researchers are fixated on the predictive features being human-readable.  As data science has shown, a market researcher’s predictions, their stories, will always be weaker than those of a data scientist. This in-part explains the continued trend of story-telling in market research circles.  Stories are popular, and contain some ambiguity, this ambiguity can allow people to take out from them what they wish. This is an expedient quality in the short term but damaging long term to the industry.

I think market researchers need to change, my aim with Knowledge Leaps is to try and bridge the gap between highly predictive features and human-readable stories.