Data Engineering

November 6, 2018November 6, 2018

Redesign Rationale and New Features

The objective behind the redesign is to make better use of screen real estate, to ease navigation and simplify work flows. Since we began development, the product has become more complex, by necessity. Making it simple and easy to use is central to the brief.

The rolling brief of "simplify" will continue to be used as the capabilities of the platform become more advanced. The UI will continue to evolve as more features are launched. In this release we have added the following features:

Data formats - users can now import zipped files, comma- , semicolon-, and pipe-delimited data files structures. For parsing we now have automatic detection of delimiters.

Column Reduction - users can use this feature to delete fields in the data and save a new, reduced, version of the data. This is a useful feature for stripping out PII fields or fields that contain "bloat". Improving performance and enhancing security.

Data Extraction - users can extract unique lists of values from fields in a data set. The primary use case for this feature is to allow users to create audiences based on behaviors. These audiences can then be appended to new data sets to identify cross-over behavior.

Data Sampling - users can randomly sample rows from a data file. For very large data sets, performing exhaustive calculations is time and resource intensive. Sampling a data set and analyzing a subset is based on sound statistical principles and rapidly increases productivity for large data sets.

Transform Filters - users can transform a filter in to a mapping file. Data reduction is an important step in data analysis, converting filters into data reduction maps will make this effortless.

Dynamic Mapping - users can access API end points, pass values to the end point and take the returned value as the "mapped value". Initially this will be limited to an internal api that maps product code to brand and owner. New API connections will be added over time.

Multiple AWS Accounts - users can now specify multiple AWS account access keys to connect to. This is to incorporate the launch of KL data products. KL now offers a range data products that firms can subscribe to. Multiple AWS account capabilities allows for customers to bring many different data streams into the account environment on the platform.

As well as building solutions that can be accessed through a simple form/button led UI, these features are the building-blocks of future analytics solutions. These features are be platform-wide universal tools, untethered from a specific context or environment. This will give our product development team greater flexibility to design and implement new functions and features.

September 8, 2018September 8, 2018

Patented Technology

The patent that has just been awarded to Knowledge Leaps is for our continuous learning technology. Whether it is survey data, purchase data or website traffic / usage data., the technology we have developed will automatically search these complex data spaces. The data spaces covers the price-demand space for packaged goods, or the attitudinal space of market research surveys and other data where there could be complex interactions. In each case, as more data is gathered - more people shopping, more people completing a survey, more people using an app or website - the application updates its predictions and builds a better understanding of the space.

In the use-case for the price-demand for packaged goods, the updated predictions then alter the recommendations about price changes that are made. This feedback loop allows the application to update its beliefs about how shoppers are reacting to prices and make improved recommendations based on this knowledge.

In the survey data use-case, the technology will create an alert when the data set becomes self-predicting. At this point capturing further data is unnecessary to understand the data set and carries an additional expense.

The majority of statistical tools enable analysts to identify the relationships in data. In the hands of a human, this is a brute-force approach and is prone to human biases and time-constraints. The Knowledge Leaps technology allows for more systematic and parallelized approach - avoiding human bias and reducing human effort.

October 30, 2017October 30, 2017

Parallelization Begins

Having built a bullet-proof k-fold analytics engine, we have begun the process of migrating it to a parallel computing framework. As the size of the datasets that Knowledge Leaps is processing has increased in terms of volume and quantity, switching to a parallel framework will add scalable improvements in speed and performance. While we had limited the number of cross validations (the k value) to a maximum of 10, we will be able to increase it further with a minimal increase in compute time and much improved accuracy calculations.

Adding parellel-ization to the batch data engineering functionality will also increase the data throughput of the application. Our aim is to deliver a 10X - 20X improvements data throughput on larger datasets.

August 14, 2017August 14, 2017

Building the Future of Machine Learning and Analytics. Right Here, Right Now.

TechCrunch recently published an article which describes what I am building with the Knowledge Leaps platform (check out website here).

Knowledge Leaps, is a soup-to-nuts data management and analytics platform. With a focus on data engineering, the platform is aimed at helping people prepare data in readiness for predictive modeling.

The first step to incorporating AI in to an analytics process is to build an application that automates grunt work. The effort is in cleaning data, mapping it and converting it to the right structure for further manipulation. It's time-consuming but can be systematized. The Knowledge Leaps application does this, right now. It seamlessly converts any data structure into user-level data using a simple interface, perfect for those who aren't data scientists.

Any data can then be used in classification models using an unbiased algorithm combined with k-fold cross validation for rigorous,objective testing. This is just the tip of the iceberg of its current, and future, functionality.

Onward, to the future of analytics.

August 5, 2017

Automation: A Bright Future

From reading many articles and posts about the threat of AI to the job market, I am coming to the view that any automation, whether or not it is as result of AI, is good for long term economic prospects. Like most economists I have painted a simplistic view of the economic cycle, none-the-less I have faith that automation is a force for good.

Automation will help decouple the relationship between reducing employment and increasing inflation, a relationship that can quickly turn an economic booms into a recession.

The accepted view is that rising demand not only increases companies' profits, it also raises inflation as prices rise in response to demand. Rising demand for a company's products and services will lead to more hiring to increase output. As economies approach full employment, the cost of labor for companies faces two inflationary pressures; the first is response to increased demand for labor, and the second is in response to increased prices lead to hire wage demands. This leads to a familiar cycle: boom -> increasing inflation -> correction in the economy -> increased unemployment and reduced inflation/prices -> boom -> etc.

Inserting automation into this cycle will allow companies to increase productivity without increasing labor cost - which erode profits and break the growth cycle. Increasing company profits will lead to increased share prices for public companies. Since many people's retirement savings are invested in the stock market in one form or another, as companies profits grow, so will the value of people's retirement savings. This will help make it easier for people to make the decision to retire. In short, the right amount of automation could a) reduce an economy's overall demand for labor, and b) provide sufficient long term stock market gains to support a growing retired section of the population. This latter point is interesting since automation could reduce the overall demand for labor. If the pool of workers chasing fewer jobs is too large then wages would fall leading to deflation and a stagnated economy. The ideal outcome is that people remove themselves from the labor market, because they can afford to retire sooner, leaving the right balance between jobs and workers. The right balance of labor supply and demand will allow for moderate inflation, GDP growth, and a stock market that can support an growing number of liberated workers.

From an employment point of view, automation may also create the need for new jobs that do not currently exist. For example prior to 2007, a marketing department in a company did not need a Social Media Manager, similarly there were no Gas Station Attendants prior to the invention of the car. In other words, automation will reduce the need for labor in current roles, as companies look to increase productivity without baking in more labor costs, it will also create new roles as the labor force becomes liberated from repetitive tasks.

One area this is happening is in data analysis and data engineering. My web app Knowledge Leaps is designed to automate the grunt and grind of data engineering. I am building it because I want people working in similar industries to be liberated from the chore of data management, so that they can focus on interpretation and application of the findings.

May 6, 2017May 7, 2017

New Feature: Handling Event-Type Data

I have spent a lot of time thinking about data and data structures. What I have learnt is that there are two types of data structures; data which has only one row per user (e.g survey data) and data which has one row for each unique user event (i.e. click stream data from an app or website) and multiple rows for any user.

Many web-based analytics platforms, like Amazon's own ML platform, only let its users upload data that has a simple data structure (one row per user such as survey data and customer profile data). Very few platforms allow users to upload event-type data and engineer it into a simple form that can be used in predictive analytics.

Transforming event data requires data engineering and this process can be daunting. To develop Knowledge Leaps further, we have spent a lot of time looking at a wide range of event-type data use cases. Our aim has to been to create a systematic, easy-to-use (given the task) approach to simplifying the data engineering work flow. As with our models, we also want our user interface and processes to be human-readable too.

In our latest release we are launching the Data Processor module. The design of this module has drawn heavily on working with real-world event data. This new feature allows the platform to take in any data type and perform simple processing rules to create analytics-ready data sets in minutes.

April 5, 2017April 5, 2017

The Power of Data (Engineering)

The secret to successful analytics lies in data engineering, as much as algorithm selection. Sure, there are exceptions to this. No doubt there are times when only one specific algorithm will work for a particular set of data. However, we believe there is no substitute for sound data engineering.

Data engineering is the process of feature creation. Features in the data are what an analytics algorithm will use to making predictions or estimation. Depending on how features are being created by a data engineering process will ultimately determine how human-readable the final models will be. It is easy to go from data engineering to data over-engineering.

An example of the pitfalls of data over-engineering is in the use of Support Vector Machines. The SVM classification algorithm is very powerful, it achieves this by a) only focusing on the handful of data points which defy a simple black-and-white separation of the data and b) performing data engineering that exposes powerful data features but which might not make sense to the ordinary person. For some use cases this is acceptable, but SVM classifications could easily enter the territory of "snake oil". SVM are an expert-user tool and the end user has to trust the person performing the analytics, because the outputs become too complex to explain in simple human terms.

Human readable models are a current focus of KL. We are in the middle of building out our data engineering functionality to allow users to create human-readable features from many different data-structure types. These new features will improve the power of KL's analytics algorithms without rendering them exclusively machine-readable.