Data Engineering

October 18, 2019October 18, 2019

No Code Data Engineering #2

We are adding to our no-code data engineering use cases. Our new Collection Manager feature plugs data pipelines into databases with no code just using a simple drag-and-drop interface.

This feature allows users with zero knowledge of databases and query languages to import data into a database. An additional UI will then allow them to create queries, aggregations and extracts using a simple UI.

The UI can be set up to update the database with new data as it is arrives from external sources, it will also automate extract creation as new data is added.

Example use-cases for this feature would be in the creation of data feeds for dashboards that auto-populate, or creating custom data products which can be timed with a guaranteed delayed delivery time. This feature will also drive our retail experimentation business - we can design and set up a data framework that captures and tags the results from test-and-learn activity.

August 9, 2019August 9, 2019

Code-free Data Science

There will always be a plentiful supply of data scientists on-hand to perform hand-cut custom data science. For what most businesses requirements, the typical data scientist is over-skilled. Only other data scientists can understand their work and, importantly, only other data scientists can check their work.

What businesses require for most tasks are people with the data-engineering skills of data scientists and not necessarily their statistical skills or their understanding of a scientific-method of analysis.

Data engineering on a big scale is fraught with challenges. While Excel and Google Sheets can handle relatively large (~1mn row) datasets there is not really a similar software solution that allows easy visualization and manipulation of larger data sets. NoSQL / SQL-databases are required for super-scale data engineering, but this requires skills of the super-user. As 'data-is-the-new-oil' mantra makes its way into businesses, people will become exposed to a growing number datasets that are beyond the realm of the software available to them and, potentially, their skill sets.

At Knowledge Leaps we are building a platform solution for this future audience and these future use-cases.The core of the platform are two important features: Visual Data Engineering pipelines and Code-Free Data Science.

The applications of these features are endless; from building a customer data lake, or building a custom-data-pipeline for report generation or even creating simple-to-evaluate predictive models.

February 26, 2019March 17, 2019

New Feature: Productization of the Production of Data Products

As we work with more closely with our partner company DecaData, we are building tools and features that help bring data products to market and then deliver them to customers. A lot of this is repetitive process work, making it ideal for automation. Furthermore, if data is the new oil, we need an oil-rig, refinery and pipeline to manage this new commodity.

Our new feature implements these operations. Users can now create automated, time-triggered pipelines that import new data files and then perform a set of customizable operations before delivering them to customers via SFTP or to an AWS S3 bucket.

November 28, 2018

A Programming Language For Data Engineering

Noodling on the internet I read this paper (Integrating UNIX Shell In A Web Browser). While it is written 18 years ago, it comes to a conclusion that is hard to argue with: Graphical User Interfaces slow work processes.

The authors claim that GUI slow us down because they require a human to interact with them. In building a GUI-led data analytics application I am inclined to agree — the time and cost associated with development of GUIs increases with simplification.

To that end we are creating a programming language for data engineering on our platform. Our working title for the language is wrangle (WRANgling Data Language). It will support ~20 data engineering functions (e.g., filter, mapping, transforming) and the ability to string commands together to perform more complex data engineering.

Excerpt from paper: "The transition from command-line interfaces to graphical interfaces carries with it a significant cost. In the Unix shell, for example, programs accept plain text as input and generate plain text as output. This makes it easy to write scripts that automate user interaction. An expert Unix user can create sophisticated programs on the spur of the moment, by hooking together simpler programs with pipelines and command substitution. For example:

kill `ps ax | grep xterm | awk '{print $1;}'`

This command uses ps to list information about running processes, grep to find just the xterm processes, awk to select just the process identifiers, and finally kill to kill those processes.

These capabilities are lost in the transition to a graphical user interface (GUI). GUI programs accept mouse clicks and keystrokes as input and generate raster graphics as output. Automating graphical interfaces is hard, unfortunately, because mouse clicks and pixels are too low-level for effective automation and interprocess communication."

November 27, 2018December 19, 2018

Point-In-Time M&A Data Base and API

We just launched the alpha-version of our point-in-time historical brand M&A data base and API. Submit a product code and date the API will return ISIN code (if public) of the owner on that date.

November 14, 2018

Big Data, Laws of Physics and Sampling

One of the issues with large data files, is that very quickly you come up against the physical laws of the universe; hash function collision rates have meaningful impact on how exhaustive your calculations are and unbounded memory structures create significant performance issues.

With our KL app, we are building technology to get round that. As our Maximum Viable File Size has grown from thousands of rows, to millions of rows and now to billions of rows we realized that the laws of physics are a real nuisance when analyzing data.

To that end, we have rolled out a data sampling feature that allows users to run analysis on a randomized subset of a data file. When speed of analysis is important then this feature allows users to get round the laws of physics and produce representative results.

November 6, 2018November 6, 2018

Redesign Rationale and New Features

The objective behind the redesign is to make better use of screen real estate, to ease navigation and simplify work flows. Since we began development, the product has become more complex, by necessity. Making it simple and easy to use is central to the brief.

The rolling brief of "simplify" will continue to be used as the capabilities of the platform become more advanced. The UI will continue to evolve as more features are launched. In this release we have added the following features:

Data formats - users can now import zipped files, comma- , semicolon-, and pipe-delimited data files structures. For parsing we now have automatic detection of delimiters.

Column Reduction - users can use this feature to delete fields in the data and save a new, reduced, version of the data. This is a useful feature for stripping out PII fields or fields that contain "bloat". Improving performance and enhancing security.

Data Extraction - users can extract unique lists of values from fields in a data set. The primary use case for this feature is to allow users to create audiences based on behaviors. These audiences can then be appended to new data sets to identify cross-over behavior.

Data Sampling - users can randomly sample rows from a data file. For very large data sets, performing exhaustive calculations is time and resource intensive. Sampling a data set and analyzing a subset is based on sound statistical principles and rapidly increases productivity for large data sets.

Transform Filters - users can transform a filter in to a mapping file. Data reduction is an important step in data analysis, converting filters into data reduction maps will make this effortless.

Dynamic Mapping - users can access API end points, pass values to the end point and take the returned value as the "mapped value". Initially this will be limited to an internal api that maps product code to brand and owner. New API connections will be added over time.

Multiple AWS Accounts - users can now specify multiple AWS account access keys to connect to. This is to incorporate the launch of KL data products. KL now offers a range data products that firms can subscribe to. Multiple AWS account capabilities allows for customers to bring many different data streams into the account environment on the platform.

As well as building solutions that can be accessed through a simple form/button led UI, these features are the building-blocks of future analytics solutions. These features are be platform-wide universal tools, untethered from a specific context or environment. This will give our product development team greater flexibility to design and implement new functions and features.

September 8, 2018September 8, 2018

Patented Technology

The patent that has just been awarded to Knowledge Leaps is for our continuous learning technology. Whether it is survey data, purchase data or website traffic / usage data., the technology we have developed will automatically search these complex data spaces. The data spaces covers the price-demand space for packaged goods, or the attitudinal space of market research surveys and other data where there could be complex interactions. In each case, as more data is gathered - more people shopping, more people completing a survey, more people using an app or website - the application updates its predictions and builds a better understanding of the space.

In the use-case for the price-demand for packaged goods, the updated predictions then alter the recommendations about price changes that are made. This feedback loop allows the application to update its beliefs about how shoppers are reacting to prices and make improved recommendations based on this knowledge.

In the survey data use-case, the technology will create an alert when the data set becomes self-predicting. At this point capturing further data is unnecessary to understand the data set and carries an additional expense.

The majority of statistical tools enable analysts to identify the relationships in data. In the hands of a human, this is a brute-force approach and is prone to human biases and time-constraints. The Knowledge Leaps technology allows for more systematic and parallelized approach - avoiding human bias and reducing human effort.

August 14, 2017August 14, 2017

Building the Future of Machine Learning and Analytics. Right Here, Right Now.

TechCrunch recently published an article which describes what I am building with the Knowledge Leaps platform (check out website here).

Knowledge Leaps, is a soup-to-nuts data management and analytics platform. With a focus on data engineering, the platform is aimed at helping people prepare data in readiness for predictive modeling.

The first step to incorporating AI in to an analytics process is to build an application that automates grunt work. The effort is in cleaning data, mapping it and converting it to the right structure for further manipulation. It's time-consuming but can be systematized. The Knowledge Leaps application does this, right now. It seamlessly converts any data structure into user-level data using a simple interface, perfect for those who aren't data scientists.

Any data can then be used in classification models using an unbiased algorithm combined with k-fold cross validation for rigorous,objective testing. This is just the tip of the iceberg of its current, and future, functionality.

Onward, to the future of analytics.

August 5, 2017

Automation: A Bright Future

From reading many articles and posts about the threat of AI to the job market, I am coming to the view that any automation, whether or not it is as result of AI, is good for long term economic prospects. Like most economists I have painted a simplistic view of the economic cycle, none-the-less I have faith that automation is a force for good.

Automation will help decouple the relationship between reducing employment and increasing inflation, a relationship that can quickly turn an economic booms into a recession.

The accepted view is that rising demand not only increases companies' profits, it also raises inflation as prices rise in response to demand. Rising demand for a company's products and services will lead to more hiring to increase output. As economies approach full employment, the cost of labor for companies faces two inflationary pressures; the first is response to increased demand for labor, and the second is in response to increased prices lead to hire wage demands. This leads to a familiar cycle: boom -> increasing inflation -> correction in the economy -> increased unemployment and reduced inflation/prices -> boom -> etc.

Inserting automation into this cycle will allow companies to increase productivity without increasing labor cost - which erode profits and break the growth cycle. Increasing company profits will lead to increased share prices for public companies. Since many people's retirement savings are invested in the stock market in one form or another, as companies profits grow, so will the value of people's retirement savings. This will help make it easier for people to make the decision to retire. In short, the right amount of automation could a) reduce an economy's overall demand for labor, and b) provide sufficient long term stock market gains to support a growing retired section of the population. This latter point is interesting since automation could reduce the overall demand for labor. If the pool of workers chasing fewer jobs is too large then wages would fall leading to deflation and a stagnated economy. The ideal outcome is that people remove themselves from the labor market, because they can afford to retire sooner, leaving the right balance between jobs and workers. The right balance of labor supply and demand will allow for moderate inflation, GDP growth, and a stock market that can support an growing number of liberated workers.

From an employment point of view, automation may also create the need for new jobs that do not currently exist. For example prior to 2007, a marketing department in a company did not need a Social Media Manager, similarly there were no Gas Station Attendants prior to the invention of the car. In other words, automation will reduce the need for labor in current roles, as companies look to increase productivity without baking in more labor costs, it will also create new roles as the labor force becomes liberated from repetitive tasks.

One area this is happening is in data analysis and data engineering. My web app Knowledge Leaps is designed to automate the grunt and grind of data engineering. I am building it because I want people working in similar industries to be liberated from the chore of data management, so that they can focus on interpretation and application of the findings.