A Programming Language For Data Engineering

Noodling on the internet I read this paper (Integrating UNIX Shell In A Web Browser). While it is written 18 years ago, it comes to a conclusion that is hard to argue with: Graphical User Interfaces slow work processes.

The authors claim that GUI slow us down because they require a human to interact with them. In building a GUI-led data analytics application I am inclined to agree — the time and cost associated with development of GUIs increases with simplification.

To that end we are creating a programming language for data engineering on our platform.  Our working title for the language is wrangle (WRANgling Data Language). It will support ~20 data engineering functions (e.g., filter, mapping, transforming) and the ability to string commands together to perform more complex data engineering.

Excerpt from paper: "The transition from command-line interfaces to graphical interfaces carries with it a significant cost. In the Unix shell, for example, programs accept plain text as input and generate plain text as output. This makes it easy to write scripts that automate user interaction. An expert Unix user can create sophisticated programs on the spur of the moment, by hooking together simpler programs with pipelines and command substitution. For example:

kill `ps ax | grep xterm | awk '{print $1;}'`

This command uses ps to list information about running processes, grep to find just the xterm processes, awk to select just the process identifiers, and finally kill to kill those processes.

These capabilities are lost in the transition to a graphical user interface (GUI). GUI programs accept mouse clicks and keystrokes as input and generate raster graphics as output. Automating graphical interfaces is hard, unfortunately, because mouse clicks and pixels are too low-level for effective automation and interprocess communication."

Spam, Bots, and Turing Tests

Since I started my blog I have had 350+ spam comments. They tend to come in waves of similar types of comments. One week they might all be in Russian, the next week they all refer to Houdini and seatbelts. Broadly speaking they fall in to two categories, they are either flattering and wanting me to click on a malicious link or they make no sense whatsoever.

This latter class of comments are interesting because of their seeming pointlessness - there is no link for me to click on contained in the comment.  This got me thinking that when I set the comments to "spam", "trash" or "publish" the author of those comments would get a message back saying: " Your comment has been published/deleted". This could be useful feedback if these comments were generated by a computer and some one was trying to write a bot that could perform natural language processing and maybe even pass the Turing Test. To train the bot you would need lots of examples of text that can be easily parsed from a web page, and where better than a blog to get that sort of information. Each time I set the status of a comment to spam I am helping train a bot and  have become an unwitting servant of a malicious hackers.

 

Big Data, Laws of Physics and Sampling

One of the issues with large data files, is that very quickly you come up against the physical laws of the universe; hash function collision rates have meaningful impact on how exhaustive your calculations are and unbounded memory structures create significant performance issues.

With our KL app, we are building  technology to get round that.  As our Maximum Viable File Size has grown from thousands of rows, to millions of rows and now to billions of rows we realized that the laws of physics are a real nuisance when analyzing data.

To that end, we have rolled out a data sampling feature that allows users to run analysis on a randomized subset of a data file.  When speed of analysis is important then this feature allows users to get round the laws of physics and produce representative results.

Redesign Rationale and New Features

 

Knowledge Leaps Landing Page Image

The objective behind the redesign is to make better use of screen real estate, to ease navigation and simplify work flows. Since we began development, the product has become more complex, by necessity. Making it simple and easy to use is central to the brief.

The rolling brief of "simplify" will continue to be used as the capabilities of the platform become more advanced.  The UI will continue to evolve as more features are launched. In this release we have added the following features:

Data formats - users can now import zipped files, comma- , semicolon-, and pipe-delimited data files structures. For parsing we now have automatic detection of delimiters.

Column Reduction - users can use this feature to delete fields in the data and save a new, reduced, version of the data. This is a useful feature for stripping out PII fields or fields that contain "bloat".  Improving performance and enhancing security.

Data Extraction - users can extract unique lists of values from fields in a data set. The primary use case for this feature is to allow users to create audiences based on behaviors. These audiences can then be appended to new data sets to identify cross-over behavior.

Data Sampling - users can randomly sample rows from a data file. For very large data sets, performing exhaustive calculations is time and resource intensive. Sampling a data set and analyzing a subset is based on sound statistical principles and rapidly increases productivity for large data sets.

Transform Filters - users can transform a filter in to a mapping file. Data reduction is an important step in data analysis, converting filters into data reduction maps will make this effortless.

Dynamic Mapping - users can access API end points, pass values to the end point and take the returned value as the "mapped value". Initially this will be limited to an internal api that maps product code to brand and owner. New API connections will be added over time.

Multiple AWS Accounts - users can now specify multiple AWS account access keys to connect to. This is to incorporate the launch of KL data products. KL now offers a range data products that firms can subscribe to. Multiple AWS account capabilities allows for customers to bring many different data streams into the account environment on the platform.

As well as building solutions that can be accessed through a simple form/button led UI, these features are the building-blocks of future analytics solutions. These features are be platform-wide universal tools, untethered from a specific context or environment. This will give our product development team greater flexibility to design and implement new functions and features.