We Eat Our Own Dog Food

I had heard about the phrase "do the dogs eat the dog food" from a start-up podcast I had listened. The idea being if your firm is building a product for customers, do the customers actually use it.

I then read this adaption of the phrase and thought it applies to us. We ship features and code that help our customers and that help us do our jobs better. We make "dog food" and we eat it. So, if the UI for a new feature is clunky or an implementation doesn't quite hit the mark, we know about because our team will tell us.

Feature Release: July 13

Today, we released a new set of features. The primary feature is a new auditing tool that helps data engineers quickly profile a data set in terms of column cardinality, row count and the constituent file count. This simple feature gives a quick snapshot of a dataset and identifies any potential data issues. In a production pipeline this prevents corrupted data being dispatched.

Data Audit Icon

Clicking the icon performs the audit. Once completed all information is viewed in the information page for each data set.

Forecasting Using Prior Distributions

We have been building some product forecasting models using Monte Carlo methods. Sales distributions are often skewed right. Using normal approximations tends to over inflate forecast estimates, since the distribution is not centered around the mean. Further more the standard deviation of skewed distributions tends to produce estimates with very wide variances - by definition.

To overcome this, we use a Monte Carlo simulator - that draws from the sales distribution at random. Creating a sample of many estimates not only gives a more accurate estimate, it is also helps us calculate more realistic margins of error.

Feature Release: July 3rd

New features rolled out this week:

  • Apply filters and mapping files to other filters and mapping files. This feature helps create randomized lists and sub filters based on new criteria. For example, extract a list of userIDs from a data file, apply gender from a look up table. Then filter this list by gender to create a specific list of users. This new file can then be sampled randomly to create a new list of random userIDs that meet a specific criteria.

Feature Release June 21

List of features/fixes in latest app release:

  • Download file compression by default. When users download data to their local computers they are compressed by default.
  • Merge data now runs in background: some users were struggling trying to combine multi-GB data files. We now merge large data sets in the background to avoid memory issues.
  • Server-less charting: All charting has been pushed to server-less environment.

New Feature: Parallelized Chart Production at Scale

We launched a new feature today. The Knowledge Leaps platform allows users to specify hundreds of charts with a few clicks. For example a user can plot sales by date split by store ID using a simple flow. This can lead to 1000s of charts being produced, each one derived from millions of lines of data.

Data Knowledge Graph

When you are building data products and filtering data files, it is important to keep track of what you have combined to make a new data set and what you have removed. This feature has saved us countless hours.

From an audit perspective we can build a complete history of a dataset - when it was added to the platform, how it was processed and when/who/where it was delivered / downloaded. This takes a removes a time-draining communication burden from our teams.

We can also add commentary and narratives to a data set. This helps us build transparency and persistent-state knowledge about data.

AI: A Working Assumption

Building a system that is 100% autonomous and makes its own decisions is both hard and high risk. Given that Amazon, with all its resources and smarts, uses human input for the low/no consequence AI built into Alexa, it is fairly safe to assume that *all* other firms making AI claims have a human involved in at least one critical step.

On Maxima: Search, Life and Software.

Until recently, I have wrestled with why people I knew growing up in a small village in the UK stayed in the village when there was a whole world of opportunities awaiting discovery. I have come to realize that life is a search process. A search for purpose, contentment and security. As with most search algorithms, some are better than others. Some peoples' search algorithm stop when they discover a local maxima - such as village life in the UK. Other algorithms drive people to travel much further.

Software development follows similar principles to a search algorithm. While we might think that we are heading towards a peak when we start out building an application, we soon discover that the landscape we are searching is evolving. If we rush too quickly to a peak we might find that we settle on a local rather than a global maxima. Facebook is a good example of the impact of search speed. The reason that Facebook prevailed is that the many social networking sites that came before it provided the company with a long list of technical and business mistakes to avoid. A major lesson was controlled growth - in other words, a slow search algorithm. Avoiding the strong temptation, especially when a social network is concerned, to grow very rapidly.

This is an example of a good search process and how it has to be a slow one for long term success. A slow search allows a process to find a stable solution. The Simulated Annealing Algorithm is a good example of this. The random perturbations applied to the search result diminish overtime as the solution gets closer to the optimum search result. The occasional randomness ensures the search doesn't get stuck on a solution.

We have also been running our own, slow search algorithm as we build Knowledge Leaps. We have been at this for a long time. App development began five years ago, but we started its design at least eight years ago. While we wanted to go faster, we have been resource constrained. The advantage of this is that we have built a resilient and fault-tolerant application. The slow-development process has also helped foster our design philosophy, when we began we wanted to be super-technical and build a new scripting language for data engineering. Over time our beliefs have mellowed as we have seen the benefits of a No Code / Low Code solution. Incorporating this philosophy into Knowledge Leaps has made the application that much more user friendly and stable.

ALEXA: State-of-the-art AI

If you want to see how good Alexa is at answering people's questions you should sign on to Alexa Answers and see the questions Alexa cannot answer. This site has gamified helping Alexa answer these questions. I spent a week doing this and figured out a pretty good work flow to stay in the top 10 of the leader board.

The winning strategy is to use Google. You copy the question in to Google and paste the answer Google back in to the Alexa Answers website for it to played back to the person who asked it. The clever thing is that since it is impossible to legally web-scrape Google.com at a commercially viable rate, Amazon have found a way of harnessing the power of Google without a) having to pay, b) violating Google.com's TOS, and c) getting caught stealing Google's IP.

After doing this for a week, the interesting thing to note is why Alexa could not answer these questions. Most of them are interpretation errors. Alexa misheard the question (e..g connor virus, coronda virus, instead of coronavirus). The remainder of the errors are because the question assumes Alexa's knowledge of the context (e.g. Is fgtv dead? - he's a youtube star) and without the subject of the question being a known entity in Alexa's knowledge graph, the results are ambiguous. Rather than be wrong, Alexa declines to answer.

Obviously this is where the amazing pattern matching abilities of the human brain come in. We can look at the subject of the question and the results and choose the most probable correct answer. Amazon can then augment Alexa's knowledge graph using these results. This is probably in violation of Google's IP if Amazon intentionally set out to do this.

Having a human being perform the hard task in a learning loop is something that we have also employed in building our platform. Knowledge Leaps can take behavioral data and tease out price sensitivity signals, using purchase data, as well as semantic signals in survey data.