A RoboCoworker for Analytics

You can have a pretty good guess at someone's age based on purely on the number of web domains they have purchased and keep up to date. I have 46 and I bought another one, the other day, RoboCoworker.com. I had in mind an automated coworker that could offer a sense of companionship to freelancers and solo start-up founders during their working day. It's semi-serious and I put these thoughts to one side as I got back to some real work.

Today, I had a call with a prospect for Knowledge Leaps. I gave them a demo and described the use-cases for their industry and role. It dawned on me, that I was describing and automated coworker, a RoboCoworker if you will.

This wouldn't be someone you can share a joke  or discuss work issues with, but would be another member of your analytics team that does the work while you are in meetings, fielding calls from stakeholders, or selling in the findings from the latest analysis. What I call real work that requires real people.

Silos: Bad For Business, People and Data

While keeping people in silos is a good thing for managing and directing them, it tends to be bad for business in the long run. Especially for businesses that rely on innovation for growth.

In the book, The Medici Effect, the author describes how the wealthy 14th century house of Medici created the conditions that led to the Renaissance - a period when there was an explosion of ideas across the arts and sciences.  This was only possible because the family's wealth was able to support artists from different disciplines who shared ideas, a lesson to companies that want to innovate.

What's true of people is also true of data. Not all data is created equally. As a result it tends to be put in silos determined by source (transactions, surveys, crm, etc). Different data has different degrees of meaningfulness;  transaction data tends to be narrow but very deep (telling you a lot about a very narrow field) whereas survey data tends to be broad but less deep. Combining data with different strengths can uncover new insights. Linking transaction data with survey data can identify broader behavior drivers, these can drive sales and increase customer engagement.

In our mind, silos are bad for data too. They prevent data owners from making new discoveries that arise from merging a customer's data.

Knowledge Leaps de-silos your data, creating a single-customer view. Allowing companies to look at the drivers, interactions and relationships across different types of data, whether its transactions,  surveys or CRM data.

 

Innovation: A Thought And A Lesson or Two

Spending time with people in different professions and trying to do their jobs is an effective way of innovating new ideas. This is very different to talking to people about what they do in their job,  unless of course you are talking to someone who is very self-aware about the pain points in their current job's functions.

Over the past year, this is what we have been doing with Knowledge Leaps. Rather than invest money in building "cool" features, we have been taking the following approach:

  • Design it
  • Build it
  • Use it
  • Improve it

The Use it phase goes beyond testing functionality, we are testing the applications performance envelope as well as it usability and seamless-ness. Using the product to do actual work on actual data - i.e. doing the job of an analyst - is central to developing and innovating a useful and useable product.

Along the way I have learnt some lessons about myself too:

  • I hate typing into text boxes and implement complex features to avoid it.
  • I feel resentful when I have to tell an application something it already "knows".
  • I am impatient. I don't like waiting for calculations to be finished.
  • I like multitasking in real life, and I like my applications to multitask.

We plan to keep this process up as we roll-out new features - advanced reporting, audience identification, and beyond.

 

 

 

 

 

A Moment To Rave About Server-less Computing

Knowledge Leaps now uses AWS Lambda. A Server-less compute technology to parallelize some of the more time-costly functions.

In layman's terms, servers are great but they have finite capacity for calculations, much like your own computer can get stuck when you have too many applications open at once, or that spreadsheet is just too large.

Server-less computing gives you the benefit of computing power without the capacity issues that a single server brings to the party. On AWS you can use up to 1024 server-less compute functions to speed up calculations. There are some limitations, which I won't go in to, but needless-to-say this technology has reduced Knowledge Leaps  compute times down by a factor of 50. Thank you Jeff!

Building An Asset, Being Strategic, Learning Important Lessons

Since shifting out of a pure-play service company to building a product-led  company, I am now seeing what it is to be strategic.

In building a product, you are investing in an asset. Investing in an asset forces you to make strategic decisions since the product features define the course and goals for a company. When resources are limited, decision-making needs to be better since the direction these decisions impose on your company's direction are costly to undo.

Bootstrapping the development Knowledge Leaps for the past three years has been eye-opening and a great learning opportunity. The top lessons learnt so far are:

  1. Don't invest money in features that don't make it easier to use the product, today.
  2. Use the product, experience the pain points, then write the scope for the next build.
  3. Get the basics done right before moving on to build more advanced features.
  4. Work with the right team.

Fundamentally, I have learnt that if I am  allocating finite resources that have a compounding effect on my company then I am making the right  strategic.

 

Patent Attorneys and Non-Euclidean Geometry

Now I know why patent attorneys are trained lawyers. A patent isn't so much about invention. Its about owning a territory and arguing that territory should be as large as possible.

Since invention-space is non-Euclidean, there is always some more space to occupy with your invention than is first apparent. Although, this largely depends  on your attorney. Finding this own-able invention-space has been an interesting journey these past few years.

Through working with my attorney, I also learnt that the USPTO also has deadlines and targets making them amenable to negotiation. Its in the USPTO's interests for patent applications to be abandoned or approved, the middle-ground of arguing back and forth is like purgatory for you, and the agent handling your application. Since the USPTO can't force you to abandon an application, they have to negotiate.

On this note, we've been negotiating with the USPTO and are looking to have some good news soon!

Parallelization Begins

Having built a bullet-proof k-fold analytics engine, we have begun the process of migrating it to a parallel computing framework. As the size of the datasets that Knowledge Leaps is processing has increased in terms of volume and quantity, switching to a parallel framework will add scalable improvements in speed and performance. While we had limited the number of cross validations (the k value) to a maximum of 10, we will be able to increase it further with a minimal increase in compute time and much improved accuracy calculations.

Adding parellel-ization to the batch data engineering functionality will also increase the data throughput of the application. Our aim is to deliver a 10X - 20X improvements data throughput on larger datasets.

Building the Future of Machine Learning and Analytics. Right Here, Right Now.

 

 

TechCrunch recently published an article which describes what I am building with the Knowledge Leaps platform (check out website here).

Knowledge Leaps, is a soup-to-nuts data management and analytics platform. With a focus on data engineering, the platform is aimed at helping people prepare data in readiness for predictive modeling.

The first step to incorporating AI in to an analytics process is to build an application that automates grunt work. The effort is in cleaning data, mapping it and converting it to the right structure for further manipulation. It's time-consuming but can be systematized. The Knowledge Leaps application does this, right now. It seamlessly converts any data structure into user-level data using a simple interface, perfect for those who aren't data scientists.

Any data can then be used in classification models using an unbiased algorithm combined with k-fold cross validation for rigorous,objective testing. This is just the tip of the iceberg of its current, and future, functionality.

Onward, to the future of analytics.

Fast Tracking Data Analysis

Not all variables in a data set or questions in a survey are equal when it comes to data analysis and analytics. Some variables (questions if it’s a survey) will be inherently better at classifying outcomes than others. For example, if you are using a data set to build a narrative around a particular binary behavior (i.e. people who do X vs people who don't do X) then there are some considerations about which variables will give you a short cut to the story.

The first rule of thumb is to start with binary predictors, i.e. variables with only two different responses / values. Variables with a greater number of possible responses/values will be more likely to have spurious relationships with the variable that you are trying to predict. Predictors with two levels are less likely to suffer this phenomena.

The second rule of thumb is to select those binary variables that have a similar distribution to the variable that you are trying to predict. For example if you are trying to predict a behavior that has 20% incidence among a certain population then the best predictors to use should also have a 20% / 80% spread across two values.

The reason for this condition being optimal is easily explained. The best predictor is one that identifies all cases correctly. Imagine that the best predictor has two possible values with 40% of cases at a value of 1 and 60% of the cases have a value of 2 in this variable. With this distribution, if 1s are predictive of the behavior we are modelling then only half the 1s can be correctly predictive if the behavior has a 20% incidence. The other half of the 1s are incorrectly predictive.  However, if the best predictor had 15% of cases that were 1s and 85% cases had a value of 2 then all the 1s could be correctly predictive. This would be a much better predictor to use - in part because the incidence of 1s (or 2s for that matter) is close to the incidence of behavior we are predicting - meaning that 1s have a better chance of being better predictors.

I have a nice graph to show this too. Watch this space!

New Feature: Handling Event-Type Data

I have spent a lot of time thinking about data and data structures. What I have learnt is that there are two types of data structures; data which has only one row per user (e.g survey data) and data which has one row for each unique user event (i.e. click stream data from an app or website) and multiple rows for any user.

Many web-based analytics platforms, like Amazon's own ML platform, only let its users upload data that has a simple data structure (one row per user such as survey data and customer profile data). Very few platforms allow users to upload event-type data and engineer it into a simple form that can be used in predictive analytics.

Transforming event data requires data engineering and this process can be daunting. To develop Knowledge Leaps further, we have spent a lot of time looking at a wide range of event-type data use cases. Our aim has to been to create a systematic, easy-to-use (given the task) approach to simplifying the data engineering work flow. As with our models, we also want our user interface and processes to be human-readable too.

In our latest release we are launching the Data Processor module. The design of this module has drawn heavily on working with real-world event data.  This new feature allows the platform to take in any data type and perform simple processing rules to create analytics-ready data sets in minutes.