One of the issues with large data files, is that very quickly you come up against the physical laws of the universe; hash function collision rates have meaningful impact on how exhaustive your calculations are and unbounded memory structures create significant performance issues.
With our KL app, we are building technology to get round that. As our Maximum Viable File Size has grown from thousands of rows, to millions of rows and now to billions of rows we realized that the laws of physics are a real nuisance when analyzing data.
To that end, we have rolled out a data sampling feature that allows users to run analysis on a randomized subset of a data file. When speed of analysis is important then this feature allows users to get round the laws of physics and produce representative results.