Distance Measures with Large Datasets
Distance Measures for Similarity Matching with Large Datasets
Today I had an interesting question from a client that was using a distance metric for similarity matching.
A distance measure is the measure of how similar one observation is compared to a set of other observations. This is most commonly used in nearest-neighbour models, but is used as a measure of similarity across a range of clustering and classification algorithms.
The difficulty lies in the fact that if you want to generate the correct closest matches, then you
need to calculate the distance to every object. This is computable in O(N)
time, which means that
if you have large numbers of observations in your dataset (10^7
in my client’s case) then these
operations can obviously take a long time.
Here is a selection of ideas to try, from easiest to hardest:
Reduce the amount of data: This is always the easiest thing to do, so if you can, do it. You might need to perform some clustering here, then you could retain only “exemplars” of the cluster. Another option is that you could pick a random subset of the data. You would get different results each time, but you could guarantee performance.
Reduce the dimensionality: Fewer features mean fewer computations.
Horizontal scaling: This problem is easily scaled. Pick subsets of the of X, find the nearest, the merge into a new set. This is a standard “map-reduce” type problem and how you implement depends on your infra. Do you have a cluster? An orchestrator? Spark? Hadoop? Etc. There’s also quite a lot of literature if you search for “scalable euclidean distances”.
Matrix optimisation: I’m pretty sure it should be possible to load a bunch of X’s into a matrix and perform the subtraction matrix-wise. Then you can take advantage of numpy’s use of *BLAS and compute a load of distances in one fell swoop. Just make sure you test the performance. You might find that python’s optimisers have already done this for you inside the CPU.
Add layers to your algorithm: I.e. first perform clustering to find distinct clusters. E.g. male/female. Then only apply your distance measure to that subset. Keep going with more clusters for as long as it makes sense.
Separate real-time and accurate algorithms: Offer real-time, rough estimates and slow-time best match, e.g. “yes, there is probably a good match, but we need to keep searching to find the best one”. You could do this in a number of ways, using any of the above ideas. E.g. offer a random subset, for the quick one, and over time keep adding more and more data until you’ve used it all up.
Use a tree-based method: I.e. figure out which features are most important. Use the first (one? two? three?) component(s) to shard your data, keep doing that, until you have one left or use the standard NN algorithm to finish off.
Finally, GPUs. GPUs are used for problems that can be easily parallelised. Essentially you’d need to implement 2 and 3 above to fully extract the parallelisation performance of the GPU. But be warned that this means you are restricting the hardware it will run on, the hardware is more expensive and it will add code complexity. I’d recommend getting it working well enough on a CPU first, and move to a GPU only if you have to.