Working with Big Data can be a pain at times. This pain however is nothing compared to the hassle of learning a non-parametric model like the k-Nearest Neighbors algorithm in Big Data.
While static data might be easily dealt with, for businesses that collect streaming data and hence require the online approach, speed is of essence. Hence, static solutions may not work and
Problem Description:
1. Given 2 k-NN graphs G and H each representing two datasets S1 and S2 respectively, create a new graph U = G⋃ H, such that U represents S = S1⋃ S2.
2. Given a k-NN graph G representing a dataset S1, how would you update G to include new dataset S2?
The most obvious solution would be to run the k-NN algorithm from scratch and build a new graph upon update to the dataset, but as we know, obvious doesn’t always cut it. This article is centered around the merge of k-NN graphs without building from scratch.
If you need a beautiful description of the k-NN algorithm, check out this article here, and here is a short video.
Peng-Cheng Lin and Wan-Lei Zhao describe 3 major merge tasks:
- P-Merge: Solves Problem 1
- J-Merge: Solves Problem 2
- Hierarchical k-NN Graph Construction
This article is based on the research work by Peng-Cheng Lin and Wan-Lei Zhao. You can find the original paper here.
First of all, let’s conceptualize what a graph k-NN might look like programmatically. Say you have a dataset with 6,000,000 records representing your customers, one might decide to create a 4-NN graph to perform offering recommendation. A quick representation of this would be a collection of lists where G[x] is a list of the 4 closest customers to x.
In this article, I’ll focus on J-Merge, which is a solution to problem 2.
J-Merge Algorithm
The J-Merge algorithm is quite simple. Here are the steps:
- Split graph G into G+(first k/2 stuff in each list) and G-(second half of each list)
- Take k/2 random samples from the new dataset (S2) and append to each list in G+.
- Create a new graph H where every list is has k random samples from S.
- Merge G+ and H into a new graph U.
- Until convergence, do the following:
- Perform NN-Descent (article coming soon, find paper here)
Convergence of the J-Merge
Watch out for my implementation on this repo.
Thank you.