An Algorithm for Finding Common Elements Across Data Sets
Wednesday 12 July 2006 @ 8:08 am
Filed under:

I was recently working on a search algorithm for a Flash piece I am working on, and the way the algorithm worked involved doing separate searches for each keyword from an index that I had created, recording the number of occurrences for each keyword, then finding out which elements were common between each of the keywords. I came up with an algorithm to do this operation linearly, that is it only has to do one pass through the data which means that it scales well for large data sets. The gist is that if you have three sets of data, then any data element needs to appear three times in the search for it to be common across the data sets. As such, all that is needed is a tally kept of how many times each data element appears and anything that has a tally of three is part of the intersection of the data sets.

Note: this only works if there are no duplicated elements in each data set, which fortunately is not too hard a condition to impose.

Here is some code representing the algorithm:

var dataSet:Array = new Array();
dataSet.push(["1", "2", "5", "10", "11", "12", "15"]);
dataSet.push(["2", "5", "11", "14", "18", "21"]);
dataSet.push(["0", "1", "2", "3", "5", "7", "11"]);

var elementTallies:Object = new Object();
var currentElement:String;

for (var i:Number = 0; i < dataSet.length; i++)
{
	for (var j:Number = 0; j < dataSet[i].length; j++)
	{
		// Go through each element for each dataset. For each element,
		// add one to the tally for that element.
		currentElement = dataSet[i][j];
		if (elementTallies[currentElement] == undefined)
		{
			elementTallies[currentElement] = 1;
		}
		else
		{
			elementTallies[currentElement]++;
		}
	}
}

// Go through the tallies looking for elements with a tally of 3.
// Those elements will make up the intersection of the three data sets
for (var element:String in elementTallies)
{
	if (elementTallies[element] == dataSet.length)
	{
		trace("element: " + element);
	}
}
// Outputs: 11, 5, 2
— By Nathan Derksen   Comments Off   PermaLink

Comments are closed.