Calculating Edge List Thresholds

November 3rd, 2011

I had an idea to use some standard way to pick the basic threshold parameters for datasets, and it struck me that all we are doing is finding the elbow in the distribution of connected times for each edge.

So, after a little bit of experimentation, I found that the 80/20 rule (Pareto principle) works well.

$pareto = $values[floor((count($values) / 100) * 80)];

The code above, takes an array of all edge weights (connected time), sorted lowest to highest, and picks out the value at the 80th percentile. i.e out of 200 values, it would pick the value at position 160 as the threshold value.
I compared the results for the MEAN average, and the MEDIAN average, and found that the 80th Percentile worked well for Studivz and Hypertext2009.
So I propose that we use this mechanism to pick the threshold values for all datasets.
