Archive

Archive for the ‘experiments’ Category

Updated Hierarchy Stats

February 16th, 2012 No comments

It turns out that the two clumps for Studivz-three-B related to K=3 and K=4. So we ran more runs on Enron and Mit to see what happens, the following are only for K=3.

In the case of the MIT datasets, I included the 4 thresholds we have been using (20th,80th,mean and median thresholds)  and a range of 20 values from the lowest edge weight to the highest. I will also run this on the Enron Subsets.

 

MIT-OCT

MIT-NOV

MIT-ALL

Enron (all)

The effect of Hierarchy and Overlap on routing

February 13th, 2012 No comments

Following on from the previous post, I implemented the algorithms that evaluate the structure of the hierarchy, and plotted this for all datasets against the main routing metrics.

However, this showed a confusing picture, so Pádraig suggested I run multiple parameters to HGCE over one dataset, and see if it made a difference. We chose the Studivz-three-B dataset, as it is large enough to give a nice general picture.

The parameters were as follows:

K=3,4

M (threshold) = 0,1e-05,2e-05,3e-05,4e-05,6e-05,6e-05,7e-05,8e-05,9e-05,6.83085891329e-05,2.04848750968e-05,8.6207182699e-05,5.97475523656e-06

the other parameters were kept at their default (see here)

Again, on first look, this seemed confusing, but by removing the two very low delivery ratio results (leaving 24 out of 26), the visualizations seemed to show a trend towards increased hierarchical structure meaning increased delivery ratio.

It is quite interesting to explore the data by filtering out different runs, e.g removing all values where K=3, or 4 makes a difference: Hierarchy Data Studivz-B

I suspect the two clumps in delivery ratio form either side of the point an important Threshold parameter.

 

Exhaustive Simulations

January 18th, 2012 No comments


I re-ran all of the simulations, as I found an odd quirk in the MIT-NOV dataset, where the Hui Centrality was not calculated correctly, and resulted in NaN being recorded for every node in every local community, which gave very odd results… this was down to a dataset error, which I have rectified, and in future cases when NaN is returned, it is replaced with a rank of 0.0. I think it was down to a divide by zero error in the Hui  Ranking code.

This results in less pronounced results for MIT-NOV than we previously looked at. Other results were note affected.

The only other notable quirk was that Moses did not identify communities for IncoCom-2005, IncoCom-2005 and MIT-ALL. The plots below show results for each Dataset.

Still working on getting a decent run out of Salathe-School, and Studivz.

Categories: experiments

Selected Nodes – with fixed HGCE output

November 30th, 2011 No comments

We identified a problem with the way HGCE was being used, for all other CFAs the community structure used was identical apart from the global parent used by BubbleH, but HGCE outputs a different structure every run. I adapted my code to make sure that BubbleRAP and BubbleH make use of the SAME community structure. In the results below, selected pairs of nodes are chosen based on their contact in the training period. Also, the overall results are reported for the best value for a given metric (i.e. the best run for delivery-ratio is used to show results for cost, latency, etc.). In this case, all use delivery-ratio as the primary metric.

Studivz

Studivz

Social Sensing

Social Sensing

MIT

MIT

Hypertext2009

Hypertext2009

Enron

Enron

Enron still shows a lose for BubbleH for HGCE. Detailed plot compared to previous results:

Delivery Ratio for Enron, BubbleH vs Bubble - old results vs new results - no great change

Delivery Ratio for Enron, BubbleH vs Bubble - old results vs new results - no great change

Still working on coding up an alternative to unlimited flood which allows us to accurately measure delivery ratio for selected pairs.

Selected Pairs evaluation

November 23rd, 2011 No comments

Pádraig suggested that we use only those nodes that have communicated in the training phase, as the message pairs  for the testing phase. To this end, I implemented this as follows:

Given the entire edge list for the training period, take each edge, and create a pair from the source and destination node. Do not create a pair of the pairing exists (either direction). At simulation startup for the test period, send messages between only those pairs, in both directions. So that messages are sent:  A->B and B->A. This reduces the number of messages sent, and it can be argued that only nodes that know each other already, will want to communicate in future. This will give us a much better picture.

Enron-Split-A

Enron Split - A results

Enron Split - A results

Prophet and Hold&Wait

Hypertext2009

Hypertext2009-split results

Hypertext2009-split results

Prophet and Hold&Wait

Studivz-split

Studivz-split results

Studivz-split results

Social Sensing Split

Social Sensing Split results

Social Sensing Split results

Very strange for Undelivered hops, but these charts will change when they use ONLY the best run for a given metric, not the best of any.

Enron Evaluation

November 17th, 2011 No comments

We took a limited period in the Enron dataset, and split it into two parts, a training phase, and a testing phase:

Activity in the Enron dataset, hourly, daily, weekly, showing possible periods that could be used for training/testing

Activity in the Enron dataset, hourly, daily, weekly, showing possible periods that could be used for training/testing

In this case, the first period  shown in the above plot was used.

For HGCE the comminity structure is shown below:

Community Structure created by HGCE on Enron-split-a

Community Structure created by HGCE on Enron-split-a

Results of Routing are not good for HGCE:

Results for Enron-split-a dataset

Results for Enron-split-a dataset

Variation in parameters (threshold) for delivery ratio, on HGCE are shown below:

Benchmark results are shown below:

In which Unlimited flood makes nearly 50% delivery ratio.

Categories: Datasets, experiments

Studivz Evaluation

November 14th, 2011 No comments

We explored a small portion of the Studivz dataset; around 4 months between Oct 2007 and Feb 2008 (illustrated below).

Studivz Weekly and Daily Activity, showing limited periods used, and message peaks (new year)

Studivz Weekly and Daily Activity, showing limited periods used, and message peaks (new year)

Because the dataset is so large, it was not practical to use all of the 28,000 nodes for testing, so, we took a small number of nodes (10) and travelled a number of hops (2) into their connected peers in a connected time graph, and used only this set of nodes to run out simulation. In this instance, we used 385 nodes. We again used MEAN,MEDIAN,80th and 20th percentiles in connected time ratios (described here) as thresholds on edge weights for Community finding using HGCE,KCLIQUE and LinkClustering. (Moses, Random and InfoMap are not thresholded).

Values used:

  • Mean: 0.0001171252,
  • Median: 5.66346e-05,
  • 80th Percentile: 0.0001579604,
  • 20th Percentile: 1.1558e-06

For InfoMap, this resulted in the graph below:

infomap-clustering-studivz

InfoMap Clustering of the reduced node set in Studivz. All edges shown.

KCLIQUE performed quite poorly on this for 80Pc, as there are some clear structures that it did not find.

KCLIQUE 80PC Threshold on Studivz Limited period/nodes

KCLIQUE 80PC Threshold on Studivz Limited period/nodes

It performed slightly better on 20th percentile thresholding, but again a number of clear structures missed.

KCLIQUE 20PC Threshold on Studivz Limited period/nodes

KCLIQUE 20PC Threshold on Studivz Limited period/nodes

LinkClustering for 80pc is shown below:

LinkClustering studivz dataset - 80pc

LinkClustering studivz dataset - 80pc

HGCE finds some structure to the network, and successfully finds hierarchy:

HGCE Community Strucutre for Studivz. Each node represents a community.

HGCE Community Strucutre for Studivz. Each node represents a community.

Result of Routing:

Studivz Dataset results of routing

Studivz Dataset results of routing

For Prophet UF and H&W the results are below:

studivz-split-uf-haw-pht-bar

Studivz Delivery Ratio for unlimitedflood, hold and wait, and Prophet

Studivz Delivery Ratio for unlimitedflood, hold and wait, and Prophet

It seems that even unlimited flood can only pentrate to 20% delivery ratio

Categories: Datasets, experiments

Hypertext2009 Evaluation

November 14th, 2011 No comments

Finally finished running the experiments on the Hypertext2009 dataset. This used all available nodes, but was split into two parts. The dataset is around 3 days long, so I used day 1 to “train” the community finding algorithms, i.e. used the edge list from day 1 to create the communities. Then, the testing phase took place on days 2 and 3. (i.e. the comminities from day 1, were used for routing in days 2 and 3).

In the training phase, I used 4 threshold values for the HGCE, KCLIQUE and LinkClustering (ahn et. al.). InfoMap, Random and Moses do not have thresholding applied. Those values relate to the MEAN, MEDIAN, 20th Percentile and 80th Percentile of the connected time ratio. i.e. ordering all edges in ascending order, pick the edge at the 20th percentile, and 80th percentile as use it’s value as the threshold value (see http://mattstabeler.co.uk/phdblog/datasets/hypertext2009 for plot).

Values used were: MEAN: 0.0035760146,MEDIAN: 0.0007309433,80th PC: g, 20th PC: 0.0003654716

The visualization below show the clustering based on InfoMap, which is non-overlapping. Edges are related to connected time ratio, and are removed where this is lower than the 80th Percentile. Edge thickness indicates connected time, and node size indicates betweeness centrality.

 Without Numbers: InfoMap Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

Without Numbers: InfoMap Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

infomap-hypertext2009-80thpc

Without Numbers: InfoMap Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

This visualisation shows clustering using the KCLIQUE algorithm, this is overallping, so some nodes will have multiple community assignments.

KCLIQUE Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

KCLIQUE Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

When we run the routing algoritms BubbleRAP, BubbleH, Prophet (best in class), Unlimited Flood (best possible) and Hold and wait (baseline), we get the following results:

combined_hypertext2009-split-current_split

Best results for each metric, for each community finding algorithm. BubbleRAP vs BubbleH

and for the rest:

ht2009-split-uf-haw-pht-line

Delivery Ratio for Unlimited Flood, Prophet and Hold and Wait on Hypertext 2009 - split

The community structure found by HGCE was very flat, and in fact found no hierarchical structure at any threshold value apart from for the 80th percentile threshold, where there is one sub community.

HGCE Community hierarchy produced by HGCE for BubbleH

HGCE Community hierarchy produced by HGCE for BubbleH

Categories: Datasets, experiments

Calculating Edge List Thresholds

November 3rd, 2011 No comments

I had an idea to use some standard way to pick the basic threshold parameters for datasets, and it struck me that all we are doing is finding the elbow in the distribution of connected times for each edge.

So, after a little bit of experimentation, I found that the 80/20 rule (Pareto principle) works well.

$pareto = $values[floor((count($values) / 100) * 80)];

The code above, takes an array of all edge weights (connected time), sorted lowest to highest, and picks out the value at the 80th percentile. i.e out of 200 values, it would pick the value at position 160 as the threshold value.
I compared the results for the MEAN average, and the MEDIAN average, and found that the 80th Percentile worked well for Studivz and Hypertext2009.
So I propose that we use this mechanism to pick the threshold values for all datasets.

Studivz

October 6th, 2011 No comments

I took some time to explore the Studivz wall posts dataset, to see whether it would be useful to use.

The first step was to extract a sub-set of the data, as the entire dataset is a little large to run in LocationSim (some of the CFAs can’t quite handle 28k nodes), scaling it up to this many nodes is a work in progress (it might involve writing something more lightweight to do some of the raw processing).

The approach I have taken so far, is to pick N nodes randomly, and included their immediate neighbours. I do the L more times to get the nodes a depth of L hops away from the source.  Using 10 random nodes, with a depth of 2 yields a network of around 3049 nodes (~10% of all nodes).  When reduced to 5 seed nodes, we get ~1000 nodes (~4%).   Going the other way, 100 seed nodes, with a depth of 1 gives 14571 nodes covering ~50% of the network. These figures change depending on which nodes are selected at random initially. Two other paramters affect the results of this, the first is a threshold, where nodes with a connnected time less than this are not included, the second is the value used to seed the random number generator (if 0, then automatically choose a seed).

In the end I settled on three parameters in the table below – note that the number of nodes in the final set is highly subjective to the initially chosen nodes, so this is very random.

Studivz Random Node Choices

N L # Nodes
3 2 213
4 2 914
10 2 3049


Interestingly, despite the source or seed nodes being picked at random, the entire graph is connected in all configurations, the graphic below shows the connected_time graph and InfoMap clusterings for N=3, L=2.

InfoMap clustering of Studivz dataset, where N=3 and L=2

InfoMap clustering of Studivz dataset, where N=3 and L=2

This is a promising start, since there are distinct clusters of nodes, which we expected, as this is the concatenation of three egocentric networks, but also there are connections between each egocentric network, meaning there is a route to every other node. However, we can’t tell from this graph how often these contacts occur.

Looking at the whole dataset, we can get an idea about how active it is over time by measuring the number of connections in a given time period, below show the number of weekly connections for the entire dataset.

Weekly number of connections in the Studivz dataset

Weekly number of connections in the Studivz dataset

It shows that this social network seems to have become increasingly popular  over time, with a peak of just over 10,000 wall posts made in Jan 2007. If we were to pick a period to concentrate on, it should probably be from October 2006 onwards.

Studivz N=3, L=2

Initial results for each metric are shown below:

Delivery Ratio for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Delivery Ratio for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Cost for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Cost for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Latency for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Latency for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Delivery ratio is very poor for all runs, to see what the maximum possible delivery ratio is, we can look at the results for flooding the network below:

Delivery Ratio plot of Unlimited Flood on Studivz 3 2 0 0

Delivery Ratio plot of Unlimited Flood on Studivz 3 2 0 0

This achieves a delivery ratio of roughly 65 percent, so we have a bit of work to do to be able to match this!

Studivz 4 2 0 0

When we add another nodes to the initial seed set, we get a step up in the total number of nodes, 914 to be exact, this is currently running through the simulator.

Studivz 4 2 0 0

Studivz 4 2 0 0

UPDATE:

Below is the weekly activity during the set using 914 nodes (4,2,0,0)

Weekly activity in STUDIVZ 4,2,0,0

Weekly activity in STUDIVZ 4,2,0,0

The results on the larger dataset are shown below, these runs were taking considerably longer, and highlighted a couple of minor bugs in the simulator (not closing files properly! which means that file not found, too many open files messages kept occurring).

Delivery Ratio, Cost, Latency, Average Delivered Hops and Average Undelivered Hops for STUDIVZ with 4 seed nodes and a depth of 2.

Delivery Ratio, Cost, Latency, Average Delivered Hops and Average Undelivered Hops for STUDIVZ with 4 seed nodes and a depth of 2.

We see here that BubbleH is doing well in terms of delivery ratio compared to bubbleRAP , link clustering, which created a huge number of communities does particularly well  (at ~3o% for BubbleRAP and BubbleH), this adds weight to the idea that a large number of communities does well, and in fact, (in this case only, where there is only on set of parameters) we see that the Average cost is roughly the same as with the other CFAs.  BubbleH also performs well in terms of cost.  Latency very high for all CFAs as the dataset is very long.

Unlimited Flood and Prophet on STUDIVZ 4 2 0 0

Unlimited Flood and Prophet on STUDIVZ 4 2 0 0

However, we see from the Unlimited flood run, that we have a way to go to match the best possible delivery ratio, at around 90% delivery ration, it beats BubbleH hands down. Some consolation though, the advanced Prophet algorithm also only gets around 52% delivery ratio.

Categories: Datasets, experiments