Archive

Archive for the ‘Datasets’ Category

Updated Hierarchy Stats

February 16th, 2012 No comments

It turns out that the two clumps for Studivz-three-B related to K=3 and K=4. So we ran more runs on Enron and Mit to see what happens, the following are only for K=3.

In the case of the MIT datasets, I included the 4 thresholds we have been using (20th,80th,mean and median thresholds)  and a range of 20 values from the lowest edge weight to the highest. I will also run this on the Enron Subsets.

 

MIT-OCT

MIT-NOV

MIT-ALL

Enron (all)

The effect of Hierarchy and Overlap on routing

February 13th, 2012 No comments

Following on from the previous post, I implemented the algorithms that evaluate the structure of the hierarchy, and plotted this for all datasets against the main routing metrics.

However, this showed a confusing picture, so Pádraig suggested I run multiple parameters to HGCE over one dataset, and see if it made a difference. We chose the Studivz-three-B dataset, as it is large enough to give a nice general picture.

The parameters were as follows:

K=3,4

M (threshold) = 0,1e-05,2e-05,3e-05,4e-05,6e-05,6e-05,7e-05,8e-05,9e-05,6.83085891329e-05,2.04848750968e-05,8.6207182699e-05,5.97475523656e-06

the other parameters were kept at their default (see here)

Again, on first look, this seemed confusing, but by removing the two very low delivery ratio results (leaving 24 out of 26), the visualizations seemed to show a trend towards increased hierarchical structure meaning increased delivery ratio.

It is quite interesting to explore the data by filtering out different runs, e.g removing all values where K=3, or 4 makes a difference: Hierarchy Data Studivz-B

I suspect the two clumps in delivery ratio form either side of the point an important Threshold parameter.

 

Selected Pairs evaluation

November 23rd, 2011 No comments

Pádraig suggested that we use only those nodes that have communicated in the training phase, as the message pairs  for the testing phase. To this end, I implemented this as follows:

Given the entire edge list for the training period, take each edge, and create a pair from the source and destination node. Do not create a pair of the pairing exists (either direction). At simulation startup for the test period, send messages between only those pairs, in both directions. So that messages are sent:  A->B and B->A. This reduces the number of messages sent, and it can be argued that only nodes that know each other already, will want to communicate in future. This will give us a much better picture.

Enron-Split-A

Enron Split - A results

Enron Split - A results

Prophet and Hold&Wait

Hypertext2009

Hypertext2009-split results

Hypertext2009-split results

Prophet and Hold&Wait

Studivz-split

Studivz-split results

Studivz-split results

Social Sensing Split

Social Sensing Split results

Social Sensing Split results

Very strange for Undelivered hops, but these charts will change when they use ONLY the best run for a given metric, not the best of any.

Enron Evaluation

November 17th, 2011 No comments

We took a limited period in the Enron dataset, and split it into two parts, a training phase, and a testing phase:

Activity in the Enron dataset, hourly, daily, weekly, showing possible periods that could be used for training/testing

Activity in the Enron dataset, hourly, daily, weekly, showing possible periods that could be used for training/testing

In this case, the first period  shown in the above plot was used.

For HGCE the comminity structure is shown below:

Community Structure created by HGCE on Enron-split-a

Community Structure created by HGCE on Enron-split-a

Results of Routing are not good for HGCE:

Results for Enron-split-a dataset

Results for Enron-split-a dataset

Variation in parameters (threshold) for delivery ratio, on HGCE are shown below:

Benchmark results are shown below:

In which Unlimited flood makes nearly 50% delivery ratio.

Categories: Datasets, experiments

Studivz Evaluation

November 14th, 2011 No comments

We explored a small portion of the Studivz dataset; around 4 months between Oct 2007 and Feb 2008 (illustrated below).

Studivz Weekly and Daily Activity, showing limited periods used, and message peaks (new year)

Studivz Weekly and Daily Activity, showing limited periods used, and message peaks (new year)

Because the dataset is so large, it was not practical to use all of the 28,000 nodes for testing, so, we took a small number of nodes (10) and travelled a number of hops (2) into their connected peers in a connected time graph, and used only this set of nodes to run out simulation. In this instance, we used 385 nodes. We again used MEAN,MEDIAN,80th and 20th percentiles in connected time ratios (described here) as thresholds on edge weights for Community finding using HGCE,KCLIQUE and LinkClustering. (Moses, Random and InfoMap are not thresholded).

Values used:

  • Mean: 0.0001171252,
  • Median: 5.66346e-05,
  • 80th Percentile: 0.0001579604,
  • 20th Percentile: 1.1558e-06

For InfoMap, this resulted in the graph below:

infomap-clustering-studivz

InfoMap Clustering of the reduced node set in Studivz. All edges shown.

KCLIQUE performed quite poorly on this for 80Pc, as there are some clear structures that it did not find.

KCLIQUE 80PC Threshold on Studivz Limited period/nodes

KCLIQUE 80PC Threshold on Studivz Limited period/nodes

It performed slightly better on 20th percentile thresholding, but again a number of clear structures missed.

KCLIQUE 20PC Threshold on Studivz Limited period/nodes

KCLIQUE 20PC Threshold on Studivz Limited period/nodes

LinkClustering for 80pc is shown below:

LinkClustering studivz dataset - 80pc

LinkClustering studivz dataset - 80pc

HGCE finds some structure to the network, and successfully finds hierarchy:

HGCE Community Strucutre for Studivz. Each node represents a community.

HGCE Community Strucutre for Studivz. Each node represents a community.

Result of Routing:

Studivz Dataset results of routing

Studivz Dataset results of routing

For Prophet UF and H&W the results are below:

studivz-split-uf-haw-pht-bar

Studivz Delivery Ratio for unlimitedflood, hold and wait, and Prophet

Studivz Delivery Ratio for unlimitedflood, hold and wait, and Prophet

It seems that even unlimited flood can only pentrate to 20% delivery ratio

Categories: Datasets, experiments

Hypertext2009 Evaluation

November 14th, 2011 No comments

Finally finished running the experiments on the Hypertext2009 dataset. This used all available nodes, but was split into two parts. The dataset is around 3 days long, so I used day 1 to “train” the community finding algorithms, i.e. used the edge list from day 1 to create the communities. Then, the testing phase took place on days 2 and 3. (i.e. the comminities from day 1, were used for routing in days 2 and 3).

In the training phase, I used 4 threshold values for the HGCE, KCLIQUE and LinkClustering (ahn et. al.). InfoMap, Random and Moses do not have thresholding applied. Those values relate to the MEAN, MEDIAN, 20th Percentile and 80th Percentile of the connected time ratio. i.e. ordering all edges in ascending order, pick the edge at the 20th percentile, and 80th percentile as use it’s value as the threshold value (see http://mattstabeler.co.uk/phdblog/datasets/hypertext2009 for plot).

Values used were: MEAN: 0.0035760146,MEDIAN: 0.0007309433,80th PC: g, 20th PC: 0.0003654716

The visualization below show the clustering based on InfoMap, which is non-overlapping. Edges are related to connected time ratio, and are removed where this is lower than the 80th Percentile. Edge thickness indicates connected time, and node size indicates betweeness centrality.

 Without Numbers: InfoMap Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

Without Numbers: InfoMap Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

infomap-hypertext2009-80thpc

Without Numbers: InfoMap Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

This visualisation shows clustering using the KCLIQUE algorithm, this is overallping, so some nodes will have multiple community assignments.

KCLIQUE Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

KCLIQUE Community Assignment of Hypertext2009 dataset, using 80th Percentile thresholding of edges

When we run the routing algoritms BubbleRAP, BubbleH, Prophet (best in class), Unlimited Flood (best possible) and Hold and wait (baseline), we get the following results:

combined_hypertext2009-split-current_split

Best results for each metric, for each community finding algorithm. BubbleRAP vs BubbleH

and for the rest:

ht2009-split-uf-haw-pht-line

Delivery Ratio for Unlimited Flood, Prophet and Hold and Wait on Hypertext 2009 - split

The community structure found by HGCE was very flat, and in fact found no hierarchical structure at any threshold value apart from for the 80th percentile threshold, where there is one sub community.

HGCE Community hierarchy produced by HGCE for BubbleH

HGCE Community hierarchy produced by HGCE for BubbleH

Categories: Datasets, experiments

Calculating Edge List Thresholds

November 3rd, 2011 No comments

I had an idea to use some standard way to pick the basic threshold parameters for datasets, and it struck me that all we are doing is finding the elbow in the distribution of connected times for each edge.

So, after a little bit of experimentation, I found that the 80/20 rule (Pareto principle) works well.

$pareto = $values[floor((count($values) / 100) * 80)];

The code above, takes an array of all edge weights (connected time), sorted lowest to highest, and picks out the value at the 80th percentile. i.e out of 200 values, it would pick the value at position 160 as the threshold value.
I compared the results for the MEAN average, and the MEDIAN average, and found that the 80th Percentile worked well for Studivz and Hypertext2009.
So I propose that we use this mechanism to pick the threshold values for all datasets.

Studivz

October 6th, 2011 No comments

I took some time to explore the Studivz wall posts dataset, to see whether it would be useful to use.

The first step was to extract a sub-set of the data, as the entire dataset is a little large to run in LocationSim (some of the CFAs can’t quite handle 28k nodes), scaling it up to this many nodes is a work in progress (it might involve writing something more lightweight to do some of the raw processing).

The approach I have taken so far, is to pick N nodes randomly, and included their immediate neighbours. I do the L more times to get the nodes a depth of L hops away from the source.  Using 10 random nodes, with a depth of 2 yields a network of around 3049 nodes (~10% of all nodes).  When reduced to 5 seed nodes, we get ~1000 nodes (~4%).   Going the other way, 100 seed nodes, with a depth of 1 gives 14571 nodes covering ~50% of the network. These figures change depending on which nodes are selected at random initially. Two other paramters affect the results of this, the first is a threshold, where nodes with a connnected time less than this are not included, the second is the value used to seed the random number generator (if 0, then automatically choose a seed).

In the end I settled on three parameters in the table below – note that the number of nodes in the final set is highly subjective to the initially chosen nodes, so this is very random.

Studivz Random Node Choices

N L # Nodes
3 2 213
4 2 914
10 2 3049


Interestingly, despite the source or seed nodes being picked at random, the entire graph is connected in all configurations, the graphic below shows the connected_time graph and InfoMap clusterings for N=3, L=2.

InfoMap clustering of Studivz dataset, where N=3 and L=2

InfoMap clustering of Studivz dataset, where N=3 and L=2

This is a promising start, since there are distinct clusters of nodes, which we expected, as this is the concatenation of three egocentric networks, but also there are connections between each egocentric network, meaning there is a route to every other node. However, we can’t tell from this graph how often these contacts occur.

Looking at the whole dataset, we can get an idea about how active it is over time by measuring the number of connections in a given time period, below show the number of weekly connections for the entire dataset.

Weekly number of connections in the Studivz dataset

Weekly number of connections in the Studivz dataset

It shows that this social network seems to have become increasingly popular  over time, with a peak of just over 10,000 wall posts made in Jan 2007. If we were to pick a period to concentrate on, it should probably be from October 2006 onwards.

Studivz N=3, L=2

Initial results for each metric are shown below:

Delivery Ratio for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Delivery Ratio for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Cost for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Cost for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Latency for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Latency for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Delivery ratio is very poor for all runs, to see what the maximum possible delivery ratio is, we can look at the results for flooding the network below:

Delivery Ratio plot of Unlimited Flood on Studivz 3 2 0 0

Delivery Ratio plot of Unlimited Flood on Studivz 3 2 0 0

This achieves a delivery ratio of roughly 65 percent, so we have a bit of work to do to be able to match this!

Studivz 4 2 0 0

When we add another nodes to the initial seed set, we get a step up in the total number of nodes, 914 to be exact, this is currently running through the simulator.

Studivz 4 2 0 0

Studivz 4 2 0 0

UPDATE:

Below is the weekly activity during the set using 914 nodes (4,2,0,0)

Weekly activity in STUDIVZ 4,2,0,0

Weekly activity in STUDIVZ 4,2,0,0

The results on the larger dataset are shown below, these runs were taking considerably longer, and highlighted a couple of minor bugs in the simulator (not closing files properly! which means that file not found, too many open files messages kept occurring).

Delivery Ratio, Cost, Latency, Average Delivered Hops and Average Undelivered Hops for STUDIVZ with 4 seed nodes and a depth of 2.

Delivery Ratio, Cost, Latency, Average Delivered Hops and Average Undelivered Hops for STUDIVZ with 4 seed nodes and a depth of 2.

We see here that BubbleH is doing well in terms of delivery ratio compared to bubbleRAP , link clustering, which created a huge number of communities does particularly well  (at ~3o% for BubbleRAP and BubbleH), this adds weight to the idea that a large number of communities does well, and in fact, (in this case only, where there is only on set of parameters) we see that the Average cost is roughly the same as with the other CFAs.  BubbleH also performs well in terms of cost.  Latency very high for all CFAs as the dataset is very long.

Unlimited Flood and Prophet on STUDIVZ 4 2 0 0

Unlimited Flood and Prophet on STUDIVZ 4 2 0 0

However, we see from the Unlimited flood run, that we have a way to go to match the best possible delivery ratio, at around 90% delivery ration, it beats BubbleH hands down. Some consolation though, the advanced Prophet algorithm also only gets around 52% delivery ratio.

Categories: Datasets, experiments

MIT-NOV Hierarchy Analysis

September 1st, 2011 No comments

I re-ran the simulations for a simplified set of paramets for MIT-NOV and Cambridge, to make better sense of the data, I ranked each parameter to HGCE (which correspond to different hierarchichal structures) for each metric – see table below:

MIT-NOV, HGCE, BubbleH Rankings - (Exp. Group: DATASET_EXPLORE)

HGCE, M Paramater Del.Ratio Cost Hops Latency Depth(not ranked) Width (not ranked) Score?
0.001 8 5 7 5 3 15  
0.002 2 7 6 14 5 11  
0.003 13 3 4 4 3 13  
0.004 11 4 3 3 4 20  
0.005 1 2 2 12 2 17  
0.006 3 9 8 10 5 10  
0.007 6 12 12 9 5 14  
0.008 9 13 13 5 4 11  
0.009 4 11 10 8 5 14  
0.01 7 14 14 13 4 15  
0.02 14 7 9 1 3 16  
0.0 10 9 11 2 3 16  
0.1 5 1 1 11 5 16  
0.2 12 5 5 7 3 11  

This table shows the relative rankings for Delivery Ratio, Cost, Hops and Latency for each parameter of M to HGCE (values were derived from the optimum parameters to KCLIQUE based on structure of its output communities, the same parameters were used for each CFA). (Table columns can be sorted by clicking on column headings).

Below is an image of the associated Community Hierarchies

MIT-NOV, HGCE Communities for different threshold values (parameter M to HGCE)

MIT-NOV, HGCE Communities for different threshold values (parameter M to HGCE)

Show below are the four metrics in barchart form, for comparison of actual values:

Delivery Ratio vs Cost, and Hops vs Latency for MIT-NOV, HGCE, BubbleH

Delivery Ratio vs Cost, and Hops vs Latency for MIT-NOV, HGCE, BubbleH

From the above we can see that generally, delivery ratio improves with more depth to the hierarchy, however, in this instance a shallow, broad structure does best overall. When ranking by latency, it appears that broad shallow structures perform best.

thought: should we run HGCE multiple times on the same data to see what the range of different structures it comes up with are? Also, we should get another hierarchical CFA (e.g. the other version of link clustering): Did this, and will post results soon.

thought: is there a way of scoring heirarchy based on depth, width and population of communities?

Odd results – an update

July 13th, 2011 No comments

Just before Pádraig went away, we realised that the results we were getting for Bubble and BubbleH were exactly the same in some cases. We also wanted to see whether hierarchy was making any difference to results, and so we set these goals for the period whilst Pádraig was away:

  • work out why bubbleH and bubble are the same in some cases
  • get enron and wall posts dataset
  • pick the most hierarchical looking clustering  in these
  • see whether BubbleH works better than BubbleRap
  • (concentrate on HGCE and KCLIQUE)

The first question: Is Bubble (no global parent) always the same result as BubbleH (global parent)?

  InfoMap KCLIQUE HGCE LinkClustering
MIT-NOV Yes No No No
CAMB Yes Yes Yes No
IC05 Yes Yes No No
IC06 Yes Yes No No
SS Yes Yes No No *
Enron Yes No No No
Studivz ? ? ? ?

* in the case of Social Sensing Study LinkClustering all are same for Delivery Ratio, apart from where the threshold is 0.0. (with Social Sensing we used multiple tuning values, hence multiple results, the others used only one set of parameters)

Answer: Not always.

Is the answer overlap?
I think that these results are down to the structure of the communities. InfoMap ALWAYS produces the same results for BubbleH and BubbleRAP. I wonder if this is this down to the fact that InfoMap partitions the network, and therefore there are no overlaps? Could it be that for the most part, KCLIQUE creates non-overlapping communities and hence the results? HGCE creates a large number of highly overlapping communities, which are also hierarchical. LinkClustering also creates a large number of communities and whilst edges cannot overlap, nodes can belong to multiple communities. Or is it really that the inherent Hierarchy in community structure is causing results to differ?

The question is also, why do BubbleH and BubbleRAP get EXACTLY the same results if there is no overlap? Well, this is because in the end, there is no difference between them when there is no complicated structure to the network. Even if BubbleH is using the global parent community, that is EXACTLY the same as using the Global Ranking in BubbleRAP, so when there is no overlap, each node belongs to EXACTLY one community, and has a local rank in both BubbleH and BubbleRap. The global parent in BubbleH is the same as using the global rank in BubbleRAP. In fact, we could re-write BubbleH to incorporate this explicitly, and do away with a forced global parent, but this is just implementation detail, the end results will be the same.

Second Part: get enron and wall posts dataset

I used Conrad’s version of the Enron Dataset, as he had taken the time to remove irregularities, and in case of future papers, he would have an in depth knowledge of how he processed the data, saving me a lot of time!

The connected time graph is below, showing a decent number of clusters, hopefully with some nice hierarchy!

Connected Time Graph of the Enron dataset.

Connected Time Graph of the Enron dataset.

I explored this dataset in the same way as the previous ones, by experimenting with settings for the different algorithms, InfoMap is easy to visualise, and so below is the InfoMap clustering of the dataset:

Connected Time Graph for Enron Dataset, coloured with InfoMap clustering

Connected Time Graph for Enron Dataset, coloured with InfoMap clustering

I also used Conrads Studivz wall post dataset, (see here), this dataset is huge, and so I haven’t worked out how to run the full set of simulations. I was able to create a connected time edgelist (connected time is based on wall post length, 1000ms per character). Below is the graph of all connections, with node size and colour related to degree, edges removed for clarity.

network_degree_colournetwork_degree

network_degree_colour_section

Studivz connected time graph close-up with edges

Enron Dataset

In order to get any clusters at all out of KCLIQUE and LinkClustering, I had to plug in some very small threshold values, this is probably due to the way in which I created the events (An event occurs when an email is sent, the duration of every event is set to 2 seconds), so for the most part, nodes are not connected, and therefore there are small overall connected times. (Conrad’s dataset did not include the content of the messages, so I was not able to give contact events any length based on message size). To simplify things,  I used the interesting parameters from KCLIQUE and LinkClustering, to inform the parameters for HGCE, specifically the –minCliqueEdgeWeight parameter (more info about HGCE parameters), which excludes edges based on weight, effectively thresholding the graph edges as with KCLIQUE and LinkClustering.

To recap, the threshold means that (in the case of KCLIQUE and LinkClustering and now HGCE) edges are removed where the connected time between individuals is lower than the threshold.

Threhold parameters used for the Enron dataset:

0.0, 0.0000001, 0.0000002, 0.0000003, 0.0000004, 0.0000005, 0.0000006, 0.0000007, 0.0000008, 0.0000009, 0.000001, 0.000002, 0.000003, 0.000004, 0.000005, 0.00001

The plot below shows the results for Delivery Ratio for BubbleRAP using no global parent (NGP), and BubbleH using global parent (GP) (in future, we can safely ignore the global parents part, as BubbleRAP should always be run without a global parent and BubbleH with a global parent).

BubbleRAP vs BubbleH on the Enron Dataset, showing results for multiple parameters (of M). BubbelH beats BubbleRAP in all cases

BubbleRAP vs BubbleH on the Enron Dataset, showing results for multiple parameters (of M). BubbelH beats BubbleRAP in all cases

This is a good result, it shows that in this dataset, for these parameters, BubbleH beats BubbleRAP, but now we need to consider why. I had done an earlier exploration of the enron dataset with default values for M ( 0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.9), and so I looked back the results for that, and was surprised to see that BubbleRAP does much better in some cases. Below is a plot with the new data included (Solid red BubbleRAP line and blue dot dash BubbleH show this data).

BubbleRAP vs BubbleH for Enron Dataset, with extra (default) values for HGCE parameters (for M). Showing stronger results for BubbleRAP

BubbleRAP vs BubbleH for Enron Dataset, with extra (default) values for HGCE parameters (for M). Showing stronger results for BubbleRAP

So, BubbleRAP does better in some situations, (PS. NEED TO CHECK THIS AGAIN TO MAKE SURE BUBBLEH IS PROPERLY PLOTTED).

I started to look at then next step, hoping it will give some answers: pick the most hierarchical looking clustering  in [datasets].

I picked the best results for BubbleH, and mapped the communities to a hierarchical tree structure, shown below:

Community structure for nest run of BubbleH on Enron dataset where M = 0.000001.

Community structure for best run of BubbleH on Enron dataset where M = 0.000001

So far so good it seems to have a broad structure with some specific hierarchical clusters, I also mapped the worst run:

Community structure for best run of BubbleH on Enron dataset where M = 0.0000007

Community structure for best run of BubbleH on Enron dataset where M = 0.0000007

This too has some good structure to it, note however, that this is a plot of the  worst performing run, in a set of best parameters, (we didn’t run a whole range of values – so, this is worst of the best)

The next to show is the best BubbleRAP run, as below:

0.0000005

Community structure for best run of BubbleRAP on Enron dataset where M = 0.0000005

Interestingly, this has broad set of high level communities (as with the previous plots, but they had a global parent), but less broadness lower in the hierarchy (i.e. fewer lower communities).

TODO: plot the worst run of BubbleRAP

UPDATE: I found a better way to show the Dendograms of the Community hierarchy, and have automated the process. This page shows all of the plots for Enron without global parent and this page WITH global parent. To get a better idea of the relationship between hierarchical structure and results, I need to combine the results from runs, with the structure in the same place – so we can compare side by side what happens when there is different types of community.