Archive

Archive for October, 2011

Experiment Logs

October 26th, 2011 No comments

I have been filling my brain up with experiments recently, and it occurred to me, that very often I lose track of what I have been doing and why I have been doing it. So I have started to log what I am doing on the Experiment Logs page.

The idea is for every chunk of experiment, write down what the goals are, what the setup was, and what the result was. It is in a fairly loose structure, and so pertinent everyday stuff will probably appear there too.

Categories: Uncategorized

Weird Simulator Bug… :(

October 26th, 2011 No comments

Anyway the problem is a strange one. I have specified a dataset (based on the Studivz German university social network) – I have selected a time period in the dataset config, and I have called this dataset stidivz-3month-5_2 which refers to the period of the dataset I’m interest in (a 3 month chunk), and the group of nodes out of the whole lot (the dataset is huge – 28k nodes), the simulator doesn’t cope well, so I picked 5 seed nodes, and discovered the network 2 hops deep, and used all of those nodes as the set of nodes to use. In the config I have done this by specifying the nodes explicitly in the node-list property:
<property name=”node-list” value=”5967,2117,780,1828,2274, …. ” />

The problems happens with community ranking, when the simulator runs the HuiRanking over the set of nodes in a community, for this dataset only, it seems to include all of the nodes in the list (as with global.dat), rather than just the nodes for that community.

However, the community file it uses lists the correct nodes (i.e. different (numbers of) nodes in each community).

However, the weirdest thing is that it ONLY does it for this particular dataset, not for any others (e.g MIT, Enron, Social-Sensing, InfoCom etc.). None of the source code has changed, just the config files, but I can’t work out what has gone wrong.
I added some debugging code to /LocationSim/src/ie/ucd/argfrot/simulate/bubblerap/LocalHuiResultTask.java which prints out the size of the Simulation context variable (as passed to LocalHuiResultTask), and for this particular dataset, it confirms that it contains all of the nodes  (specified in the config). It appears that instead of the Simulation object having only the nodes for that community, it is including all of the nodes by mistake.

What I don’t understand is why it works for other datasets (with a smaller number of nodes) but for this particular dataset, is does not… I can only think that it’s something to do with my configuration. The config files I have tested with are:
xml/UNIFIED_EXTENDED/InfoMap-make-communities.xml (which uses xml/UNIFIED_EXTENDED/datasets/All-Datasets.xml)
xml/UNIFIED_EXTENDED/InfoMap-centrality.xml (which uses xml/UNIFIED_EXTENDED/datasets/All-Datasets.xml and xml/UNIFIED_EXTENDED/datasets/All-Communities.xml)

commands I was running

java -jar dtnsim.jar xml/UNIFIED_EXTENDED/InfoMap-make-communities.xml 1 DATASETS=studivz-3month-5_2 PARAM_SET=default EXPERIMENT_GROUP=BUGFINDING
followed by
java -jar dtnsim.jar xml/UNIFIED_EXTENDED/InfoMap-centrality.xml 1 DATASETS=studivz-3month-5_2 PARAM_SET=default EXPERIMENT_GROUP=BUGFINDING

generates a community file in
datasets/communities/BUGFINDING/InfoMap/studivz-3month-5_2/global-parent
and
datasets/communities/BUGFINDING/InfoMap/studivz-3month-5_2/no-global-parent

(e.g. file is named: edge_list.dat.communities.dat)

My next thought is to try to see if there is an issue with carriage returns in the communites.dat files since I moved it to the new SVN – but this seems unlikely…

Update

Fixed… a very simple ommision in the config files for community dataset loading:

Categories: Uncategorized

Next

October 13th, 2011 No comments

Next plan is to:

  • do a thorough search for paper targets
  • use a section of the dateset (1 to 3 months)
  • incorporate moses and see what happens
  • train the CFA on the first month
  • run algorithms on the last part
  • run multiple random sub-graphs and take the averages etc.

Future

  • aim to have this done by november,
  • then get paper sorted ready for december
  • then work on next section involving Vector Clocks for estimating network properties used for routing
  • then finish the frikking thesis

Steps

Also – explore studivz dataset with KCLIQUE

  • use Mean, Median and 80th Percentile – finish KCLIQUE Studivz 4 2 0 0
  1. incorporate Moses algorithm
    • visualise?
    • check bubbleH moses studivz 4 2 0 0
  2. Pick a period of activity in the dataset
    • test runs
    • 1st Oct 2006 to 1st Feb 2007?
    • pick a new set of sub graphs based on this period?
  3. generate a graph based on the the first 1/3rd
  4. run the algorithm from start to finish (but start the flooding 1/3rd in
  5. run multiple times with different random node configs (e.g. 4,2,0,0 x 10) and get the average results of all
Categories: To Do

Studivz

October 6th, 2011 No comments

I took some time to explore the Studivz wall posts dataset, to see whether it would be useful to use.

The first step was to extract a sub-set of the data, as the entire dataset is a little large to run in LocationSim (some of the CFAs can’t quite handle 28k nodes), scaling it up to this many nodes is a work in progress (it might involve writing something more lightweight to do some of the raw processing).

The approach I have taken so far, is to pick N nodes randomly, and included their immediate neighbours. I do the L more times to get the nodes a depth of L hops away from the source.  Using 10 random nodes, with a depth of 2 yields a network of around 3049 nodes (~10% of all nodes).  When reduced to 5 seed nodes, we get ~1000 nodes (~4%).   Going the other way, 100 seed nodes, with a depth of 1 gives 14571 nodes covering ~50% of the network. These figures change depending on which nodes are selected at random initially. Two other paramters affect the results of this, the first is a threshold, where nodes with a connnected time less than this are not included, the second is the value used to seed the random number generator (if 0, then automatically choose a seed).

In the end I settled on three parameters in the table below – note that the number of nodes in the final set is highly subjective to the initially chosen nodes, so this is very random.

Studivz Random Node Choices

N L # Nodes
3 2 213
4 2 914
10 2 3049


Interestingly, despite the source or seed nodes being picked at random, the entire graph is connected in all configurations, the graphic below shows the connected_time graph and InfoMap clusterings for N=3, L=2.

InfoMap clustering of Studivz dataset, where N=3 and L=2

InfoMap clustering of Studivz dataset, where N=3 and L=2

This is a promising start, since there are distinct clusters of nodes, which we expected, as this is the concatenation of three egocentric networks, but also there are connections between each egocentric network, meaning there is a route to every other node. However, we can’t tell from this graph how often these contacts occur.

Looking at the whole dataset, we can get an idea about how active it is over time by measuring the number of connections in a given time period, below show the number of weekly connections for the entire dataset.

Weekly number of connections in the Studivz dataset

Weekly number of connections in the Studivz dataset

It shows that this social network seems to have become increasingly popular  over time, with a peak of just over 10,000 wall posts made in Jan 2007. If we were to pick a period to concentrate on, it should probably be from October 2006 onwards.

Studivz N=3, L=2

Initial results for each metric are shown below:

Delivery Ratio for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Delivery Ratio for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Cost for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Cost for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Latency for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Latency for BubbleH vs BubbleRAP for Studivz 3 2 0 0

Delivery ratio is very poor for all runs, to see what the maximum possible delivery ratio is, we can look at the results for flooding the network below:

Delivery Ratio plot of Unlimited Flood on Studivz 3 2 0 0

Delivery Ratio plot of Unlimited Flood on Studivz 3 2 0 0

This achieves a delivery ratio of roughly 65 percent, so we have a bit of work to do to be able to match this!

Studivz 4 2 0 0

When we add another nodes to the initial seed set, we get a step up in the total number of nodes, 914 to be exact, this is currently running through the simulator.

Studivz 4 2 0 0

Studivz 4 2 0 0

UPDATE:

Below is the weekly activity during the set using 914 nodes (4,2,0,0)

Weekly activity in STUDIVZ 4,2,0,0

Weekly activity in STUDIVZ 4,2,0,0

The results on the larger dataset are shown below, these runs were taking considerably longer, and highlighted a couple of minor bugs in the simulator (not closing files properly! which means that file not found, too many open files messages kept occurring).

Delivery Ratio, Cost, Latency, Average Delivered Hops and Average Undelivered Hops for STUDIVZ with 4 seed nodes and a depth of 2.

Delivery Ratio, Cost, Latency, Average Delivered Hops and Average Undelivered Hops for STUDIVZ with 4 seed nodes and a depth of 2.

We see here that BubbleH is doing well in terms of delivery ratio compared to bubbleRAP , link clustering, which created a huge number of communities does particularly well  (at ~3o% for BubbleRAP and BubbleH), this adds weight to the idea that a large number of communities does well, and in fact, (in this case only, where there is only on set of parameters) we see that the Average cost is roughly the same as with the other CFAs.  BubbleH also performs well in terms of cost.  Latency very high for all CFAs as the dataset is very long.

Unlimited Flood and Prophet on STUDIVZ 4 2 0 0

Unlimited Flood and Prophet on STUDIVZ 4 2 0 0

However, we see from the Unlimited flood run, that we have a way to go to match the best possible delivery ratio, at around 90% delivery ration, it beats BubbleH hands down. Some consolation though, the advanced Prophet algorithm also only gets around 52% delivery ratio.

Categories: Datasets, experiments