Home > Supervisor Meetings, what i've been doing > Final Chapter – Next Steps

Final Chapter – Next Steps

Met with Pádraig to discuss next steps.

We decided to go along the route of trying to classify CFAs based on synthetic datasets. We realised that there are no synthetic datasets out there that model both contacts and clustering.

We decided to be clever with the Studivz dataset. We will run the Moses CFA over it. Then pick a random community, then build out the network based on connected communities.

The idea being that we can decide how many nodes we want, and keep going until we have enough.


I implemented this. The script takes an edge list, the community allocation by Moses. (see file in SVN)

The script picks a random community C0, then, while the total number of nodes is less than the max number of nodes, it picks the community (C1) connected to C0 that has the largest number of links to C0. Then, it picks one of these communities at random, and repeats the picking process until the total number of nodes has been reached or exceeded.

Initially, I did this four times, with a limit of 200 nodes and produced the following graphs, coloured by modularity class, and sized by betweeness centrality (sized linearly between 5 and 100 in gephi).





But then I got to thinking what are we actually trying to test here? It’s a bit vague to just see what happens in these datasets. We could be testing a number of things:

  • Which CFA works best for well connected communities
  • Which CFA works best for poorly connected communities
  • Are we testing a particular aspect of a network?
    • Density?
    • Avg. community size
    • Average network size

So I also decides to generate a set of nodes in poorly connected datasets, so adapted the algorithm to pick the communities with the lowest number of links. Resulting in the four below:





Also, perhaps we should be considering weighted links, perhaps the most time connected vs the least time connected?

Each of these subsets should be compared in terms of other network metrics to see if there is an effect on CFA performance.

awesome command to do this:

EXPERIMENT_GROUP=FINAL_EVALUATION DATASETS=studivz-three-200-A,studivz-three-200-B,studivz-three-200-C,studivz-three-200-D,studivz-three-200-A-MIN,studivz-three-200-B-MIN,studivz-three-200-C-MIN,studivz-three-200-D-MIN,mit-nov,mit-oct,cambridge,social-sensing,hypertext2009,infocom-2005,infocom-2006  && for DATASET in $(echo ${DATASETS} | tr "," "\n"); do php -f scripts/stats/DatasetCommunityLinkCountStats.php OUTPUT/${EXPERIMENT_GROUP}/edgelist-graphs/${DATASET}/edge_list.dat OUTPUT/${EXPERIMENT_GROUP}/communities/${DATASET}/Moses/no-global-parent/edge_list.dat.communities.dat  CSV ${DATASET} > OUTPUT/${EXPERIMENT_GROUP}/data/${DATASET}-DatasetCommLinkStats.txt; done
  1. Pádraig
    June 23rd, 2012 at 12:05 | #1

    So hopefully now we have two categories of network, one with densely connected communities and one loosely connected. One way to check this would be to look at the distribution of external contacts per time period per node compared with the internal distribution. I suppose it would be enough to just look at the ratio of the means of these distributions, i.e. average number of internal to external contacts.

    It may be though that your two alternative selection policies (min versus max) are very influenced by the starting point. From looking at the network diagrams it looks like studivz-three-200-B is sampled from a very dense part of the network and it looks denser than the other ‘max’ samplings.

    Anyway, it looks like we have a strategy for producing shed loads of ‘realistic’ contact network data.


  1. No trackbacks yet.