In my previous post I provided an outline of how I’m doing a large and complex SNA on a knowledge exchange organistion that has a world-wide presence, a membership of about 2,500, and the use of multiple platforms. In this post I’ll talk about data cleansing, and Gloor’s Contribution Index as a way of attributing data.
The task is challenging not the least because I have 10 complete years of data, and two years of incomplete data to analyse. It’s also a daunting task, but I was able to quickly reduce the dataset to 10,576 rows and 7 columns which had to be cleaned and manipulated! My tool of choice for a dataset of this size, at least for the initial cleaning and manipulation stage, is Microsoft Excel. Excel has some very good capabilities including a =CLEAN command to remove non-printable hidden characters that cause problems in analysis tools.
The dataset contained 10,354 posts. 7,238 were reply posts. Of these "Anonymous" posted 1,999 replies to 1,374 posts. This represents about 18% of all posts. However, it was necessary to remove "Anonymous" from the dataset, because "Anonymous" is almost certainly not a single person, and to leave them in would distort the results. Similarly, identified pseudonyms, aliases, and duplicate names, along with “self-replies” and no answers were removed. Ultimately this process left 703 identified individuals in the network. These people comprise the node-set for the public bounded or contained network, for which activity and various network measures can be applied.
One of the first measures applied was Gloor’s Contribution Index (messages sent – messages received)/(messages sent + messages received). It is interpreted as follows: