Zero Intelligent Agents
Chart of Congressional activity on Twitter related to SOPA/PIPA
As many of you know, this week thousands of people mobilized to protest two laws being considered in Congress: the Stop Online Piracy Act (SOPA) and it’s Senate version the PROTECT IP Act (PIPA). Several Internet mainstays, such as Wikipedia, Reddit andy O’Reilly blacked out their sites to protest the bill. For some information on why this legislation is so dangerous check out this excellent video by The Guardian.
The mobilization against SOPA/PIPA also included many grassroots efforts to contact Congress and demand the bill be stopped. Given the attention the bill was getting, I was curious if there was any surge in discussion of the bill by members of Congress on Twitter.
So, I created a visualization that is a cumulative timeline of tweets by members of the U.S. Congress for “SOPA” or “PIPA.” To see if there was any surge, check out the visualization for yourself.
First steps in data visualisation using d3.js, by Mike Dewar
Last night Mike Dewar presented a wonderful talk to the New York Open Statistical Programming Meetup titled, “First steps in data visualisation using d3.js.” Mike took the audience through an excellent review of d3.js fundamentals, as well as showed off some of the features of working with Chrome Web Developer Tools. This is one of the best talks we have ever had, and if you have had any interest in exploring d3.js, but were intimidated by the design concepts or syntax, this is exactly the talk for you.
Also, Mike’s slides were all designed using d3.js and are available for download on his Github account: https://github.com/mikedewar/d3talk.
Monthly Twitter activity for all members of the U.S. Congress
Many months ago I blogged about the research that John Myles White and I are conducting on using Twitter data to estimate an individual’s political ideology. As I mentioned then, we are using the Twitter activity of members of the U.S. Congress to build a training data set for our model. A large part of the effort for this project has gone into designing a system to systematically collect the Twitter data on the members of the U.S. Congress.
Today I am pleased to announce that we have worked out most of the bugs, and now have a reliable data set upon which to build. Better still, we are ready to share. Unlike our old system, the data now lives on a live CouchDB database, and can be queried for specific research tasks. We have combined all of the data available from Twitter’s search API with the information on each member from Sunlight Foundation’s Congressional API.
To show the power of the database, I decided to use my newly acquired d3.js (sick of it yet?) skills to put together a tool that allows you to compare the monthly Twitter activity of all members of the U.S. Congress on Twitter for 2011.
Simply choose a politician from the drop-down menu (alphabetical by surname) and the graph will update with their activity data. If you want to reset the graph, just click the “Clear selections” button.
Feel free to add as many members as you like, but the dimensions of the visualization max out around 9. I have been playing around with for awhile, it’s fun! Oh, and if you choose a member and nothing happens it is most likely because that person didn’t tweet anything in 2011. I could have built-in error-catching or some warning. Also, to clear things you need to re-load the page. I’ll leave real UX to the professional web designers.
Back to the data. Unfortunately, the database is sitting on a server that cannot process many requests (read, web-scale) at a time. In fact, this blog post may bring it down! As such, if you are interested in getting access to the database please contact me directly. But be forewarned, working with this system and CouchDB requires a mature understanding of several tools and languages; including but not restricted to; curl, map/reduce, Javascript, and JSON. And that’s before you have even done any analysis.
Many people have asked me about working with Congressional Twitter data, so I hope this data can be useful. Please feel free to reach out if you have any questions.
Who are the most central members of the China’s leadership as we enter 2012?
As the United States gears up for what appears to be a long and grueling 2012 presidential campaign, China will also undergo its decennial turnover in presidential power in 2012. Unlike the United States, however, this shift will not involve any campaigning or voting—at least not with the people of China. Instead, this shift is one that is formalized within he Chinese Communist party; but that doesn’t mean that there won’t be interesting shifts and reallocations of power.
This leads naturally to many questions; perhaps most importantly that of this post’s title: Who are the most central members of the China’s leadership as we enter 2012?
Recently, I had the opportunity to work with Recorded Future, a startup out of Boston that specializes in longitudinal entity extraction from the massive amount of open-source data generated daily. For example, they have used their data to predict future patent issues for Apple based on issues raised by their competitors. This analysis includes many entities: Apple, HTC, Samsung, etc.; as well as the patents and law suits.
For our analysis we focused on the China’s leadership, as defined by the CIA World Factbook, and extracted all of named entities in their data for 2011 (over 4 billion events) for which any of the 33 official Chinese leaders appear. The result is a dataset with over 150,000 entities; including people, organizations, and places. To answer our questions, however, I used the co-occurrence of these entities in sentence fragments to build a large network of these entities.
Here I define an edge between two entities as the co-occurrence of two entities in a sentence fragment, which is provided by Recorded Future. Then, by extracting only the entities that are defined as people in the data, I generated a graph with 5,435 nodes and 34,413 edges. Big, but not unreasonable for analysis. Next, I computed some basic network statistics on that graph. As I have mentioned many times before, these measures are often most interesting if compared together. To highlight key actors, I generated a scatter plot of two metrics: Eigenvector centrality and betweenness centrality.
Eigenvector centrality measures the overall centrality of person in the network. It accounts for not only the number of connections a person has, but also the number of connections that person has to others with many connections. People with high Eigenvector centrality will be the most prominent and well-connected actors in a network. Alternatively, betweenness centrality measures that number of paths that go through an actor as a function of the total number of paths in the entire network. People with high betweenness will be those that act as critical bridges or cut-points between two densely connected parts of a network.
When we compare these metrics, as I have above, we can easily identify key actors as those that do not follow the relatively linear relationship between to two measures. Those with high betweenness but relatively low Eigenvector are central bridges within the network. What makes this comparison important, however, is that these bridges do it with few connections—hence the lower EIgenvector. Likewise, those with relatively high Eigenvector but low betweenness are network insiders. They sit inside some central region of the network, but have very few connections outside that region. To further highlight these key actors, I have shaded the data points based on how much they diverge from this linear trend. The network bridges are dark the red, while insiders are dark blue.
The above plot was designed using d3.js and is interactive. When you roll over the data points the “Leader Information” section is populated and identifies who in the Chinese leadership the point represents. If you click on the point, or the photo of the leader, you are brought to their full biography page provided by China Vitae.
What I love most about visualizing data in this way is it leads to many questions. What I love more about creating interactive graphics is it allows for that first layer of questions to be immediately answered, which in turn leads to an even richer investigation.
Many things jump out of the above plot right away:
- We see the obvious placement of Hu Jintao and Wen Jiabao in the upper-right of the graph. This is useful because it confirms that nothing odd is happening in our data: the most powerful men in China on paper are also the most central in our graph.
- Popular press is reporting that current Vice-President Xi Jining will lead the transitional government, so it is interesting to see him clustered closely with Xie Xuren and Zhao Xiaochuan. What will their role be in the new government?
- Why is Yin Weimin, a man with an ostensibly minor role in government, such an important bridge in the network?
- Liang Guangile, the Chinese Minister of Defense, is a key insider. This seems makes sense given the prominence of the military in the Chinese government, but why is he isolated from the rest of the network?
What’s more important than these questions, however, are the non-obvious ones this plot raises. What I need is help from those with a better understanding of Chinese politics. Does the placement of some of China’s leaders seem way off, or does this plot essentially reinforce well-held beliefs about the balance of power? I am very interested in getting other people’s perspectives
Finally, I want to give a special thanks to Christopher Ahlberg, CEO of Record Future, for working with me to on this data and allowing me to publish these findings.
Federal Reserve borrowing during the 2007-2009 financial crisis
First, from looking at the date of my last substantive post I owe everyone an apology. I have essentially let Zero Intelligence Agents wither on the vie, and that is terrible. Not so much because I think people are desperate to read it, but because I am desperate to get feedback from people on my projects and ideas.
One such project I have been working on recently is looking at the newly released data on Federal Reserve borrowing of 407 banks and companies during the 2007-2009 financial crisis. I have been looking for data sets to tell stories with because one of the tools I am eager to learn in 2012 is Michael Bostock’s d3.js, a Javascript library for data-driven design (d3, get it?). It is an incredibly powerful tool, albeit very verbose and cumbersome for a total Javascript newbie such as myself
I decided to teach myself some d3 through this Federal Reserver data, and came up with this visualization in the labs section of drewconway.com. The image below is just a snapshot of the visualization, please click through to see the full interactive chart.
Because the data contained so many companies, I decided to focus on only those that were the most aggressive borrowers during the crisis. I defined this as an institution that borrowed more than 500% of its market capitalization in a single day, i.e., 5x its value. This left me with 16 companies. I then also excluded Lehman Brothers, because on a single day it borrowed over 40x its value, which was too extreme an outlier for this visualization.
What’s left are 15 companies that tell a fascinating tale of the turmoil in the financial markets from 2007-2009. What struck me the most about the visualization was how many foreign banks were among the most aggressive. From the snapshot above, you can see that Dexia SA has the largest spike in its trend line. I must admit, I had never heard of this bank, but as it turns out it is a Belgian-French bank, which also happens to also be currently under investigation by the EU.
I would love to get feedback, both in terms of the data as well as the design of the visualization. If anyone has some insight as to what was going on with these banks during this time that might explain their trends, please let me know. Also, as I am very new to d3, if you have ideas on how to make the visualization better I welcome those as well.
UPDATE II: Updated the visualization with a legend and company filter, as suggested by many. I think it is better.
UPDATE: Thanks to @deepfoo for pointing me to this backgrounder from Bloomberg.
Copyright © 2004 -2012 Knowledge Matters™ - all rights reserved
The Webpages and Occasional Blog of Graham Durant-Law
E-mail: graham@durantlaw.info
