Graphics known as tag clouds, which summarize chunks of text by weighting the font size of words based on frequency of use, have long been used to navigate the Web. Now they’re gaining traction as tools for armchair analysis of political rhetoric, thanks to Wordle, a tool that generates elegant word clouds from websites and blog feeds.
The Washington Post took word-cloud geekery mainstream this weekend, printing a pair of word clouds generated from the campaign blogs of John McCain and Barack Obama. The results:

Washington Post word clouds
As the mystery bloggers at Democracy In America note, it’s evidence that the presidential race is squarely focused on Barack Obama — “from afar both blogs look like they could belong to Obama girl.”
That’s a safe conclusion, considering the two biggest, bluest data points in the graphic. But are there other insights further down the size scale? One might be tempted to conclude that McCain’s big purple “Pentagon” is evidence of his emphasis on national security (or, on closer reading, that it’s more likely a reference to the campaign’s latest attack ads). A look at Obama’s cloud shows a curiously big “president” after the standard trope-trio of “hope,” “change,” “can.” Proof that he really is the “presumptuous nominee?“
Not quite. As with any visualization of a complicated data set, attention to detail is critical. Are the inputs being compared completely equal? Is the visual representation unbiased? Does the graphic quickly convey useful information?
The above clouds aren’t too bad, but they could be better. Each shows the top 150 words, which is Wordle’s default setting. But the number of words represented could be a bit bigger. Half the words are horizontal and half vertical, arranged in no particular order. Any bias created by this arrangement is probably random, but it does mean finding words of interest is a bit of a scavenger hunt. (Quick, where’s “Iraq?”)
Most important, although the author of the article in the Post does link to the two blogs used as data sources, there’s no indication of the range of dates used to generate the clouds. This is a problem, at least if one wants to use the clouds to analyze the rhetorical current of the presidential debate. A quick look at the two blogs shows that McCain’s is infrequently updated by a few staffers, whereas Obama’s is frequently refreshed by a variety of contributors (perhaps not a surprise in a race between a Blackberry addict and an analog candidate). Thus, Obama’s feed ends July 26, and McCain’s concludes a full month earlier. Comparing them directly is an exercise in apples and oranges (although to be fair, it’s not clear whether the Post considered this, and it’s difficult to tell now that the campaign blogs have moved on to the week’s latest inanity).
Below, I’ve regenerated the clouds with a few improvements. Each contains 250 words, displayed horizontally and arranged alphabetically from left to right. The color schemes are identical, and I’ve tried to ensure that the graphics are roughly congruent. Finally, I used Yahoo Pipes to ensure that they both cover the same timespan — posts after July 27. That’s not a lot of time, but short of scraping both blogs with unscrupulous Internet tools, I’m not sure there’s a better way to get an equal data set. The results:

Obama

McCain
(visualizations via Wordle)
Even equalized, it appears that the argument that it’s all about Obama is still sound, although this time McCain does show up as a tiny red squiggle at the top of Obama’s cloud. It’s clear that this week’s debate is all about “drilling” for “oil” versus how to “make” “new” “energy.” And it looks like Obama’s blog covers more topics with more words than McCain. Beyond that, readers will have to draw their own conclusions.
Tag clouds are great tools, but in order to convey useful information, their parameters must be correctly aligned. Consider them critically, lest they mislead.