Connor Mendenhall

Entries categorized as ‘Statistics’

Citius, Altius, Dorkius

August 23, 2008 · 3 Comments

I always feel ambivalent about the Olympic Games. Though once an exaltation of individual achievement, today’s games are a spectacle of mindless patriotism. Nations march in the opening ceremonies. National flags are plastered on every athlete. National anthems grace every medal ceremony. But individuals compete. Sure, it’s fun to root for the USA, but it’s hard to see how something like Michael Phelps’ streak of record-breaking races is any sort of national accomplishment. It’s a victory for Phelps, for his team and his trainers, but I sure didn’t sacrifice much on behalf of my country to help him win eight gold medals.

Yet, at every Olympics, the boosterism continues, with the total medal count the favored indicator of national greatness. This year, of course, it is a portent of our geopolitical future: today the Chinese win the Olympics, tomorrow they win the world economy!

On the other hand, incredible human achievements still lie beneath the shlocky crust of national pride. Plus, this year’s games brought our increasingly irrelevant and oddly endearing President out in force, which more than makes up for all the spectators wearing goofy Uncle Sam hats.

Most important, although the medal count ought to be an irrelevant indicator, it’s not. The modern Olympics have always been political, and so has the medal count. That makes it a fun statistic to slice and dice. The New York Times did so recently with a neat medal-count cartogram, and the BBC recalculated the rankings based on things like population and GDP.

Interesting — but not quite geeky enough for me. So, with the help of Google Docs, I cooked up a gapminder animated chart comparing medal count numbers to some of the World Bank’s development indicators. Check it out below (you’ll have to click through).

The dataset is available here, and a lighter, embedded chart is available here (no thanks to WordPress!). Medal count data was imported from an Excel spreadsheet prepared by Chandoo, who graciously copy-pasted every medal count since 1896 from the IOC website and posted the data on his blog. Other indicators were imported from a World Bank DDP query. The 2008 medal count came from the official website around 10pm EST last night, so it’s already changed a bit.

Bear a few things in mind when using the chart. First, World Bank data is unavailable before 1960, so development indicators can only be compared to medal counts for recent Olympic games. However, data on medal counts (that includes the total number of medals awarded at the Olympics, the number of medals awarded to each country, and the “medal share,” or percent of all medals won by each country) goes back all the way to 1896, so these data can be compared before 1960.

Second, the Olympics are held every four years, but development indicators are included for every year. The chart extrapolates medal counts in non-Olympics years, so if you care about accuracy, check out the dataset for exact values, or make sure you’ve scrolled the time slider to an Olympic year (for an example of this, check out the U.S. medal count between 1903 and 1905, graphed against the year).

Finally, any errors in the data are my own, and likely the result of furious copy-pasting between various sources. I’ve checked it as best I can, but if you see something crazy, let me know, and I’ll do my best to fix it. Enjoy!

Categories: China · Economics · Nationalism · Olympics · Statistics

Who’s the boss?

August 6, 2008 · Leave a Comment

Josh Keating at FP Passport points to a recent opinion poll regarding Russia’s political leadership with a post headlined “Fewer than one in 10 Russians think Medvedev’s in charge.”

That’s certainly one conclusion to draw from the results, which indicate that only 9 percent of Russians believe President Dmitri Medvedev is more powerful than his predecessor, Vladimir Putin. But take a look at the rest of the data:

When asked “Who holds the real power in the country?” more than a third – 36% – said Prime Minister Vladimir Putin did, while only 9% saw Medvedev as the main figure. Almost half – 47% – answered that power was shared by “both equally.” Eight percent gave no answer.

It seems that “One in two Russians believe Medvedev, Putin share power equally” would also have been an accurate headline, albeit a less dramatic one.

Whether or not this is an indication of emerging checks and balances is unclear. It’s disconcerting that since the question was last asked in March, Putin’s share of the power poll grew by 15 percent while Medvedev’s dropped by 11 points. But then again, the survey is a barometer of public sentiment rather than political reality. Either way, it appears that most Russians’ perspective on just who controls the Kremlin is more complicated than popularly portrayed.

Categories: Russia · Statistics

Clearing up campaign word clouds

August 4, 2008 · Leave a Comment

Graphics known as tag clouds, which summarize chunks of text by weighting the font size of words based on frequency of use, have long been used to navigate the Web. Now they’re gaining traction as tools for armchair analysis of political rhetoric, thanks to Wordle, a tool that generates elegant word clouds from websites and blog feeds.

The Washington Post took word-cloud geekery mainstream this weekend, printing a pair of word clouds generated from the campaign blogs of John McCain and Barack Obama. The results:

Washington Post word clouds

As the mystery bloggers at Democracy In America note, it’s evidence that the presidential race is squarely focused on Barack Obama — “from afar both blogs look like they could belong to Obama girl.”

That’s a safe conclusion, considering the two biggest, bluest data points in the graphic. But are there other insights further down the size scale? One might be tempted to conclude that McCain’s big purple “Pentagon” is evidence of his emphasis on national security (or, on closer reading, that it’s more likely a reference to the campaign’s latest attack ads). A look at Obama’s cloud shows a curiously big “president” after the standard trope-trio of “hope,” “change,” “can.” Proof that he really is the “presumptuous nominee?

Not quite. As with any visualization of a complicated data set, attention to detail is critical. Are the inputs being compared completely equal? Is the visual representation unbiased? Does the graphic quickly convey useful information?

The above clouds aren’t too bad, but they could be better. Each shows the top 150 words, which is Wordle’s default setting. But the number of words represented could be a bit bigger. Half the words are horizontal and half vertical, arranged in no particular order. Any bias created by this arrangement is probably random, but it does mean finding words of interest is a bit of a scavenger hunt. (Quick, where’s “Iraq?”)

Most important, although the author of the article in the Post does link to the two blogs used as data sources, there’s no indication of the range of dates used to generate the clouds. This is a problem, at least if one wants to use the clouds to analyze the rhetorical current of the presidential debate. A quick look at the two blogs shows that McCain’s is infrequently updated by a few staffers, whereas Obama’s is frequently refreshed by a variety of contributors (perhaps not a surprise in a race between a Blackberry addict and an analog candidate). Thus, Obama’s feed ends July 26, and McCain’s concludes a full month earlier. Comparing them directly is an exercise in apples and oranges (although to be fair, it’s not clear whether the Post considered this, and it’s difficult to tell now that the campaign blogs have moved on to the week’s latest inanity).

Below, I’ve regenerated the clouds with a few improvements. Each contains 250 words, displayed horizontally and arranged alphabetically from left to right. The color schemes are identical, and I’ve tried to ensure that the graphics are roughly congruent. Finally, I used Yahoo Pipes to ensure that they both cover the same timespan — posts after July 27. That’s not a lot of time, but short of scraping both blogs with unscrupulous Internet tools, I’m not sure there’s a better way to get an equal data set. The results:

Obama

McCain

(visualizations via Wordle)

Even equalized, it appears that the argument that it’s all about Obama is still sound, although this time McCain does show up as a tiny red squiggle at the top of Obama’s cloud. It’s clear that this week’s debate is all about “drilling” for “oil” versus how to “make” “new” “energy.” And it looks like Obama’s blog covers more topics with more words than McCain. Beyond that, readers will have to draw their own conclusions.

Tag clouds are great tools, but in order to convey useful information, their parameters must be correctly aligned. Consider them critically, lest they mislead.

Categories: Design · Election 2008 · Internet · Politics · Statistics

Tax facts

July 22, 2008 · 1 Comment

The always-excellent 3quarksdaily points to “The Measure of America,” a new study by the American Human Development Project that concludes, among other things, that “the top 1 percent of U.S. households possesses a full third of America’s wealth,” and that “households in the top 10 percent of the income distribution hold more than 71 percent of the country’s wealth, while those in the lowest 60 percent possess just 4 percent.” A frightening indicator of increasing inequality, right?

Wrong. The latest research by the Tax Foundation shows that the top 1 percent of U.S. households pay 39 percent of all income taxes, and households in the top 10 percent of the income distribution pay about 71 percent of the total tax burden — a remarkably symmetric result. Income inequality may indeed be increasing, but it appears that we all pay our part when the taxman cometh.

Categories: Economics · Statistics · Taxes