Representing Data Visually

I recently started reading two programmer blogs. Well actually they are just blogs written by programmers, the topics they cover vary quite a lot. The first is Skorks. Its written by Alan Skorkin, an Aussie (I think). He covers a lot of Ruby programming, most of which I don’t quite understand yet, but appreciate all the same. Its the generic topics he writes about that really catch my eye.

The second blog is Coding Horror which was frequently quoted on Skorks. Again, I tend to glance through the hardcore programming posts and pay more attention to the other topics, which I can visualize or relate to.

So, why am I mentioning them? I came across an amazing author (and a lot of more substantial things) called Edward Tufte thanks to these blogs. Tufte writes about visualizing large amounts of data in the best possible way. Its a simple ideal which is very very difficult to implement, considering how most of us (mea culpa) try too hard to make presentations “look good” rather than focus on the data that we want to show.

I checked for the availability of his books around here and also the price. Being forbiddingly expensive for a person on a meagre salary such as myself, I resorted to downloading them from a torrent. The two books which I could find were :

The Visual Display of Quantitative Information (Its around 400MB of high-resolution pdf)

Envisioning Information (Around 30MB)

Just to give you a taste of what the book is about, I’ll plug an example from one of them. Its a set of 4 datasets which are statistically very similar, but when represented on an x-y graph show distinct properties.

Anscombe’s Quartet
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

The mean, variance and correlation between the respective x-y values are constant for all 4 datasets, yet they are very different from each other. The only sureshot way of showing it is visually :

Anscombe's Quartet

There is a lot to be learnt from these books by anyone who will need to represent data in any form. And that would be almost all of us at some point or the other.

I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s