Quantcast

Posts tagged with data visualisation

Conversation topics on Facebook by age.

hi-res







afrographique:

An infographic celebrating African Nobel Prize winners from across the continent.

afrographique:

An infographic celebrating African Nobel Prize winners from across the continent.


hi-res




The classic red/green colouring scheme for trading screens seems too alarmist.

http://media.dailyfx.com/illustrations/2012/04/30/AUDUSD_Trading_the_Reserve_Bank_of_Australia_Interest_Rate_Decision_body_ScreenShot100.png

http://graphics.moneyshow.com/traders/TipsCharts/March2012/daytraders07_1_med.gif

http://i.istockimg.com/file_thumbview_approve/7204532/2/stock-photo-7204532-stock-market-financial-trading-screen-in-green-and-red.jpg

http://accuratestocktrading.com/wp-content/uploads/2010/01/screenshot-when-email-alert5.jpg
http://media.dailyfx.com/illustrations/2012/04/30/AUDUSD_Trading_the_Reserve_Bank_of_Australia_Interest_Rate_Decision_body_ScreenShot100.png
http://4xlounge.com/wp-content/uploads/2011/07/tbconsolelive.png

Conceptually, the red/green distinction makes sense as corresponding to stop/go in traffic signals. But traffic signals need to be neon and striking in a hectic 3-D environment where it’s paramount for everyone to definitely not-miss the stop command.

But in a sheltered 2-D environment where goals commonly include to master emotion, to control passive reactivity, to keep a long-term head in the middle of short-term volatility, and to digest (calmly) massive amounts of information en simultáneo, neon red/green seems too grating.

yellow and blue trading screen (GVZ)

I made the above picture with R of course, like this:

    require(quantmod)
    getSymbols("^GVZ")
    chartSeries(GVZ)
    reChart(up.col="light blue", dn.col="yellow")

(GVZ is the gold volatility index.)

It’s not a perfect colour scheme—I would use Lab to do better—but it already improves on #FF0000 versus #00FF00.

 

One theory of the evolution of trichromacy in primates says that

  • red/green dichotomy tells us whether meat or fruit is rotten or ripe (especially in dappled light)
  • blue/yellow dichotomy tells us how cool/warm something is
  • light/dark (value) is the most basic kind of vision.

If we take that as a starting point, a less alarmist colour scheme for trading software could use the blue/yellow dichotomy to indicate whether a security price went up or down. Use a neutral chroma for “small” moves (this depends upon one’s time-frame, but properly the definition of “big move” should be calibrated to an exponential moving average with some width depending on one’s market telescope). Intensity of the move could be signalled with lightness, so that most figures on a screen are a readable lightness of a neutral colour, but “big moves” are tinged with convexly more chroma and very-convexly more lightness.

XSTRATA

The definition of “up/down” might be refigured as whether the trader is short/long the security in question, or perhaps redness/greenness could be used in conjunction with the “market view” of cold/hot, to indicate whether a security is moving for/against one’s strategy. That too could be seen as overly alarming, but a (pseudo)convex coding of red-ness might again solve the problem again, only invoking the “panic mode” when there’s really something to worry about.

(Source: twitter.com)




This is a beautiful and terrible data graphic—what Edward Tufte calls “chartjunk” or “design over communication”.
I’ll note some of the flaws for later reference in a longer piece I’m working on where I try to hit the highlights of numeracy / practical data literacy for non-statisticians.
The total spending size is irrelevant. The major determinant should be spending per pupil; country size is a confounding factor. Bubbles should be sized to the second set of numbers. (Theme: sensibly transformed data are better than raw data.)
The reordering effect and coloured strings look great and are helpful in tracking how orderings change across variables.
But, the scales are chosen so you can’t tell anything from the length of the ribbon! Look at the literacy rates. If it’s so undifferentiated why even include it? If you want to show some differences it’s acceptable to use an odds-ratio scale or a flog scale. (Theme: transformations are good.)
Literacy rates and schooling attainment are, I guess, “ok” measures of how educated your population is. But it’s circular reasoning. Spend more on education per pupil because more of them stayed in school for longer. Duh. But the question is, did they learn more per [dollar|ruble|euro|peso]?
It could be interesting to look at years in school versus test scores and spending but the literacy numbers get in the way. I would put literacy last as no one wants to compare it with any of the other neighbours. You could even put it before the bubbles (main graphic) as if to say “Look, things aren’t all that bad in education. Let’s start with something nice” whilst comparing literacy to spending but not to science scores or years in school or what-all else.
The bubbles overlap like a Venn diagram. Not a huge problem since it’s fairly apparent that this is just to make things pretty but it could be confusing to someone who expects visual overlap to indicate conceptual overlap.
It would be reallynice to bring out the number of crossings between spending and per-cap spending. I can’t think of one obvious best way to do this but for one thing you could put science and math on a horizontal comparison rather than vertical. Then maybe change the lightness/value of crossing points to somehow draw the eye to it? The crossing points (differences between per-cap spending and measured outcomes) are what’s really interesting in this inquiry.

This is a beautiful and terrible data graphic—what Edward Tufte calls “chartjunk” or “design over communication”.

I’ll note some of the flaws for later reference in a longer piece I’m working on where I try to hit the highlights of numeracy / practical data literacy for non-statisticians.

  • The total spending size is irrelevant. The major determinant should be spending per pupil; country size is a confounding factor. Bubbles should be sized to the second set of numbers. (Theme: sensibly transformed data are better than raw data.)
  • The reordering effect and coloured strings look great and are helpful in tracking how orderings change across variables.
  • But, the scales are chosen so you can’t tell anything from the length of the ribbon! Look at the literacy rates. If it’s so undifferentiated why even include it? If you want to show some differences it’s acceptable to use an odds-ratio scale or a flog scale. (Theme: transformations are good.)
  • Literacy rates and schooling attainment are, I guess, “ok” measures of how educated your population is. But it’s circular reasoning. Spend more on education per pupil because more of them stayed in school for longer. Duh. But the question is, did they learn more per [dollar|ruble|euro|peso]?
  • It could be interesting to look at years in school versus test scores and spending but the literacy numbers get in the way. I would put literacy last as no one wants to compare it with any of the other neighbours. You could even put it before the bubbles (main graphic) as if to say “Look, things aren’t all that bad in education. Let’s start with something nice” whilst comparing literacy to spending but not to science scores or years in school or what-all else.
  • The bubbles overlap like a Venn diagram. Not a huge problem since it’s fairly apparent that this is just to make things pretty but it could be confusing to someone who expects visual overlap to indicate conceptual overlap.
  • It would be reallynice to bring out the number of crossings between spending and per-cap spending. I can’t think of one obvious best way to do this but for one thing you could put science and math on a horizontal comparison rather than vertical. Then maybe change the lightness/value of crossing points to somehow draw the eye to it? The crossing points (differences between per-cap spending and measured outcomes) are what’s really interesting in this inquiry.

(Source: mat.usc.edu)


hi-res




the Good People and the misguided

HT @jaredwoodard (supervenes)

the Good People and the misguided

image

HT @jaredwoodard (supervenes)


hi-res




Components of internet traffic 1995-2005
via proofmathisbeautiful, un, infoneer-pulse, byWired:


"the center of gravity of … media … is moving to a post-HTML environment,” we promised nearly a decade and half ago. The examples of the time were a bit silly — a “3-D furry-muckers VR space” and “headlines sent to a pager”…


Look how much DNS requests used to take up!
Also surprised email isn’t more of the traffic now. I guess this is measured in terms of bytes rather than in ℓ₀ terms: number of
I remember how in the late 90’s people would speculate that everyone would become a co-creator (in fact big-money books were written to this theme). But maybe the lesson is that there are a relatively small number of passionate artists and artisans trying to get the word out about their stuff, and well-organised corps are very good at getting us to pay attention to certain art and not other art—although “viral” is a fairly chaotic a Wild West, certainly more so than three-channel broadcast. The “peer-to-peer” category I interpret as people trading albums and movies by the top artists. The picture doesn’t go up until 2010 but I think big corps have made inroads into the fuchsia video band by now.
One more mathematical observation about this chart: the total amount of traffic obviously exploded during 1995-2005 but we see a constant height on the graph. So that’s like “modulo size changes”aka the familiar.

Components of internet traffic 1995-2005

via proofmathisbeautiful, un, infoneer-pulse, byWired:

"the center of gravity of … media … is moving to a post-HTML environment,” we promised nearly a decade and half ago. The examples of the time were a bit silly — a “3-D furry-muckers VR space” and “headlines sent to a pager”…

  • Look how much DNS requests used to take up!
  • Also surprised email isn’t more of the traffic now. I guess this is measured in terms of bytes rather than in ℓ₀ terms: number of

I remember how in the late 90’s people would speculate that everyone would become a co-creator (in fact big-money books were written to this theme). But maybe the lesson is that there are a relatively small number of passionate artists and artisans trying to get the word out about their stuff, and well-organised corps are very good at getting us to pay attention to certain art and not other art—although “viral” is a fairly chaotic a Wild West, certainly more so than three-channel broadcast. The “peer-to-peer” category I interpret as people trading albums and movies by the top artists. The picture doesn’t go up until 2010 but I think big corps have made inroads into the fuchsia video band by now.

One more mathematical observation about this chart: the total amount of traffic obviously exploded during 1995-2005 but we see a constant height on the graph. So that’s like “modulo size changes”aka the familiar
image.


hi-res







An unabashedly narcissistic data analysis of my own tweets.

The unequivocally lovely Jeff Gentry (@geoffjentry) has contributed an R package with easy-to-read documentation that works, which I’ll walk through here so that you, too, can gaze at your own face mirrored in the beauty of a woodland pond—er, sea of electrons.

Here’s the basic flow for grabbing stuff. You can do more with ROAuth but that’s a bit of a pain.

require(twitteR)
RT.of.me <- searchTwitter("RT @isomorphisms", n=100)
news <- getTrends(n=50)
firehose <- publicTimeline(n=999)


my.tweets <- userTimeline('isomorphisms', n=3500)
head(my.tweets$text) Consider: Donkey Kong is neither a donkey, nor a kong. William Thurston, geometrizer of manifolds http://t.co/UPwuAnbP When I invent a single-letter language, it's going to be called Ж. @theLoneFuturist True. If $GOOG were only an ad network, with no search facility, how much would it be worth? Do your arms hang down by your side in zero gravity? Because then I bet astronauts have less smelly armpits. Can't log into Hacker News with #w3m! Unexpected. Salt and sugar are opposites. Therefore if i eat too much salty food I must balance it with candy. #logic @leighblue Do you know any behavioural econ studies on utility vs bite-size / package-size?

Those are some of the ways you can grab data — twitteR hooks into RCurl and then, like, the info is just there. Run twListToDF( tweets ) to split the raw info into 10 subfields—text-of-tweet, to-whom-was-the-reply-threaded, timestamp, and more.

To pull out just one of those fields—like “source of tweet”, for example, use sapply:

my.tweets <- userTimeline('isomorphisms', n=3000)
whence.i.tweet <- sapply( my.tweets, function(x) x$statusSource

You can see from plots 1, 2, 3, and 4 that I use @floodgap's TTYtter client (tweeting from the command line; no installation). In fact this is why I’ve started tweeting so much the last few months: I run TTYtter in a virtual terminal, mutt (command-line gmail) in another virtual terminal, and therefore it becomes quite easy to flick my virtual newsfeed/conversation stream on for a minute or two here and there whenever I’m at the computer. It feels like The Matrix or Neuromancer or something.

Here’s how I created the ggplot radial chart #4 — this was the longest command I had to use to generate any of them. For some reason qplot didn’t like scale_y_log10() so I did:

ggplot( data = data.frame(whence.i.tweet),  aes( x=factor(whence.i.tweet),  fill=factor(whence.i.tweet) )   )
 + scale_y_log10()
 + geom_bar()
 + coord_polar()
 + opts(  title="whence @isomorphisms tweets",   axis.title.x=theme_blank(),   legend.title=theme_blank()   )

In the words of @jeffreybreen, twitteR almost makes this too easy. A few months ago — before I knew about this package — I was analysing tweets for a client who wanted to gauge the effectiveness of “customer service tweets”. I wrote an ugly, hacky perl script that told me whether the tweet had an @ in it, whether the @ was a RT @an wem the tweet was @, and so on. Dealing with people using @ in another sense besides “Hey @cmastication, what’s up?” or different numbers of spaces between RT/MT; multiple RT's in the same message; and so on — was an icky mess. I probably spent half a week changing my regexes around to deal with more cases I hadn't thought of. Like most statisticians, I hate data munging—swimming around in the data is the fun part, not patching up the kiddie pool. Besides that, my client wanted the results in an Excel file — and Excel can't handle multidimensional arrays (whereas a tweet mentioning @a @b @c should have just one “mentions” slot with three things in it).


That twitteR package is so hot right now.

  

But as much fun as it was to display my love of TTYtter in four different plots, that’s not the only R-based egotainment you can compute on a Friday night.

How wordy am I?

I know I am wordy. I often adopt a telegraphic SMS-like typing style (“Sntrm wd b gr8 prez, like Ahmedinejad”) rather than hold back my trenchant remarks about astronauts’ armpits. Tumblr’s auto-tweets don’t help my average, either—the default is long, and I’m usually too lazy to change it.

With the magic of kernel density estimates—which are definitely not overkill for the analysis of my appropriately-florid and highly-important charstreams—and my usual base::plot params, the length of my tweets is made art in the form of chart #5.

I got a vector of tweet-lengths using @hadleywickham's stringr package:

my.tweets <- userTimeline('isomorphisms', n=3500)
my.tweets <- twListToDF( my.tweets ) iso <- my.tweets$text require(stringr) iso.len <- str_length(iso) #vectorised! No for loops necessary hist( iso.len, fill="cyan" )

Proving once again that all real-world distributions fit a bell curv—…um.

You can of course use subset( my.tweets ) to plot tweets that were made under certain conditions—I might look only at my tumblr auto-posts using subset( my.tweets, statusSource=="tumblr"). Or only at short tweets using subset( my.tweets, str_length(my.tweets$text)<100 ). And so on.

 

Lastly, I wanted to plot my tweeple—the people I talk to on twitter (most of whom I don’t actually know in real life … I like to keep friends and mathematical geekery separate). As you can see from the final chart, it was largely a sh_tshow. Or so I thought, until I considered attacking the problem with ggplot.

One of ggplot's strengths—in my opinion its greatest strength—is the facet_grid( atttribute.1 ~ attribute.2) function. In combination with base::cut — which assigns discrete “levels” to the data — facetting is especially powerful. I cut my data into four subsets, based on how many times I’ve tweeted @ someone:

my.tweets <- userTimeline( 'isomorphisms', n=3000 )

# only tweets that are @ someone talkback <- subset( my.tweets,  is.na(replyToSN) == FALSE )
#the value would be NA iff I tweeted into the vast nothingness, apropos of no-one
# just the names, not the rest of the tweet's text or meta-information tweeps <- talkback$replyToSN
#make a new data frame for ggplot to facet_wrap. tweep.count <- table(tweeps) tweep.levels <- cbind( tweep.count,
cut( tweep.count, c(0,1,2,5,100) ),
rownames(tweep.count)
) tweeps <- data.frame(tweep.levels) names(tweeps) <- c("number", "category", "name") class(tweeps$number) <- "numeric"
#all the above stuff only came clear after a few attempts
#and likewise the plot didn't work out perfect at first, either!
#but here's a decent plot that works: ggplot( data = tweeps, aes(x=number) ) + facet_wrap(~ category, scale="free_x") + geom_text( aes(label=name, y=30-order(name), size=sqrt(log(number)),    col=number+(as.numeric(category))^2 ), position="jitter" ) + opts( legend.title = theme_blank(), legend.text = theme_blank() )

This made for a much more readable image. Not perfect, but definitely displaying info now.

 

OK, I do love talking about my twistory a little too much — but I’d like to see your histograms as well! If you run some stats on your own account, please post some pics below. I believe images can be directly embedded in the Disqus comments with <img src="http://i.minus.com/staggering_analysis_of_my_fantastic_words.jpg">.

(To save your R plots to a file rather than to the screen, do png("a plot named Sue.png"); plot( laa dee daa ); dev.off() where ; could be replaced by a newline.)










If you’re using base::plot in R for the first time you may have looked at ?plot (2 page help file) or ?par (12 page help file) to figure out what’s going on. It’s overwhelming.

This document explains the parameters I always bother to set. That way you can get decent plots without reading every parameter’s description.

(If you are just using R for the very first time and need some data, type data(faithful) or data(pima) to load some interesting pre-cleaned data sets. Then do plot(pima) or plot(faithful) to see how the base::plot functions. Type ??pima if you can’t find the dataset.) 

> plot(faithful, pch=20, col=rgb(.1,.1,.1,.5), cex=.6)

image

Firstly: what is par? When you type par( lwd=3, col="#333333", yaxt="n" ), it will open an empty box that will hold your next plot( dnorm, -3, 3). You can run different plots in the box and as long as you don’t close it, the line-width will be 3 times bigger than default, the y-axis won’t have labels, and the colour will be dark-grey.

There are a lot of plotting options. Here are the ones I use regularly:

  • cex = .8. Decreases the size of type or plotted points by 20%.
  • par(new=TRUE). Use this to plot two things on top of each other. Beware, the labels will overprint over each other too (but this doesn’t matter for quick, casual plots).
    image
    distribution of likes on tumblr 
  • col = "red", col = "#333333". I think #333333 is the best default colour and I use red if a point or line needs to stand out.
     one time I spent an evening statistically simulating a made-up society in R, and this was the distribution of people's qualities I generated
  • col=rgb(.1,.1,.1,.5). This is another decent grey for overplotting. I used this in the Old Faithful plot at the top. The first three numbers are Red, Green, Blue and the fourth is Transparency.
  • lwd = 3. This is a good line width, I think, especially with the dark grey col="#333333".
  • pch = 20. Plots points with a small circle. pch=19 is a slightly larger dot and pch=15 is a square. Read after the second group of bullets for more info.
  • png("name of the plot.png"). Then do plot(x), par(new=TRUE), plot(y), par(new=TRUE), plot(z), and remember to finish it off with dev.off(). [dev.off() means device off; the par() window and the png() file are considered “graphic devices”.]

a bimodal probability distribution

Here are the ones I use less regularly, but still more than weird stuff like oma, mex, mai, etc.

  • lend = butt. Line ending is square rather than mitred. I use this before I make a histogram.
  • ylog=TRUE, xlog=TRUE. “Hubble made this significantly worse chart before it was discovered that all data look like straight lines on log-log plots.” —Lawrence Krauss
  • las=1. If you want all of your axis labels to be printed horizontally.
  • mfrow=c(2,2). If you want to juxtapose four plots next to each other. mfrow and they will write like a typewriter, left-to-right and starting over on a new line after 2 spots have been filled in. mfcol=c(3,3) and they will fill in vertically. (Try it if what I said doesn’t make sense.)
  • yaxs="n". This suppresses printing the vertical axis labels. I do this when plotting a distribution because those vertical numbers aren’t meaningful.
  • main="It's a plot about nothing. Don't you get it? People _love_ nothing!". This is the title of the plot.
  • legend( "top right", legend=c("control", "placebo", "test group"), fill = c("black", "#333333", "red"), border="white", bty="n"). This is how I find legends look good. You should only need to change the placement, legend text, and fill to make it work for your plot.
  • To plot multiple figures in the same picture do mfrow=c(3,2) or mfcol=c(3,2). Then the next six = three × two plots you run will go in left-to-right or up-to-down order, filling in six spots.
    mfrow=c(3,2)
    Don’t forget to do par(mfrow=c(1,1)) after you’re done, to go back to one plot per diagram.
  • If you want to save your plot to a file rather than “print” it to the screen, type png("a plot about nothing.png"); plot( stuff ); dev.off(). The dev.off() tells the system to go back to normal (printing to the screen—PNG device off).
  • One more awesome  advice from the StackOverflow R community: how to get some sweet, sweet log-axis tickmarks. Read all about it.

Most of these can be done inside of plot( dpois, 0, 15, lwd = 3) or beforehand in a par(lwd=3); plot( dpois, 0, 15). With par(new=TRUE) and par(mfrow=c(2,2)), though, you need to do them in a par() beforehand.

If you forget what the colours or the pch shapes are, do this: plot( 1:25, pch=1:25, col=1:25 ). You’ll get this:

plot( 1:20, pch=1:20, col=1:20)

So basically, you only want pch=20 and sometimes pch=19 or pch=15, like I said.

One more thing you might like to learn is how to colour important data points red and normal ones grey. I’ll explain that another time.