Posts tagged with twitter

A question I’ve googled before without success. Hopefully this answer will show up for someone who needs it. I’ll also go over the better-known uses of ? just in case.

  • To get help in R about a function like subset you type ?subset . That’s like man subset from the command line.
  • If you only know roughly what you’re looking for use double question marks: so ??nonlinear will lead to the package nlme. That’s like apropos on the command line.
  • To get a package overview, type ?xts::xts. There is no ?xts help. Packages that don’t have ?twitteR::twitteR you will need to use ??twitteR to find the help pages on ?twitteR::status-class, ?twitteR::dmGet, etc.
  • Finally, the question of the title. To get R help on punctuation such as (, {, [, `, ::, ..., +, and yes, even on ? itself, use single quotes to ‘escape’ the meaningful symbol. Examples follow:
    • ?'`'
    • ?'('
    • ?'['
    • ?'...'
    • ?'+'
    • ?'%*%'
    • ?'%x%'
    • ?'%o%'
    • ?'%%'
    • ?'%/%'
    • ?'$'
    • ?'^'
    • ?'~'
    • ?'<-'
    • ?'='
    • ?'<<-'

All of the quotation marks `, ', " use the same help file so ?'"' or ?'`' will give you the help file for ?'''.

Albert Wenger, one of the owners of tumblr

At minute 31:

  • Google did not invent keyword advertising
  • GoTo, later renamed Overture, out of IdeaLab, invented it
  • and were acquired by Yahoo
  • Google improved upon the keyword search idea, turning keyword search into a viable business model
  • They realised there needs to be such a thing as a quality score—i.e., you don’t myopically give the ad space to the highest bidder. Long-term revenue maximisation required asking what the users want, and not p***ing them off.


Over a year ago, I wrote a letter to the editor of the Journal of Computational Sciences, urging the retraction of Bollen, Mao, and Zeng’s paper, “Twitter Mood Predicts the Stock Market.” Since JoCS is an Elsevier journal, one does not simply email the editor.
Rather, one has to register with the Elsevier author system, … submit LaTeX source code of a letter, along with supporting documents, author bio, .… I distilled the main arguments into two:

  1. first, that the Granger causality tests presented in BMZ’s paper are … datamining, and present no evidence for a connection between Twitter and the Dow Jones Index;
  2. and that the quoted predictive accuracy of the forecast model is so high, it would … [contradict] the experiences of … [traders] … and so this forecast accuracy is likely to be erroneously reported.
I included references to BMZ’s failed attempts to commercialize their patented techniques with Derwent.

Following the strictest protocol, the editor of JoCS duly sent this letter to reviewers . After roughly seven months, …

The reviewers’ comments were more than fair. If my arguments were unclear, I was more than happy to reword them and provide additional evidence to get my point across. So I edited my letter to the editor, and re-sent it. …

…within two months or so (the equivalent of overnight in journal-time), the editor sent me a rejection notice with … review, quoted below. This review—this review is sensational. As one afflicted with Hamlet Syndrome, I admire Reviewer #4’s conviction. As someone too often in search of the right phrase to dismiss a crap idea, I take delight in Reviewer #4’s acid pen: I have never seen a reviewer so viciously shit-can a paper before. Reviewer #4 tore my letter to pieces, then burned the pieces. Then poured lye on the ashes. Then salted the earth where the lye sizzled. Then burnt down the surrounding forest, etc.

read on


In May the Financial Times reported that Derwent Capital, the hedge fund that partnered with Johan Bollen and Huina Mao to trade the “Twitter Predictor” Strategy “shut down”. The official story is that Derwent’s Capital Markets’ Absolute Return fund opened for investments in July 2011, and…The official story is that Derwent’s Capital Markets’ Absolute Return fund opened for investments in July 2011, and shuttered after a single month, with reported returns of 1.86%.

There are a few oddities here:

  1. Why is the FT reporting in May 2012 that a hedge fund closed in August 2011?1 It would seem this is no longer news. To confirm this is not an error on the part of the Financial Times, I quote a ‘weekly sentiment email’ sent by Derwent Capital on June 6, 2012: “Some of you may have read about our Hedge Fund closing last year in press articles this week.” What? I just caught up on the news of this ‘moon landing’, and now you’re telling me there are more events happening in the world?
  2. As late as the end of March 2012, Derwent was posting performance numbers for managed accounts on their webpage. The reported performance was generally positive, but not consistent, with the spectacular performance promised by Johan Bollen. This period of Derwent’s existence has gone down the memory hole.

You can follow @shabbychef on twitter as well.

To tweet or not to tweet: that is the question.

  • @isomorphisms: Whether 'tis nobler in the mind to suffer
  • The @'s and RT's of outrageous spammers,
  • Or to hit /block against a sea of affiliate-marketers,
  • And by Reporting, end them? To quit; to disable one's account; no more
  • And by quitting we say to shut the laptop and end
  • The predictable op-eds and the thousand engineer'd linkbaits
  • That social media is heir to, 'tis a consummation
  • Devoutly to be wish'd. To quit; to go outside;
  • To go outside, perchance to walk, or lift a heavy object, or swing on a swing: ay, there's the rub,
  • For in the outdoors what emails may come to mind that we realise we need to write just this second but don't have a pen or paper or a blasted smartphone,
  • When we have left our b0xen in our flats,
  • Must give us pause: there's the habituation
  • That makes calamity of this 140-character media stream;
  • For who would bear the whips and scorns of fatuous trolls on internet chatboards,
  • (who are totally wrong), the successful entrepreneur's contumely,
  • The pangs of unrequited backlinks, the ISP's delay;
  • The insolence of @sacca and the spurns
  • Of blog readers who reliably bounce at an 87% rate,
  • When @isomorphisms himself might his own quietus make
  • With a bare bodkin? Who would arXiv browse,
  • To hunt and peck under a weary life,
  • But that the dread of something after logging off,
  • The undiscover'd country from whose boredom
  • No surfer ever returns, puzzles the will
  • And makes us rather bear those ills we have
  • Than fly to others we know not of?
  • Thus iPads do make addicts of us all,
  • Its native 2048-by-1536-pixel resolution
  • Is sicklied o'er with the pale cast of the thought that I'm never not looking at the damn thing,
  • And that enterprises of great pith and moment
  • With an enticing reddit link their currents turn awry,
  • And lose the name of action!

An unabashedly narcissistic data analysis of my own tweets.

The unequivocally lovely Jeff Gentry (@geoffjentry) has contributed an R package with easy-to-read documentation that works, which I’ll walk through here so that you, too, can gaze at your own face mirrored in the beauty of a woodland pond—er, sea of electrons.

Here’s the basic flow for grabbing stuff. You can do more with ROAuth but that’s a bit of a pain.

RT.of.me <- searchTwitter("RT @isomorphisms", n=100)
news <- getTrends(n=50)
firehose <- publicTimeline(n=999)

my.tweets <- userTimeline('isomorphisms', n=3500)
head(my.tweets$text) Consider: Donkey Kong is neither a donkey, nor a kong. William Thurston, geometrizer of manifolds http://t.co/UPwuAnbP When I invent a single-letter language, it's going to be called Ж. @theLoneFuturist True. If $GOOG were only an ad network, with no search facility, how much would it be worth? Do your arms hang down by your side in zero gravity? Because then I bet astronauts have less smelly armpits. Can't log into Hacker News with #w3m! Unexpected. Salt and sugar are opposites. Therefore if i eat too much salty food I must balance it with candy. #logic @leighblue Do you know any behavioural econ studies on utility vs bite-size / package-size?

Those are some of the ways you can grab data — twitteR hooks into RCurl and then, like, the info is just there. Run twListToDF( tweets ) to split the raw info into 10 subfields—text-of-tweet, to-whom-was-the-reply-threaded, timestamp, and more.

To pull out just one of those fields—like “source of tweet”, for example, use sapply:

my.tweets <- userTimeline('isomorphisms', n=3000)
whence.i.tweet <- sapply( my.tweets, function(x) x$statusSource

You can see from plots 1, 2, 3, and 4 that I use @floodgap's TTYtter client (tweeting from the command line; no installation). In fact this is why I’ve started tweeting so much the last few months: I run TTYtter in a virtual terminal, mutt (command-line gmail) in another virtual terminal, and therefore it becomes quite easy to flick my virtual newsfeed/conversation stream on for a minute or two here and there whenever I’m at the computer. It feels like The Matrix or Neuromancer or something.

Here’s how I created the ggplot radial chart #4 — this was the longest command I had to use to generate any of them. For some reason qplot didn’t like scale_y_log10() so I did:

ggplot( data = data.frame(whence.i.tweet),  aes( x=factor(whence.i.tweet),  fill=factor(whence.i.tweet) )   )
 + scale_y_log10()
 + geom_bar()
 + coord_polar()
 + opts(  title="whence @isomorphisms tweets",   axis.title.x=theme_blank(),   legend.title=theme_blank()   )

In the words of @jeffreybreen, twitteR almost makes this too easy. A few months ago — before I knew about this package — I was analysing tweets for a client who wanted to gauge the effectiveness of “customer service tweets”. I wrote an ugly, hacky perl script that told me whether the tweet had an @ in it, whether the @ was a RT @an wem the tweet was @, and so on. Dealing with people using @ in another sense besides “Hey @cmastication, what’s up?” or different numbers of spaces between RT/MT; multiple RT's in the same message; and so on — was an icky mess. I probably spent half a week changing my regexes around to deal with more cases I hadn't thought of. Like most statisticians, I hate data munging—swimming around in the data is the fun part, not patching up the kiddie pool. Besides that, my client wanted the results in an Excel file — and Excel can't handle multidimensional arrays (whereas a tweet mentioning @a @b @c should have just one “mentions” slot with three things in it).

That twitteR package is so hot right now.


But as much fun as it was to display my love of TTYtter in four different plots, that’s not the only R-based egotainment you can compute on a Friday night.

How wordy am I?

I know I am wordy. I often adopt a telegraphic SMS-like typing style (“Sntrm wd b gr8 prez, like Ahmedinejad”) rather than hold back my trenchant remarks about astronauts’ armpits. Tumblr’s auto-tweets don’t help my average, either—the default is long, and I’m usually too lazy to change it.

With the magic of kernel density estimates—which are definitely not overkill for the analysis of my appropriately-florid and highly-important charstreams—and my usual base::plot params, the length of my tweets is made art in the form of chart #5.

I got a vector of tweet-lengths using @hadleywickham's stringr package:

my.tweets <- userTimeline('isomorphisms', n=3500)
my.tweets <- twListToDF( my.tweets ) iso <- my.tweets$text require(stringr) iso.len <- str_length(iso) #vectorised! No for loops necessary hist( iso.len, fill="cyan" )

Proving once again that all real-world distributions fit a bell curv—…um.

You can of course use subset( my.tweets ) to plot tweets that were made under certain conditions—I might look only at my tumblr auto-posts using subset( my.tweets, statusSource=="tumblr"). Or only at short tweets using subset( my.tweets, str_length(my.tweets$text)<100 ). And so on.


Lastly, I wanted to plot my tweeple—the people I talk to on twitter (most of whom I don’t actually know in real life … I like to keep friends and mathematical geekery separate). As you can see from the final chart, it was largely a sh_tshow. Or so I thought, until I considered attacking the problem with ggplot.

One of ggplot's strengths—in my opinion its greatest strength—is the facet_grid( atttribute.1 ~ attribute.2) function. In combination with base::cut — which assigns discrete “levels” to the data — facetting is especially powerful. I cut my data into four subsets, based on how many times I’ve tweeted @ someone:

my.tweets <- userTimeline( 'isomorphisms', n=3000 )

# only tweets that are @ someone talkback <- subset( my.tweets,  is.na(replyToSN) == FALSE )
#the value would be NA iff I tweeted into the vast nothingness, apropos of no-one
# just the names, not the rest of the tweet's text or meta-information tweeps <- talkback$replyToSN
#make a new data frame for ggplot to facet_wrap. tweep.count <- table(tweeps) tweep.levels <- cbind( tweep.count,
cut( tweep.count, c(0,1,2,5,100) ),
) tweeps <- data.frame(tweep.levels) names(tweeps) <- c("number", "category", "name") class(tweeps$number) <- "numeric"
#all the above stuff only came clear after a few attempts
#and likewise the plot didn't work out perfect at first, either!
#but here's a decent plot that works: ggplot( data = tweeps, aes(x=number) ) + facet_wrap(~ category, scale="free_x") + geom_text( aes(label=name, y=30-order(name), size=sqrt(log(number)),    col=number+(as.numeric(category))^2 ), position="jitter" ) + opts( legend.title = theme_blank(), legend.text = theme_blank() )

This made for a much more readable image. Not perfect, but definitely displaying info now.


OK, I do love talking about my twistory a little too much — but I’d like to see your histograms as well! If you run some stats on your own account, please post some pics below. I believe images can be directly embedded in the Disqus comments with <img src="http://i.minus.com/staggering_analysis_of_my_fantastic_words.jpg">.

(To save your R plots to a file rather than to the screen, do png("a plot named Sue.png"); plot( laa dee daa ); dev.off() where ; could be replaced by a newline.)

Geotagged photos (e.g. flickr) and text (e.g. twitter) associate data to a particular point on the globe × time. In other words, a fibre bundle over S²×T.

Imagine the position future historians will be in — if they can synthesise the petabytes of digital data we generate these days. I would love to have crowd-sourced pictures or live-tweeting bystanders’ microblogs of the Yan tie lun 鹽鐵論 (81 B.C.) — Rashomon effect be damned.

The most-followed accounts are:

  1. Lady Gaga
  2. Justin Bieber
  3. Britney Spears
  4. Barack Obama
  5. Kim Kardashian
  6. Katy Perry
  7. Ashton Kutcher
  8. Ellen DeGeneres
  9. Taylor Swift
  10. Oprah Winfrey

Other than the bolded names, this reminds me of the TED talk about why anyone will eat at Applebee’s, just not anyone you know.