Quantcast

Posts tagged with R

A question I’ve googled before without success. Hopefully this answer will show up for someone who needs it. I’ll also go over the better-known uses of ? just in case.

  • To get help in R about a function like subset you type ?subset . That’s like man subset from the command line.
  • If you only know roughly what you’re looking for use double question marks: so ??nonlinear will lead to the package nlme. That’s like apropos on the command line.
  • To get a package overview, type ?xts::xts. There is no ?xts help. Packages that don’t have ?twitteR::twitteR you will need to use ??twitteR to find the help pages on ?twitteR::status-class, ?twitteR::dmGet, etc.
  • Finally, the question of the title. To get R help on punctuation such as (, {, [, `, ::, ..., +, and yes, even on ? itself, use single quotes to ‘escape’ the meaningful symbol. Examples follow:
    • ?'`'
    • ?'('
    • ?'['
    • ?'...'
    • ?'+'
    • ?'%*%'
    • ?'%x%'
    • ?'%o%'
    • ?'%%'
    • ?'%/%'
    • ?'$'
    • ?'^'
    • ?'~'
    • ?'<-'
    • ?'='
    • ?'<<-'

All of the quotation marks `, ', " use the same help file so ?'"' or ?'`' will give you the help file for ?'''.




tl,dr: If you want to be contacted for freelance R work, edit this list https://github.com/isomorphisms/hire-an-r-programmer/blob/gh-pages/README.md.

 

Background/Problem: I was looking for a list of R-programming freelancers and realised there is no such list.

Other than famous people and people I already talk to, I don’t know even a small fraction of the R community—let alone people who do R among other things and don’t participate in the mailing lists or chatrooms I do.

This is actually a more general problem since anyone looking to hire an R programmer will find a wall of tutorials if they http://google.com/search?q=hire+an+r+programmer.

 

Solution: I thought about making a publicly-editable website where freelancers can put their contact info, specialty areas, links to projects, preferred kind of work, rates, and so on.

Of course, I’d have to make the login system. And validate users. And fight spam. And think up some database models, change the fields if someone suggests something better…. And it would be nice to link to StackOverflow, Github, CRAN, and …

The more I thought about it the more I favoured a solution where someone else does all the work. GitHub already has a validation system, usernames, logins, and a publicly editable “wiki”. MVP. No maintenance, no vetting, no development. GitHub already shows up in google so whoever searches for “hire an R programmer” will find you if you put your details there.

It’s actually unbelievable that we’ve had R-Bloggers as a gathering place for so long, but nowhere central to list who’s looking for work.

So I committed https://github.com/isomorphisms/hire-an-r-programmer/blob/gh-pages/README.md which is a markdown file you can add your information to, if you want to be found by recruiters who are looking for R programmers. Forking is a good design pattern for this purpose as well. Add whatever information you want, and if you think I’m missing some fields you can add those as well. Suggestions/comments also welcome below.




Two interesting ideas here:

  • "trading time"
  • price impact of a trade proportional to exp( √size )

Code follows:

Read More

(Source: finmath.stanford.edu)










playing along with Elias Wegert in R:

X <- matrix(1:100,100,100)                  #grid
X <- X * complex(imaginary=.05) + t(X)/20    #twist & shout
X <- X - complex(real=2.5,imaginary=2.5)     #recentre
plot(X, col=hcl(h=55*Arg(sin(X)), c=Mod(sin(X))*40 ) ,        pch=46, cex=6)

Found it was useful to define these few functions:

arg <- function(z) (Arg(z)+pi)/2/pi*360     #for HCL colour input
ring <- function(C) C[.8 < Mod(C) &   Mod(C) < 1.2]        #focus on the unit circle
lev <- function(x) ceiling(log(x)) - log(x)
m <- function(z) lev(Mod(z))
plat <- function(domain, FUN) plot( domain, col= hcl( h=arg(FUN(domain)), l=70+m(domain)), pch=46, cex=1.5, main=substitute(FUN) )           #say it directly

NB, hcl's hue[0,360] so phase or arg needs to be matched to that.










I like this concept of “low volatility, interrupted by occasional periods of high volatility”. I think I will call it “volatility”.

Daniel Davies

via nonergodic

 

(PS: If you didn’t see it before: try plotting this in R:

vol.of.vol <- function(x) {
    dpois(x, lambda=dpois(x, 5)
    }

… and so on, to your heart’s content.

imageimage
image

Fun, right?)




Whilst reading John Hempton’s post on shorting $HLF I decided to follow along in quantmod.

Long story short, HerbaLife sells weight-loss supplements through a multilayer marketing business model which Bill Ackman contends is an unsustainable, illegal pyramid scheme. Ackman calls it “the only billion-dollar brand no-one’s ever heard of” and Hempton posts some very unflattering Craigslist marketing adverts:

image

image

image

thus undermining its credibility.

I should mention that I got some internet ads from Pershing Square Capital Management when I googled for herbalife information. In other words the shorts are spreading the word around to the common man to jump on this short! Destroy this pyramid scheme! You could read this as similar to a penny-stock email, but I view it simply as credible self-interest: I already put my shorts on for what I believe are rational reasons. It’s obviously in my self-interest to convince you to do the same but I do in fact believe that $HLF will and should go down and you know I do because I put my money where my mouth is. Whether that’s an ideal confluence of honesty, moral high ground, and selfishness—capitalism at its best—or some overpowerful hedgies using their marketing dollars to bring down a solid company, I’ll leave up to you.

 

Anyway, on to the quantmod stuff.

Here’s how to generate the 2007–present view:

image

require(quantmod); getSymbols('HLF'); setDefaults(chartSeries, up.col="gold", dn.col="#2255aa", color.vol=FALSE); chartSeries(HLF)

Now here’s the interesting part.


(…”Ackman” should read “Einhorn” in red there…)

You can notice in red that trades per day (volume) have risen to 10, 20 times normal levels during 2013—which maybe we can attribute to the “buzz” generated by Pershing Square, @KidDynamite, Bronte Capital, and whoever else is calling $HLF a pyramid scheme.

median(Vo(HLF)) tells me the halfway split between “high” and “low” trading volume for this stock. It’s roughly 2 million trades per day. Then with quantmod I can plot those hi-lo subsets with chartSeries(subset(HLF, Vo(HLF)<2e6)); chartSeries(subset(HLF, Vo(HLF)>2e6)) to get a visual on “calm days” versus “heavy days”. That’s something you can’t do with Google Charts.

Here’s calm (under 2 million trades/day)

image

upper half of heavy trading days (over 2 million/day)

image

and what I’ll call “pirate days” (over 10 million trades/day)—with plunderers swarming the stock, battling with swords between their teeth

image

wherein it’s visually clear that very heavy trading skewed blue over gold—i.e. $HLF closed lower than it opened on that day: the heavy trading volume was taking the price downward.

But more precisely what happened on those days? This is a good job for the stem-and-leaf plot. Notice, by the way, that reality here looks nothing like a bell curve. Sorry, pet peeve. Anyway here is the stem plot of heavy trading days:

> hi.volume <- subset(HLF, Vo(HLF)>1e7)
> stem(Cl(hi.volume)-Op(hi.volume))

  The decimal point is at the |

  -14 | 1
  -12 | 
  -10 | 
   -8 | 2
   -6 | 1554
   -4 | 542
   -2 | 430
   -0 | 988761851
    0 | 345667780388
    2 | 058699
    4 | 1
    6 | 5

I love stem plots because they give you precision and the general picture at once. From the way the ink lies you get the same pic as the kernel density plot( density( Cl(hi.volume) - Op(hi.volume) ), col="#333333" , ylab="", main="Volume at least 10 million $HLF", yaxt="n", xlab="Price Movement over the Trading Day"); polygon( density( Cl(hi.volume) - Op(hi.volume) ), col="#333333", border="#333333" )
image
but you can also see actual numbers in the stem plot. For example the ones to the right of +0 are pretty interesting. Slight gains on many of those pirate days, but not enough to bash back a 14-point loss on a single day.










I was re-reading Michael Murray’s explanation of cointegration:

and marvelling at the calculus.

Of course it’s not any subtraction. It’s subtracting a function from a shifted version of itself. Still doesn’t sound like a universal revolution.

(But of course the observation that the lagged first-difference will be zero around an extremum (turning point), along with symbolic formulæ for (infinitesimal) first-differences of a function, made a decent splash.)

definition of derivative

Jeff Ryan wrote some R functions that make it easy to first-difference financial time series.

image

Here’s how to do the first differences of Goldman Sachs’ share price:

require(quantmod)
getSymbols("GS")
gs <- Ad(GS)
plot(  gs - lag(gs)  )

image

Look how much more structured the result is! Now all of the numbers are within a fairly narrow band. With length(gs) I found 1570 observations. Here are 1570 random normals plot(rnorm(1570, sd=10), type="l") for comparison:

image

Not perfectly similar, but very close!

Looking at the first differences compared to a Gaussian brings out what’s different between public equity markets and a random walk. What sticks out to me is the vol leaping up aperiodically in the $GS time series.

I think I got even a little closer with drawing the stdev’s from a Poisson process plot(rnorm(1570, sd=rpois(1570, lambda=5)), type="l")

image

but I’ll end there with the graphical futzing.

What’s really amazing to me is how much difference a subtraction makes.




The Cauchy distribution (?dcauchy in R) nails a flashlight over the number line

http://upload.wikimedia.org/wikipedia/commons/thumb/9/93/Number-line.svg/1000px-Number-line.svg.png

and swings it at a constant speed from 9 o’clock down to 6 o’clock over to 3 o’clock. (Or the other direction, from 3→6→9.) Then counts how much light shone on each number.

*

In other words we want to map evenly from the circle (minus the top point) onto the line. Two of the most basic, yet topologically distinct shapes related together.

image

You’ve probably heard of a mapping that does something close enough to this: it’s called tan.

http://www.calculatorsoup.com/images/trig_plots/graph_tan_pi.gif
Since tan is so familiar it’s implemented in Excel, which means you can simulate draws from a Cauchy distribution in a spreadsheet. Make a column of =RAND()'s (say column A) and then pipe them through tan. For example B1=TAN(A1). You could even do =TAN(RAND()) as your only column. That’s not quite it; you need to stretch and shift the [0,1] domain of =RAND() so it matches [−π,+π] like the circle. So really the long formula (if you didn’t break it into separate columns) would be =TAN( PI() * (RAND()−.5) ). A stretch and a shift and you’ve matched the domains up. There’s your Cauchy draw.

In R one could draw three Cauchy’s with rcauchy(3) or with tan(2*(runif(3).5)).

*

http://www.calculatorsoup.com/images/trig_plots/graph_tan_pi.gif

What’s happening at tan(−3π/2) and tan(π/2)? The tan function is putting out to ±∞.

I saw this in school and didn’t know what to make of it—I don’t think I had any further interest than finishing my problem set.

File:Hyperbola one over x.svg

I saw as well the ±∞ in the output of flip[x]= 1/x.

  • 1/−.0000...001 → −∞ whereas 1/.0000...0001 → +∞.

It’s not immediately clear in the flip[x] example but in tan[x/2] what’s definitely going on is that the angle is circling around the top of the circle (the hole in the top) and the flashlight of the Cauchy distribution could be pointing to the right or to the left at a parallel above the line.

Why not just call this ±∞ the same thing? “Projective infinity”, or, the hole in the top of the circle.

http://upload.wikimedia.org/wikipedia/commons/8/85/Stereographic_projection_in_3D.png




SETUP (CAN BE SKIPPED)

We start with data (how was it collected?) and the hope that we can compare them. We also start with a question which is of the form:

  • how much tax increase is associated with how much tax avoidance/tax evasion/country fleeing by the top 1%?
  • how much traffic does our website lose (gain) if we slow down (speed up) the load time?
  • how many of their soldiers do we kill for every soldier we lose?
  • how much do gun deaths [suicide | gang violence | rampaging multihomicide] decrease with 10,000 guns taken out of the population?
  • how much more fuel do you need to fly your commercial jet 1,000 metres higher in the sky?
  • how much famine [to whom] results when the price of low-protein wheat rises by $1?
  • how much vegetarian eating results when the price of beef rises by $5? (and again distributionally, does it change preferentially by people with a certain culture or personal history, such as they’ve learned vegetarian meals before or they grew up not affording meat?) How much does the price of beef rise when the price of feed-corn rises by $1?
  • how much extra effort at work will result in how much higher bonus?
  • how many more hours of training will result in how much faster marathon time (or in how much better heart health)?
  • how much does society lose when a scientist moves to the financial sector?
  • how much does having a modern financial system raise GDP growth? (here ∵ the X ~ branchy and multidimensional, we won’t be able to interpolate in Tufte’s preferred sense)
  • how many petatonnes of carbon per year does it take to raise the global temperature by how much?
  • how much does $1000 million spent funding basic science research yield us in 30 years?
  • how much will this MBA raise my annual income?
  • how much more money does a comparable White make than a comparable Black? (or a comparable Man than a comparable Woman?)
  • how much does a reduction in child mortality decrease fecundity? (if it actually does)

  • how much can I influence your behaviour by priming you prior to this psychological experiment?
  • how much higher/lower do Boys score than Girls on some assessment? (the answer is usually “low |β|, with low p" — in other words "not very different but due to the high volume of data whatever we find is with high statistical strength")

bearing in mind that this response-magnitude may differ under varying circumstances. (Raising morning-beauty-prep time from 1 minute to 10 minutes will do more than raising 110 minutes to 120 minutes of prep. Also there may be interaction terms like you need both a petroleum engineering degree and to live in one of {Naija, Indonesia, Alaska, Kazakhstan, Saudi Arabia, Oman, Qatar} in order to see the income bump. Also many of these questions have a time-factor, like the MBA and the climate ones.)

building up a nonlinear function from linear parts

As Trygve Haavelmo put it: using reason alone we can probably figure out which direction each of these responses will go. But knowing just that raising the tax rate will drive away some number of rich doesn’t push the debate very far—if all you lose is a handful of symbolic Eduardo Saverins who were already on the cusp of fleeing the country, then bringing up the Laffer curve is chaff. But if the number turns out to be large then it’s really worth discussing.

In less polite terms: until we quantify what we’re debating about, you can spit bollocks all day long. Once the debate is quantified then the discussion should become way more intelligent, less derailing to irrelevant theoretically-possible-issues-which-are-not-really-worth-wasting-time-on.

So we change one variable over which we have control and measure how the interesting thing responds. Once we measure both we come to the regression stage where we try to make a statement of the form “A 30% increase in effort will result in a 10% increase in wage” or “5 extra minutes getting ready in the morning will make me look 5% better”. (You should agree from those examples that the same number won’t necessarily hold throughout the whole range. Like if I spend three hours getting ready the returns will have diminished from the returns on the first five minutes.)

Correlation

Avoiding causal language, we say that a 10% increase in (your salary) is associated with a 30% increase in (your effort).

 
MAIN PART (SKIP TO HERE IF SKIMMING)

The two numbers that jump out of any regression table output (e.g., lm in R) are p and β.

  • β is the estimated size of the linear effect
  • p is how sure we are that the estimated size is exactly β. (As in golf, a low p is better: more confident, more sure. Low p can also be stated as a high t.)

Wary that regression tables spit out many, many numbers (like Durbin-Watson statistic, F statistic, Akaike Information, and more) specifically to measure potential problems with interpreting β and p naïvely, here are pictures of the textbook situations where p and β can be interpreted in the straightforward way:

First, the standard cases where the regression analysis works as it should and how to read it is fairly obvious:
(NB: These are continuous variables rather than on/off switches or ordered categories. So instead of “Followed the weight-loss regimen” or “Didn’t follow the weight-loss regimen” it’s someone quantified how much it was followed. Again, actual measurements (how they were coded) getting in the way of our gleeful playing with numbers.)

image
image
image

Second, the case I want to draw attention to: a small statistical significance doesn’t necessarily mean nothing’s going on there.

image
image

The code I used to generate these fake-data and plots.

If the regression measures a high β but low confidence (high p), that is still worth taking a look at. If regression picks up wide dispersion in male-versus-female wages—let’s say double—but we’re not so confident (high p) that it’s exactly double because it’s sometimes 95%, sometimes 180%, sometimes 310%, we’ve still picked up a significant effect.

The exact value of β would not be statistically significant or confidently precise due to a high p but actually this would be a very significant finding. (Try it the same with any of my other examples, or another quantitative-comparison scenario you think up. It’s either a serious opportunity, or a serious problem, that you’ve uncovered. Just needs further looking to see where the variation around double comes from.)

You can read elsewhere about how awful it is that p<.05 is the password for publishable science, for many reasons that require some statistical vocabulary. But I think the most intuitive problem is the one I just stated. If your geiger counter flips out to ten times the deadly level of radiation, it doesn’t matter if it sometimes reads 8, sometimes 0, and sometimes 15—the point is, you need to be worried and get the h*** out of there. (Unless the machine is wacked—but you’d still be spooked, wouldn’t you?)

 
FOLLOW-UP (CAN BE SKIPPED)

The scale of β is the all-important thing that we are after. Small differences in βs of variables that are important to your life can make a huge difference.

  • Think about getting a 3% raise (1.03) versus a 1% wage cut (.99).
  • Think about twelve in every 1000 births kill the mother versus four in every 1000.
  • Think about being 5 minutes late for the meeting versus 5 minutes early.

image
linear maps as multiplication
linear mappings -- notice they're ALL straight lines through the origin!


Order-of-magnitude differences (like 20 versus 2) is the difference between fly and dog; between life in the USA and near-famine; between oil tanker and gas pump; between Tibet’s altitude and Illinois’; between driving and walking; even the Black Death was only a tenth of an order of magnitude of reduction in human population.




Keeping in mind that calculus tells us that nonlinear functions can be approximated in a local region by linear functions (unless the nonlinear function jumps), β is an acceptable measure of “Around the current levels of webspeed” or “Around the current levels of taxation” how does the interesting thing respond.



Linear response magnitudes can also be used to estimate global responses in a nonlinear function, but you will be quantifying something other than the local linear approximation.

Anscombes quartet  The four data sets are different, yet they have the same &#8220;line of best fit&#8221; as computed by ordinary least squares regression.




Tibshirani’s original paper on the lasso.

  • Breiman’s Garotte — 1993
  • Tibshirani lasso paper submitted — 1994
  • Tibshirani lasso paper revised — 1995
  • Tibshirani lasso paper accepted — 1996

This is one of those papers that I’m so excited about, I feel like “You should just read the whole thing! It’s all good!” But I realise that’s less than reasonable.

Here is a bit of summary, feel free to request other information and I’ll do my best to adapt it.

The basic question is: I have some data and I want the computer to generate (regress) a linear model of it for me. What procedure should I tell the computer to do to get a good | better | best model?

The first technique, by Abraham de Moivre, applied fruitfully by Gauss in the late 1800’s (so, ok, no computers then — but nowadays we just run lm in R), was to minimise the sum of squared error (Euclidean distance)
image
of a given affine model of the data. (Affine being linear + one more parameter for a variable origin, to account for the average value of the data ex observable parameters. For example to model incomes in the USA when the only observed parameters are age, race, and zip code—you would want to include the average baseline US income level, and that would be accomplished mathematically by shifting the origin, a.k.a. the alpha; or autonomous or “vector of ones” regression-model parameter, a.k.a. the affine addition to an otherwise linear model.)

It was noticed by several someones at various points in time that whilst de Moivre’s least-squares (OLS) method is provably (calculus of variations) the optimal linear model given well-behaved data, real data does not always behave.

http://bookcoverarchive.com/images/books/wellbehaved_women_seldom_make_history.large.jpg

In the presence of correlation, missing data, wrong data, and other problems, the “optimal” OLS solution is overfit, meaning that the model it makes for you picks up on too many of the problems. Is there a way to pick up on more signal and less noise? More gold and less dross? More of the real stuff and fewer of the impurities?

I can think of 2 ways people have tried to scrape off the corrosion without flaying as well too much of the underlying good-material:

  1. Assume simpler models are better. This is the approach taken by ridge regression (a.k.a. Tikhonov regularisation a.k.a. penalisation), the lasso, and the garotte
    ridge regression with tuning parameter highlighted
  2. Compare ensembles of models, then choose one in the “middle”. Robust methods, for example, use statistical functions that vary less in theory from flawed situation to flawed situation, than do other statistical functions. Subset selection, hierarchical methods, and  generate a lot of models on the real data and 
 

That’s the backstory. Now on to what Tibshirani actually says. His original lasso paper contrasts 3 ways of penalising complicated models, plus regression on subsets.

The three formulae:

  • penalisation & restriction to subsets
    Ridge Regression: edited it again to make the lambda term specifically look like the tuning parameter for the penalty
    Tibshirani 3
    definition of ’hat_j^o is the OLS magnitudes
  • garotte
    Leo Breiman garotte
  • lasso
    Tibshirani lasso def

look superficially quite similar. Tibshirani discusses the pro’s, con’s, when’s, and wherefore’s of the different approaches.

Tibshirani lasso 1

 

(In reading this paper I learned a new symbol, the operator ƒ(x) = (x)⁺. It means
’(x) = (x)⁺
and looks like
’(x) = (x)⁺
). In R code, ifelse(x<0, 0, x). Like absolute value but not exactly.)

 

Back to the lasso. How does such a small change to the penalty function, change the estimated linear model we get as output?

http://i.imgur.com/VJiAh.png
http://i.imgur.com/zeBrm.png

http://i.imgur.com/tqOim.png
http://i.imgur.com/zeBrm.png