Quantcast

Posts tagged with vectorize

What follows is like a kata to strengthen your R fundamentals.

The lovely stats in the wild recently posted some hott data analysis of Olympians’ ages and sexes. Because I’m annoyingly picky about graphics, I asked for his code so I could tweak the graphics according to my own perfidious norms. Stats in the wild posted his scrapeR of sports-reference.com — I’m sure you can find some more interesting uses for it — and asked for (polite) suggestions for improvement.

One potential place for improvement in stats in the wild's code could answer two questions for R learners more generally so I’m sharing the code block.

alphabet<-c("a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z")

for (i.one in 1:26){
		for (i.two in 1:26){
letters<-paste(alphabet[i.one],alphabet[i.two],sep="")
}

The desired goal is to get pairs of letters like

  [1] aa ba ca da ea fa ga ha ia ja ka la ma na oa
 [16] pa qa ra sa ta ua va wa xa ya za ab bb cb db
 [31] eb fb gb hb ib jb kb lb mb nb ob pb qb rb sb
 [46] tb ub vb wb xb yb zb ac bc cc dc ec fc gc hc
 [61] ic jc kc lc mc nc oc pc qc rc sc tc uc vc wc
 [76] xc yc zc ad bd cd dd ed fd gd hd id jd kd ld
 [91] md nd od pd qd rd sd td ud vd wd xd yd zd ae
[106] be ce de ee fe ge he ie je ke le me ne oe pe
[121] qe re se te ue ve we xe ye ze af bf cf df ef
[136] ff gf hf if jf kf lf mf nf of pf qf rf sf tf
[151] uf vf wf xf yf zf ag bg cg dg eg fg gg hg ig
[166] jg kg lg mg ng og pg qg rg sg tg ug vg wg xg
[181] yg zg ah bh ch dh eh fh gh hh ih jh kh lh mh
[196] nh oh ph qh rh sh th uh vh wh xh yh zh ai bi
[211] ci di ei fi gi hi ii ji ki li mi ni oi pi qi
[226] ri si ti ui vi wi xi yi zi aj bj cj dj ej fj
[241] gj hj ij jj kj lj mj nj oj pj qj rj sj tj uj
[256] vj wj xj yj zj ak bk ck dk ek fk gk hk ik jk
[271] kk lk mk nk ok pk qk rk sk tk uk vk wk xk yk
[286] zk al bl cl dl el fl gl hl il jl kl ll ml nl
[301] ol pl ql rl sl tl ul vl wl xl yl zl am bm cm
[316] dm em fm gm hm im jm km lm mm nm om pm qm rm
[331] sm tm um vm wm xm ym zm an bn cn dn en fn gn
[346] hn in jn kn ln mn nn on pn qn rn sn tn un vn
[361] wn xn yn zn ao bo co do eo fo go ho io jo ko
[376] lo mo no oo po qo ro so to uo vo wo xo yo zo
[391] ap bp cp dp ep fp gp hp ip jp kp lp mp np op
[406] pp qp rp sp tp up vp wp xp yp zp aq bq cq dq
[421] eq fq gq hq iq jq kq lq mq nq oq pq qq rq sq
[436] tq uq vq wq xq yq zq ar br cr dr er fr gr hr
[451] ir jr kr lr mr nr or pr qr rr sr tr ur vr wr
[466] xr yr zr as bs cs ds es fs gs hs is js ks ls
[481] ms ns os ps qs rs ss ts us vs ws xs ys zs at
[496] bt ct dt et ft gt ht it jt kt lt mt nt ot pt
[511] qt rt st tt ut vt wt xt yt zt au bu cu du eu
[526] fu gu hu iu ju ku lu mu nu ou pu qu ru su tu
[541] uu vu wu xu yu zu av bv cv dv ev fv gv hv iv
[556] jv kv lv mv nv ov pv qv rv sv tv uv vv wv xv
[571] yv zv aw bw cw dw ew fw gw hw iw jw kw lw mw
[586] nw ow pw qw rw sw tw uw vw ww xw yw zw ax bx
[601] cx dx ex fx gx hx ix jx kx lx mx nx ox px qx
[616] rx sx tx ux vx wx xx yx zx ay by cy dy ey fy
[631] gy hy iy jy ky ly my ny oy py qy ry sy ty uy
[646] vy wy xy yy zy az bz cz dz ez fz gz hz iz jz
[661] kz lz mz nz oz pz qz rz sz tz uz vz wz xz yz

which seems like a simple request. But how to do this idiomatically in R?

It’s quite often that you want to do for (i in 1:222) { for (j in 1:333) { for (k in 1:444) { stuff }}}

Also nice to know that R has already provided access to “the 13th letter in the alphabet” with letters[13], so it’s unnecessary to redefine alphabet every time. (yay!)

As used in maths, the inner product of two [tensors | matrices | vectors] shrinks the output, and the outer product enlargens the output. In this case, “outer product” cycles through for (1:26) { for (1:26) { fill up the matrix with each entry [i,j] } } and does so idiomatically—that is, with vectorised loops. (Which is the goal in R, J, and other vectorised languages.)

Here’s my answer, and I’d like to hear your comments or better/also-good solutions.

c( outer( letters, letters,FUN=paste ,sep=""))

Broken down:

  • letters[1:26] = iterate through the alphabet. letters also does the whole alphabet.
  • outer = outer product of two arrays, try outer( 2:7, 3:5 ) at the R prompt and then try outer( 1:26, 1:26, FUN=paste ). (In maths outer contrasts with convolution = 2:7 * 3:8 in R — and with inner-producting, which is the dot-product, similar to determinant, equal to a projection, same as matrix multiplication, essentially the ∑i•j•k essentially the sum-product of terms = 2:7 %*% 3:8.)
  • FUN=paste, sep="" The grand theory behind this is much more complicated than what it does. paste concatenates two strings, with a default separator of space sep=" ".

    The gnarly theory reason: FUN is an argument to outer, which defaults to multiplication (you see this in outer( 1:26, 1:26 ) ) but can be set to concatenation since we’re working with characters rather than numbers. Then to pass sep="" to paste — how to do that? You get a problem calling FUN=paste( sep="") because that’s incoherent to the computer. You could do an ugly workaround with FUN=function(x) paste(x, sep="") … but the makers of R foresaw that you would often want to do things like this, so in addition to FUN they made ARGS come after FUN, only needing the distinguishment of a comma, and ARGS passes arguments to FUN, so you can write sep="" within outer, without having to make a function(x) specifically to pass to FUN.

    Wow, that was not fun
  • c = the natural output is 2-dimensional and c streamlines that into one single vector.

Another way to do it is:

sapply( letters, FUN=function(x) paste(x, letters, sep="") )

which I think is uglier … perhaps because it uses letters twice or perhaps because I think outer-producting is what I’m really doing.

Thoughts? Can it be done even more idiomatically or naturally?

UPDATE: gappy3000 says expand.grid() scales better than outer().