james_reinders
Software Programmer

Raise Your Hand and Ask: What’s R?

opinion
Jul 7, 20178 mins

istock 514217564
Credit: istock

Note: Most people don’t want to be the uncool one to raise their hand and ask a question, but in many cases we really should. These occasional ‘Raise Your Hand and Ask” posts highlight cool “buzzwords” you may have heard. My aim isn’t just to explain what they mean (that you can look up), but also why they matter.

You’ve probably heard of “R,” but have you ever seen “R” code or tried “R” out?  If not, here’s a quick read to solve that!  At a minimum, you’ll get to see some R code. If you follow along, you’ll have R installed on your system and have run a few R commands.

Wikipedia describes R as a programming language and software environment for statistical computing and graphics. I think it’s popular because it combines a “conversational” (interactive) type-and-do-it interface with powerful built-in capabilities to do complex statistics (often in a single line) and complex graphical plotting (likewise).

Example of R plotting data

One thing you learn quickly about R is that many people use it to create quick plots of data, both inputs and results. There are many graphing features that are very easy to learn, and we’ll look at “barplot” to illustrate.  To do this, we simply start “r” and type a single command “barplot…” as shown here:

 % r

 R version 3.3.2 (2016-10-31) — “Sincere Pumpkin Patch”

Copyright (C) 2016 The R Foundation for Statistical Computing

Platform: x86_64-apple-darwin11.4.2 (64-bit)

(and a bunch more lines telling us that R has started)

 > barplot(table(sample(1:3, size=10000, replace=TRUE, prob=c(.150,.650,.200))))

One thing you’ll see with examples in R is that many simple examples avoid using real data by simply randomly creating input data.  The sample function does exactly that – it is used to create 10,000 randomly selected numbers (1, 2, or 3) with probabilities of 15%, 65% and 20%, respectively. The table function builds a summary table containing the counts for the occurrences.

popup Pop-up window from barplot (first example)

For fun – try these variations on your own:

> foo=table(sample(1:3, size=10000, replace=TRUE, prob=c(.150,.650,.200)))

> barplot(foo)

> barplot(horiz=TRUE,foo)

> barplot(col=c(“blue”),foo)

> barplot(names.arg=c(“A”,”B”,”EEE”),foo)

> barplot(horiz=TRUE,col=c(“light blue”),names.arg=c(“A”,”B”,”EIEIO”),foo)

 

popupwindowfrombarplot Pop-up window from barplot (final “try these variations”)

R offers many types of graphing functions: histograms, dot plots, bar plots, line charts, pie charts, boxplots, scatterplots (you can learn more in the online “Quick-R” resources).

For instance – try this:

 % r

> pie3D(labels=c(“A”,”B”,”EIEIO”),foo)

pop up window from pie3d Pop-up window from pie3D

Example of R doing a statistical problem

R has a rich collection of functions to help with statistical processing needs. Many functions are supported for machine learning, and using them is generally very simple.

Clustering is a key capability used in machine learning. Here is a simple example that creates a two-dimensional matrix of random numbers (purposefully created as natural two clusters by merging two random distributions with different means) and then uses ‘kmeans’ to compute two clusters:

 mymat <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))

mycl <- kmeans(mymat, 2, 20)

Now try each of these commands one at a time. The first plots the points, the next command adds the center points for both clusters, and the last two commands add line segments from each point to the center of its cluster.  This is a cute example with an easy split. You can try changing the “sd” (standard deviation) parameters, above, from 0.3 to a much larger value. The quality of the categorization will obviously degrade.  NOTE: I stopped showing the “>” at the front of each R command to make it easy to cut-and-paste these sequences from this article to try out on your own system.  You will see the “>” prompts on your screen.

plot(mymat, col = mycl$cluster, pch=3, lwd=6)

points(cl$centers, col = 1:2, pch = 7, lwd=6)

segments(mymat[mycl$cluster==1,][,1], mymat[mycl$cluster==1,][,2],

mycl$centers[1,1], mycl$centers[1,2])

segments(mymat[mycl$cluster==2,][,1], mymat[mycl$cluster==2,][,2],

mycl$centers[2,1], mycl$centers[2,2], col=2)

graph of clustering with kmeans graph of clustering with kmeans

For higher-quality classifications that include feedback on the quality of the classification, we can use “clusplot.” We need to include the library “cluster” for this function. Next, the code defines how many points to create (fewer is easier to read on the plot), and the standard deviation. I show graphs for values of 35, 40, and 50 (for the standard deviation) to show how classification is less distinct when datasets overlap in the real world. Your graphs will vary from what I show because your random numbers will vary. In fact, the graphs differ each time we run this set of commands for this example.

library(cluster)

many <- 20

stddiv <- 35

or <- matrix(c(rnorm(many+many, sd = stddiv, mean=100), rep(1859,many)), ncol = 3)

rownames(or) <- rownames(or, do.NULL = FALSE, prefix = “OR” )

ca <- matrix(c(rnorm(many+many, sd = stddiv, mean = 200), rep(1850,many)), ncol = 3)

rownames(ca) <- rownames(ca, do.NULL = FALSE, prefix = “CA” )

st <- rbind(or,ca)

clusplot(st,pam(st,2,cluster.only=TRUE),label=2)

 

graph of clustering with clusplot Graph of clustering with clusplot

Functions? Yes!

We can type “myfunc(20,35)” to do the same thing if we define the function:

 myfunc <- function(many,stddiv) {

   # same code as before…

   or <- matrix(c(rnorm(many+many, sd = stddiv, mean=100), rep(1859,many)), ncol = 3) 

   rownames(or) <- rownames(or, do.NULL = FALSE, prefix = “OR” )

   ca <- matrix(c(rnorm(many+many, sd = stddiv, mean = 200), rep(1850,many)), ncol = 3)

   rownames(ca) <- rownames(ca, do.NULL = FALSE, prefix = “CA” )

   st <- rbind(or,ca)

   clusplot(st,pam(st,2,cluster.only=TRUE),label=2)

}

 

How to install R and try this yourself

My earlier post (“How Does a 20X Speed-Up in Python Grab You?”) included step-by-step instructions on using Anaconda to install Python, and acceleration from Intel (the latter being optional for this blog). With Python installed via conda, you are set up to install R essentials as follows, assuming you’ve installed Python as I previously described:

% conda install -c r r-essentials

% conda install -c bioconda r-plotrix

The second command adds a 3D pie chart graphing capability that I used in my graphing examples. We get a quick impression during the install of how many capabilities are built into R. The R Essentials bundle installed by this command includes more than 80 of the most popular R packages for data science.

Isn’t R slow? No, but here’s how to make it fast

R is so popular for data scientists, including in the super-hot field of machine learning, that making it run fast has not escaped attention. The net result is that R can be effectively used with reasonable performance. Some even claim it is ready for HPC usage (see the pdbR project). It remains an interesting balance between ease-of-use and performance, meaning that you can get higher performance by shifting to other programming methods such as C++ or Fortran, but neither offers the high-level power and ease-of-use of R. I’ve included a number of links below for additional information, the last four of which are about high-performance R.  (A good overview is HPC with R: The Basics by Drew Schmidt.)

 Summary

At a minimum, I’ve shared some insights into R that include actual R code. I hope you’ll install it on your own machine now – and include it in your “bag of tricks” when you have statistical problems to tackle and graphs to plot.

 If you want to dig a little deeper, here are some recommended readings:

Click here to download your free 30-day trial of Intel Parallel Studio XE

james_reinders
Software Programmer

James Reinders is a software programmer with a passion for Parallel Programming and Parallel Computer Architecture. He has contributed to the development of some of the world’s fastest computers, and the software tools that make that performance accessible for programmers. James has shared this passion in classes, webinars, articles and has authored eight books for software developers. James enjoyed 10,001 days working at Intel, and now continues to share his passion to help others “Think Parallel.”

More from this author