Raise Your Hand and Ask: What’s R?

Note: Most people don’t want to be the uncool one to raise their hand and ask a question, but in many cases we really should. These occasional ‘Raise Your Hand and Ask” posts highlight cool “buzzwords” you may have heard. My aim isn’t just to explain what they mean (that you can look up), but also why they matter.

You’ve probably heard of “R,” but have you ever seen “R” code or tried “R” out? If not, here’s a quick read to solve that! At a minimum, you’ll get to see some R code. If you follow along, you’ll have R installed on your system and have run a few R commands.

Wikipedia describes R as a programming language and software environment for statistical computing and graphics. I think it’s popular because it combines a “conversational” (interactive) type-and-do-it interface with powerful built-in capabilities to do complex statistics (often in a single line) and complex graphical plotting (likewise).

Example of R plotting data

One thing you learn quickly about R is that many people use it to create quick plots of data, both inputs and results. There are many graphing features that are very easy to learn, and we’ll look at “barplot” to illustrate. To do this, we simply start “r” and type a single command “barplot…” as shown here:

% r

R version 3.3.2 (2016-10-31) — “Sincere Pumpkin Patch”

Platform: x86_64-apple-darwin11.4.2 (64-bit)

(and a bunch more lines telling us that R has started)

> barplot(table(sample(1:3, size=10000, replace=TRUE, prob=c(.150,.650,.200))))

One thing you’ll see with examples in R is that many simple examples avoid using real data by simply randomly creating input data. The sample function does exactly that – it is used to create 10,000 randomly selected numbers (1, 2, or 3) with probabilities of 15%, 65% and 20%, respectively. The table function builds a summary table containing the counts for the occurrences.

For fun – try these variations on your own:

> foo=table(sample(1:3, size=10000, replace=TRUE, prob=c(.150,.650,.200)))

> barplot(foo)

> barplot(horiz=TRUE,foo)

> barplot(col=c(“blue”),foo)

> barplot(names.arg=c(“A”,”B”,”EEE”),foo)

> barplot(horiz=TRUE,col=c(“light blue”),names.arg=c(“A”,”B”,”EIEIO”),foo)

R offers many types of graphing functions: histograms, dot plots, bar plots, line charts, pie charts, boxplots, scatterplots (you can learn more in the online “Quick-R” resources).

For instance – try this:

% r

> pie3D(labels=c(“A”,”B”,”EIEIO”),foo)

Example of R doing a statistical problem

R has a rich collection of functions to help with statistical processing needs. Many functions are supported for machine learning, and using them is generally very simple.

Clustering is a key capability used in machine learning. Here is a simple example that creates a two-dimensional matrix of random numbers (purposefully created as natural two clusters by merging two random distributions with different means) and then uses ‘kmeans’ to compute two clusters:

mymat <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))

mycl <- kmeans(mymat, 2, 20)

Now try each of these commands one at a time. The first plots the points, the next command adds the center points for both clusters, and the last two commands add line segments from each point to the center of its cluster. This is a cute example with an easy split. You can try changing the “sd” (standard deviation) parameters, above, from 0.3 to a much larger value. The quality of the categorization will obviously degrade. NOTE: I stopped showing the “>” at the front of each R command to make it easy to cut-and-paste these sequences from this article to try out on your own system. You will see the “>” prompts on your screen.

plot(mymat, col = mycl$cluster, pch=3, lwd=6)

points(cl$centers, col = 1:2, pch = 7, lwd=6)

segments(mymat[mycl$cluster==1,][,1], mymat[mycl$cluster==1,][,2],

mycl$centers[1,1], mycl$centers[1,2])

segments(mymat[mycl$cluster==2,][,1], mymat[mycl$cluster==2,][,2],

mycl$centers[2,1], mycl$centers[2,2], col=2)

For higher-quality classifications that include feedback on the quality of the classification, we can use “clusplot.” We need to include the library “cluster” for this function. Next, the code defines how many points to create (fewer is easier to read on the plot), and the standard deviation. I show graphs for values of 35, 40, and 50 (for the standard deviation) to show how classification is less distinct when datasets overlap in the real world. Your graphs will vary from what I show because your random numbers will vary. In fact, the graphs differ each time we run this set of commands for this example.

library(cluster)

many <- 20

stddiv <- 35

or <- matrix(c(rnorm(many+many, sd = stddiv, mean=100), rep(1859,many)), ncol = 3)

rownames(or) <- rownames(or, do.NULL = FALSE, prefix = “OR” )

ca <- matrix(c(rnorm(many+many, sd = stddiv, mean = 200), rep(1850,many)), ncol = 3)

rownames(ca) <- rownames(ca, do.NULL = FALSE, prefix = “CA” )

st <- rbind(or,ca)

clusplot(st,pam(st,2,cluster.only=TRUE),label=2)

Functions? Yes!

We can type “myfunc(20,35)” to do the same thing if we define the function:

myfunc <- function(many,stddiv) {

# same code as before…

or <- matrix(c(rnorm(many+many, sd = stddiv, mean=100), rep(1859,many)), ncol = 3)

rownames(or) <- rownames(or, do.NULL = FALSE, prefix = “OR” )

ca <- matrix(c(rnorm(many+many, sd = stddiv, mean = 200), rep(1850,many)), ncol = 3)

rownames(ca) <- rownames(ca, do.NULL = FALSE, prefix = “CA” )

st <- rbind(or,ca)

clusplot(st,pam(st,2,cluster.only=TRUE),label=2)

}

How to install R and try this yourself

My earlier post (“How Does a 20X Speed-Up in Python Grab You?”) included step-by-step instructions on using Anaconda to install Python, and acceleration from Intel (the latter being optional for this blog). With Python installed via conda, you are set up to install R essentials as follows, assuming you’ve installed Python as I previously described:

% conda install -c r r-essentials

% conda install -c bioconda r-plotrix

The second command adds a 3D pie chart graphing capability that I used in my graphing examples. We get a quick impression during the install of how many capabilities are built into R. The R Essentials bundle installed by this command includes more than 80 of the most popular R packages for data science.

Isn’t R slow? No, but here’s how to make it fast

R is so popular for data scientists, including in the super-hot field of machine learning, that making it run fast has not escaped attention. The net result is that R can be effectively used with reasonable performance. Some even claim it is ready for HPC usage (see the pdbR project). It remains an interesting balance between ease-of-use and performance, meaning that you can get higher performance by shifting to other programming methods such as C++ or Fortran, but neither offers the high-level power and ease-of-use of R. I’ve included a number of links below for additional information, the last four of which are about high-performance R. (A good overview is HPC with R: The Basics by Drew Schmidt.)

Summary

At a minimum, I’ve shared some insights into R that include actual R code. I hope you’ll install it on your own machine now – and include it in your “bag of tricks” when you have statistical problems to tackle and graphs to plot.

If you want to dig a little deeper, here are some recommended readings:

My article “How Does a 20X Speed-Up in Python Grab You?” includes step-by-step instructions on using Anaconda to install Python (and therefore sets you up to install R essentials as I describe in this article).
DataCamp’s “Introduction to R” online tutorial (free).
Quick-R online information about R.
“Statistics with R” is an introduction to R graphics and statistics with many R examples, covering many analytics topics.
The perfect “first use of R” tutorial in my opinion (a former Kaggle competition turned tutorial):
- The MAIN page for information about this tutorial: Titanic: Machine Learning from Disaster, Start here! Predict survival on the Titanic and get familiar with ML basics
- Some user pages with information on their approaches and thoughts:
  - Exploring the Titanic Dataset
  - Titanic: Getting Started with R
- Details on how “R” can be made fast for HPC/HPDA using Trust, TBB, CUDA, etc. all “under the hood”: Rth: Parallel R through Thrust
- HPC with R: The Basics, Drew Schmidt, Intel Parallel Universe Magazine, Issue 28, pages 46-56
- Video Tutorial and material: Outstanding tutorial on Using R for HPC with lots of downloadable supporting material – all free. Click on the “videos” link for the list of five videos.
- The “Programming with Big Data in R” project (pbdR) is a set of highly scalable R packages for distributed computing and profiling in data science.

Click here to download your free 30-day trial of Intel Parallel Studio XE

Software Development

Topics

About

Policies

Our Network

More

Raise Your Hand and Ask: What’s R?

More from this author

Boost System and IoT Development with Updated Intel System Studio

Watch Experts Talk About Cool Science at HPC

Raise Your Hand and Ask: What’s a Qubit?

Connecting Dies: How Moore’s Law Now Drives Packages

Intel HPC Developer Conference and SC17: Supercomputing Matters More Than Ever

Intel’s ‘2018 Model Year’ Developer Tools Are Now Available

Developer Skills at Work: Serious (and Not-So-Serious) Stuff

Digital Twins: A Compelling Use for Simulations on IoT Data

Show me more

Stop using AI to submit bug reports, says Google

The ‘toggle-away’ efficiencies: Cutting AI costs inside the training loop

AI optimization: How we cut energy costs in social media recommendation systems

How to build desktop apps in Typescript with Electrobun

Write and run assembly in Python with Copapy

Run AI Models Locally on Your PC — No Cloud Required (LM Studio Guide)