Introduction to Data Science with R, Jupyter and ggplot2

Tongi University, College of Design and Innovation, April 2017

David Ing | Aalto University and the International Society for the System Sciences | Toronto, Canada

Disclaimer: David Ing is not (yet) an expert in R, Jupyter and ggplot2. He was fluent as an econometrician (1985-1987) for IBM, applying a similar package, GRAFSTAT, see "An APL system for interactive scientific-engineering graphics and data analysis" | G. J. Burkland, P. Heidelberger, P. D. Welch, L. S.Y. Wu, Martin Schatzoff | June 1984 | APL '84: Proceedings of the International Conference on APL at http://dx.doi.org/10.1145/384283.801082

Creative Commons LicenceThis work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Copies of this notebook can be found at http://coevolving.com/tongji/, in a variety of formats.

Workshop Scope

  • NOT to teach you everything about data science!
  • You should know about ways to represent data to support your research findings.
  • You will see some good tools for qualitative methods (and maybe even quantitative methods).
  • We can work together to get you started on tools that will suit your needs.

One way: (i) ggplot2, through (ii) Jupyter, on (iii) R.

ggplot2, based on the Grammar of Graphics (originally Leland Wilkinson, 1995-2005)

Jupyter notebook for reproducible research, originally for Julia, Python and R

R Project for Statistical Computing https://www.r-project.org/

Advantages of this three-part package:

  • Scalable to large datasets.
  • Open source software (free as liberty, free as in beer)
  • Can be hosted on personal workstation (Windows, MacOS, Linux) or via a browser to a server (e.g. Big Data University)
  • Notebook in ipynb format is exportable to .html and .md (and .pdf if Latex plugin is installed)
  • Popular amongst data scientists across disciplines (e.g. The Institute for Quantitative Social Science, Harvard University, at http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html

Disadvantages:

  • Architected as browser + server, so installation requires some technical expertise.

An Orientation Demonstration

R programming operates inside an active workspace, see "R Programming/Manage your workspace" at https://en.wikibooks.org/wiki/R_Programming/Manage_your_workspace.

Within a Jupyter notebook:

  • To run a cell: Ctrl + Enter
  • To run a cell and go to the next cell: Shift + Enter

Let's start by clearing everything in the workspace.

  • If we invoke help on the command, we see that remove and rm can be used to remove objects.
In [ ]:
help(rm)
In [ ]:
rm(list = ls())

Let's check if the workspace is really empty.

In [ ]:
ls()

To ensure that there's something in the workspace, let's enter the date.

In [ ]:
myDate <- Sys.Date()
myDate

What's in the workspace?

In [ ]:
ls()

There are sample data frames described online at https://vincentarelbundock.github.io/Rdatasets/datasets.html.

To import data from the web, do we have the XML and RCurl packages? We can show the packages that are installed.

In [ ]:
installed.packages()

If they're not there, we'll need to install two packages to import from the web.

In [ ]:
# The following commands are commented out, as already having been run in the workspace.  If not, remove the #.
# install.packages("XML")
# install.packages("RCurl")

Load the XML and RCurl packages into the current session.

In [ ]:
library(XML)
library(RCurl)

Let's try to import:

In [ ]:
urlTarget <- "http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Housing.csv"

urlData <- getURL(urlTarget)

ls()

What's the data look like, that we've imported? Use the head function.

In [ ]:
head(urlData)

Ah, it's CSV (Comma Separated Values). We can create a data frame with the read.csv function, and then check the number of columns and rows.

In [ ]:
Housing <- read.csv(urlTarget,header = TRUE)
ncol(Housing)
nrow(Housing)
dim(Housing)

Let's look at the first few rows of the table.

In [ ]:
head(Housing)

We can get some quick descriptive statistics with summary (i.e. mean, median, 25th and 75th quartiles, min, max).

In [ ]:
summary(Housing)

The internal structure of the data frame can show integers, numerics, and factors.

In [ ]:
str(Housing)

To plot using ggplot2, we'll first need to install the package.

In [ ]:
# The following commands are commented out, as already having been run in the workspace.  If not, remove the #.
# install.packages("ggplot2")

Load ggplot2 into the current session with library.

In [ ]:
library(ggplot2)

Generate a simple scatterplot of lotsize vs. price, and display it.

In [ ]:
# Generate the plot
HousingPriceLotsizeScatter <- ggplot(data=Housing, 
                                     aes(x = lotsize, y = price)) + 
                                     geom_point()
# Display the plot
HousingPriceLotsizeScatter

Extend the scatterplot to include number of bedrooms.

In [ ]:
# Generate the plot
HousingPriceLotsizeBedrooms <- ggplot(data=Housing, 
                                      aes(x = lotsize, y = price)) + 
                                      geom_point(aes (color=bedrooms) )
# Display the plot
HousingPriceLotsizeBedrooms

Facet the grid by the number of bedrooms, and add in linear models.

In [ ]:
# Generate the plot
HousingPriceLotsizeBedroomsLM <- ggplot(data=Housing, 
                                      aes(x = lotsize, y = price)) + 
                                      geom_smooth(method = "lm") +
                                      geom_point() +
                                      facet_grid(bedrooms ~ .)
# Display the plot
HousingPriceLotsizeBedroomsLM

Getting a working platform

There are three choices:

(1) In English, sign up at https://bigdatauniversity.com/, with https://datascientistworkbench.com/.

(2) In Chinese, sign up at https://bigdatauniversity.com.cn/, with https://datascientistworkbench.cn/.

(3) Install your own local workstation with three parts:

  • R;
  • Jupyter (with the R kernel); and
  • Packages as required (e.g. ggplot2, XML, RCurl).

Learning more about R

At Big Data University, there is an online R 101 course that steps through R, with videos and exercises.

In February 2017, Polong Lin (from IBM) led a Data Science with R meetup at Ryerson University in Toronto.

In [ ]: