David Ing | Aalto University and the International Society for the System Sciences | Toronto, Canada
Disclaimer: David Ing is not (yet) an expert in R, Jupyter and ggplot2. He was fluent as an econometrician (1985-1987) for IBM, applying a similar package, GRAFSTAT, see "An APL system for interactive scientific-engineering graphics and data analysis" | G. J. Burkland, P. Heidelberger, P. D. Welch, L. S.Y. Wu, Martin Schatzoff | June 1984 | APL '84: Proceedings of the International Conference on APL at http://dx.doi.org/10.1145/384283.801082
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copies of this notebook can be found at http://coevolving.com/tongji/, in a variety of formats.
One way: (i) ggplot2, through (ii) Jupyter, on (iii) R.
ggplot2, based on the Grammar of Graphics (originally Leland Wilkinson, 1995-2005)
Jupyter notebook for reproducible research, originally for Julia, Python and R
R Project for Statistical Computing https://www.r-project.org/
Advantages of this three-part package:
Disadvantages:
R programming operates inside an active workspace, see "R Programming/Manage your workspace" at https://en.wikibooks.org/wiki/R_Programming/Manage_your_workspace.
Within a Jupyter notebook:
Let's start by clearing everything in the workspace.
help(rm)
rm(list = ls())
Let's check if the workspace is really empty.
ls()
To ensure that there's something in the workspace, let's enter the date.
myDate <- Sys.Date()
myDate
What's in the workspace?
ls()
There are sample data frames described online at https://vincentarelbundock.github.io/Rdatasets/datasets.html.
To import data from the web, do we have the XML and RCurl packages? We can show the packages that are installed.
installed.packages()
If they're not there, we'll need to install two packages to import from the web.
# The following commands are commented out, as already having been run in the workspace. If not, remove the #.
# install.packages("XML")
# install.packages("RCurl")
Load the XML and RCurl packages into the current session.
library(XML)
library(RCurl)
Let's try to import:
urlTarget <- "http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Housing.csv"
urlData <- getURL(urlTarget)
ls()
What's the data look like, that we've imported? Use the head function.
head(urlData)
Ah, it's CSV (Comma Separated Values). We can create a data frame with the read.csv function, and then check the number of columns and rows.
Housing <- read.csv(urlTarget,header = TRUE)
ncol(Housing)
nrow(Housing)
dim(Housing)
Let's look at the first few rows of the table.
head(Housing)
We can get some quick descriptive statistics with summary (i.e. mean, median, 25th and 75th quartiles, min, max).
summary(Housing)
The internal structure of the data frame can show integers, numerics, and factors.
str(Housing)
To plot using ggplot2, we'll first need to install the package.
# The following commands are commented out, as already having been run in the workspace. If not, remove the #.
# install.packages("ggplot2")
Load ggplot2 into the current session with library.
library(ggplot2)
Generate a simple scatterplot of lotsize vs. price, and display it.
# Generate the plot
HousingPriceLotsizeScatter <- ggplot(data=Housing,
aes(x = lotsize, y = price)) +
geom_point()
# Display the plot
HousingPriceLotsizeScatter
Extend the scatterplot to include number of bedrooms.
# Generate the plot
HousingPriceLotsizeBedrooms <- ggplot(data=Housing,
aes(x = lotsize, y = price)) +
geom_point(aes (color=bedrooms) )
# Display the plot
HousingPriceLotsizeBedrooms
Facet the grid by the number of bedrooms, and add in linear models.
# Generate the plot
HousingPriceLotsizeBedroomsLM <- ggplot(data=Housing,
aes(x = lotsize, y = price)) +
geom_smooth(method = "lm") +
geom_point() +
facet_grid(bedrooms ~ .)
# Display the plot
HousingPriceLotsizeBedroomsLM
There are three choices:
(1) In English, sign up at https://bigdatauniversity.com/, with https://datascientistworkbench.com/.
(2) In Chinese, sign up at https://bigdatauniversity.com.cn/, with https://datascientistworkbench.cn/.
(3) Install your own local workstation with three parts:
At Big Data University, there is an online R 101 course that steps through R, with videos and exercises.
In February 2017, Polong Lin (from IBM) led a Data Science with R meetup at Ryerson University in Toronto.
The hands-on material is at http://bit.ly/feb22ryersonlab, pointing to https://my.datascientistworkbench.com/share/jupyter/v1/10.999.11.103/ML0101EN0000000/DataSciencewithR.ipynb, recently redirected to https://my.datascientistworkbench.com/tools/jupyter-notebook/api/v1/resources/DataSciencewithR2.ipynb.
The printable materials are at http://bit.ly/feb22ryerson, pointing to https://gist.github.com/polong-lin/a166cedef1399724fd5f06087660f396