Data Science, Automation & Related Technology
Section 1: Notes on R
R is a highly versatile statistical and mathematical computing platform, and it is free and open source. Downloads and documentation are available at http://www.r-project.org.
The notes in this section are recommendations for students in my statistics courses at Syracuse University and may not apply universally.
Working Directory
Keep files organized in a folder/subfolder structure. At the start of an R session, point the working directory to the relevant folder so files can be referenced locally. For example:
setwd("H:/Folder/SubFolder/SubSubFolder")
Note the change from Windows backslash to forward slash.
Script Editor and Source
Command-line entry grows old quickly. Most work is better done in a script editor (in base R: File > New Script). Select one or more lines and use Ctrl+R to run them. Scripts can be saved and later run in full from the console:
source("foo.R")
This assumes foo.R is saved in the current working directory. For a more comfortable editing environment with syntax highlighting, RStudio is the current standard recommendation.
Menu-Driven GUI: R Commander (Rcmdr)
Package Rcmdr provides a point-and-click interface for many standard statistical tasks. Every menu selection generates the corresponding R code in the output window, which can be copied, edited, and rerun — making it a useful bridge for students transitioning to scripting. It is particularly well suited for introductory-level work. John Fox (the package author) has written a textbook built around it: Fox, J. (2016). Using the R Commander: A Point-and-Click Interface for R. Chapman & Hall. https://doi.org/10.1201/9781315380537. See also http://www.rcommander.com.
Installing Packages Without Administrative Privileges
On a public or shared computer where you lack admin rights but have a read/write-accessible folder, redirect the R library path to that folder. Create a subfolder (e.g., R_Packages_Library) inside your accessible drive, then run:
.libPaths(c("H:\\R_Packages_Library", .libPaths()))
After this, install.packages() will install to that folder for the session.
Additional Help and Reference Links
The links below have been useful starting points for R-related questions. Some may have moved or gone inactive over time.
- http://www.ats.ucla.edu/stat/r/
- http://ww2.coastal.edu/kingw/statistics/R-tutorials
- http://msenux.redwoods.edu/math/R/
- http://www.statmethods.net
- http://www.r-bloggers.com/
- http://www.sr.bham.ac.uk/~ajrs/R/r-function_list.html
- http://stackoverflow.com/tags/r/info
- http://www.stat.pitt.edu/stoffer/tsa2/R_time_series_quick_fix_ed2.htm
Section 2: R in Google Colab — Quick Reference
The code blocks below are a working reference for R in Google Colab. They assume a dataframe df_illustration is available with variables: BinaryV1, CategV1, CategV2, QuantV1, QuantV2, QuantV3, QuantV4.
Setup: Mount Google Drive (Python cell — run before switching to R runtime)
from google.colab import drive
drive.mount('/content/drive')
# Google will prompt for authorization.
# Your entire Google Drive is then accessible under /content/drive/MyDrive/
Analysis 1: Summary and Boxplot of a Single Variable
summary(df_illustration$QuantV1) boxplot(df_illustration$QuantV1, main = "QuantV1 (All)")
Analysis 2–3: Two-Group Comparison (Moore & McCabe, Ch. 7)
# Summaries and boxplot by group tapply(df_illustration$QuantV1, df_illustration$BinaryV1, summary) boxplot(QuantV1 ~ BinaryV1, data = df_illustration, main = "QuantV1 by BinaryV1") # Two-sample t-test t.test(QuantV1 ~ BinaryV1, data = df_illustration)
Analysis 4–5: One-Way ANOVA (Moore & McCabe, Ch. 12)
# Summaries and boxplot by group tapply(df_illustration$QuantV1, df_illustration$CategV1, summary) boxplot(QuantV1 ~ CategV1, data = df_illustration, main = "QuantV1 by CategV1") # One-way ANOVA anova_fit <- aov(QuantV1 ~ CategV1, data = df_illustration) summary(anova_fit)
Analysis 6: Crosstabulation and Chi-Square Test (Moore & McCabe, Ch. 9)
tab <- table(df_illustration$CategV1, df_illustration$CategV2) tab # counts prop.table(tab, 2) # column percentages prop.table(tab, 1) # row percentages chisq.test(tab)$expected # expected counts chisq.test(tab) # chi-square test
Analysis 7–8: Correlation and Scatterplot Matrix (Moore & McCabe, Ch. 10)
cor(df_illustration[, c("QuantV1","QuantV2","QuantV3","QuantV4")], use = "complete.obs")
pairs(df_illustration[, c("QuantV1","QuantV2","QuantV3","QuantV4")])
Analysis 9: Simple Linear Regression (Moore & McCabe, Ch. 10–11)
slr_fit <- lm(QuantV1 ~ QuantV2, data = df_illustration) summary(slr_fit) par(mfrow = c(2, 2)); plot(slr_fit)
Analysis 10: Multiple Linear Regression with VIFs (Moore & McCabe, Ch. 11)
mlr_fit <- lm(QuantV1 ~ QuantV2 + QuantV3 + QuantV4, data = df_illustration)
summary(mlr_fit)
par(mfrow = c(2, 2)); plot(mlr_fit)
# Variance Inflation Factors — install 'car' package first if needed:
# install.packages("car")
library(car)
vif(mlr_fit)
Reading Data from a ZIP File
# Option 1: Read a CSV from inside a ZIP without extracting zip_path <- "/content/drive/MyDrive/path/to/data.zip" utils::unzip(zip_path, list = TRUE) # inspect contents first con <- unz(zip_path, "inner/path/file.csv") data_streamed <- read.csv(con) head(data_streamed) # Option 2: Unzip to a local folder, then read normally exdir <- "/content/extracted_data" # fast; ephemeral in Colab utils::unzip(zip_path, exdir = exdir) csv_path <- file.path(exdir, "inner/path/file.csv") data_unzipped <- read.csv(csv_path) head(data_unzipped)
Reading Data Directly from a URL
data_from_url <- read.table( "https://raw.githubusercontent.com/your/repo/file.csv", header = TRUE, sep = "," ) head(data_from_url)