Data Science, Automation & Related Technology

Section 1: Notes on R

R is a highly versatile statistical and mathematical computing platform, and it is free and open source. Downloads and documentation are available at http://www.r-project.org.

The notes in this section are recommendations for students in my statistics courses at Syracuse University and may not apply universally.

Working Directory

Keep files organized in a folder/subfolder structure. At the start of an R session, point the working directory to the relevant folder so files can be referenced locally. For example:

setwd("H:/Folder/SubFolder/SubSubFolder")

Note the change from Windows backslash to forward slash.

Script Editor and Source

Command-line entry grows old quickly. Most work is better done in a script editor (in base R: File > New Script). Select one or more lines and use Ctrl+R to run them. Scripts can be saved and later run in full from the console:

source("foo.R")

This assumes foo.R is saved in the current working directory. For a more comfortable editing environment with syntax highlighting, RStudio is the current standard recommendation.

Menu-Driven GUI: R Commander (Rcmdr)

Package Rcmdr provides a point-and-click interface for many standard statistical tasks. Every menu selection generates the corresponding R code in the output window, which can be copied, edited, and rerun — making it a useful bridge for students transitioning to scripting. It is particularly well suited for introductory-level work. John Fox (the package author) has written a textbook built around it: Fox, J. (2016). Using the R Commander: A Point-and-Click Interface for R. Chapman & Hall. https://doi.org/10.1201/9781315380537. See also http://www.rcommander.com.

Installing Packages Without Administrative Privileges

On a public or shared computer where you lack admin rights but have a read/write-accessible folder, redirect the R library path to that folder. Create a subfolder (e.g., R_Packages_Library) inside your accessible drive, then run:

.libPaths(c("H:\\R_Packages_Library", .libPaths()))

After this, install.packages() will install to that folder for the session.

Additional Help and Reference Links

The links below have been useful starting points for R-related questions. Some may have moved or gone inactive over time.

Section 2: R in Google Colab — Quick Reference

The code blocks below are a working reference for R in Google Colab. They assume a dataframe df_illustration is available with variables: BinaryV1, CategV1, CategV2, QuantV1, QuantV2, QuantV3, QuantV4.

Setup: Mount Google Drive (Python cell — run before switching to R runtime)

from google.colab import drive
drive.mount('/content/drive')
# Google will prompt for authorization.
# Your entire Google Drive is then accessible under /content/drive/MyDrive/

Analysis 1: Summary and Boxplot of a Single Variable

summary(df_illustration$QuantV1)
boxplot(df_illustration$QuantV1, main = "QuantV1 (All)")

Analysis 2–3: Two-Group Comparison (Moore & McCabe, Ch. 7)

# Summaries and boxplot by group
tapply(df_illustration$QuantV1, df_illustration$BinaryV1, summary)
boxplot(QuantV1 ~ BinaryV1, data = df_illustration, main = "QuantV1 by BinaryV1")

# Two-sample t-test
t.test(QuantV1 ~ BinaryV1, data = df_illustration)

Analysis 4–5: One-Way ANOVA (Moore & McCabe, Ch. 12)

# Summaries and boxplot by group
tapply(df_illustration$QuantV1, df_illustration$CategV1, summary)
boxplot(QuantV1 ~ CategV1, data = df_illustration, main = "QuantV1 by CategV1")

# One-way ANOVA
anova_fit <- aov(QuantV1 ~ CategV1, data = df_illustration)
summary(anova_fit)

Analysis 6: Crosstabulation and Chi-Square Test (Moore & McCabe, Ch. 9)

tab <- table(df_illustration$CategV1, df_illustration$CategV2)
tab                        # counts
prop.table(tab, 2)         # column percentages
prop.table(tab, 1)         # row percentages
chisq.test(tab)$expected   # expected counts
chisq.test(tab)            # chi-square test

Analysis 7–8: Correlation and Scatterplot Matrix (Moore & McCabe, Ch. 10)

cor(df_illustration[, c("QuantV1","QuantV2","QuantV3","QuantV4")], use = "complete.obs")
pairs(df_illustration[, c("QuantV1","QuantV2","QuantV3","QuantV4")])

Analysis 9: Simple Linear Regression (Moore & McCabe, Ch. 10–11)

slr_fit <- lm(QuantV1 ~ QuantV2, data = df_illustration)
summary(slr_fit)
par(mfrow = c(2, 2)); plot(slr_fit)

Analysis 10: Multiple Linear Regression with VIFs (Moore & McCabe, Ch. 11)

mlr_fit <- lm(QuantV1 ~ QuantV2 + QuantV3 + QuantV4, data = df_illustration)
summary(mlr_fit)
par(mfrow = c(2, 2)); plot(mlr_fit)

# Variance Inflation Factors — install 'car' package first if needed:
# install.packages("car")
library(car)
vif(mlr_fit)

Reading Data from a ZIP File

# Option 1: Read a CSV from inside a ZIP without extracting
zip_path <- "/content/drive/MyDrive/path/to/data.zip"
utils::unzip(zip_path, list = TRUE)   # inspect contents first
con <- unz(zip_path, "inner/path/file.csv")
data_streamed <- read.csv(con)
head(data_streamed)

# Option 2: Unzip to a local folder, then read normally
exdir <- "/content/extracted_data"   # fast; ephemeral in Colab
utils::unzip(zip_path, exdir = exdir)
csv_path <- file.path(exdir, "inner/path/file.csv")
data_unzipped <- read.csv(csv_path)
head(data_unzipped)

Reading Data Directly from a URL

data_from_url <- read.table(
  "https://raw.githubusercontent.com/your/repo/file.csv",
  header = TRUE, sep = ","
)
head(data_from_url)

Thomas T. John, Ph.D.