LES with R – get start – LES | lets enjoy statistics

LES with R – get start

R and RStudio

R is a free, statistics and data science software that provides you a handy way of day-to-day data handling such cleaning, editing, analyse, visualise and communicate the outputs. You can download R from CRAN, the comprehensive R archive network, https://cloud.r-project.org. It works for Windows, MAC and Linux operating system. You just need to download the right version. A new version of R released every year with some more updates over the year. This module is prepared based on the version R 4.3.0.

RStudio is an integrated development environment (IDE) for R programming which is also available free of cost at https://posit.co/download/rstudio-desktop/. While we are talking use R we basically work in RStudio and R is working in the background.

There are many advantages of using R. Some of them are below.

Reproducibility: In R once you write the code for data analysis, producing a fancy graph, write a function and save in a safe directory you can reproduce the same outputs even after years of time. Its the beauty of coding in R.
Help \(\&\) support: A comprehensive online help available. You can find example code for anything you want to do using R. You just need to copy and paste the right code for you. Some example websites where you can find the help are below.
- LES - https://letsenjoystatistics.com/.
- Quick R - https://www.statmethods.net/.
- STHDA - http://www.sthda.com/english/.
- https://stackoverflow.com/.
- … many more.
Package ready on demand: Thousands of R packages are available to help done your task.
Functionality: Beside doing statistics and data analysis, R can be used for many other purposes such as writing report, an article, thesis, books, preparing PowerPoint slides, creating web application, checking facebook status, sending email etc. So, its almost all in one package. The frequently used interfaces are R Markdown and R Sweave. Other programming languages such as Python, C++, and SQL can be called and used in R. So, it provides versatile opportunity.
Example datasets: R has a huge number of example datasets those are built in with the packages. You can access anyone of them and practice. To get the list of built it datasets just type data() on the console and hit the enter button. Of course more datasets will come along with the package you install. There are some command that is very helpful to explore the built in datasets. For example, help(package="datasets") will provide the documentation of the datasets, data(package="ggplot2") give the list of the datasets built in with the package ggplot2, and data(package = .packages(all.available = TRUE)) gives the list of all datasets from all installed packages, dplyr::storms will give access to the dataset storms built in with the package dplyr.
Handling multiple datasets: R provides an opportunity of handling multiple datasets at the same time. So, you can load many data sets at the same time and work on multiple datasets together.
Zero money: Most importantly all these can be done with a cost of a ZERO.

Therefore, it is worth investing some time to learn a program like R to be an efficient data and web handler.

Get started

To start using R simply download the right version for your machine and install it. Most cases, you need to install R first and then install RStudio however in some cases while downloading from the university software centre you may need installing RStudio only. Once installed, just double click the RStudio desktop icon to open it. You will see pane layout (various windows) as below.

In the above figure we see two most important windows are marked (in rectangle). The top one (source) is the input window and the below (console) is both input and output window. The red circled tabs Plots and Help are also frequently used. The pane layout can be rearranged simply following Tools -> Global Options -> Pane Layout. I prefer showing two panes only, source in the left and console in the right. Maybe you do not see the source when open the RStudio for the first time. To get it, simply click the little green plus sign on the top left corner and select R Script.

Typing in `R`

Typing in R is easy and simple. You can either type in console or in source however I recommend typing most of the codes in the source which you can save and use anytime later. The codes written on the source is called the R script. You can save console as well however it is not so efficient because every time you save a console it occupies lots of space in your computer.

Coding in R is object oriented. That means you can consider the data, graphs, output of an analysis as an object and assign a name to it. This is really handy because you can call the object by its name anytime. Assign an object a name using an equal sign. Let’s see a simple code example.

# Creating a vector of values 1 to 5 and assign a name "x"
x = c(1:5)

# Displaying the x
print(x)

## [1] 1 2 3 4 5

# Creating a vector of values from 1 to 2 with an increment of 0.2 with a name "y"
y = seq(1,2, by=0.2)

# Displaying "y"
print(y)

## [1] 1.0 1.2 1.4 1.6 1.8 2.0

# Combining x and y to create a matrix with 2 columns
dt = matrix(cbind(x,y), ncol=2)

# Naming the column names x and y
colnames(dt) = c('x', 'y')

# Display the matrix
print(dt)

##      x   y
## [1,] 1 1.0
## [2,] 2 1.2
## [3,] 3 1.4
## [4,] 4 1.6
## [5,] 5 1.8
## [6,] 1 2.0

To add notes or comments on the R script use # sign.

Setting the working directory

Working directory is a folder in your computer where your data sets and stored in. Any results including new datasets and graphs are directly saved into the current working directory.

setwd("C:/Users/mmoinuddin/OneDrive - UCLan/LES")

The folder LES is my working directory where I stored my datasets and I can call them anytime I like. You can copy and paste the code and change the path of your own. Once you set your working directory you can check it using the following command line.

getwd()

## [1] "C:/Users/mmoinuddin/OneDrive - UCLan/LES"

You can get the list of files stored in the current working directory by simply typing the dir() command in the console and hitting the enter button on your keyboard.

dir()

##  [1] "data spread _AE.docx"         "data_types.png"              
##  [3] "diet article _AE.docx"        "evidence_pyramid.png"        
##  [5] "FHS_1.png"                    "FHS_2.png"                   
##  [7] "google_cough.png"             "google_flu1.jpeg"            
##  [9] "google_flu2.jpeg"             "google_flu21.jpeg.png"       
## [11] "Hypothesis T1 _AE.rtf"        "john_map2.jpg"               
## [13] "john_snow_map_tubeW.png"      "John_Snow_Pub_1.jpg"         
## [15] "john_snow_waterpomp.jpg"      "LES-with-R.html"             
## [17] "LES-with-R.Rmd"               "LES home.png"                
## [19] "LES home2.png"                "LES home3.png"               
## [21] "LES logo with Text.png"       "LES logo.png"                
## [23] "LES with R.Rmd"               "LES_home_HD.png"             
## [25] "LES_logo.pptx"                "les_logo_modified.png"       
## [27] "Ox_evidence_level.png"        "Pump_Handle_-_John_Snow_.jpg"
## [29] "research_process.png"         "rstudio_consoles.png"        
## [31] "Snow-cholera-map-1.jpg"       "SPSS data for teaching"      
## [33] "study_designs.png"

Install packages

An R package consists of a bundle of functions to be used for a specific task. For example, function for calculating average (mean), to get the number of rows and columns (dim) are under package base, for calculating variance (var), conducting statistical t-test (t.test), Chi-square test (chisq.test) are under the package stats. Packages for basic data handling and analysis are mostly already installed which are called R core packages. However, for a specific task you need some packages to install. Installing a packages is very simple. To install a package use code install.packages("package name"). For example, to install the package ggplot2 for fancy plots the command is below.

# install.packages("ggplot2", repos = "https://cloud.r-project.org")

Note that I have added some extra information repos = "https://cloud.r-project.org" inside the brackets. This is because I have written this document using R Markdown feature. You can install multiple packages using single command.

Loading packages

Once a package is installed in your machine you will not need to install it anymore. You just need to load the package once whenever you open shutdown and open RStudio. A package can be loaded using the code library(). I have loaded most of my required packages below.

library(ggplot2)
library(tidyverse)
library(foreign)
library(psych)
library(readxl)
library(MASS)

Any function of a package can be used even without loading it however you need to write a bit of extra code. For example, ggplot() is the main function for any plot in the ggplot2 package can be used without loading the package using the code ggplot2::ggplot().

The following packages would be useful to install after you first time install R and RStudio.

Packages	Packages	Packages	Packages
MASS	epiDisplay	maps	ggpmisc
readxl	foreign	mice	margins
tidyverse	summarytools	gtools	ggeffects
psych	ggplot2	dplyr	lubridate

Loading the data into `R`

To start exploring, you need a dataset to load into R first. The command for loading a dataset depends on the format of the dataset. Frequently used commands for loading a dataset are load() and readRDS() for an R data file, read.csv() for a .csv file, read_xlsx() and read_excel() for Excel file, read.spss() for SPSS file and read.dta() for a STATA file etc. That means any type of file can be loaded into R. These are the common file type. There are some other file types as well which is less common however exists. R has option for any kind of data to read.

Throughout this course we will be using two data sets, Birthweight_reduced and Diet datasets. In my working directory these data files are stored in different formats such as SPSS, CSV and Excel. Let’s start with the Birthweight_reduced dataset. I will load all three versions.

setwd("C:/Users/mmoinuddin/OneDrive - UCLan/LES")
df_csv <- read.csv("SPSS data for teaching/Birthweight_reduced_kg_R.csv")
df_spss <- read.spss("SPSS data for teaching/Birthweight_reduced_kg_SPSS.sav", to.data.frame = TRUE)
df_excel <- read_xlsx("SPSS data for teaching/Birthweight_reduced_kg_R.xlsx")

While reading data from Excel file you can specify the sheet name. Also, you can specify the number of rows you want to load.

Understanding the dataset

To know a little bit about the dataset three useful commands are dim() - to know how many variables and rows are there, names() - to know the variable names, head() - to see a number of row in the dataset. Let’s see what these commands give us.

dim(df_csv)

## [1] 42 16

names(df_csv)

##  [1] "ID"          "Length"      "Birthweight" "Headcirc"    "Gestation"  
  ##  [6] "smoker"      "mage"        "mnocig"      "mheight"     "mppwt"      
  ## [11] "fage"        "fedyrs"      "fnocig"      "fheight"     "lowbwt"     
  ## [16] "mage35"

head(df_csv)

##     ID Length Birthweight Headcirc Gestation smoker mage mnocig mheight mppwt
  ## 1 1360     56        4.55       34        44      0   20      0     162    57
  ## 2 1016     53        4.32       36        40      0   19      0     171    62
  ## 3  462     58        4.10       39        41      0   35      0     172    58
  ## 4 1187     53        4.07       38        44      0   20      0     174    68
  ## 5  553     54        3.94       37        42      0   24      0     175    66
  ## 6 1636     51        3.93       38        38      0   29      0     165    61
  ##   fage fedyrs fnocig fheight lowbwt mage35
  ## 1   23     10     35     179      0      0
  ## 2   19     12      0     183      0      0
  ## 3   31     16     25     185      0      1
  ## 4   26     14     25     189      0      0
  ## 5   30     12      0     184      0      0
  ## 6   31     16      0     180      0      0

Data formats in `R`

There are multiple types of datasets that we usually use in R. In the R language we this is usually called an object. An object can be a list, matrix, tibble, data.frame etc. A list can contain several other types of objects. There are different ways of accessing a specific element of the datasets. As you go alonge with using it you will be an expert.

Frequently used commands

I believe it will be very useful to get a list of commands usually needed to do data manipulation, management and analysis in Biostatistics setting. The table below contains the list of commands with their sources package name.

Command	Use
Data management
`help()`	Obtain documentation for a given R command
`example()`	View some examples on the use of a command
`c()`	Enter data manually to a vector in R
`seq()`	Make arithmetic progression vector
`rep()`	Make vector of repeated values
`data()`	Load a (as a data.frame) built-in dataset
`View()`	View dataset in a spreadsheet-type format
`load()`	Load an existing .Rdata file
`readRDS()`	Load an existing .RDS file
`read.csv()`	Load an existing CSV file
`read.spss()`	Load an existing SPSS file
`read_xls()`	Load an existing Excel (.xls) file
`read_xlsx()`	Load an existing Excel (.xlsx) file
`read_excel()`	Load an existing Excel file
`write.csv()`	Saving an working data file as a CSV file
`install.packages()`	Install new packages
`library()`	Load an R package already installed
`require()`	Load an R package already installed
`dim()`	See number of rows/cols of data.frame
`names()`	See column/variable names of data.frame
`length()`	Give length of a vector
`ls()`	Lists memory contents
`rm()`	Removes an item from memory
`as.numeric()`	Convert string to numeric
`as.data.frame()`	Convert a matrix into a data frame
`factor()`	Create/replace/label a factor variable
`ordered()`	Create/replace/label an ordered variable
`mutate()`	Create new variable from existing one
`if_else()`	Create new variable using condition
Statistics
`table()`	Get a frequency table for a variable
`addmargins()`	Add marginal sums to an existing table
`prop.table()`	Compute proportions from a contingency table
`summary()`	Get summary statistics for a variable
`describe()`	Get specific summary statistics
`describeBy()`	Get specific summary statistics by groups
`xtabs()`	Cross-tabulation tables using formulas
`mean()`	Calculate mean for a variable
`median()`	Calculate median for a variable
`var()`	Calculate variance of values in vector
`sd()`	Calculate sd of values in vector
`sum()`	Add up all values in a vector
`sample()`	Take a sample from a vector of data
`cor()`	Calculate correlation between two variables
`prop.test()`	Inference for 1 proportion using normal approx
`t.test()`	Carries out a student t-test
`chisq.test()`	Carries out a chi-square test
`fisher.test()`	Carries out a Fisher exact test
`aov()`	Perform ANOVA using formula
`wilcox.test()`	Mann-Whitney U test for independent samples
`wilcox.test()`	Wilcoxon Signed Rank test for paired samples
`kruskal.test()`	Kruskal Wallis test
`lm()`	Linear regression model
`glm()`	Generalized linear model (linear/logistic/Poisson reg)
Visualise
`hist()`	Create a histogram
`boxplot()`	Create a boxplot
`plot()`	Create a scatter plot (many more . . . )
`abline(a,b)`	Add a line on a scatter plot
`pairs()`	Scatter plot matrix
`pdf()`	Save graph as pdf file.
`jpeg()`	Save graph as png or jpeg
`dev.off()`	Necessary after pdf() and jpeg() commands
`ggplot()`	Generate fancy graphs of many types

This list is not an exclusive lists rather its the beginning. To get help for any command, use help() or ?. For example, to get help for the command dim, type help(dim) or ?dim and hit enter button. Go to the bottom of the help page to see multiple examples. Alternatively Google it. Let us see some practical in the classroom.

LES with R – get start