R and RStudio
R is a free, statistics and data science software that
provides you a handy way of day-to-day data handling such cleaning,
editing, analyse, visualise and communicate the outputs. You can
download R from CRAN, the comprehensive
R archive network, https://cloud.r-project.org. It works for Windows, MAC
and Linux operating system. You just need to download the right version.
A new version of R released every year with some more
updates over the year. This module is prepared based on the version
R 4.3.0.
RStudio is an integrated development environment (IDE)
for R programming which is also available free of cost at
https://posit.co/download/rstudio-desktop/.
While we are
talking use R we basically work in RStudio and
R is working in the background.
There are many advantages of using R. Some of them are
below.
-
Reproducibility: In
Ronce you write the code for data analysis, producing a fancy graph, write a function and save in a safe directory you can reproduce the same outputs even after years of time. Its the beauty of coding inR. -
Help \(\&\) support: A comprehensive online help available. You can find example code for anything you want to do using
R. You just need to copy and paste the right code for you. Some example websites where you can find the help are below.- LES - https://letsenjoystatistics.com/.
- Quick
R- https://www.statmethods.net/. - STHDA - http://www.sthda.com/english/.
- https://stackoverflow.com/.
- … many more.
-
Package ready on demand: Thousands of
Rpackages are available to help done your task. -
Functionality: Beside doing statistics and data analysis,
Rcan be used for many other purposes such as writing report, an article, thesis, books, preparing PowerPoint slides, creating web application, checking facebook status, sending email etc. So, its almost all in one package. The frequently used interfaces areR MarkdownandR Sweave. Other programming languages such asPython,C++, andSQLcan be called and used inR. So, it provides versatile opportunity. -
Example datasets:
Rhas a huge number of example datasets those are built in with the packages. You can access anyone of them and practice. To get the list of built it datasets just typedata()on theconsoleand hit theenterbutton. Of course more datasets will come along with the package you install. There are some command that is very helpful to explore the built in datasets. For example,help(package="datasets")will provide the documentation of the datasets,data(package="ggplot2")give the list of the datasets built in with the packageggplot2, anddata(package = .packages(all.available = TRUE))gives the list of all datasets from all installed packages,dplyr::stormswill give access to the datasetstormsbuilt in with the packagedplyr. -
Handling multiple datasets:
Rprovides an opportunity of handling multiple datasets at the same time. So, you can load many data sets at the same time and work on multiple datasets together. -
Zero money: Most importantly all these can be done with a cost of a
ZERO.
Therefore, it is worth investing some time to learn a program like
R to be an efficient data and web handler.
Get started
To start using R simply download the right version for
your machine and install it. Most cases, you need to install
R first and then install RStudio however in
some cases while downloading from the university software centre you may
need installing RStudio only. Once installed, just double
click the RStudio desktop icon to open it. You will see
pane layout (various windows) as below.
In the above figure we see two most important windows are marked (in
rectangle). The top one (source) is the input window and
the below (console) is both input and output window. The
red circled tabs Plots and Help are also
frequently used. The pane layout can be rearranged simply following
Tools -> Global Options -> Pane Layout. I prefer
showing two panes only, source in the left and
console in the right. Maybe you do not see the
source when open the RStudio for the first
time. To get it, simply click the little green plus sign on
the top left corner and select R Script.
Typing in R
Typing in R is easy and simple. You can either type in
console or in source however I recommend
typing most of the codes in the source which you can save
and use anytime later. The codes written on the source is called the
R script. You can save console as well however
it is not so efficient because every time you save a
console it occupies lots of space in your computer.
Coding in R is object oriented. That means you can
consider the data, graphs, output of an analysis as an object and assign
a name to it. This is really handy because you can call the object by
its name anytime. Assign an object a name using an equal sign. Let’s see
a simple code example.
# Creating a vector of values 1 to 5 and assign a name "x"
x = c(1:5)
# Displaying the x
print(x)
## [1] 1 2 3 4 5
# Creating a vector of values from 1 to 2 with an increment of 0.2 with a name "y"
y = seq(1,2, by=0.2)
# Displaying "y"
print(y)
## [1] 1.0 1.2 1.4 1.6 1.8 2.0
# Combining x and y to create a matrix with 2 columns
dt = matrix(cbind(x,y), ncol=2)
# Naming the column names x and y
colnames(dt) = c('x', 'y')
# Display the matrix
print(dt)
## x y
## [1,] 1 1.0
## [2,] 2 1.2
## [3,] 3 1.4
## [4,] 4 1.6
## [5,] 5 1.8
## [6,] 1 2.0
To add notes or comments on the R script use
# sign.
Setting the working directory
Working directory is a folder in your computer where your data sets and stored in. Any results including new datasets and graphs are directly saved into the current working directory.
setwd("C:/Users/mmoinuddin/OneDrive - UCLan/LES")
The folder LES is my working directory where I stored my
datasets and I can call them anytime I like. You can copy and paste the
code and change the path of your own. Once you set your working
directory you can check it using the following command line.
getwd()
## [1] "C:/Users/mmoinuddin/OneDrive - UCLan/LES"
You can get the list of files stored in the current working directory
by simply typing the dir() command in the
console and hitting the enter button on your
keyboard.
dir()
## [1] "data spread _AE.docx" "data_types.png"
## [3] "diet article _AE.docx" "evidence_pyramid.png"
## [5] "FHS_1.png" "FHS_2.png"
## [7] "google_cough.png" "google_flu1.jpeg"
## [9] "google_flu2.jpeg" "google_flu21.jpeg.png"
## [11] "Hypothesis T1 _AE.rtf" "john_map2.jpg"
## [13] "john_snow_map_tubeW.png" "John_Snow_Pub_1.jpg"
## [15] "john_snow_waterpomp.jpg" "LES-with-R.html"
## [17] "LES-with-R.Rmd" "LES home.png"
## [19] "LES home2.png" "LES home3.png"
## [21] "LES logo with Text.png" "LES logo.png"
## [23] "LES with R.Rmd" "LES_home_HD.png"
## [25] "LES_logo.pptx" "les_logo_modified.png"
## [27] "Ox_evidence_level.png" "Pump_Handle_-_John_Snow_.jpg"
## [29] "research_process.png" "rstudio_consoles.png"
## [31] "Snow-cholera-map-1.jpg" "SPSS data for teaching"
## [33] "study_designs.png"
Install packages
An R package consists of a bundle of functions to be
used for a specific task. For example, function for calculating average
(mean), to get the number of rows and columns
(dim) are under package base, for calculating
variance (var), conducting statistical t-test
(t.test), Chi-square test (chisq.test) are
under the package stats. Packages for basic data handling
and analysis are mostly already installed which are called
R core packages. However, for a specific task you need some
packages to install. Installing a packages is very simple. To install a
package use code install.packages("package name"). For
example, to install the package ggplot2 for fancy plots the
command is below.
# install.packages("ggplot2", repos = "https://cloud.r-project.org")
Note that I have added some extra information
repos = "https://cloud.r-project.org" inside the brackets.
This is because I have written this document using
R Markdown feature. You can install multiple packages using
single command.
Loading packages
Once a package is installed in your machine you will not need to
install it anymore. You just need to load the package once whenever you
open shutdown and open RStudio. A package can be loaded
using the code library(). I have loaded most of my required
packages below.
library(ggplot2)
library(tidyverse)
library(foreign)
library(psych)
library(readxl)
library(MASS)
Any function of a package can be used even without loading it however
you need to write a bit of extra code. For example,
ggplot() is the main function for any plot in the
ggplot2 package can be used without loading the package
using the code ggplot2::ggplot().
The following packages would be useful to install after you first
time install R and RStudio.
| Packages | Packages | Packages | Packages |
|---|---|---|---|
| MASS | epiDisplay | maps | ggpmisc |
| readxl | foreign | mice | margins |
| tidyverse | summarytools | gtools | ggeffects |
| psych | ggplot2 | dplyr | lubridate |
Loading the data into R
To start exploring, you need a dataset to load into R
first. The command for loading a dataset depends on the format of the
dataset. Frequently used commands for loading a dataset are
load() and readRDS() for an R
data file, read.csv() for a .csv file,
read_xlsx() and read_excel() for
Excel file, read.spss() for SPSS
file and read.dta() for a STATA file etc. That
means any type of file can be loaded into R. These are the
common file type. There are some other file types as well which is less
common however exists. R has option for any kind of data to
read.
Throughout this course we will be using two data sets,
Birthweight_reduced and Diet datasets. In my
working directory these data files are stored in different formats such
as SPSS, CSV and Excel. Let’s
start with the Birthweight_reduced dataset. I will load all
three versions.
setwd("C:/Users/mmoinuddin/OneDrive - UCLan/LES")
df_csv <- read.csv("SPSS data for teaching/Birthweight_reduced_kg_R.csv")
df_spss <- read.spss("SPSS data for teaching/Birthweight_reduced_kg_SPSS.sav", to.data.frame = TRUE)
df_excel <- read_xlsx("SPSS data for teaching/Birthweight_reduced_kg_R.xlsx")
While reading data from Excel file you can specify the
sheet name. Also, you can specify the number of rows you
want to load.
Understanding the dataset
To know a little bit about the dataset three useful commands are
dim() - to know how many variables and rows are there,
names() - to know the variable names, head() -
to see a number of row in the dataset. Let’s see what these commands
give us.
dim(df_csv)
## [1] 42 16
names(df_csv)
## [1] "ID" "Length" "Birthweight" "Headcirc" "Gestation"
## [6] "smoker" "mage" "mnocig" "mheight" "mppwt"
## [11] "fage" "fedyrs" "fnocig" "fheight" "lowbwt"
## [16] "mage35"
head(df_csv)
## ID Length Birthweight Headcirc Gestation smoker mage mnocig mheight mppwt
## 1 1360 56 4.55 34 44 0 20 0 162 57
## 2 1016 53 4.32 36 40 0 19 0 171 62
## 3 462 58 4.10 39 41 0 35 0 172 58
## 4 1187 53 4.07 38 44 0 20 0 174 68
## 5 553 54 3.94 37 42 0 24 0 175 66
## 6 1636 51 3.93 38 38 0 29 0 165 61
## fage fedyrs fnocig fheight lowbwt mage35
## 1 23 10 35 179 0 0
## 2 19 12 0 183 0 0
## 3 31 16 25 185 0 1
## 4 26 14 25 189 0 0
## 5 30 12 0 184 0 0
## 6 31 16 0 180 0 0
Data formats in R
There are multiple types of datasets that we usually use in
R. In the R language we this is usually called an
object. An object can be a list,
matrix, tibble, data.frame etc. A
list can contain several other types of objects. There are different
ways of accessing a specific element of the datasets. As you go alonge
with using it you will be an expert.
Frequently used commands
I believe it will be very useful to get a list of commands usually needed to do data manipulation, management and analysis in Biostatistics setting. The table below contains the list of commands with their sources package name.
| Command | Use |
|---|---|
| Data management | |
help() |
Obtain documentation for a given R command |
example() |
View some examples on the use of a command |
c() |
Enter data manually to a vector in R |
seq() |
Make arithmetic progression vector |
rep() |
Make vector of repeated values |
data() |
Load a (as a data.frame) built-in dataset |
View() |
View dataset in a spreadsheet-type format |
load() |
Load an existing .Rdata file |
readRDS() |
Load an existing .RDS file |
read.csv() |
Load an existing CSV file |
read.spss() |
Load an existing SPSS file |
read_xls() |
Load an existing Excel (.xls) file |
read_xlsx() |
Load an existing Excel (.xlsx) file |
read_excel() |
Load an existing Excel file |
write.csv() |
Saving an working data file as a CSV file |
install.packages() |
Install new packages |
library() |
Load an R package already installed |
require() |
Load an R package already installed |
dim() |
See number of rows/cols of data.frame |
names() |
See column/variable names of data.frame |
length() |
Give length of a vector |
ls() |
Lists memory contents |
rm() |
Removes an item from memory |
as.numeric() |
Convert string to numeric |
as.data.frame() |
Convert a matrix into a data frame |
factor() |
Create/replace/label a factor variable |
ordered() |
Create/replace/label an ordered variable |
mutate() |
Create new variable from existing one |
if_else() |
Create new variable using condition |
| Statistics | |
table() |
Get a frequency table for a variable |
addmargins() |
Add marginal sums to an existing table |
prop.table() |
Compute proportions from a contingency table |
summary() |
Get summary statistics for a variable |
describe() |
Get specific summary statistics |
describeBy() |
Get specific summary statistics by groups |
xtabs() |
Cross-tabulation tables using formulas |
mean() |
Calculate mean for a variable |
median() |
Calculate median for a variable |
var() |
Calculate variance of values in vector |
sd() |
Calculate sd of values in vector |
sum() |
Add up all values in a vector |
sample() |
Take a sample from a vector of data |
cor() |
Calculate correlation between two variables |
prop.test() |
Inference for 1 proportion using normal approx |
t.test() |
Carries out a student t-test |
chisq.test() |
Carries out a chi-square test |
fisher.test() |
Carries out a Fisher exact test |
aov() |
Perform ANOVA using formula |
wilcox.test() |
Mann-Whitney U test for independent samples |
wilcox.test() |
Wilcoxon Signed Rank test for paired samples |
kruskal.test() |
Kruskal Wallis test |
lm() |
Linear regression model |
glm() |
Generalized linear model (linear/logistic/Poisson reg) |
| Visualise | |
hist() |
Create a histogram |
boxplot() |
Create a boxplot |
plot() |
Create a scatter plot (many more . . . ) |
abline(a,b) |
Add a line on a scatter plot |
pairs() |
Scatter plot matrix |
pdf() |
Save graph as pdf file. |
jpeg() |
Save graph as png or jpeg |
dev.off() |
Necessary after pdf() and jpeg() commands |
ggplot() |
Generate fancy graphs of many types |
This list is not an exclusive lists rather its the beginning. To get
help for any command, use help() or ?. For
example, to get help for the command dim, type
help(dim) or ?dim and hit enter
button. Go to the bottom of the help page to see multiple examples.
Alternatively Google it. Let us see some practical in the
classroom.



Leave A Comment