R and RStudio
R
is a free, statistics and data science software that
provides you a handy way of day-to-day data handling such cleaning,
editing, analyse, visualise and communicate the outputs. You can
download R
from CRAN
, the comprehensive
R
archive network, https://cloud.r-project.org. It works for Windows, MAC
and Linux operating system. You just need to download the right version.
A new version of R
released every year with some more
updates over the year. This module is prepared based on the version
R 4.3.0
.
RStudio
is an integrated development environment (IDE)
for R
programming which is also available free of cost at
https://posit.co/download/rstudio-desktop/.
While we are
talking use R
we basically work in RStudio
and
R
is working in the background.
There are many advantages of using R
. Some of them are
below.
-
Reproducibility: In
R
once you write the code for data analysis, producing a fancy graph, write a function and save in a safe directory you can reproduce the same outputs even after years of time. Its the beauty of coding inR
. -
Help \(\&\) support: A comprehensive online help available. You can find example code for anything you want to do using
R
. You just need to copy and paste the right code for you. Some example websites where you can find the help are below.- LES - https://letsenjoystatistics.com/.
- Quick
R
- https://www.statmethods.net/. - STHDA - http://www.sthda.com/english/.
- https://stackoverflow.com/.
- … many more.
-
Package ready on demand: Thousands of
R
packages are available to help done your task. -
Functionality: Beside doing statistics and data analysis,
R
can be used for many other purposes such as writing report, an article, thesis, books, preparing PowerPoint slides, creating web application, checking facebook status, sending email etc. So, its almost all in one package. The frequently used interfaces areR Markdown
andR Sweave
. Other programming languages such asPython
,C++
, andSQL
can be called and used inR
. So, it provides versatile opportunity. -
Example datasets:
R
has a huge number of example datasets those are built in with the packages. You can access anyone of them and practice. To get the list of built it datasets just typedata()
on theconsole
and hit theenter
button. Of course more datasets will come along with the package you install. There are some command that is very helpful to explore the built in datasets. For example,help(package="datasets")
will provide the documentation of the datasets,data(package="ggplot2")
give the list of the datasets built in with the packageggplot2
, anddata(package = .packages(all.available = TRUE))
gives the list of all datasets from all installed packages,dplyr::storms
will give access to the datasetstorms
built in with the packagedplyr
. -
Handling multiple datasets:
R
provides an opportunity of handling multiple datasets at the same time. So, you can load many data sets at the same time and work on multiple datasets together. -
Zero money: Most importantly all these can be done with a cost of a
ZERO
.
Therefore, it is worth investing some time to learn a program like
R
to be an efficient data and web handler.
Get started
To start using R
simply download the right version for
your machine and install it. Most cases, you need to install
R
first and then install RStudio
however in
some cases while downloading from the university software centre you may
need installing RStudio
only. Once installed, just double
click the RStudio
desktop icon to open it. You will see
pane layout (various windows) as below.
In the above figure we see two most important windows are marked (in
rectangle). The top one (source
) is the input window and
the below (console
) is both input and output window. The
red circled tabs Plots
and Help
are also
frequently used. The pane layout can be rearranged simply following
Tools -> Global Options -> Pane Layout
. I prefer
showing two panes only, source
in the left and
console
in the right. Maybe you do not see the
source
when open the RStudio
for the first
time. To get it, simply click the little green plus
sign on
the top left corner and select R Script
.
Typing in R
Typing in R
is easy and simple. You can either type in
console
or in source
however I recommend
typing most of the codes in the source
which you can save
and use anytime later. The codes written on the source is called the
R script
. You can save console
as well however
it is not so efficient because every time you save a
console
it occupies lots of space in your computer.
Coding in R
is object oriented. That means you can
consider the data, graphs, output of an analysis as an object and assign
a name to it. This is really handy because you can call the object by
its name anytime. Assign an object a name using an equal sign. Let’s see
a simple code example.
# Creating a vector of values 1 to 5 and assign a name "x"
x = c(1:5)
# Displaying the x
print(x)
## [1] 1 2 3 4 5
# Creating a vector of values from 1 to 2 with an increment of 0.2 with a name "y"
y = seq(1,2, by=0.2)
# Displaying "y"
print(y)
## [1] 1.0 1.2 1.4 1.6 1.8 2.0
# Combining x and y to create a matrix with 2 columns
dt = matrix(cbind(x,y), ncol=2)
# Naming the column names x and y
colnames(dt) = c('x', 'y')
# Display the matrix
print(dt)
## x y
## [1,] 1 1.0
## [2,] 2 1.2
## [3,] 3 1.4
## [4,] 4 1.6
## [5,] 5 1.8
## [6,] 1 2.0
To add notes or comments on the R
script use
#
sign.
Setting the working directory
Working directory is a folder in your computer where your data sets and stored in. Any results including new datasets and graphs are directly saved into the current working directory.
setwd("C:/Users/mmoinuddin/OneDrive - UCLan/LES")
The folder LES
is my working directory where I stored my
datasets and I can call them anytime I like. You can copy and paste the
code and change the path of your own. Once you set your working
directory you can check it using the following command line.
getwd()
## [1] "C:/Users/mmoinuddin/OneDrive - UCLan/LES"
You can get the list of files stored in the current working directory
by simply typing the dir()
command in the
console
and hitting the enter
button on your
keyboard.
dir()
## [1] "data spread _AE.docx" "data_types.png"
## [3] "diet article _AE.docx" "evidence_pyramid.png"
## [5] "FHS_1.png" "FHS_2.png"
## [7] "google_cough.png" "google_flu1.jpeg"
## [9] "google_flu2.jpeg" "google_flu21.jpeg.png"
## [11] "Hypothesis T1 _AE.rtf" "john_map2.jpg"
## [13] "john_snow_map_tubeW.png" "John_Snow_Pub_1.jpg"
## [15] "john_snow_waterpomp.jpg" "LES-with-R.html"
## [17] "LES-with-R.Rmd" "LES home.png"
## [19] "LES home2.png" "LES home3.png"
## [21] "LES logo with Text.png" "LES logo.png"
## [23] "LES with R.Rmd" "LES_home_HD.png"
## [25] "LES_logo.pptx" "les_logo_modified.png"
## [27] "Ox_evidence_level.png" "Pump_Handle_-_John_Snow_.jpg"
## [29] "research_process.png" "rstudio_consoles.png"
## [31] "Snow-cholera-map-1.jpg" "SPSS data for teaching"
## [33] "study_designs.png"
Install packages
An R
package consists of a bundle of functions to be
used for a specific task. For example, function for calculating average
(mean
), to get the number of rows and columns
(dim
) are under package base
, for calculating
variance (var
), conducting statistical t-test
(t.test
), Chi-square test (chisq.test
) are
under the package stats
. Packages for basic data handling
and analysis are mostly already installed which are called
R core
packages. However, for a specific task you need some
packages to install. Installing a packages is very simple. To install a
package use code install.packages("package name")
. For
example, to install the package ggplot2
for fancy plots the
command is below.
# install.packages("ggplot2", repos = "https://cloud.r-project.org")
Note that I have added some extra information
repos = "https://cloud.r-project.org"
inside the brackets.
This is because I have written this document using
R Markdown
feature. You can install multiple packages using
single command.
Loading packages
Once a package is installed in your machine you will not need to
install it anymore. You just need to load the package once whenever you
open shutdown and open RStudio
. A package can be loaded
using the code library()
. I have loaded most of my required
packages below.
library(ggplot2)
library(tidyverse)
library(foreign)
library(psych)
library(readxl)
library(MASS)
Any function of a package can be used even without loading it however
you need to write a bit of extra code. For example,
ggplot()
is the main function for any plot in the
ggplot2
package can be used without loading the package
using the code ggplot2::ggplot()
.
The following packages would be useful to install after you first
time install R
and RStudio
.
Packages | Packages | Packages | Packages |
---|---|---|---|
MASS | epiDisplay | maps | ggpmisc |
readxl | foreign | mice | margins |
tidyverse | summarytools | gtools | ggeffects |
psych | ggplot2 | dplyr | lubridate |
Loading the data into R
To start exploring, you need a dataset to load into R
first. The command for loading a dataset depends on the format of the
dataset. Frequently used commands for loading a dataset are
load()
and readRDS()
for an R
data file, read.csv()
for a .csv
file,
read_xlsx()
and read_excel()
for
Excel
file, read.spss()
for SPSS
file and read.dta()
for a STATA
file etc. That
means any type of file can be loaded into R
. These are the
common file type. There are some other file types as well which is less
common however exists. R
has option for any kind of data to
read.
Throughout this course we will be using two data sets,
Birthweight_reduced
and Diet
datasets. In my
working directory these data files are stored in different formats such
as SPSS
, CSV
and Excel
. Let’s
start with the Birthweight_reduced
dataset. I will load all
three versions.
setwd("C:/Users/mmoinuddin/OneDrive - UCLan/LES")
df_csv <- read.csv("SPSS data for teaching/Birthweight_reduced_kg_R.csv")
df_spss <- read.spss("SPSS data for teaching/Birthweight_reduced_kg_SPSS.sav", to.data.frame = TRUE)
df_excel <- read_xlsx("SPSS data for teaching/Birthweight_reduced_kg_R.xlsx")
While reading data from Excel
file you can specify the
sheet
name. Also, you can specify the number of rows you
want to load.
Understanding the dataset
To know a little bit about the dataset three useful commands are
dim()
- to know how many variables and rows are there,
names()
- to know the variable names, head()
-
to see a number of row in the dataset. Let’s see what these commands
give us.
dim(df_csv)
## [1] 42 16
names(df_csv)
## [1] "ID" "Length" "Birthweight" "Headcirc" "Gestation"
## [6] "smoker" "mage" "mnocig" "mheight" "mppwt"
## [11] "fage" "fedyrs" "fnocig" "fheight" "lowbwt"
## [16] "mage35"
head(df_csv)
## ID Length Birthweight Headcirc Gestation smoker mage mnocig mheight mppwt
## 1 1360 56 4.55 34 44 0 20 0 162 57
## 2 1016 53 4.32 36 40 0 19 0 171 62
## 3 462 58 4.10 39 41 0 35 0 172 58
## 4 1187 53 4.07 38 44 0 20 0 174 68
## 5 553 54 3.94 37 42 0 24 0 175 66
## 6 1636 51 3.93 38 38 0 29 0 165 61
## fage fedyrs fnocig fheight lowbwt mage35
## 1 23 10 35 179 0 0
## 2 19 12 0 183 0 0
## 3 31 16 25 185 0 1
## 4 26 14 25 189 0 0
## 5 30 12 0 184 0 0
## 6 31 16 0 180 0 0
Data formats in R
There are multiple types of datasets that we usually use in
R
. In the R language we this is usually called an
object
. An object can be a list
,
matrix
, tibble
, data.frame
etc. A
list can contain several other types of objects. There are different
ways of accessing a specific element of the datasets. As you go alonge
with using it you will be an expert.
Frequently used commands
I believe it will be very useful to get a list of commands usually needed to do data manipulation, management and analysis in Biostatistics setting. The table below contains the list of commands with their sources package name.
Command | Use |
---|---|
Data management | |
help() |
Obtain documentation for a given R command |
example() |
View some examples on the use of a command |
c() |
Enter data manually to a vector in R |
seq() |
Make arithmetic progression vector |
rep() |
Make vector of repeated values |
data() |
Load a (as a data.frame) built-in dataset |
View() |
View dataset in a spreadsheet-type format |
load() |
Load an existing .Rdata file |
readRDS() |
Load an existing .RDS file |
read.csv() |
Load an existing CSV file |
read.spss() |
Load an existing SPSS file |
read_xls() |
Load an existing Excel (.xls) file |
read_xlsx() |
Load an existing Excel (.xlsx) file |
read_excel() |
Load an existing Excel file |
write.csv() |
Saving an working data file as a CSV file |
install.packages() |
Install new packages |
library() |
Load an R package already installed |
require() |
Load an R package already installed |
dim() |
See number of rows/cols of data.frame |
names() |
See column/variable names of data.frame |
length() |
Give length of a vector |
ls() |
Lists memory contents |
rm() |
Removes an item from memory |
as.numeric() |
Convert string to numeric |
as.data.frame() |
Convert a matrix into a data frame |
factor() |
Create/replace/label a factor variable |
ordered() |
Create/replace/label an ordered variable |
mutate() |
Create new variable from existing one |
if_else() |
Create new variable using condition |
Statistics | |
table() |
Get a frequency table for a variable |
addmargins() |
Add marginal sums to an existing table |
prop.table() |
Compute proportions from a contingency table |
summary() |
Get summary statistics for a variable |
describe() |
Get specific summary statistics |
describeBy() |
Get specific summary statistics by groups |
xtabs() |
Cross-tabulation tables using formulas |
mean() |
Calculate mean for a variable |
median() |
Calculate median for a variable |
var() |
Calculate variance of values in vector |
sd() |
Calculate sd of values in vector |
sum() |
Add up all values in a vector |
sample() |
Take a sample from a vector of data |
cor() |
Calculate correlation between two variables |
prop.test() |
Inference for 1 proportion using normal approx |
t.test() |
Carries out a student t-test |
chisq.test() |
Carries out a chi-square test |
fisher.test() |
Carries out a Fisher exact test |
aov() |
Perform ANOVA using formula |
wilcox.test() |
Mann-Whitney U test for independent samples |
wilcox.test() |
Wilcoxon Signed Rank test for paired samples |
kruskal.test() |
Kruskal Wallis test |
lm() |
Linear regression model |
glm() |
Generalized linear model (linear/logistic/Poisson reg) |
Visualise | |
hist() |
Create a histogram |
boxplot() |
Create a boxplot |
plot() |
Create a scatter plot (many more . . . ) |
abline(a,b) |
Add a line on a scatter plot |
pairs() |
Scatter plot matrix |
pdf() |
Save graph as pdf file. |
jpeg() |
Save graph as png or jpeg |
dev.off() |
Necessary after pdf() and jpeg() commands |
ggplot() |
Generate fancy graphs of many types |
This list is not an exclusive lists rather its the beginning. To get
help for any command, use help()
or ?
. For
example, to get help for the command dim
, type
help(dim)
or ?dim
and hit enter
button. Go to the bottom of the help page to see multiple examples.
Alternatively Google
it. Let us see some practical in the
classroom.
Leave A Comment