# LES with R - summarise data

#### Dr M Moinuddin

#### October, 2023

## Birthweight dataset

Birthweight dataset contains information on newborn babies and their parents. There are 16 variables measured for 42 newborn babies and their parents. The description of the variables are in table below.Name | Variable label |
---|---|

ID | Baby number |

length | Length of baby (cm) |

Birthweight | Weight of baby (kg) |

headcirumference | Head Circumference |

Gestation | Gestation (weeks) |

smoker | Mother smokes 1 = smoker 0 = non-smoker |

motherage | Maternal age |

mnocig | Number of cigarettes smoked per day by mother |

mheight | Mothers height (cm) |

mppwt | Mothers pre-pregnancy weight (kg) |

fage | Father’s age |

fedyrs | Father’s years in education |

fnocig | Number of cigarettes smoked per day by father |

fheight | Father’s height (kg) |

lowbwt | Low birth weight, 0 = No and 1 = yes |

mage35 | Mother over 35, 0 = No and 1 = yes |

```
df_bw <- read.csv("SPSS data for teaching/Birthweight_reduced_kg_R.csv")
dim(df_bw)
```

`## [1] 42 16`

`names(df_bw)`

```
## [1] "ID" "Length" "Birthweight" "Headcirc" "Gestation"
## [6] "smoker" "mage" "mnocig" "mheight" "mppwt"
## [11] "fage" "fedyrs" "fnocig" "fheight" "lowbwt"
## [16] "mage35"
```

```
df_bw <- df_bw %>%
mutate(smoker = factor(smoker, levels=c(0,1), labels=c('Non-Smoker', 'Smoker')),
mage35 = factor(mage35, levels=c(0,1), labels=c('No', 'Yes')),
LBW = factor(lowbwt, levels=c(0,1), labels=c('Normal-BW', 'L-BW')))
```

## Summarise data

The basic steps of data summarisation are,

- Counting or frequency distribution
- Averaging - mean and median
- Measuring spread - scatteredness of the data
- Measuring association & dependence

However visualization is getting more popular in the age of data science.

### Counting or frequency distribution

The first principle of statistics and data mining is counting. Whether descriptive or inferential statistics (statistical decision making) counting is always there. A ‘Count’ is also known as frequency and count is a very good friend. The count is transformed into proportion or percentage to make it more readable and comparable.

We count the attributes or categories to describe the
**categorical variable**. For example:

- Among all the mothers, how many were smoker?
- How many of them got an university level degree?
- How many fathers were smoker?

In `R`

it is very easy to produce a table of counts. Let’s
count the number of smoker.

```
tab <- table(df_bw$smoker)
tab <- prop.table(tab)
print(tab)
```

```
##
## Non-Smoker Smoker
## 0.4761905 0.5238095
```

```
tab <- table(df_bw$mage35)
tab <- prop.table(tab)
print(tab)
```

```
##
## No Yes
## 0.9047619 0.0952381
```

```
# Count of two variables
tab <- table(df_bw$mage35, df_bw$LBW)
tab1 <- addmargins(tab)
print(tab1)
```

```
##
## Normal-BW L-BW Sum
## No 33 5 38
## Yes 3 1 4
## Sum 36 6 42
```

```
tab <- prop.table(tab, 1)
# Percentage
print(tab)*100
```

```
##
## Normal-BW L-BW
## No 0.8684211 0.1315789
## Yes 0.7500000 0.2500000
```

```
##
## Normal-BW L-BW
## No 86.84211 13.15789
## Yes 75.00000 25.00000
```

Looking at percentage makes more sense and easy to compare. This can be even better by visualising the counts in graph.

```
p_age_lbw <- ggplot(df_bw, aes(x=mage35, fill=LBW))+
geom_bar(position = 'stack', color='grey30', width = 0.75)+
theme_minimal()+
theme(axis.text = element_text(size=12),
axis.title = element_text(size=14),
legend.position = 'top',
legend.title = element_blank())+
labs(y="Count", x="Mom's age > 35y")
p_age_lbw
```

From the graph we see that the proportion of LBW among mothers is smaller irrespective of their age. For comparison purpose it’s maybe better to present in a different way.

```
p_age_lbw <- ggplot(df_bw, aes(x=mage35, fill=LBW))+
geom_bar(position = 'fill', color='grey30', width = 0.75)+
theme_minimal()+
theme(axis.text = element_text(size=12),
axis.title = element_text(size=14),
legend.position = 'top',
legend.title = element_blank())+
labs(y="Proportion", x="Mom's age > 35y")
p_age_lbw
```

Now, it is clearly visible that the proportion of LBW baby is higher among the mothers age > 35 years.

### Standard error of proportion

The proportion of LBW baby we have estimated in the last section is
based on the sample data (statistic). Therefore there is variation in
sample proportion. The variation of sample proportion is measured using
`standard error (SE)`

. `SE`

not only exists for
proportion it also available for other sample measure such as mean,
correlation, odds ration, regression coefficient etc. The formula for
`SE`

for proportion \(\hat{p}\) is,

\[SE(\hat{p}) =
\sqrt{(\hat{p}*(1-\hat{p})/n)},\] where \(n\) is the number of total people. So, we
can easily calculate the \(SE(\hat{p})\) using `R`

using
the following command. Let us say \(\hat{p}\) is the proportion of LBW
baby.

```
# SE (p) among Mom's <=35
SE_young = sqrt(0.1315789*(1-0.1315789)/38)
SE_young1 = sqrt(tab[1,2]*(1-tab[1,2])/tab1[1,3])
print(c(SE_young, SE_young1))
```

`## [1] 0.05483609 0.05483610`

```
# SE (p) among Mom's >35
SE_older = sqrt(0.25*(1-0.25)/4)
SE_older1 = sqrt(tab[2,2]*(1-tab[2,2])/tab1[2,3])
print(c(SE_older, SE_older1))
```

`## [1] 0.2165064 0.2165064`

\(SE(\hat{p})\) for the older (>35y) Mom’s is much higher than their counterparts.

### 95% confidence interval (CI) of proportion

The \(95\%\) CI for proportion is a very useful measure. It gives us more information about the uncertainty in the sample proportions. It has two values, lower and upper limit. The difference between these values is known as the length of the interval. Wider CI means more uncertainty in the sample estimate. So, the tighter is better and more reliable.

The general formula for a \(95\%\) is,

\[CI (low) = p - Z_\alpha*SE(p)\] and \[CI (up) = p + Z_\alpha*SE(p),\]

where \(Z_\alpha\) is the normal reference value. For a \(95\%\) the value of \(Z_\alpha\) is always \(1.96\). So, we can easily get the \(95\%\) CI for proportion of LBW baby among younger and older Moms.

```
# 95% CI among Mom's <=35
CI_Y_low = tab[1,2]-1.96*SE_young
CI_Y_up = tab[1,2]+1.96*SE_young
print(c(CI_Y_low, CI_Y_up))
```

`## [1] 0.0241002 0.2390577`

```
# 95% CI among Mom's >35
CI_O_low = tab[2,2]-1.96*SE_older
CI_O_up = tab[2,2]+1.96*SE_older
print(c(CI_O_low, CI_O_up))
```

`## [1] -0.1743524 0.6743524`

As we seen before the \(SE(p)\) was larger for the Moms age >35 so the \(95\%\) CI is wider for the same. Therefore, the proportion of LBW among the older Moms is less reliable. That means if you repeat the study with another set of mothers you may find a very different results.

Also, as the second CI include 0 (zero) inside therefore the proportion of LBW baby among older mothers is not significantly different from 0 while the proportion of LBW baby for younger mothers is significantly different from 0.

Moreover, the two CIs overlaps each other. Therefore, we can say the proportions of LBW baby do not differ significantly between the younger and older mothers. That means being a mother of age >35 years is not statistically significantly associated with LBW. It is recommended to report the SE or 95% CI while reporting the proportion or percentage.

**Exercise**: Estimate the proportion of LBW baby along
with their SE and 95% CI among smoker and non-smoker mothers. Comment on
the reliability and statistical significance of your findings.

### Averaging - mean and median

#### Mean

Mean (arithmetic mean) is the summation of all the data points divided by the number of observation. It is based on all the observations. Very useful summary statistics for everyday data dealing, planning, and budgeting. Mean is suitable for continuous or discrete data.

Examples:

- average number of cigarettes smoked by mother per day,
- average birthweight of newborn in kg,

Let say We have 9 rabbit litters with sizes \(7, 6, 4, 7, 6, 4, 5, 7, 10\). What is the mean litter size? (6.2)

While reporting mean it is recommended to report either standard deviation (SD), standard error (SE) or \(95\%\) CI for mean. For a variable \(X\) the sample mean, population mean, sample variance, standard deviation and the SE is denoted by \(\bar{x}\), \(\mu\), \(s^2(x)\), \(s(x)\) and \(SE(\bar{x})\). The population standard deviation is denoted by \(\sigma\) and population variance is \(\sigma^2\). The relationship between \(s(x)\) and \(SE(\bar{x})\) is simply, \(SE(\bar{x}) = s(x)/\sqrt{n}\), where \(n\) is the sample size.

To get an idea about the spread of the data, let’s see the plots below.

```
p_headc_point <- ggplot(df_bw, aes(x = ID, y = Birthweight))+
geom_point(color='firebrick')+
theme_minimal()+
theme(axis.text = element_text(size=12),
axis.title = element_text(size=14))+
labs(x='Baby ID', y="Birthweight")
p_length_point <- ggplot(df_bw, aes(x = ID, y = Length))+
geom_point(color='darkgreen')+
theme_minimal()+
theme(axis.text = element_text(size=12),
axis.title = element_text(size=14))+
labs(x='Baby ID', y="Length")
gridExtra::grid.arrange(p_headc_point, p_length_point, ncol=2)
```

Do you see any difference in spread (variance) in birthweight and length of the newborns? Maybe no. We can get some idea at least when we see the vertical scale in the left side however nothing certain. There are several measure we can use be sure about the spread in the data.

**Range**is the distance between maximum and the minimum values.**Variance \((s^2, \sigma^2)\)**is a measure of spread of the individual data points from the mean. Variance is the average squared distance of individual data values from the mean.**Standard deviation \((s, \sigma)\)**is the square root of the variance. Widely used measure of variation.- It is recommended reporting SD or variance with mean.
**Empirical rule (\(68 - 95 - 99.7\))**- For a normal distribution, \(68, 95, and~ 99.7\) percent of the data points should be withing the \(mean \pm \sigma,mean \pm 2\sigma, ~and~ mean \pm 3\sigma\) limits respectively. Let’s calculate the mean and SD for head circumference and length of the baby.

```
#Mean and SD for birthweight
mu_bw <- mean(df_bw$Birthweight)
sd_bw <- sd(df_bw$Birthweight)
print(c(mu_bw, sd_bw))
```

`## [1] 3.312857 0.603895`

```
# Mean and SD for length
mu_ln <- mean(df_bw$Length)
sd_ln <- sd(df_bw$Length)
print(c(mu_ln, sd_ln))
```

`## [1] 51.333333 2.935624`

Again, we can construct the \(95\%\) CI and interpret in similar way.

```
n = nrow(df_bw)
# 95% CI for mean BW
CI95_low_bw = mu_bw - 1.96*sd_bw/sqrt(n)
CI95_up_bw = mu_bw + 1.96*sd_bw/sqrt(n)
print(c(CI95_low_bw, CI95_up_bw))
```

`## [1] 3.130218 3.495496`

```
# 95% CI for mean length
CI95_low_ln = mu_ln - 1.96*sd_ln/sqrt(n)
CI95_up_ln = mu_ln + 1.96*sd_ln/sqrt(n)
# 95% CI for mean BW
print(c(CI95_low_ln, CI95_up_ln))
```

`## [1] 50.44550 52.22117`

The \(95\%\) CI for mean birthweight c(3.13, 3.50) means that we are 95% confident that this interval will contain the true (population) mean birthweight.

We interprete the SD value in terms of the empirical rule. So, for birthweight data we can say,

`print(c(mu_bw - sd_bw, mu_bw + sd_bw)) `

`## [1] 2.708962 3.916752`

68% of the birthweight will be inside this limit which is \(42*68\% = 28.56\) baby.

`print(c(mu_bw - 2*sd_bw, mu_bw + 2*sd_bw)) `

`## [1] 2.105067 4.520647`

`print(c(mu_bw - 2*sd_bw, mu_bw + 3*sd_bw)) `

`## [1] 2.105067 5.124542`

95% of the birthweight will be inside this limit which is \(42*95\% = 39.9\) baby.

**Exercise**: Between which values almost all (41.9
baby’s) birthweight will be in?

#### Median

Median is the middle value when the data are ranked in order from smallest to largest. It’s the positional value which divides the data set into two equal parts so 50% of the data are in the left side or the median. The strong point of median is that it does not affected by extreme values or outliers.

For \(n\) is odd, \(\{(𝑛+1)/2\}^{th}\) positional value is the median. For \(n\) is even, average of \((n/2)^{th}\) and \((n/2 + 1)^{th}\) positional values is the median. What is the median size for the \(9\) litters with sizes \(7, 6, 4, 7, 6, 4, 5, 7, 10\)? (6)

There another measure of spread which commonly used with median. This spread is measured based on other positional values. Several measures available.

**Quartile**– 3 values that divides the data into 4 equal parts;**Quintile**- 4 values that divides the data into 5 equal parts;**Decile**- 9 values that divides the data into 10 equal parts, and**Percentile**- 99 values that divides the data into 100 equal parts.Formula for quartiles are:

- 1st quartile (Q1) is \(\{(𝑛+1)×1/4\}^{th}\) positional value.
- 2nd quartile (Q2) is \(\{(𝑛+1)×2/4\}^{th}\) positional value.
- 3rd quartile (Q3) is \(\{(𝑛+1)×3/4\}^{th}\) positional value.

**IQR – Interquartile range**is the difference between Q3 and Q1. IQR contains middle 50% of all data points. IQR is recommended reporting with median.

`summary(df_bw$Birthweight)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.920 2.940 3.295 3.313 3.647 4.570
```

So, for how many babies their birthweight will be inside the values \((2.940, 3.647)\)?

There is smartere command which can many jobs at with a single command. For example we want to calculate the summary statistics by smoking status.

`describe(Birthweight + Length ~ smoker, data=df_bw, quant=c(.25,.75))`

```
##
## Descriptive statistics by group
## smoker: Non-Smoker
## vars n mean sd median trimmed mad min max range skew
## Birthweight 1 20 3.51 0.52 3.38 3.49 0.55 2.65 4.55 1.9 0.31
## Length 2 20 51.80 3.25 53.00 51.94 1.48 43.00 58.00 15.0 -0.74
## kurtosis se Q0.25 Q0.75
## Birthweight -1.05 0.12 3.14 3.93
## Length 0.78 0.73 50.75 53.00
## ------------------------------------------------------------
## smoker: Smoker
## vars n mean sd median trimmed mad min max range skew
## Birthweight 1 22 3.13 0.63 3.18 3.14 0.62 1.92 4.57 2.65 0.02
## Length 2 22 50.91 2.62 51.00 50.89 2.97 46.00 58.00 12.00 0.41
## kurtosis se Q0.25 Q0.75
## Birthweight -0.38 0.13 2.74 3.54
## Length 0.50 0.56 49.25 52.75
```

From this outcome you can easily calculate the CI and compare mean Birthweight between the smoker and non-smoker mothers.

### Mean or Median - what to use?

There is **NO** direct answer however depends on:

- What you are interested in learning from the data set?
- Extreme values are present or not?
- And the shape of the distribution matters.

There is a misconception that if the distribution is not symmetric or normal we can not use mean.

**Median is preferable** to mean if,

- There are extreme values or outliers in the data set.
- The distribution is significantly skewed.
- The data is ordinal.

### Visualise averages with spread

Visualisation is more insightful for any statistics and data. We can also visualise mean, SE, median and IQR using graphs. Let us consider the birthweight of babies by smoking status.

```
tab <- psych::describeBy(Birthweight~smoker, data = df_bw, mat=T, digits = 1)
tab <- tab[,c("group1", "n", "mean", "sd", "median", "se")]
tab$low <- tab$mean-1.96*tab$se
tab$high <- tab$mean+1.96*tab$se
# Mean with SD
p_mean_sd_bw <- ggplot(tab, aes(y = mean, x = group1)) +
geom_pointrange(aes(ymax = (mean+sd), ymin = (mean-sd)),shape=19, size=1,
color='darkgreen')+
theme_minimal()+
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 14))+
coord_flip()+
labs(title = "Mean BW with SD, ", x=' ', y=' ')
p_mean_sd_bw
```

```
# Mean with 95% CI
p_mean_ci_bw <- ggplot(tab, aes(y = mean, x = group1)) +
geom_pointrange(aes(ymax = high, ymin = low),shape=19, size=1,
color='darkgreen')+
theme_minimal()+
theme(axis.title = element_text(size = 12),
axis.text = element_text(size = 14))+
coord_flip()+
labs(title = "Mean BW with 95% CI, ", x=' ', y=' ')
p_mean_ci_bw
```

It is clearer to see the difference in birthweight between smoker and non-smoker mom’s. What does it mean?

From the second graph (with \(95\%\) CI) we can get even more information. The CIs are non overlapping that means the mean birthweight between smoker vs non-smoker are statistically significantly different.

We can also, visulaise the median and IQR using a boxplot.

```
p_bw_box <- ggplot(df_bw, aes(y=Birthweight, x=smoker))+
geom_boxplot(color='darkgreen', fill='grey80', width=0.5)+
coord_flip()+labs(x='', y='', title = "Boxplot for BW")+
theme_minimal()+
theme(axis.title = element_text(size=12),
axis.text = element_text(size = 14))
p_bw_box
```

The first, second and third vertical lines are the Q1, median and Q2 respectively.

## Leave A Comment