library(ggplot2)
library(patchwork)
Normal Distribution
Required Packages
theme_set(theme_bw())
theme_replace(panel.grid.minor = element_blank(),
panel.grid.major = element_blank())
Normal Distribution
One of the most important and widely used continuous distribution is the Normal distribution, or Gaussian distribution.
Let \(X \sim N(\mu,\sigma)\) be a random variable following a normal distribution with mean \(\mu\) and standard deviation \(\sigma\). In R, the following functions described in the table below, allows us to summarize the function relating to the normal distribution
Function | Description |
---|---|
dnorm | Normal density (Probability Density Function) |
pnorm | Normal distribution (Cumulative Distribution Function) |
qnorm | Quantile function of the Normal distribution |
rnorm | Normal random number generation |
By default, all of the functions above consider the standard Normal distribution, which has a mean of zero and a standard deviation of one, \(X \sim N(0,1)\)
dnorm
The density function for a normal distribution with mean \(\mu\) and standard deviation \(\sigma\) is
\[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp{\left( -\frac{1}{2\sigma^2 }(x-\mu)^2 \right)} \] for \(-\infty < x < \infty\)
We can use dnorm()
function to calculate the density function, i.e \(f(x)\), for a grid of \(x\) values from any normal distribution with mean \(\mu\) and standard deviation \(\sigma\)
For example, we can calculate \(f(0)\) from a standard normal distribution
dnorm(x = 0,mean = 0,sd = 1)
#> [1] 0.3989423
Consider evaluating \(f(x)\) for \(x \in [1,10]\) with mean 1 and standard deviation of 3
dnorm(x=1:10,mean = 1,sd = 3)
#> [1] 0.132980760 0.125794409 0.106482669 0.080656908 0.054670025 0.033159046
#> [7] 0.017996989 0.008740630 0.003798662 0.001477283
pnorm
The pnorm()
function gives the Cumulative Distribution Function (CDF) of the Normal distribution, which is the probability that the variable \(X\) takes a value less than or equal to \(x\). Mathematically, \(F_X(x) = P(X \leq x)\).
For any continuous distribution \(P(X = x)=0\), so equivalently the CDF is \(P(X \leq x) = P(X < x)\).
Consider the standard normal distribution, since this distribution is symmetrical centered around \(\mu=0\) then \(P(X \leq 0) = 0.5\). We can verify this result using pnorm
as follows
pnorm(0,mean = 0, sd = 1)
#> [1] 0.5
Example: Suppose \(X\) is the SAT-M score which has a normal distribution with a mean of 507 and standard deviation of 111. What is the probability of scoring less than 700 on the SAT-M?
Show Code
<- round(pnorm(700, mean=507, sd=111) * 100,2)
prob1 <- ggplot(data.frame(x = c(150,900)), aes(x)) +
plt1 stat_function(fun = dnorm,
geom = "line",
xlim = c(150,900),
args = list(mean = 507,sd = 111)) +
stat_function(fun = dnorm,
geom = "area",
fill = 'steelblue',
alpha =0.3,
xlim = c(150, 700),
args = list(mean = 507,sd = 111))+
annotate("text", x = 510, y = 0.0015,
label = paste0(prob1,'%'),
size = 8)+
geom_vline(xintercept = 700,linetype =2)+
labs(x = '',y= 'Density')
That is \(P(X < 700)\),
pnorm(700, mean=507, sd=111)
#> [1] 0.9589596
What about the probability of scoring greater than 700?
Show Code
<- round(pnorm(700, mean=507, sd=111,
prob2 lower.tail = FALSE)*100,2)
<- ggplot(data.frame(x = c(150,900)), aes(x)) +
p2 stat_function(fun = dnorm,
geom = "line",
xlim = c(150,900),
args = list(mean = 507,sd = 111)) +
stat_function(fun = dnorm,
geom = "area",
fill = 'steelblue',
alpha =0.3,
xlim = c(700, 900),
args = list(mean = 507,sd = 111))+
annotate("text", x = 738, y = 0.00017,
label = paste0(prob2,'%'),
size=4.7)+
geom_vline(xintercept = 700,linetype =2)+
labs(x = '',y= 'Density')
We are interested \(P(X > 700)\), which can be obtained through \(P(X > 700) = 1- P(X \leq 700)\)
1-pnorm(700, mean = 507, sd = 111)
#> [1] 0.04104036
Alternative, pnorm()
has an argument lower.tail=TRUE
(by default). If lower.tail=TRUE
, the probabilities \(P(X \leq x)\) are returned. Otherwise, if lower.tail=FALSE
, \(P(X > x)\) are returned
pnorm(700, mean = 507, sd = 111, lower.tail = FALSE)
#> [1] 0.04104036
The Empirical rule (also known as the 68-95-99.7 rule) is a statistical rule stating that for a normal distribution, where most of the data will fall within three standard deviations of the mean. The empirical rule can be broken down into three parts: 68% of data falls within the first standard deviation from the mean (blue shaded region). 95% fall within the 2nd standard deviations (up to the green shaded region). 99.7% fall within third standard deviation (up to the red shaded region)
Show Code
<- c(-4,4) x_limits
<- function(shade_limits,percent_display,
plot_shaded_normal
annotate_x_label,annotate_y_label,x_limits = c(-4,4),
fill = 'steelblue',
alpha = 0.3,annotate_label_size=4.5){
<- ggplot(data.frame(x = x_limits), aes(x)) +
plt stat_function(fun = dnorm,geom = "line",xlim = x_limits) +
stat_function(fun = dnorm,geom = "area",fill = fill,alpha =alpha,
xlim = shade_limits)+
labs(x = '',y= 'Density')+
annotate("text", x = annotate_x_label, y = annotate_y_label,
label = paste0(percent_display,'%'),
size=annotate_label_size)+
scale_x_continuous(name = '',limits = x_limits,
breaks = x_limits[1]:x_limits[2])
return(plt)
}
<- plot_shaded_normal(shade_limits = c(-1,1),
plots percent_display = 68,
annotate_x_label = 0.1,annotate_y_label = 0.15)+
plot_shaded_normal(shade_limits = c(-2, 2),fill='green',
percent_display = 95,
annotate_x_label = 0,annotate_y_label = 0.15)+
plot_shaded_normal(shade_limits = c(-3, 3),fill='red',
percent_display = 99.7,
annotate_x_label = 0.1,annotate_y_label = 0.15)+
plot_annotation(title = '68–95–99.7 Rule')
We can easily verify these results using pnorm
. Assuming a standard Normal distribution if we were one standard deviation away from the mean then
pnorm(1)-pnorm(-1)
#> [1] 0.6826895
If we were two standard deviations away from the mean
pnorm(2) - pnorm(-2)
#> [1] 0.9544997
and lastly, three standard deviations away from the mean
pnorm(3) - pnorm(-3)
#> [1] 0.9973002
qnorm
The function qnorm()
returns the value of the inverse cumulative density function (CDF) of the normal distribution with specified mean \(\mu\) and standard deviation \(\sigma\).
Let \(F_X(x) = P(X \leq x)\) be the CDF of the normal distribution, and suppose it returns the probability \(p\), i.e, \(F_X(x) = p\). The inverse of the CDF or (quantile function) tells you what \(x\) would make \(F_X(x)\) return some probability \(p\);
\[F_X^{-1}(p) = x\] For example, for the standard normal distribution \(F_X(0)=P(X \leq 0) = 0.5\). That is, the value of \(x\) or (quantile) which gives us a cumulative probability of 0.5 is \(x=0\).
Therefore, we can use the qnorm()
function to find out what value of \(x\) or (quantile) gives us a a cumulative probability of \(p\). Hence, the qnorm
function is the inverse of the pnorm
function
Show Code
<- c(-4,4)
x_limits
<- ggplot(data.frame(x = x_limits), aes(x)) +
p1 stat_function(fun = dnorm,
geom = "line",
xlim = x_limits) +
stat_function(fun = dnorm,
geom = "area",
fill = 'steelblue',
alpha =0.3,
xlim = c(-4, 0))+
geom_vline(xintercept = 0,linetype=2,color = '#cf5d55')+
labs(x = '',y= 'Density')+
annotate('text', x = -0.65, y = 0.17,
parse =TRUE,
label = paste0(
expression('F'[x]),'(0)==0.5') )+
annotate('text',x = 0.77, y = 0.41,
parse =TRUE,
label = paste0(
expression('F'[x]^{-1}),' *(0.5)==0'),
color = '#cf5d55')+
scale_x_continuous(name = '',limits = x_limits,
breaks = -4:4)
qnorm(p=0.5)
#> [1] 0
Going back to our example of the SAT-M scores. Suppose \(X\) is the SAT-M score which has a normal distribution with a mean of 507 and standard deviation of 111. Recall the probability of obtaining a score less than 700 on the SAT-M was \(P(X < 700) = 0.9589596\). Therefore, if we were interesting in finding the score or (quantile) which gives us a cumulative probability of roughly 96% we can use qnorm
as follows:
qnorm(0.9589596, mean=507, sd=111)
#> [1] 700
We should see that the output value is exactly 700.
rnorm
The rnorm
function generates \(n\) observations from a Normal distribution with mean \(\mu\) and standard deviation \(\sigma\).
We specify a seed for reproducibility,
set.seed(10)
Let’s start by generate 10 random observations from a standard normal distribution
rnorm(10, mean = 0, sd = 1)
#> [1] 0.01874617 -0.18425254 -1.37133055 -0.59916772 0.29454513 0.38979430
#> [7] -1.20807618 -0.36367602 -1.62667268 -0.25647839
or equivalently,
rnorm(10)
#> [1] 1.10177950 0.75578151 -0.23823356 0.98744470 0.74139013 0.08934727
#> [7] -0.95494386 -0.19515038 0.92552126 0.48297852
We can specify a different mean and standard deviation
rnorm(10, mean = 10, sd = 2)
#> [1] 8.807379 5.629426 8.650268 5.761878 7.469604 9.252677 8.624889 8.255682
#> [9] 9.796478 9.492439
In the following plot we generate \(n=100,1000,10000\) random observations from a standard normal distribution. If we increase the number of observations, we see the data will approach the true Normal density function
Show Code
<- function(n_samples, mean = 0, sd = 1,
plot_random_normal fill = 'steelblue', alpha = 0.3,
bins = 30, density_linewidth = 1.2,
density_linetype = 2){
<- ggplot(data = data.frame(x=rnorm(n_samples,mean = mean, sd = sd),
plt aes(x))+
n_samples),geom_histogram(aes(y = after_stat(density)),
colour = 1, fill = fill,
alpha = alpha,
bins=bins) +
geom_density(linewidth = density_linewidth,
linetype = density_linetype,
colour = 2)+
labs(x= '',y = 'density')+
facet_grid(~n_samples)
return(plt)
}
<- plot_random_normal(n_samples = 100)+
plots plot_random_normal(n_samples = 1000)+
plot_random_normal(n_samples = 10000)+
plot_layout(ncol=3)
Standardizing Values
The normal model \(N(0,1)\) is called the standard normal distribution, and it is the default model used when working with (.)norm
functions
A \(z\)-score, also known as a standard score, quantifies the number of standard deviations a data point is from the mean of a distribution. This can be useful for comparing and analyzing data across different scales or distributions, providing a standardized way to assess the relative position of a particular value within a dataset
For example, consider a scenario where you are comparing the performance of students in two different subjects, one with scores ranging from 0 to 100 and another with scores ranging from 0 to 50. Without standardization, it can be challenging to assess which student has performed relatively better across both subjects.
The \(z\)-score of a particular value is defined as \[ z = \frac{\text{value} - \text{mean}}{\text{standard deviation}} \] If the values in their original units of measurement follow a normal distribution, then the \(z\)-scores will follow a standard normal distribution
Recall the SAT-M scores example: Suppose \(X\) is the SAT-M score which has a normal distribution with a mean of 507 and standard deviation of 111. What is the probability of scoring less than 700 on the SAT-M?
pnorm(700, mean = 507, sd = 111)
#> [1] 0.9589596
We can also compute this quantity with the \(z\)-score with respect to the standard normal distribution \[ \begin{align*} z &= \frac{\text{value} - \text{mean}}{\text{standard deviation}} \\[10pt] z &= \frac{700 - 507}{111} \\[10pt] z &= 1.738739 \end{align*} \]
<- (700-507)/111
z pnorm(z)
#> [1] 0.9589596
Show Code
<- round(z * 100,2)
prob2 <- ggplot(data.frame(x = c(-4,4)), aes(x)) +
p2 stat_function(fun = dnorm,
geom = "line",
xlim = c(-4,4),
args = list(mean = 0,sd = 1)) +
stat_function(fun = dnorm,
geom = "area",
fill = 'red',
alpha =0.3,
xlim = c(-4, z),
args = list(mean = 0,sd = 1))+
annotate("text", x = (510-507)/111, y = 0.15,
label = paste0(prob1,'%'),
size = 8)+
geom_vline(xintercept = z,linetype =2)+
labs(x = '',y= 'Density')
By standardizing the scores, we shift from calculating \(P(X < 700)\) where the units are in SAT-M scores to equivalently computing \(P(Z < 1.738739)\). This represents the number of standard deviations a data point is from the mean of the distribution. By standardizing we can now compare these results with say scores obtained from an english SAT exam on the same scale