Taking too long? Close loading screen.

SOA ASA Exam: Predictive Analysis (PA) – 1. Basics of R

[mathjax]

Data Types

Create an integer

append “L” to an integer: x <- 1L

Data Structures

Vectors

Create a vector

c(...)

a <- c(1:5)
b <- c(5:1)
c <- c("A", "B", "C")
d <- c(TRUE, FALSE, FALSE, TRUE, TRUE)
print(a)
[1] 1 2 3 4 5
print(b)
[1] 5 4 3 2 1
print(c)
[1] "A" "B" "C"
print(c)
[1] "A" "B" "C"
print(d)
[1]  TRUE FALSE FALSE  TRUE  TRUE

 

Create a sequence of numbers

seq(from, to, by)

x <- seq(0, 5, 1)
[1] 1 2 3 4 5

 

Extract subsets of vectors

[]

# Using positive integers
a[2]
[1] 2
a[c(2, 4)]
[1] 2 4

# Using negative integers
b[-1]
[1] 4 3 2 1
b[-(2:4)]
[1] 5 1

# Using logical vectors
a[d]
[1] 1 4 5

Remark:

  • Unequal Length: For two vectors of unequal length, the shorter vector is recycled by repeating the elements in shorter vector to match the longer vector.
> print(a + 1:3)
[1] 2 4 6 5 7

 

Factors

  • Create a factor: factor(...)
# define x as a vector
x <- c("M", "F", "M", "O", "F")
# factorize x and assign to x.factor
x.factor <- factor(x)
x.factor

[1] M F M O F
Levels: F M O

# Show possible levels: levels()
levels(x.factor)
[1] "F" "M" "O"

# Integer representation
as.numeric(x.factor)
[1] 2 1 2 3 1

# Sort levels in order
z <- c("Good", "Excellent", "Bad", "Good", "Bad", "Good")
z.factor <- factor(z, order = TRUE)
z.factor
[1] Good      Excellent Bad       Good      Bad       Good     
Levels: Bad < Excellent < Good

z.factor <- factor(z, levels = c("Bad", "Good", "Excellent"))
z.factor
[1] Good      Excellent Bad       Good      Bad       Good     
Levels: Bad Good Excellent

 

Matrices

Create a matrix

matrix(data = <vector>, nrow = <nrow>, , ncol = <ncol>, byrow = FALSE)

A <- matrix(1:8, ncol = 2)
A
     [,1] [,2]
[1,]    1    5
[2,]    2    6
[3,]    3    7
[4,]    4    8

B <- matrix(9:16, nrow = 4, byrow = TRUE)
B
     [,1] [,2]
[1,]    9   10
[2,]   11   12
[3,]   13   14
[4,]   15   16
C <- matrix(c(2, 1, 1, 4), ncol = 2)
C
     [,1] [,2]
[1,]    2    1
[2,]    1    4
D <- matrix(LETTERS[1:6], nrow  = 3)
D
     [,1] [,2]
[1,] "A"  "D" 
[2,] "B"  "E" 
[3,] "C"  "F"

Attributes

    • nrow()
    • ncol()
    • dim()

 

Matrix Multiplication

%*%

A %*% C
     [,1] [,2]
[1,]    7   21
[2,]   10   26
[3,]   13   31
[4,]   16   36

 

Matrix Transposition

t()

t(A)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8

 

Matrix Inversion

solve()

solve(C)
           [,1]       [,2]
[1,]  0.5714286 -0.1428571
[2,] -0.1428571  0.2857143

 

Data Frames

Create a data frame

data.frame(data)

x <- 6:10
y <- 11:15
z <- c("one", "two", "three", "four", "five")
df <- data.frame(x, y, z)
df
   x  y     z
1  6 11   one
2  7 12   two
3  8 13 three
4  9 14  four
5 10 15  five

Print records

head(df, n = 3)
  x  y     z
1 6 11   one
2 7 12   two
3 8 13 three

tail(df, n = 3)
   x  y     z
3  8 13 three
4  9 14  four
5 10 15  five

Attributes

    • nrow(df)
    • ncol(df)
    • dim(df)

Column names

    • names(df) <- c()
    • columns(df) <- c()

 

Subsetting a dataframe

df$y
[1] 11 12 13 14 15

df[, c("x", "z")]
   x     z
1  6   one
2  7   two
3  8 three
4  9  four
5 10  five

 

Lists

Create a list

list(data)

l <- list(p = "This is a list",
+           q = 4:6,
+           r = matrix(1:9, nrow = 3),
+           s = data.frame(x = c(1, 4, 9)))

 

Data Management

Background

actuary
  x            name gender age exams Q1 Q2 Q3 salary
1 1      Embryo Luo      M  -1    10 10  9  9  3e+05
2 2                      F  25     3 NA  9  7     NA
3 3     Peter Smith      M  22     0  4  5  5  8e+04
4 4            <NA>      F  50     4  7  7  8     NA
5 5 Angela Peterson      F  30     6  8  8 10     NA
6 6  Emily Johnston      F  42     7  9 10 10     NA
7 7   Barbara Scott      7  29     5  8  9  7     NA
8 8    Benjamin Eng      M  36     9  7  8  8     NA

 

Removing Unimportant Variables

use the $ notation to access the variable and assign it to the special symbol NULL.

actuary$x <- NULL
actuary
             name gender age exams Q1 Q2 Q3 salary
1      Embryo Luo      M  -1    10 10  9  9  3e+05
2                      F  25     3 NA  9  7     NA
3     Peter Smith      M  22     0  4  5  5  8e+04
4            <NA>      F  50     4  7  7  8     NA
5 Angela Peterson      F  30     6  8  8 10     NA
6  Emily Johnston      F  42     7  9 10 10     NA
7   Barbara Scott      7  29     5  8  9  7     NA
8    Benjamin Eng      M  36     9  7  8  8     NA

actuary$salary <- NULL
actuary
             name gender age exams Q1 Q2 Q3
1      Embryo Luo      M  -1    10 10  9  9
2                      F  25     3 NA  9  7
3     Peter Smith      M  22     0  4  5  5
4            <NA>      F  50     4  7  7  8
5 Angela Peterson      F  30     6  8  8 10
6  Emily Johnston      F  42     7  9 10 10
7   Barbara Scott      7  29     5  8  9  7
8    Benjamin Eng      M  36     9  7  8  8

 

Missing/Abnormal Values

Method Pros Cons
Removing observations / variables with missing values The simplest method Results in loss of information
Replacing missing values with numeric variables: Mean / median Simple to use Needs the assumption that missing values are missing at random, i.e., missing values have the same distribution as non-missing values.
Replacing missing values with categorical variables:
Mode or a new factor level
Destroys relationships between variables, which are typically correlated
Replacing missing values with an assigned category /value based on prior knowledge Preserves meaning of missing values Not applicable when no prior knowledge is available
Advanced imputation: Using a model based on the combinations of other variables to fill in the missing values Tends to be more accurate than other methods
  • Computationally intensive
  • Depends on order of imputing the variables

missing values are denoted by the special symbol NA (without the double quotes “”), which stands for “not available.”

 

Removing observations

is.na() function

Logical subsetting-using appropriate logical tests to identify elements that satisfy certain conditions.

# remove the second observation, which has a missing value for Q1.
is.na(actuary$Q1)
[1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
actuary.1 <- actuary[!is.na(actuary$Q1), ]
actuary.1
             name gender age exams Q1 Q2 Q3
1      Embryo Luo      M  -1    10 10  9  9
3     Peter Smith      M  22     0  4  5  5
4            <NA>      F  50     4  7  7  8
5 Angela Peterson      F  30     6  8  8 10
6  Emily Johnston      F  42     7  9 10 10
7   Barbara Scott      7  29     5  8  9  7
8    Benjamin Eng      M  36     9  7  8  8

 

Removing observations

complete.cases() or na.omit() functions

These two functions are more global than is.na() and remove all rows containing missing values.

  • When the complete.cases() function is applied to a data frame, it returns a logical vector of the same length as the number of rows of the data frame with elements equal to TRUE if the corresponding rows contain “complete cases” (i.e., no missing values) and equal to FALSE otherwise. When this logical vector is used as the row index vector, only rows without missing values are retained.
  • When the function na.omit() is applied to a data frame, any rows with missing data are “omitted.”
# use the complete.cases() function to wipe out the second and fourth rows, which contain the "NA" values in either name or Q1.
cc <- complete.cases(actuary)
actuary.2 <- actuary[cc, ]
actuary.2
             name gender age exams Q1 Q2 Q3
1      Embryo Luo      M  -1    10 10  9  9
3     Peter Smith      M  22     0  4  5  5
5 Angela Peterson      F  30     6  8  8 10
6  Emily Johnston      F  42     7  9 10 10
7   Barbara Scott      7  29     5  8  9  7
8    Benjamin Eng      M  36     9  7  8  8
> actuary.3 <- na.omit(actuary)
> actuary.3
             name gender age exams Q1 Q2 Q3
1      Embryo Luo      M  -1    10 10  9  9
3     Peter Smith      M  22     0  4  5  5
5 Angela Peterson      F  30     6  8  8 10
6  Emily Johnston      F  42     7  9 10 10
7   Barbara Scott      7  29     5  8  9  7
8    Benjamin Eng      M  36     9  7  8  8

 

New Background

actuary.n <- actuary.3
actuary.n
             name gender age exams Q1 Q2 Q3
1      Embryo Luo      M  -1    10 10  9  9
3     Peter Smith      M  22     0  4  5  5
5 Angela Peterson      F  30     6  8  8 10
6  Emily Johnston      F  42     7  9 10 10
7   Barbara Scott      7  29     5  8  9  7
8    Benjamin Eng      M  36     9  7  8  8

# Check the number of rows
nrow(actuary) - nrow(actuary.n)
[1] 2

 

Adding New Variables

To add a new variable to a data frame, we can use the dollar sign $ notation to name and define the new variable.

<data_frame>$<new_var> <- <definition>

# Create a variable "S" that measures the average of the three evaluation ratings for each actuary
actuary.n$S <- (actuary.n$Q1 + actuary.n$Q2 + actuary.n$Q3) / 3
actuary.n
             name gender age exams Q1 Q2 Q3        S
1      Embryo Luo      M  -1    10 10  9  9 9.333333
3     Peter Smith      M  22     0  4  5  5 4.666667
5 Angela Peterson      F  30     6  8  8 10 8.666667
6  Emily Johnston      F  42     7  9 10 10 9.666667
7   Barbara Scott      7  29     5  8  9  7 8.000000
8    Benjamin Eng      M  36     9  7  8  8 7.666667

 

Using Function

apply()

This function allows us to “apply” an arbitrary function to any dimension of a matrix and a data frame, provided that its columns are of the same data structure.

apply(<data_frame>, MARGIN, <function>)

  • where MARGIN = 1 indicates that the function will act on the rows of the data frame
  • while MARGIN = 2 indicates that the function will act on its columns.
# Create a variable "S" that measures the average of the three evaluation ratings for each actuary
actuary.n$S <- apply(actuary.n[, c("Q1", "Q2", "Q3")], 1, mean)
actuary.n
             name gender age exams Q1 Q2 Q3        S
1      Embryo Luo      M  -1    10 10  9  9 9.333333
3     Peter Smith      M  22     0  4  5  5 4.666667
5 Angela Peterson      F  30     6  8  8 10 8.666667
6  Emily Johnston      F  42     7  9 10 10 9.666667
7   Barbara Scott      7  29     5  8  9  7 8.000000
8    Benjamin Eng      M  36     9  7  8  8 7.666667

 

Sorting

  • Use which.max() and which.min() to find the maximum
# Find the index at which the average rating is "maximized"
actuary.n[which.max(actuary.n$S), "name"]
[1] Emily Johnston
7 Levels:  Angela Peterson Barbara Scott Benjamin Eng ... Peter Smith

# Find the index at which the average rating is "minimized"
actuary.n[which.min(actuary.n$S), "name"]
[1] Peter Smith
7 Levels:  Angela Peterson Barbara Scott Benjamin Eng ... Peter Smith
  • Use order() to get the permutation of indices when the elements of the vector supplied as an argument is sorted from smallest to largest.

 

# Sort the data frame with respect to variable S from the smallest to the largest (ascending order)
order(actuary.n$S)
[1] 2 6 5 3 1 4
actuary.n1 <- actuary.n[order(actuary.n$S), ]
actuary.n1
             name gender age exams Q1 Q2 Q3        S
3     Peter Smith      M  22     0  4  5  5 4.666667
8    Benjamin Eng      M  36     9  7  8  8 7.666667
7   Barbara Scott      7  29     5  8  9  7 8.000000
5 Angela Peterson      F  30     6  8  8 10 8.666667
1      Embryo Luo      M  -1    10 10  9  9 9.333333
6  Emily Johnston      F  42     7  9 10 10 9.666667

# Sort the data frame with respect to variable S from the largest to the smallest (descending order)
order(actuary.n$S, decreasing= TRUE)
[1] 4 1 3 5 6 2
actuary.n2 <- actuary.n[order(actuary.n$S, decreasing= TRUE), ]
actuary.n2
             name gender age exams Q1 Q2 Q3        S
6  Emily Johnston      F  42     7  9 10 10 9.666667
1      Embryo Luo      M  -1    10 10  9  9 9.333333
5 Angela Peterson      F  30     6  8  8 10 8.666667
7   Barbara Scott      7  29     5  8  9  7 8.000000
8    Benjamin Eng      M  36     9  7  8  8 7.666667
3     Peter Smith      M  22     0  4  5  5 4.666667

# Sorted by gender, then by S, both in descending order
order(actuary.n$gender, actuary.n$S, decreasing = TRUE)
[1] 1 6 2 4 3 5
actuary.n3 <- actuary.n[order(actuary.n$gender, actuary.n$S, decreasing = TRUE), ]
actuary.n3
             name gender age exams Q1 Q2 Q3        S
1      Embryo Luo      M  -1    10 10  9  9 9.333333
8    Benjamin Eng      M  36     9  7  8  8 7.666667
3     Peter Smith      M  22     0  4  5  5 4.666667
6  Emily Johnston      F  42     7  9 10 10 9.666667
5 Angela Peterson      F  30     6  8  8 10 8.666667
7   Barbara Scott      7  29     5  8  9  7 8.000000

 

Identify observations with certain characteristics

# Identify Fellow from the actuaries who should have passed 10 exams.
actuary.FSA <- actuary.n[actuary.n$exams == 10, ]
actuary.FSA
        name gender age exams Q1 Q2 Q3        S
1 Embryo Luo      M  -1    10 10  9  9 9.333333

# Identify ASA's from the actuaries who should have passed 7 to 9 exams.
actuary.ASA<- actuary.n[actuary.n$exams >= 7 & actuary.n$exams <= 9, ]
actuary.ASA
            name gender age exams Q1 Q2 Q3        S
6 Emily Johnston      F  42     7  9 10 10 9.666667
8   Benjamin Eng      M  36     9  7  8  8 7.666667

 

Create a new categorical variable

lifelse(<cond>, <result_1>, <result_2>)

where:

  • <cond> is the logical test to be conducted on a vector.
  • <resul t_1> is the return value if the corresponding element of the logical vector is TRUE.
  • <result_2> is the return value if the corresponding element of the logical vector is FALSE.
# Use ifelse() to create a categorical variable based on the number of exams passed
actuary.n$title <- ifelse(actuary.n$exams == 10, "FSA", ifelse(actuary.n$exams >= 7 & actuary.n$exams <= 9, "ASA", "student"))
actuary.n
             name gender age exams Q1 Q2 Q3        S   title
1      Embryo Luo      M  -1    10 10  9  9 9.333333     FSA
3     Peter Smith      M  22     0  4  5  5 4.666667 student
5 Angela Peterson      F  30     6  8  8 10 8.666667 student
6  Emily Johnston      F  42     7  9 10 10 9.666667     ASA
7   Barbara Scott      7  29     5  8  9  7 8.000000 student
8    Benjamin Eng      M  36     9  7  8  8 7.666667     ASA

 

Horizontal and vertical concatenations

  • cbind(): If the external dataset carries new variables for the same observations as the current dataset, then we can do a horizontal concatenation and combine the new columns with the old ones
  • rbind(): If the external dataset has the same variables as the current dataset, then we can do a vertical concatenation and add new observations.
# Use cbind() to append variables "smoke" and "weight" to actuary.n
new <- data.frame(smoke = c("N", "N", "Y", "Y", "N", "N", "N", "Y"), weight = c(70, 65, 60, 90 , 60, 55, 50 , 75))
actuary.n <- cbind(actuary.n, new[cc, ])
actuary.n
             name gender age exams Q1 Q2 Q3        S   title smoke weight
1      Embryo Luo      M  -1    10 10  9  9 9.333333     FSA     N     70
3     Peter Smith      M  22     0  4  5  5 4.666667 student     Y     60
5 Angela Peterson      F  30     6  8  8 10 8.666667 student     N     60
6  Emily Johnston      F  42     7  9 10 10 9.666667     ASA     N     55
7   Barbara Scott      7  29     5  8  9  7 8.000000 student     N     50
8    Benjamin Eng      M  36     9  7  8  8 7.666667     ASA     Y     75

 

Create a compound variable

“paste” its arguments as character strings.

paste(<vector_1>, ..., <vector_n>, sep = <string>, collapse = <string>)

  • sep: supplies a delimiter to separate the individual vectors. The default of the sep argument is one space (" ").
  • collapse: The multiple character strings will then be concatenate using the delimiter indicated by the collapse argument
paste(c("a", "b", "c"), c("x", "y", "z"), sep = "")
[1] "ax" "by" "cz"

# Recycling rule
paste("The answer is", c("YES", "N0"))
[1] "The answer is YES" "The answer is N0"
paste("Yr", 1:10)
 [1] "Yr 1"  "Yr 2"  "Yr 3"  "Yr 4"  "Yr 5"  "Yr 6"  "Yr 7"  "Yr 8"  "Yr 9"  "Yr 10"

paste("The answer is", c("YES", "N0"), sep = "...", collapse = " OR ")
[1] "The answer is...YES OR The answer is...N0"

paste0(<vector_1>, ..., <vector_n>)

paste0 function is a wrapper for paste() with the sep argument set a priori to “zero” space (" ").

# Set up the compound variable named genderSmoke.
actuary.n$genderSmoke <- paste0(actuary.n$gender, actuary.n$smoke)
actuary.n
             name gender age exams Q1 Q2 Q3        S   title smoke weight genderSmoke
1      Embryo Luo      M  -1    10 10  9  9 9.333333     FSA     N     70          MN
3     Peter Smith      M  22     0  4  5  5 4.666667 student     Y     60          MY
5 Angela Peterson      F  30     6  8  8 10 8.666667 student     N     60          FN
6  Emily Johnston      F  42     7  9 10 10 9.666667     ASA     N     55          FN
7   Barbara Scott      7  29     5  8  9  7 8.000000 student     N     50          7N
8    Benjamin Eng      M  36     9  7  8  8 7.666667     ASA     Y     75          MY

Another example would be:

paste0(1:10, c("st", "nd", "rd", rep("th", 7)))
 [1] "1st"  "2nd"  "3rd"  "4th"  "5th"  "6th"  "7th"  "8th"  "9th"  "10th"

 

For-Loop Statement

Construct a confidence interval

# Initialization step
n <- 100
n_sim <- 1000 # number of replications
mu <- 5 # true mean vaLue
sigma <- 2
count <- rep(NA, n_sim)

# Repetition step
set.seed(0) # to make results reproducibLe
for (i in 1:n_sim) {
     # Draw a random sampLe of size n from a normal distribution
     # with mean mu and standard deviation sigma
+    x <- rnorm(n, mean = mu, sd = sigma)
+    count[i] <- (abs(mean(x) - mu) <= qnorm(0.975) * sigma / sqrt(n))
+ }

# Final result
mean(count) 
[1] 0.952

 

Reproducible Results

Set a Seed

Set a seed in R is used for:

  • Reproducing the same output of simulation studies.
  • Help to debug the code when dealing with pseudorandom numbers.

set.seed(): allow you to set a seed and a generator (with the kind argument) in R.

  • The state of the random number generator is stored in .Random.seed (in the global environment). It is a vector of integers which length depends on the generator.
  • If the seed is not specified, R uses the clock of the system to establish one.
set.seed(1)
rnorm(5)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
rnorm(5)
[1] -0.8204684 0.4874291 0.7383247 0.5757814 -0.3053884

======================================= Same as Below =======================================

set.seed(1)
rnorm(10)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078 -0.8204684 0.4874291
[8] 0.7383247 0.5757814 -0.3053884> set.seed(1)

 

Unset a Seed

To unset or reset a seed in R, you have two options:

  • As R uses the system clock to set a seed when not specified, you can use the Sys.time function as follows to come back to the default behavior

set.seed(Sys.time())

  • pass a NULL to the function, to re-initialize the generator “as if no seed had yet been set”.

set.seed(NULL)

 

Generating Randomized Results

Uniform

  • runif(n): simulates \(n\) uniform random variables on interval \([0,1]\)

Normal

  • pnorm(q, mean, sd, lower.tail = TRUE): The \(p\) is for “probability” and is a cumulative distribution function.
  • qnorm(x, mean, sd): The \(q\) is for “quantile“, and it is an inverse.
  • dnorm(x, mean = x, sd = y): The \(d\) is for “density“, which is a density function.
  • rnorm(n, mean = x, sd = y): The \(r\) is for “random“, a random variable with the specified distribution, simulates \(n\) normal distribution with mean \(x\) and standard deviation \(y\)

 

Fitting Least Squares Estimates

Question

You are fitting a multiple linear regression model:

\(Y=\beta_0+\beta_1X_1+\beta_2X_2+\epsilon\)

to the following dataset:

\(i\) \(Y_i\) \(X_{i1}\) \(X_{i2}\)
1 14 2 4
2 11 3 2
3 14 0 6
4 18 8 4
5 12 4 3
6 9 1

2

Perform the following two tasks in R.

    1. Set up the response vector \(Y\) and design matrix \(X\) corresponding to the dataset above.
    2. Use the matrix formula \(\hat{\beta}=(X^TX)^{-1}X^TY\) to calculate the least squares estimates of \(\beta_0,\beta_1,\beta_2\)

Solution

> Y <- c(14, 11, 14, 18, 12, 9)
> X1<- c(2, 3, 0, 8, 4, 1)
> X2 <- c(4, 2, 6, 4, 3, 2)
> X <- matrix(c(XO, Xi, X2), nrow = 6)
> X
     [,1] [,2] [,3]
[1,]    1    2    4
[2,]    1    3    2
[3,]    1    0    6
[4,]    1    8    4
[5,]    1    4    3
[6,]    1    1    2

> B <- solve(t(X) %*% X) %*% t(X) %*% Y
> B
          [,1]
[1,] 5.2505543
[2,] 0.8137472
[3,] 1.5166297

Therefore, \(\hat{\beta_0}=5.2505543\), \(\hat{\beta_1}=0.8137472\), \(\hat{\beta_2}=1.5166297\)