[mathjax]
Data Types
Create an integer
append “L
” to an integer: x <- 1L
Data Structures
Vectors
Create a vector
c(...)
a <- c(1:5) b <- c(5:1) c <- c("A", "B", "C") d <- c(TRUE, FALSE, FALSE, TRUE, TRUE) print(a) [1] 1 2 3 4 5 print(b) [1] 5 4 3 2 1 print(c) [1] "A" "B" "C" print(c) [1] "A" "B" "C" print(d) [1] TRUE FALSE FALSE TRUE TRUE
Create a sequence of numbers
seq(from, to, by)
x <- seq(0, 5, 1) [1] 1 2 3 4 5
Extract subsets of vectors
[]
# Using positive integers a[2] [1] 2 a[c(2, 4)] [1] 2 4 # Using negative integers b[-1] [1] 4 3 2 1 b[-(2:4)] [1] 5 1 # Using logical vectors a[d] [1] 1 4 5
Remark:
- Unequal Length: For two vectors of unequal length, the shorter vector is recycled by repeating the elements in shorter vector to match the longer vector.
> print(a + 1:3) [1] 2 4 6 5 7
Factors
- Create a factor:
factor(...)
# define x as a vector x <- c("M", "F", "M", "O", "F") # factorize x and assign to x.factor x.factor <- factor(x) x.factor [1] M F M O F Levels: F M O # Show possible levels: levels() levels(x.factor) [1] "F" "M" "O" # Integer representation as.numeric(x.factor) [1] 2 1 2 3 1 # Sort levels in order z <- c("Good", "Excellent", "Bad", "Good", "Bad", "Good") z.factor <- factor(z, order = TRUE) z.factor [1] Good Excellent Bad Good Bad Good Levels: Bad < Excellent < Good z.factor <- factor(z, levels = c("Bad", "Good", "Excellent")) z.factor [1] Good Excellent Bad Good Bad Good Levels: Bad Good Excellent
Matrices
Create a matrix
matrix(data = <vector>, nrow = <nrow>, , ncol = <ncol>, byrow = FALSE)
A <- matrix(1:8, ncol = 2) A [,1] [,2] [1,] 1 5 [2,] 2 6 [3,] 3 7 [4,] 4 8 B <- matrix(9:16, nrow = 4, byrow = TRUE) B [,1] [,2] [1,] 9 10 [2,] 11 12 [3,] 13 14 [4,] 15 16 C <- matrix(c(2, 1, 1, 4), ncol = 2) C [,1] [,2] [1,] 2 1 [2,] 1 4 D <- matrix(LETTERS[1:6], nrow = 3) D [,1] [,2] [1,] "A" "D" [2,] "B" "E" [3,] "C" "F"
Attributes
-
nrow()
ncol()
dim()
Matrix Multiplication
%*%
A %*% C [,1] [,2] [1,] 7 21 [2,] 10 26 [3,] 13 31 [4,] 16 36
Matrix Transposition
t()
t(A) [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8
Matrix Inversion
solve()
solve(C) [,1] [,2] [1,] 0.5714286 -0.1428571 [2,] -0.1428571 0.2857143
Data Frames
Create a data frame
data.frame(data)
x <- 6:10 y <- 11:15 z <- c("one", "two", "three", "four", "five") df <- data.frame(x, y, z) df x y z 1 6 11 one 2 7 12 two 3 8 13 three 4 9 14 four 5 10 15 five
Print records
head(df, n = 3) x y z 1 6 11 one 2 7 12 two 3 8 13 three tail(df, n = 3) x y z 3 8 13 three 4 9 14 four 5 10 15 five
Attributes
-
nrow(df)
ncol(df)
dim(df)
Column names
-
names(df) <- c()
columns(df) <- c()
Subsetting a dataframe
df$y [1] 11 12 13 14 15 df[, c("x", "z")] x z 1 6 one 2 7 two 3 8 three 4 9 four 5 10 five
Lists
Create a list
list(data)
l <- list(p = "This is a list", + q = 4:6, + r = matrix(1:9, nrow = 3), + s = data.frame(x = c(1, 4, 9)))
Data Management
Background
actuary x name gender age exams Q1 Q2 Q3 salary 1 1 Embryo Luo M -1 10 10 9 9 3e+05 2 2 F 25 3 NA 9 7 NA 3 3 Peter Smith M 22 0 4 5 5 8e+04 4 4 <NA> F 50 4 7 7 8 NA 5 5 Angela Peterson F 30 6 8 8 10 NA 6 6 Emily Johnston F 42 7 9 10 10 NA 7 7 Barbara Scott 7 29 5 8 9 7 NA 8 8 Benjamin Eng M 36 9 7 8 8 NA
Removing Unimportant Variables
use the $
notation to access the variable and assign it to the special symbol NULL
.
actuary$x <- NULL actuary name gender age exams Q1 Q2 Q3 salary 1 Embryo Luo M -1 10 10 9 9 3e+05 2 F 25 3 NA 9 7 NA 3 Peter Smith M 22 0 4 5 5 8e+04 4 <NA> F 50 4 7 7 8 NA 5 Angela Peterson F 30 6 8 8 10 NA 6 Emily Johnston F 42 7 9 10 10 NA 7 Barbara Scott 7 29 5 8 9 7 NA 8 Benjamin Eng M 36 9 7 8 8 NA actuary$salary <- NULL actuary name gender age exams Q1 Q2 Q3 1 Embryo Luo M -1 10 10 9 9 2 F 25 3 NA 9 7 3 Peter Smith M 22 0 4 5 5 4 <NA> F 50 4 7 7 8 5 Angela Peterson F 30 6 8 8 10 6 Emily Johnston F 42 7 9 10 10 7 Barbara Scott 7 29 5 8 9 7 8 Benjamin Eng M 36 9 7 8 8
Missing/Abnormal Values
Method | Pros | Cons |
Removing observations / variables with missing values | The simplest method | Results in loss of information |
Replacing missing values with numeric variables: Mean / median | Simple to use | Needs the assumption that missing values are missing at random, i.e., missing values have the same distribution as non-missing values. |
Replacing missing values with categorical variables: Mode or a new factor level |
Destroys relationships between variables, which are typically correlated | |
Replacing missing values with an assigned category /value based on prior knowledge | Preserves meaning of missing values | Not applicable when no prior knowledge is available |
Advanced imputation: Using a model based on the combinations of other variables to fill in the missing values | Tends to be more accurate than other methods |
|
missing values are denoted by the special symbol NA
(without the double quotes “”), which stands for “not available.”
Removing observations
is.na()
function
Logical subsetting-using appropriate logical tests to identify elements that satisfy certain conditions.
# remove the second observation, which has a missing value for Q1. is.na(actuary$Q1) [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE actuary.1 <- actuary[!is.na(actuary$Q1), ] actuary.1 name gender age exams Q1 Q2 Q3 1 Embryo Luo M -1 10 10 9 9 3 Peter Smith M 22 0 4 5 5 4 <NA> F 50 4 7 7 8 5 Angela Peterson F 30 6 8 8 10 6 Emily Johnston F 42 7 9 10 10 7 Barbara Scott 7 29 5 8 9 7 8 Benjamin Eng M 36 9 7 8 8
Removing observations
complete.cases()
or na.omit()
functions
These two functions are more global than is.na()
and remove all rows containing missing values.
- When the
complete.cases()
function is applied to a data frame, it returns a logical vector of the same length as the number of rows of the data frame with elements equal toTRUE
if the corresponding rows contain “complete cases” (i.e., no missing values) and equal toFALSE
otherwise. When this logical vector is used as the row index vector, only rows without missing values are retained. - When the function
na.omit()
is applied to a data frame, any rows with missing data are “omitted.”
# use the complete.cases() function to wipe out the second and fourth rows, which contain the "NA" values in either name or Q1. cc <- complete.cases(actuary) actuary.2 <- actuary[cc, ] actuary.2 name gender age exams Q1 Q2 Q3 1 Embryo Luo M -1 10 10 9 9 3 Peter Smith M 22 0 4 5 5 5 Angela Peterson F 30 6 8 8 10 6 Emily Johnston F 42 7 9 10 10 7 Barbara Scott 7 29 5 8 9 7 8 Benjamin Eng M 36 9 7 8 8
> actuary.3 <- na.omit(actuary) > actuary.3 name gender age exams Q1 Q2 Q3 1 Embryo Luo M -1 10 10 9 9 3 Peter Smith M 22 0 4 5 5 5 Angela Peterson F 30 6 8 8 10 6 Emily Johnston F 42 7 9 10 10 7 Barbara Scott 7 29 5 8 9 7 8 Benjamin Eng M 36 9 7 8 8
New Background
actuary.n <- actuary.3 actuary.n name gender age exams Q1 Q2 Q3 1 Embryo Luo M -1 10 10 9 9 3 Peter Smith M 22 0 4 5 5 5 Angela Peterson F 30 6 8 8 10 6 Emily Johnston F 42 7 9 10 10 7 Barbara Scott 7 29 5 8 9 7 8 Benjamin Eng M 36 9 7 8 8 # Check the number of rows nrow(actuary) - nrow(actuary.n) [1] 2
Adding New Variables
To add a new variable to a data frame, we can use the dollar sign $
notation to name and define the new variable.
<data_frame>$<new_var> <- <definition>
# Create a variable "S" that measures the average of the three evaluation ratings for each actuary actuary.n$S <- (actuary.n$Q1 + actuary.n$Q2 + actuary.n$Q3) / 3 actuary.n name gender age exams Q1 Q2 Q3 S 1 Embryo Luo M -1 10 10 9 9 9.333333 3 Peter Smith M 22 0 4 5 5 4.666667 5 Angela Peterson F 30 6 8 8 10 8.666667 6 Emily Johnston F 42 7 9 10 10 9.666667 7 Barbara Scott 7 29 5 8 9 7 8.000000 8 Benjamin Eng M 36 9 7 8 8 7.666667
Using Function
apply()
This function allows us to “apply” an arbitrary function to any dimension of a matrix and a data frame, provided that its columns are of the same data structure.
apply(<data_frame>, MARGIN, <function>)
- where
MARGIN = 1
indicates that the function will act on the rows of the data frame - while
MARGIN = 2
indicates that the function will act on its columns.
# Create a variable "S" that measures the average of the three evaluation ratings for each actuary actuary.n$S <- apply(actuary.n[, c("Q1", "Q2", "Q3")], 1, mean) actuary.n name gender age exams Q1 Q2 Q3 S 1 Embryo Luo M -1 10 10 9 9 9.333333 3 Peter Smith M 22 0 4 5 5 4.666667 5 Angela Peterson F 30 6 8 8 10 8.666667 6 Emily Johnston F 42 7 9 10 10 9.666667 7 Barbara Scott 7 29 5 8 9 7 8.000000 8 Benjamin Eng M 36 9 7 8 8 7.666667
Sorting
- Use
which.max()
andwhich.min()
to find the maximum
# Find the index at which the average rating is "maximized" actuary.n[which.max(actuary.n$S), "name"] [1] Emily Johnston 7 Levels: Angela Peterson Barbara Scott Benjamin Eng ... Peter Smith # Find the index at which the average rating is "minimized" actuary.n[which.min(actuary.n$S), "name"] [1] Peter Smith 7 Levels: Angela Peterson Barbara Scott Benjamin Eng ... Peter Smith
- Use
order()
to get the permutation of indices when the elements of the vector supplied as an argument is sorted from smallest to largest.
# Sort the data frame with respect to variable S from the smallest to the largest (ascending order) order(actuary.n$S) [1] 2 6 5 3 1 4 actuary.n1 <- actuary.n[order(actuary.n$S), ] actuary.n1 name gender age exams Q1 Q2 Q3 S 3 Peter Smith M 22 0 4 5 5 4.666667 8 Benjamin Eng M 36 9 7 8 8 7.666667 7 Barbara Scott 7 29 5 8 9 7 8.000000 5 Angela Peterson F 30 6 8 8 10 8.666667 1 Embryo Luo M -1 10 10 9 9 9.333333 6 Emily Johnston F 42 7 9 10 10 9.666667 # Sort the data frame with respect to variable S from the largest to the smallest (descending order) order(actuary.n$S, decreasing= TRUE) [1] 4 1 3 5 6 2 actuary.n2 <- actuary.n[order(actuary.n$S, decreasing= TRUE), ] actuary.n2 name gender age exams Q1 Q2 Q3 S 6 Emily Johnston F 42 7 9 10 10 9.666667 1 Embryo Luo M -1 10 10 9 9 9.333333 5 Angela Peterson F 30 6 8 8 10 8.666667 7 Barbara Scott 7 29 5 8 9 7 8.000000 8 Benjamin Eng M 36 9 7 8 8 7.666667 3 Peter Smith M 22 0 4 5 5 4.666667 # Sorted by gender, then by S, both in descending order order(actuary.n$gender, actuary.n$S, decreasing = TRUE) [1] 1 6 2 4 3 5 actuary.n3 <- actuary.n[order(actuary.n$gender, actuary.n$S, decreasing = TRUE), ] actuary.n3 name gender age exams Q1 Q2 Q3 S 1 Embryo Luo M -1 10 10 9 9 9.333333 8 Benjamin Eng M 36 9 7 8 8 7.666667 3 Peter Smith M 22 0 4 5 5 4.666667 6 Emily Johnston F 42 7 9 10 10 9.666667 5 Angela Peterson F 30 6 8 8 10 8.666667 7 Barbara Scott 7 29 5 8 9 7 8.000000
Identify observations with certain characteristics
# Identify Fellow from the actuaries who should have passed 10 exams. actuary.FSA <- actuary.n[actuary.n$exams == 10, ] actuary.FSA name gender age exams Q1 Q2 Q3 S 1 Embryo Luo M -1 10 10 9 9 9.333333 # Identify ASA's from the actuaries who should have passed 7 to 9 exams. actuary.ASA<- actuary.n[actuary.n$exams >= 7 & actuary.n$exams <= 9, ] actuary.ASA name gender age exams Q1 Q2 Q3 S 6 Emily Johnston F 42 7 9 10 10 9.666667 8 Benjamin Eng M 36 9 7 8 8 7.666667
Create a new categorical variable
lifelse(<cond>, <result_1>, <result_2>)
where:
<cond>
is the logical test to be conducted on a vector.<resul t_1>
is the return value if the corresponding element of the logical vector isTRUE
.<result_2>
is the return value if the corresponding element of the logical vector isFALSE
.
# Use ifelse() to create a categorical variable based on the number of exams passed actuary.n$title <- ifelse(actuary.n$exams == 10, "FSA", ifelse(actuary.n$exams >= 7 & actuary.n$exams <= 9, "ASA", "student")) actuary.n name gender age exams Q1 Q2 Q3 S title 1 Embryo Luo M -1 10 10 9 9 9.333333 FSA 3 Peter Smith M 22 0 4 5 5 4.666667 student 5 Angela Peterson F 30 6 8 8 10 8.666667 student 6 Emily Johnston F 42 7 9 10 10 9.666667 ASA 7 Barbara Scott 7 29 5 8 9 7 8.000000 student 8 Benjamin Eng M 36 9 7 8 8 7.666667 ASA
Horizontal and vertical concatenations
cbind()
: If the external dataset carries new variables for the same observations as the current dataset, then we can do a horizontal concatenation and combine the new columns with the old onesrbind()
: If the external dataset has the same variables as the current dataset, then we can do a vertical concatenation and add new observations.
# Use cbind() to append variables "smoke" and "weight" to actuary.n new <- data.frame(smoke = c("N", "N", "Y", "Y", "N", "N", "N", "Y"), weight = c(70, 65, 60, 90 , 60, 55, 50 , 75)) actuary.n <- cbind(actuary.n, new[cc, ]) actuary.n name gender age exams Q1 Q2 Q3 S title smoke weight 1 Embryo Luo M -1 10 10 9 9 9.333333 FSA N 70 3 Peter Smith M 22 0 4 5 5 4.666667 student Y 60 5 Angela Peterson F 30 6 8 8 10 8.666667 student N 60 6 Emily Johnston F 42 7 9 10 10 9.666667 ASA N 55 7 Barbara Scott 7 29 5 8 9 7 8.000000 student N 50 8 Benjamin Eng M 36 9 7 8 8 7.666667 ASA Y 75
Create a compound variable
“paste” its arguments as character strings.
paste(<vector_1>, ..., <vector_n>, sep = <string>, collapse = <string>)
sep
: supplies a delimiter to separate the individual vectors. The default of thesep
argument is one space (" "
).collapse
: The multiple character strings will then be concatenate using the delimiter indicated by thecollapse
argument
paste(c("a", "b", "c"), c("x", "y", "z"), sep = "") [1] "ax" "by" "cz" # Recycling rule paste("The answer is", c("YES", "N0")) [1] "The answer is YES" "The answer is N0" paste("Yr", 1:10) [1] "Yr 1" "Yr 2" "Yr 3" "Yr 4" "Yr 5" "Yr 6" "Yr 7" "Yr 8" "Yr 9" "Yr 10" paste("The answer is", c("YES", "N0"), sep = "...", collapse = " OR ") [1] "The answer is...YES OR The answer is...N0"
paste0(<vector_1>, ..., <vector_n>)
paste0
function is a wrapper for paste()
with the sep
argument set a priori to “zero” space (" "
).
# Set up the compound variable named genderSmoke. actuary.n$genderSmoke <- paste0(actuary.n$gender, actuary.n$smoke) actuary.n name gender age exams Q1 Q2 Q3 S title smoke weight genderSmoke 1 Embryo Luo M -1 10 10 9 9 9.333333 FSA N 70 MN 3 Peter Smith M 22 0 4 5 5 4.666667 student Y 60 MY 5 Angela Peterson F 30 6 8 8 10 8.666667 student N 60 FN 6 Emily Johnston F 42 7 9 10 10 9.666667 ASA N 55 FN 7 Barbara Scott 7 29 5 8 9 7 8.000000 student N 50 7N 8 Benjamin Eng M 36 9 7 8 8 7.666667 ASA Y 75 MY
Another example would be:
paste0(1:10, c("st", "nd", "rd", rep("th", 7))) [1] "1st" "2nd" "3rd" "4th" "5th" "6th" "7th" "8th" "9th" "10th"
For-Loop Statement
Construct a confidence interval
# Initialization step n <- 100 n_sim <- 1000 # number of replications mu <- 5 # true mean vaLue sigma <- 2 count <- rep(NA, n_sim) # Repetition step set.seed(0) # to make results reproducibLe for (i in 1:n_sim) { # Draw a random sampLe of size n from a normal distribution # with mean mu and standard deviation sigma + x <- rnorm(n, mean = mu, sd = sigma) + count[i] <- (abs(mean(x) - mu) <= qnorm(0.975) * sigma / sqrt(n)) + } # Final result mean(count) [1] 0.952
Reproducible Results
Set a Seed
Set a seed in R is used for:
- Reproducing the same output of simulation studies.
- Help to debug the code when dealing with pseudorandom numbers.
set.seed()
: allow you to set a seed and a generator (with the kind
argument) in R.
- The state of the random number generator is stored in
.Random.seed
(in the global environment). It is a vector of integers which length depends on the generator. - If the seed is not specified, R uses the clock of the system to establish one.
set.seed(1) rnorm(5) [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078 rnorm(5) [1] -0.8204684 0.4874291 0.7383247 0.5757814 -0.3053884 ======================================= Same as Below ======================================= set.seed(1) rnorm(10) [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078 -0.8204684 0.4874291 [8] 0.7383247 0.5757814 -0.3053884> set.seed(1)
Unset a Seed
To unset or reset a seed in R, you have two options:
- As R uses the system clock to set a seed when not specified, you can use the
Sys.time
function as follows to come back to the default behavior
set.seed(Sys.time())
- pass a
NULL
to the function, to re-initialize the generator “as if no seed had yet been set”.
set.seed(NULL)
Generating Randomized Results
Uniform
runif(n)
: simulates \(n\) uniform random variables on interval \([0,1]\)
Normal
pnorm(q, mean, sd, lower.tail = TRUE)
: The \(p\) is for “probability” and is a cumulative distribution function.qnorm(x, mean, sd)
: The \(q\) is for “quantile“, and it is an inverse.dnorm(x, mean = x, sd = y)
: The \(d\) is for “density“, which is a density function.rnorm(n, mean = x, sd = y)
: The \(r\) is for “random“, a random variable with the specified distribution, simulates \(n\) normal distribution with mean \(x\) and standard deviation \(y\)
Fitting Least Squares Estimates
Question
You are fitting a multiple linear regression model:
\(Y=\beta_0+\beta_1X_1+\beta_2X_2+\epsilon\)
to the following dataset:
\(i\) | \(Y_i\) | \(X_{i1}\) | \(X_{i2}\) |
1 | 14 | 2 | 4 |
2 | 11 | 3 | 2 |
3 | 14 | 0 | 6 |
4 | 18 | 8 | 4 |
5 | 12 | 4 | 3 |
6 | 9 | 1 |
2 |
Perform the following two tasks in R.
-
- Set up the response vector \(Y\) and design matrix \(X\) corresponding to the dataset above.
- Use the matrix formula \(\hat{\beta}=(X^TX)^{-1}X^TY\) to calculate the least squares estimates of \(\beta_0,\beta_1,\beta_2\)
Solution
> Y <- c(14, 11, 14, 18, 12, 9) > X1<- c(2, 3, 0, 8, 4, 1) > X2 <- c(4, 2, 6, 4, 3, 2) > X <- matrix(c(XO, Xi, X2), nrow = 6) > X [,1] [,2] [,3] [1,] 1 2 4 [2,] 1 3 2 [3,] 1 0 6 [4,] 1 8 4 [5,] 1 4 3 [6,] 1 1 2 > B <- solve(t(X) %*% X) %*% t(X) %*% Y > B [,1] [1,] 5.2505543 [2,] 0.8137472 [3,] 1.5166297
Therefore, \(\hat{\beta_0}=5.2505543\), \(\hat{\beta_1}=0.8137472\), \(\hat{\beta_2}=1.5166297\)