Taking too long? Close loading screen.

SOA ASA Exam: Predictive Analysis (PA) – 2. ggplot

[mathjax]

Making ggplots

Basic Features

Load library

library(ggplot2)

 

ggplot Function

ggplot(data = <DATA>, mapping= aes(<AESTHETIC_1> = <VARIABLE_1>,
                                    <AESTHETIC 2> = <VARIABLE_2>,
                                    ...)) +
    geom_<TYPE>(< ... >) +
    geom_<TYPE>(< ... >) +
    <OTHER_FUNCTIDNS> +
     ...

  • The ggplot() function initializes the plot, defines the source of data using the data argument (almost always a data frame), and specifies what variables in the data are mapped to visual elements in the plot by the mapping argument.
    • Mappings in a ggplot are specified using the aes() function, with aes standing for “aesthetics“.
  • The geom functions: Subsequent to the ggplot() function, we put in geometric objects, or geoms for short, which include points, lines, bars, histograms, boxplots, and many other possibilities, by means of one or more geom functions. Placed layer by layer, these geoms determine what kind of plot is to be drawn and modify its visual characteristics, taking the data and aesthetic mappings specified in the ggplot() function as inputs.

 

Important geoms

geom Type of Object Produced Frequently Used Arguments
geom_bar() Bar chart fill, alpha
geom_boxplot() Boxplot fill, alpha
geom_histogram() Histogram fill, alpha, bins
geom_point() Scatterplot color, alpha, shape, size
geom_smooth() Smooth curve color, fill,method, se

 

Example: Personal Injury Insurance Data

Dataset: persinj.csv

1. Import Dataset

persinj <- read.csv("persinj.csv")
#Takeout a subset of 50 observations from the full data
persinj50 <- persinj[seq(1, nrow(persinj), length = 50), ]

 

2. Color all observations in blue

ggplot(data = persinj50, mapping = aes(x = op_time, y = amt)) +
    geom_point(colour = "blue")

 

3. Color all observations distinguished by legal representation

# 1.
ggplot(data = persinj50, mapping = aes(x = op_time, y = amt, colour = legrep)) +
    geom_point()

# 2. Convert "legrep" variable to a factor
ggplot(data = persinj50, mapping = aes(x = op_time, y = amt, colour = factor(legrep))) +
    geom_point()

 

3.1.  The use ofgeoom_smooth()

  • alpha
    controls the transparency of the plotted objects on a scale from 0 (fully transparent) to 1 (opaque); the default value is 1.
  • fill
    controls filled areas of bars and polygons
# 1. 
ggplot(data = persinj50, mapping = aes(x = op_time, y = amt, colour = factor(legrep))) +
    geom_point(size = 2, alpha = 0.5) +
    geom_smooth()

# 2. Fill the interior of standard error bands to the "legrep" variable treated as a factor variable
ggplot(data = persinj50, mapping = aes(x = op_time, y = amt, colour = factor(legrep), fill = factor(legrep))) + 
    geom_point(size = 2, alpha = 0.5) +
    geom_smooth()
1. The consistent coloring is achieved by mapping both the color aesthetic and fill aesthetic (which controls filled areas of bars, polygons and, in this case, the interior of standard error bands) to the legrep variable treated as a factor variable.
2. If you omit fill = factor(legrep), then the two smoothed curves will still be colored according to the presence of legal representation due to the color aesthetic, but without the fill aesthetic, the geom_smooth() function will shade the two standard error ribbons by its default color, which is gray.

We can see that the claim amount of the two types of injuries behaves quite differently with respect to op_time, with those for legal representation being much more sensitive to changes in op_time.

 

To make just one smoothed curve applied to all 50 observations in the persinj 50 dataset while still having them colored according to the presence of legal representation by specifying different aesthetic mappings for different geoms:

# 1. The aes() function in the ggplot() call only has the x and y aesthetics; the color aesthetic is moved to the geom_point() function.
#    As a result, the 50 points will have their color distinguished by the legrep variable.
ggplot(data = persinj50, mapping = aes(x = op_time, y = amt)) +
    geom_point(aes(colour = factor(legrep)), size = 2, alpha = 0.5) +
    geom_smooth()

# 2. 
ggplot(data = persinj50, mapping = aes(x = op_time, y = amt, fill = factor(legrep))) +
    geom_point(aes(colour = factor(legrep)), size = 2, alpha = 0.5) +
    geom_smooth(se = FALSE)

# 3.
ggplot(data = persinj50, mapping = aes(x = op_time, y = amt)) +
    geom_point(aes(colour = factor(legrep)), size = 2, alpha = 0.5) +
    geom_smooth(aes(fill = factor(legrep)))
  • the 50 points will have their color distinguished by the legrep variable.
  • the fill aesthetic added to the initial ggplot () call and
  • the option se = FALSE added to the geom_smooth() function.
  • The fill aesthetic has no effect on the geom_point() function, but it does affect the geom_smooth() function, which, by default, produces the standard error bands that are filled in color according to the fill aesthetic. Even though these bands are switched off due to the option se = FALSE, separate smoothed curves are still fitted to the two groups of injuries. Without the color aesthetic, however, the two curves share the same color, which is blue.
  • the fill aesthetic added to the geom_smooth() function.
  • As a result, separate smoothed curves are fitted to the two groups of injuries with the standard error bands filled in color according to
    legrep. As the color aesthetic is absent in geom_smooth(), the two smoothed curves still share the same color (blue).

 

3.2.  The use of facet_wrap()

In ggplot2, grouping is achieved by mapping one or more categorical variables in the data to visual elements like color, shape, fill, size, and linetype.

Faceting, which is another useful way to categorize our data into distinct groups, while grouping showcases two or more groups of observations in a single plot, faceting displays the observations in separate plots (known as a “small multiple” plot) produced for each value of the faceting variable placed side-by-side, usually on the same scale, to facilitate comparison.

facet_wrap(- <FACET_VAR>, ncol = <n>)

where ncol = <n>: optional, determines the number of columns used to display the facets.

# 1. Produce faceted scatterplot according to the two values of legrep
ggplot(data = persinj50, mapping = aes(x = op_time, y = amt)) +
    geom_point(size = 2, alpha = 0.5) +
    geom_smooth() +
    facet_wrap(~ legrep)

# 2. Add scales="free" to facet_wrap() to relax this constraint and "free" the scales
ggplot(data = persinj50, mapping = aes(x = op_time, y = amt)) +
    geom_point(size = 2, alpha = 0.5) +
    geom_smooth() +
    facet_wrap(~ legrep, scales="free")

3.3.  The use of facet_grid()

To do faceting even when there are two faceting variables, we can do a cross-classification using the facet_grid() function, which produces a two-dimensional “grid” of plots. Its syntax is similar to that of facet_wrap():

facet_grid(<FACET_VAR_1> ~ <FACET_VAR_2>, ncol = <n>)

 

Customizing Your Plots

Axes

  • coord_cartesian(): Use xlim and ylim arguments to set a two-element numeric vector indicating the desired lower and upper limits of the x-axis and y-axis. Data points outside the limits are thrown away.
  • scale_x_log10(): converts the scale of the x-axis of a ggplot to a log 10 asis an re-positions the data points accordingly.
    When this function is applied, the points 10, 100, 1000, 10000 will be shown as consecutive numbers on the x-axis because their log 10 counterparts, log1010 = 1, log10100 = 2, log101000 = 3, and log1010000 = 4 are consecutive.

Titles, subtitles, and captions

  • labs(): allows to set the label for the x-axis, y-axis, and the text for the title, subtitle, and caption of a ggplot using the x, y, title, subtitle, and caption arguments, respectively, with the desired label or text supplied as a character string.
  • xlab(): set the label for the x-axis
  • ylab(): set the label for the y-axis
  • ggtitle(): set the caption of a ggplot

Displaying multiple graphs on one page

  • grid.arrange(): in the gridExtra package, can place several ggplots in a single figure for ease of comparison.
    not the same as faceting, where each figure is for a sub-sample of the data corresponding to certain values of the faceting variable(s).
ggplot(data = persinj50, mapping = aes(x = op_time,
                                       y = amt,
                                       color= factor(legrep),
                                       fill= factor(legrep))) +
  geom_point(size = 2, alpha = 0.5) +
  geom_smooth() +
  labs(title = "Personal Injury Datase",
       x = "Operational Time",
       y = "Claim Amount" ) +
  coord_cartesian(ylim = c(-200000, 300000))

 

Data Exploration

Univariate Data Exploration

Numeric Variables

  • summary()

Example: Summary statistics

persinj <- read.csv("persinj.csv")

summary(persinj$amt)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     10    6297   13854   38367   35123 4485797 
    • the mean of claim amount (38,367) is way higher than its median (13,854), indicating that its distribution is highly skewed to the right.

 

Problems with skewed data and possible solutions

In predictive modeling, it is often undesirable to have a right-skewed target variable for two reasons:

    • After all, our objective in predictive analytics is to study the association between the target variable and the predictors in the data over a wide range of variable values. If most of the observations of the target variable cluster narrowly in the small-value range, this will make it difficult to investigate the effect of the predictors on the target variable globally-we simply don’t know enough about the target variable in the right tail. The same idea applies to a right-skewed predictor. If a predictor exhibits a heavy right skew, we are unable to differentiate the observations of the target variable effectively on the basis of the values of the predictor, most of which are concentrated in the small-value range.
    • A number of predictive models (e.g., linear models, decision trees) are fitted by minimizing the sum of the squared discrepancies between the observed values and predicted values of the target variable. If the target variable is right-skewed, then the outliers, or extreme values, will contribute substantially to the sum and have a disproportionate effect on the model.

Correcting for the skewness

    • Apply a monotone concave function to shrink the outliers and symmetrize the overall distribution while preserving the ranks of the observed values of the variable. This will reduce the effects of the extreme values on the model and tend to improve the goodness of fit than using the original right-skewed variable.
      Two commonly used transformations for dealing with right-skewed variables are:

      • Log transformation
        This transformation (typically with respect to base e) can be applied as long as the variable of interest is strictly positive. None of the variable values should be zero or negative.
      • Square root transformation:
        Although not discussed in the PA e-learning modules, the square root transformation is in a similar spirit to the log transformation, but is applicable even to non-negative variables, some of whose values are zero.

 

Example: Calculating the summary statistics for two groups of observations

Task: Calculate the summary statistics for claim amount separately for injuries with legal representation and those without legal representation. Comment on the central tendency and dispersion of claim amount for these two groups of injuries.

Solution:

1. Split the dataset into two subsets

persinj.0 <- persinj[persinj$legrep == 0,]
persinj.1 <- persinj[persinj$legrep == 1,]

summary(persinj.0$amt)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     10    4061   11164   32398   29641 2798362

summary(persinj.1$amt)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     20    7305   15309   41775   38761 4485797

sd(persinj.0$amt)
[1] 77820.33

sd(persinj.1$amt)
[1] 97541.38

 

2. Visualize data with histograms: geom_histogram()

p1 <- ggplot(persinj, aes(x = amt)) +
  geom_histogram(bins = 10, fill= "blue") + xlim(0, 100000) + ggtitle("Bins = 10")

p2 <- ggplot(persinj, aes(x = amt)) +
  geom_histogram(fill= "blue") + xlim(0, 100000) + ggtitle("Default Value")

p3 <- ggplot(persinj, aes(x = amt)) +
  geom_histogram(bins = 40, fill= "blue") + xlim(0, 100000) + ggtitle("Bins = 40")

p4<- ggplot(persinj, aes(x = amt)) +
  geom_histogram(bins = 80, fill= "blue") + xlim(0, 100000) + ggtitle("Bins = 80")

library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)
  • the distribution of claim amount is heavily lopsided to the right with a pronounced tail.

 

3. Correcting the skewness

p1<- ggplot(persinj, aes(x = log(amt))) +
  geom_histogram()

p2<- ggplot(persinj, aes(x = sqrt(amt))) +
  geom_histogram()

library(gridExtra)
grid.arrange(p1, p2, ncol = 2)
  • It is evident that the log transformation has effectively removed the skewness of claim amount and made the resulting distribution much more symmetric.

 

4. Identify outliers

    • Errors: If an outlier is erroneous, then it is reasonable to correct the error or remove it entirely.
    • Natural: Outliers can also arise naturally. Not necessarily errors, they are simply observations whose values are far away from the rest of the data, but are possible in theory.
      • Remove: If we somehow know that a natural outlier is not likely to have a material effect on the final model, then it is fine to remove it.
      • Ignore: If the outliers make up only an insignificant proportion of the data and unlikely to create bias, then it is sensible to leave them in the data.
      • Modify: We can also modify the outliers to make them more reasonable, like censoring the policyholder age of 140 at 100.
      • Using robust model forms: Instead of minimizing the squared error between the predicted values and the observed values, we could replace the squared error by the absolute error, which places much less relative weight on the large errors and reduces the impact of outliers on the fitted model.

 

5. A variation of histograms

Density plots: geom_density()

p1<- ggplot(persinj, aes(x = log(amt))) +
  geom_density()

p2<- ggplot(persinj, aes(x = sqrt(amt))) +
  geom_density()

library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

 

Box plots: geom_boxplot()

p1<- ggplot(persinj, aes(y = amt)) +
  geom_boxplot()

p2<- ggplot(persinj, aes(y = log(amt))) +
  geom_boxplot()

library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

    • While most of the raw claim amounts are so close that the 25% percentile, median, and 75% percentile all degenerate to the same line, the log transformation corrects for the skewness and re-positions the data points for much easier visual inspection. Still there are quite a number of “outliers,” which are shown as large dotted points.

Categorical Variables

Descriptive Statistics

    • Frequency tables: Categorical variables, even when coded as numbers, do not always have a natural order, so statistical summaries like the mean and median may not make sense. To understand the distribution of a categorical variable, we can look at the relative frequency of each of its levels through a frequency table, constructed by the table() function in R.
# the counts of the seven levels of inj
table(persinj$inj)
    1     2     3     4     5     6     9 
15638  3376  1133   189   188   256  1256

# the percentage counts of the seven levels of inj
table(persinj$inj)/nrow(persinj)
          1           2           3           4           5           6           9 
0.709656925 0.153203848 0.051415865 0.008576874 0.008531494 0.011617353 0.056997640

Graphical Displays

    • Bar charts: When the number of levels of a categorical variable increases, a frequency table becomes more and more difficult to read. In most cases, the frequencies per se are not that important; what truly matters is their relative magnitude.
      In this regard, bar charts extract the information in a frequency table and present the numeric counts visually, highlighting the relative frequency of each level in the variable. Looking at a bar chart, we can easily tell which levels are the most popular and which ones have minimal observations.
  •  
# Produce two bar charts for injury code corresponding to the two frequency tables
# Convert integers to factors
persinj$inj <- as.factor(persinj$inj)
persinj$legrep <- as.factor(persinj$legrep)

p1 <- ggplot(persinj, aes(x = inj)) +
  geom_bar (fill = "blue")
p2 <- ggplot(persinj, aes(x = inj)) +
  geom_bar(fill = "blue", aes(y = ..prop.., group= 1))

grid.arrange(p1, p2, ncol = 2)

 

Bivariate Data Exploration

Numeric vs. Numeric

Descriptive Statistics

Correlation: An easy way to summarize the linear relationship between two numeric variables is through the correlation coefficient, or simply correlation. cor()

The larger the correlation in magnitude (i.e., the closer they are to + 1 or -1), the stronger the degree of linear association between the two variables.

cor(persinj$amt, persinj$op_time)
[1] 0.3466114

cor(log(persinj$amt), persinj$op_time)
[1] 0.6070667

A zero correlation only means that two variables are not linearly related, but they may be related in more subtle ways. More complex relationships (e.g., quadratic) can be revealed more effectively by graphical displays.

 

Graphical Displays

Scatterplot: The relationship between two numeric variables (one of which is usually the target variable) is typically visualized by a scatterplot. geom_point()
Such a plot often gives us a good sense of the nature of the relationship (e.g., increasing, decreasing, polynomial, periodic) between the two numeric variables and sometimes yields insights that correlations alone cannot provide.

p1 <- ggplot(persinj, aes(x = op_time, y = amt)) +
  geom_point(alpha = 0.05) +
  geom_smooth(method = "lm", se = FALSE)

p2 <- ggplot(persinj, aes(x = op_time, y = log(amt))) +
  geom_point(alpha = 0.05) +
  geom_smooth(method = "lm", se = FALSE)

grid.arrange(p1, p2, ncol = 2)

Both plots exhibit an increasing relationship, but the scatterplot for the log of claim amount displays a much more conspicuous upward sloping trend, indicating that the log of claim amount is approximately positively linear in operational time.

This is a further manifestation of the merits of the log transformation in uncovering relationships that would otherwise be obscure.

Although a scatterplot itself is confined to depicting the relationship between only two numeric variables, the effect of a third, categorical variable can be incorporated and investigated by decorating the observations by color, shape, or other visual elements according to the levels assumed by this third variable.

This way, we can visually inspect whether the relationship between the two numeric variables varies with the levels of the third, categorical variable. In statistical language, this phenomenon is known as interaction and is an important modeling issue to keep in mind when constructing an effective predictive model.

To make a scatterplot for the log of claim amount against operational time, with the observations color-distinguished by legal representation:

ggplot(persinj, aes(x = op_time, y = log(amt), color = legrep)) +
  geom_point(alpha = 0.25) + geom_smooth(method = "lm", se = FALSE)

The scatterplot shows that the two smoothed lines corresponding to the two levels of legrep have markedly different slopes and intercepts.

In other words, the linear relationship between the log of claim amount and operational time depends materially on whether legal representation is present or not. We can also roughly tell the effect of legal representation on the (log of) claim amount: Injuries with legal representation (i.e., those with legrep = 1) tend to produce higher claim amounts, except when operational time is extraordinarily large (90 or higher), in which case legal representation does not seem to have a noticeable impact on claim amount.

Numeric vs. Categorical

Descriptive Statistics

To summarize the association between a numeric variable and a categorical variable, we can partition the data into different subsets, one subset for each level of the categorical variable, and compute the mean of the numeric variable there. These conditional means varying substantially may suggest a strong relationship between the two variables.

Below we produce a table of the mean of the log-transformed claim amount (it is also fine to use the untransformed claim amount variable) split by different levels of inj and legrep, which are categorical.

Note on tidyverse package

    • Installation:
      • install.packages(tidyverse)
      • May need to install dependencies on Linux: apt install libharfbuzz-dev libfribidi-dev
library(tidyverse)
persinj %>%
    group_by(inj) %>% # make sure inj is factorized: persinj$inj <- as.factor(persinj$inj)
    summarise( 
        mean = mean(log(amt)), 
        median = median(log(amt)), 
        n = n()
    )


# A tibble: 7 × 4
  inj    mean median     n
  <fct> <dbl>  <dbl> <int>
1 1      9.37   9.36 15638
2 2     10.3   10.3   3376
3 3     10.7   10.9   1133
4 4     11.0   11.2    189
5 5     10.8   10.6    188
6 6      9.68   9.07   256
7 9      8.35   8.57  1256

Insights:

    • The claim amount on average increases from injury code 1 to injury code 4, then decreases down to injury code 9.
library(tidyverse)
persinj %>%
    group_by(legrep) %>% # make sure legrep is factorized: persinj$legrep <- as.factor(persinj$legrep)
    summarise(
        mean = mean(log(amt)),
        median = median(log(amt)),
        n = n()
    )


# A tibble: 2 × 4
  legrep  mean median     n
  <fct>  <dbl>  <dbl> <int>
1 0       9.18   9.32  8008
2 1       9.77   9.64 14028

Insights:

    • Larger claim amounts are associated with the use of legal representation.

Graphical Displays

Split Boxplots: The conditional distribution of a numeric variable given a second, categorical variable is best visualized by split boxplots, where a series of boxplots of the numeric variable split by the categorical variable are made.

Below we construct two split boxplots for the log of claim amount, one split by injury code and one split by legal representation. The categorical variable that is used to split the numeric variable simply enters the x aesthetic and a collection of boxplots of the numeric variable for each level of the categorical variable will be shown.

pl<- ggplot(persinj, aes(x = inj, y = log(amt))) + geom_boxplot()
p2 <- ggplot(persinj, aes(x =legrep, y = log(amt))) + geom_boxplot()
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

 

The above plots can be condensed as below:

ggplot(persinj, aes(x = inj , y = log(amt), fill = legrep)) +
    geom_boxplot()

We split the log of claim amount by injury code (the x aesthetic), followed by legal representation (the fill aesthetic) within each injury code, to view a three-way relationship.

Highlights:

    • Regardless of the injury code, larger claim sizes tend to be injuries with legal representation, with the effect being the most prominent for injuries of code 5 and code 9.

Histograms: histograms can also be adapted to visualize the distribution of a numeric variable split by a categorical variable. They can either be histograms stacked on top of one another (using the fill aesthetic) to highlight the contribution of each categorical level to the overall distribution of the numeric variable, or dodged histograms with each bin placed side by side for comparison (note the argument position = "dodge").

For the dodged histogram, it is necessary to specify y = ..density.. so that the histogram shows the density rather than raw counts; counts are misleading because there are a lot more injuries with legal representation than those without.

 

# Stacked (top) and dodged (bottom) histograms of the log of claim amount in the persinj data.
p1 <- ggplot(persinj, aes(x = log(amt), fill= legrep)) + 
    geom_histogram()
p2 <- ggplot(persinj, aes(x = log(amt), y = ..density.., fill= legrep)) + 
    geom_histogram(position = "dodge")
library(gridExtra)
grid.arrange(p1, p2)

Both histograms suggest that larger claims (e.g., those with log(amt)>=9) tend to be those with legal representation, as expected.

Categorical vs. Categorical

Descriptive Statistics

When examining a pair of categorical variables, it is often useful to construct a two-way frequency table showing the number of observations for every combination of the levels of the two variables using the table() function. When two arguments are supplied to the table() function, the first argument will correspond to the rows of the two-way frequency table while the second argument will correspond to its columns.

# Make a two-way frequency table for legal representation crossed with injury code
table(persinj$legrep, persinj$inj)

        1     2     3     4     5     6     9
  0  5571  1152   374    56    85   121   649
  1 10067  2224   759   133   103   135   607

Graphical Displays

Bar Charts: To visualize the distribution of a categorical variable split by another categorical variable effectively, split bar charts can be of use.

# Produce three bar charts for injury code split by legal representation
p1 <- ggplot(persinj, aes(x = inj, fill = legrep)) + 
    geom_bar()
p2 <- ggplot(persinj, aes(x = inj, fill = legrep)) + 
    geom_bar(position = "dodge") 
p3 <- ggplot(persinj, aes(x = inj, fill = legrep)) + 
    geom_bar(position = "fill") + 
    ylab("Proportion")
library(gridExtra)
grid.arrange(p1, p2, p3, ncol = 2)

Highlights:

    • p1 (Stacked): The first bar chart has counts within each injury code colored by legal representation because the fill aesthetic is set to legrep. In other words, each bar is broken down proportionally into injuries with legal representation (colored in green) and those without legal representation (colored in red).
    • p2 (Dodged): The second bar chart has counts within each injury code separated according to legal representation and placed side by side for comparison due to the option position = "dodge" in the geom_bar() function.
    • p3 (Filled): In the third bar chart, the relative proportions (not counts) of injuries with and without legal representation within each injury code are shown due to the option position = "fill" (not to be confused with the fill aesthetic). This makes it easy to compare proportions across different injury codes, although we lose the ability to see the number of injuries in each code. In predictive modeling, factor levels with more observations are generally considered more reliable than sparse levels.

Filled bar charts are usually the most useful for depicting the interplay between two categorical variables. If a categorical variable is an important predictor, then the proportions of 0 (in red) and 1 (in green) should vary significantly across different levels of the categorical variable.

Exercise 1

Preparation

persinj <- read.csv("persinj.csv")
persinj$inj <- as.factor(persinj$inj)
persinj$legrep <- as.factor(persinj$legrep)
library(ggplot2)
# CHUNK 1
ggplot(persinj, aes(x = inj, color = legrep)) + geom_bar()

# CHUNK 2
ggplot(persinj, aes(x = inj, fill = legrep)) + geom_bar()

# CHUNK 3
ggplot(persinj, aes(x = inj)) + geom_bar(fill = legrep)

# CHUNK 4
ggplot(persinj, aes(x = inj)) + geom_bar(aes(fill = legrep))
Error in list2(just = just, width = width, na.rm = na.rm, orientation = orientation, :
object 'legrep' not found

Solution:

  • Chunk 1: a bar chart for injury code with the boundary (not the interior) of the vertical bars colored according to legal representation.
  • Chunk 2: This is similar to CHUNK 1, except that this time it is the interior of the vertical bars that is color-coded according to legal representation
  • Chunk 3: This chunk of code does not work. The reason is that the fill argument (not fill aesthetic) of the geom_bar() function is mapped to a variable (legrep here) instead of a constant.
  • Chunk 4: generates the same output as CHUNK 2. Instead of putting the fill aesthetic in the ggplot() call, it is placed inside the geom_bar() function. 

Exercise 2

Preparation:

library(ggplot2)
data(diamonds)

a) Determine the number of observations and variables in the diamonds dataset.

# The number of observations
nrow(diamonds)
[1] 53940

# The number of variables
length(colnames(diamonds))
[1] 10

# Or use dim(<data set>)
dim(diamonds)
[1] 53940    10

With the aid of appropriate graphical displays and/or summary statistics, complete the following tasks.

 

(b) Perform univariate exploration of the price of diamonds (price) and the quality of the cut (cut). Determine if any of these two variables should be transformed and, if so, what transformation should be made. Do your recommended transformation(s), if any, and delete the original variable(s).

# 
summary(diamonds$price)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    326     950    2401    3933    5324   18823

The variable ranges from 326 to 18,823 and its mean is much higher than its median, an indication of its pronounced right skewness. We use histogram to prove it:

ggplot(diamonds, aes(x = price)) + geom_histogram()

To deal with the right skewness of price, we can use a log transformation. The following commands create the log-transformed price and delete the original price variable.

diamonds$Lprice <- log(diamonds$price)
diamonds$price <- NULL

ggplot(diamonds, aes(x = Lprice)) + geom_histogram()

The resulting histogram shows that the distribution of the log-transformed price is closer to symmetric, although it is not bell-shaped.

The cut variable is a 5-level categorical variable; its levels are “Fair”, “Good”, ”Very Good”, “Premium”, and “Ideal”.

levels(diamonds$cut)

[1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal" 

# The following table shows the counts for each level:
table(diamonds$cut)

     Fair      Good Very Good   Premium     Ideal 
     1610      4906     12082     13791     21551

# The following table shows the percentage for each level:
table(diamonds$cut)/nrow(diamonds)

      Fair       Good  Very Good    Premium      Ideal 
0.02984798 0.09095291 0.22398962 0.25567297 0.39953652

Since cut is a categorical variable, numeric transformations such as the log transformation are not applicable and so we would prefer to leave it as it is.

 

(c) Explore the relationship between the price of diamonds and the weight of diamonds (carat).

Because Lprice and carat are both numeric variables, a scatterplot for the two variables is appropriate for exploring their relationship.

ggplot(diamonds, aes(x = carat, y = Lprice)) + geom_point(alpha = 0.5)

Highlights:

  • The scatterplot shows that the two variables are strongly positively related; the heavier the diamond, the more expensive it is, conforming to our intuition.

 

(d) Explore the relationship between the price of diamonds and the quality of the cut.

Since carat is a categorical variable, we use boxplot to explore their relationship.

ggplot(diamonds, aes(x = cut, y = Lprice)) + geom_boxplot(fill = "red")

Highlights:

  • The split boxplot, however, suggests the counter-intuitive idea that diamonds of a higher quality tend to be cheaper (though only slightly). How can this be the case?

 

(e) Reconcile the apparent contradiction between what you get in parts (c) and (d).

To reconcile the contradiction between parts (c) and (d), one can look at the relationship between carat and cut. Again, as carat is numeric and cut is categorical, a split boxplot for carat split by cut will serve our purpose.

ggplot(diamonds, aes(x = cut, y = carat)) + geom_boxplot(fill = "red")

Highlights:

  • The split boxplot shows that the weight of a diamond tends to drop as the quality of the cut becomes higher (“Premium” is an exception). According to part (c), the carat is an important predictor of Lprice, so the findings in part (d) may be a result of the negative relationship between carat and cut — higher quality diamonds may be less pricey because they weigh less.

 

Summary of Plot Use

  Categorical Numerical
Categorical Filled Bar Charts Split Boxplots, Dodged Histograms
Numerical Scatterplot