[mathjax]
Making ggplots
Basic Features
Load library
library(ggplot2)
ggplot Function
ggplot(data = <DATA>, mapping= aes(<AESTHETIC_1> = <VARIABLE_1>,
<AESTHETIC 2> = <VARIABLE_2>,
...)) +
geom_<TYPE>(< ... >) +
geom_<TYPE>(< ... >) +
<OTHER_FUNCTIDNS> +
...
- The
ggplot()
function initializes the plot, defines the source of data using thedata
argument (almost always a data frame), and specifies what variables in the data are mapped to visual elements in the plot by themapping
argument.- Mappings in a ggplot are specified using the
aes()
function, with aes standing for “aesthetics“.
- Mappings in a ggplot are specified using the
- The
geom
functions: Subsequent to theggplot()
function, we put in geometric objects, or geoms for short, which include points, lines, bars, histograms, boxplots, and many other possibilities, by means of one or more geom functions. Placed layer by layer, these geoms determine what kind of plot is to be drawn and modify its visual characteristics, taking the data and aesthetic mappings specified in theggplot()
function as inputs.
Important geoms
geom | Type of Object Produced | Frequently Used Arguments |
geom_bar() |
Bar chart | fill , alpha |
geom_boxplot() |
Boxplot | fill , alpha |
geom_histogram() |
Histogram | fill , alpha , bins |
geom_point() |
Scatterplot | color , alpha , shape , size |
geom_smooth() |
Smooth curve | color , fill ,method , se |
Example: Personal Injury Insurance Data
Dataset: persinj.csv
1. Import Dataset
persinj <- read.csv("persinj.csv") #Takeout a subset of 50 observations from the full data persinj50 <- persinj[seq(1, nrow(persinj), length = 50), ]
2. Color all observations in blue
ggplot(data = persinj50, mapping = aes(x = op_time, y = amt)) + geom_point(colour = "blue")
3. Color all observations distinguished by legal representation
# 1. ggplot(data = persinj50, mapping = aes(x = op_time, y = amt, colour = legrep)) + geom_point() # 2. Convert "legrep" variable to a factor ggplot(data = persinj50, mapping = aes(x = op_time, y = amt, colour = factor(legrep))) + geom_point()
3.1. The use ofgeoom_smooth()
alpha
controls the transparency of the plotted objects on a scale from 0 (fully transparent) to 1 (opaque); the default value is 1.fill
controls filled areas of bars and polygons
# 1. ggplot(data = persinj50, mapping = aes(x = op_time, y = amt, colour = factor(legrep))) + geom_point(size = 2, alpha = 0.5) + geom_smooth() # 2. Fill the interior of standard error bands to the "legrep" variable treated as a factor variable ggplot(data = persinj50, mapping = aes(x = op_time, y = amt, colour = factor(legrep), fill = factor(legrep))) + geom_point(size = 2, alpha = 0.5) + geom_smooth()
We can see that the claim amount of the two types of injuries behaves quite differently with respect to op_time
, with those for legal representation being much more sensitive to changes in op_time
.
To make just one smoothed curve applied to all 50 observations in the persinj 50 dataset while still having them colored according to the presence of legal representation by specifying different aesthetic mappings for different geoms
:
# 1. The aes() function in the ggplot() call only has the x and y aesthetics; the color aesthetic is moved to the geom_point() function. # As a result, the 50 points will have their color distinguished by the legrep variable. ggplot(data = persinj50, mapping = aes(x = op_time, y = amt)) + geom_point(aes(colour = factor(legrep)), size = 2, alpha = 0.5) + geom_smooth() # 2. ggplot(data = persinj50, mapping = aes(x = op_time, y = amt, fill = factor(legrep))) + geom_point(aes(colour = factor(legrep)), size = 2, alpha = 0.5) + geom_smooth(se = FALSE) # 3. ggplot(data = persinj50, mapping = aes(x = op_time, y = amt)) + geom_point(aes(colour = factor(legrep)), size = 2, alpha = 0.5) + geom_smooth(aes(fill = factor(legrep)))
|
|
|
|
|
3.2. The use of facet_wrap()
In ggplot2, grouping is achieved by mapping one or more categorical variables in the data to visual elements like color, shape, fill, size, and linetype.
Faceting, which is another useful way to categorize our data into distinct groups, while grouping showcases two or more groups of observations in a single plot, faceting displays the observations in separate plots (known as a “small multiple” plot) produced for each value of the faceting variable placed side-by-side, usually on the same scale, to facilitate comparison.
facet_wrap(- <FACET_VAR>, ncol = <n>)
where ncol = <n>
: optional, determines the number of columns used to display the facets.
# 1. Produce faceted scatterplot according to the two values of legrep ggplot(data = persinj50, mapping = aes(x = op_time, y = amt)) + geom_point(size = 2, alpha = 0.5) + geom_smooth() + facet_wrap(~ legrep) # 2. Add scales="free" to facet_wrap() to relax this constraint and "free" the scales ggplot(data = persinj50, mapping = aes(x = op_time, y = amt)) + geom_point(size = 2, alpha = 0.5) + geom_smooth() + facet_wrap(~ legrep, scales="free")
3.3. The use of facet_grid()
To do faceting even when there are two faceting variables, we can do a cross-classification using the facet_grid()
function, which produces a two-dimensional “grid” of plots. Its syntax is similar to that of facet_wrap()
:
facet_grid(<FACET_VAR_1> ~ <FACET_VAR_2>, ncol = <n>)
Customizing Your Plots
Axes
coord_cartesian()
: Usexlim
andylim
arguments to set a two-element numeric vector indicating the desired lower and upper limits of the x-axis and y-axis. Data points outside the limits are thrown away.scale_x_log10()
: converts the scale of the x-axis of a ggplot to a log 10 asis an re-positions the data points accordingly.
When this function is applied, the points 10, 100, 1000, 10000 will be shown as consecutive numbers on the x-axis because their log 10 counterparts, log1010 = 1, log10100 = 2, log101000 = 3, and log1010000 = 4 are consecutive.
Titles, subtitles, and captions
labs()
: allows to set the label for the x-axis, y-axis, and the text for the title, subtitle, and caption of a ggplot using the x, y, title, subtitle, and caption arguments, respectively, with the desired label or text supplied as a character string.xlab()
: set the label for the x-axisylab()
: set the label for the y-axisggtitle()
: set the caption of a ggplot
Displaying multiple graphs on one page
grid.arrange()
: in thegridExtra
package, can place several ggplots in a single figure for ease of comparison.
not the same as faceting, where each figure is for a sub-sample of the data corresponding to certain values of the faceting variable(s).
ggplot(data = persinj50, mapping = aes(x = op_time, y = amt, color= factor(legrep), fill= factor(legrep))) + geom_point(size = 2, alpha = 0.5) + geom_smooth() + labs(title = "Personal Injury Datase", x = "Operational Time", y = "Claim Amount" ) + coord_cartesian(ylim = c(-200000, 300000))
Data Exploration
Univariate Data Exploration
Numeric Variables
summary()
Example: Summary statistics
persinj <- read.csv("persinj.csv") summary(persinj$amt) Min. 1st Qu. Median Mean 3rd Qu. Max. 10 6297 13854 38367 35123 4485797
-
- the mean of claim amount (38,367) is way higher than its median (13,854), indicating that its distribution is highly skewed to the right.
Problems with skewed data and possible solutions
In predictive modeling, it is often undesirable to have a right-skewed target variable for two reasons:
-
- After all, our objective in predictive analytics is to study the association between the target variable and the predictors in the data over a wide range of variable values. If most of the observations of the target variable cluster narrowly in the small-value range, this will make it difficult to investigate the effect of the predictors on the target variable globally-we simply don’t know enough about the target variable in the right tail. The same idea applies to a right-skewed predictor. If a predictor exhibits a heavy right skew, we are unable to differentiate the observations of the target variable effectively on the basis of the values of the predictor, most of which are concentrated in the small-value range.
- A number of predictive models (e.g., linear models, decision trees) are fitted by minimizing the sum of the squared discrepancies between the observed values and predicted values of the target variable. If the target variable is right-skewed, then the outliers, or extreme values, will contribute substantially to the sum and have a disproportionate effect on the model.
Correcting for the skewness
-
- Apply a monotone concave function to shrink the outliers and symmetrize the overall distribution while preserving the ranks of the observed values of the variable. This will reduce the effects of the extreme values on the model and tend to improve the goodness of fit than using the original right-skewed variable.
Two commonly used transformations for dealing with right-skewed variables are:- Log transformation:
This transformation (typically with respect to base e) can be applied as long as the variable of interest is strictly positive. None of the variable values should be zero or negative. - Square root transformation:
Although not discussed in the PA e-learning modules, the square root transformation is in a similar spirit to the log transformation, but is applicable even to non-negative variables, some of whose values are zero.
- Log transformation:
- Apply a monotone concave function to shrink the outliers and symmetrize the overall distribution while preserving the ranks of the observed values of the variable. This will reduce the effects of the extreme values on the model and tend to improve the goodness of fit than using the original right-skewed variable.
Example: Calculating the summary statistics for two groups of observations
Task: Calculate the summary statistics for claim amount separately for injuries with legal representation and those without legal representation. Comment on the central tendency and dispersion of claim amount for these two groups of injuries.
Solution:
1. Split the dataset into two subsets
persinj.0 <- persinj[persinj$legrep == 0,] persinj.1 <- persinj[persinj$legrep == 1,] summary(persinj.0$amt) Min. 1st Qu. Median Mean 3rd Qu. Max. 10 4061 11164 32398 29641 2798362 summary(persinj.1$amt) Min. 1st Qu. Median Mean 3rd Qu. Max. 20 7305 15309 41775 38761 4485797 sd(persinj.0$amt) [1] 77820.33 sd(persinj.1$amt) [1] 97541.38
2. Visualize data with histograms: geom_histogram()
p1 <- ggplot(persinj, aes(x = amt)) + geom_histogram(bins = 10, fill= "blue") + xlim(0, 100000) + ggtitle("Bins = 10") p2 <- ggplot(persinj, aes(x = amt)) + geom_histogram(fill= "blue") + xlim(0, 100000) + ggtitle("Default Value") p3 <- ggplot(persinj, aes(x = amt)) + geom_histogram(bins = 40, fill= "blue") + xlim(0, 100000) + ggtitle("Bins = 40") p4<- ggplot(persinj, aes(x = amt)) + geom_histogram(bins = 80, fill= "blue") + xlim(0, 100000) + ggtitle("Bins = 80") library(gridExtra) grid.arrange(p1, p2, p3, p4, ncol = 2)
- the distribution of claim amount is heavily lopsided to the right with a pronounced tail.
3. Correcting the skewness
p1<- ggplot(persinj, aes(x = log(amt))) + geom_histogram() p2<- ggplot(persinj, aes(x = sqrt(amt))) + geom_histogram() library(gridExtra) grid.arrange(p1, p2, ncol = 2)
- It is evident that the log transformation has effectively removed the skewness of claim amount and made the resulting distribution much more symmetric.
4. Identify outliers
-
- Errors: If an outlier is erroneous, then it is reasonable to correct the error or remove it entirely.
- Natural: Outliers can also arise naturally. Not necessarily errors, they are simply observations whose values are far away from the rest of the data, but are possible in theory.
- Remove: If we somehow know that a natural outlier is not likely to have a material effect on the final model, then it is fine to remove it.
- Ignore: If the outliers make up only an insignificant proportion of the data and unlikely to create bias, then it is sensible to leave them in the data.
- Modify: We can also modify the outliers to make them more reasonable, like censoring the policyholder age of 140 at 100.
- Using robust model forms: Instead of minimizing the squared error between the predicted values and the observed values, we could replace the squared error by the absolute error, which places much less relative weight on the large errors and reduces the impact of outliers on the fitted model.
5. A variation of histograms
Density plots: geom_density()
p1<- ggplot(persinj, aes(x = log(amt))) + geom_density() p2<- ggplot(persinj, aes(x = sqrt(amt))) + geom_density() library(gridExtra) grid.arrange(p1, p2, ncol = 2)
Box plots: geom_boxplot()
p1<- ggplot(persinj, aes(y = amt)) + geom_boxplot() p2<- ggplot(persinj, aes(y = log(amt))) + geom_boxplot() library(gridExtra) grid.arrange(p1, p2, ncol = 2)
-
- While most of the raw claim amounts are so close that the 25% percentile, median, and 75% percentile all degenerate to the same line, the log transformation corrects for the skewness and re-positions the data points for much easier visual inspection. Still there are quite a number of “outliers,” which are shown as large dotted points.
Categorical Variables
Descriptive Statistics
-
- Frequency tables: Categorical variables, even when coded as numbers, do not always have a natural order, so statistical summaries like the mean and median may not make sense. To understand the distribution of a categorical variable, we can look at the relative frequency of each of its levels through a frequency table, constructed by the
table()
function in R.
- Frequency tables: Categorical variables, even when coded as numbers, do not always have a natural order, so statistical summaries like the mean and median may not make sense. To understand the distribution of a categorical variable, we can look at the relative frequency of each of its levels through a frequency table, constructed by the
# the counts of the seven levels of inj table(persinj$inj) 1 2 3 4 5 6 9 15638 3376 1133 189 188 256 1256 # the percentage counts of the seven levels of inj table(persinj$inj)/nrow(persinj) 1 2 3 4 5 6 9 0.709656925 0.153203848 0.051415865 0.008576874 0.008531494 0.011617353 0.056997640
Graphical Displays
-
- Bar charts: When the number of levels of a categorical variable increases, a frequency table becomes more and more difficult to read. In most cases, the frequencies per se are not that important; what truly matters is their relative magnitude.
In this regard, bar charts extract the information in a frequency table and present the numeric counts visually, highlighting the relative frequency of each level in the variable. Looking at a bar chart, we can easily tell which levels are the most popular and which ones have minimal observations.
- Bar charts: When the number of levels of a categorical variable increases, a frequency table becomes more and more difficult to read. In most cases, the frequencies per se are not that important; what truly matters is their relative magnitude.
# Produce two bar charts for injury code corresponding to the two frequency tables # Convert integers to factors persinj$inj <- as.factor(persinj$inj) persinj$legrep <- as.factor(persinj$legrep) p1 <- ggplot(persinj, aes(x = inj)) + geom_bar (fill = "blue") p2 <- ggplot(persinj, aes(x = inj)) + geom_bar(fill = "blue", aes(y = ..prop.., group= 1)) grid.arrange(p1, p2, ncol = 2)
Bivariate Data Exploration
Numeric vs. Numeric
Descriptive Statistics
Correlation: An easy way to summarize the linear relationship between two numeric variables is through the correlation coefficient, or simply correlation. cor()
The larger the correlation in magnitude (i.e., the closer they are to + 1 or -1), the stronger the degree of linear association between the two variables.
cor(persinj$amt, persinj$op_time) [1] 0.3466114 cor(log(persinj$amt), persinj$op_time) [1] 0.6070667
A zero correlation only means that two variables are not linearly related, but they may be related in more subtle ways. More complex relationships (e.g., quadratic) can be revealed more effectively by graphical displays.
Graphical Displays
Scatterplot: The relationship between two numeric variables (one of which is usually the target variable) is typically visualized by a scatterplot. geom_point()
Such a plot often gives us a good sense of the nature of the relationship (e.g., increasing, decreasing, polynomial, periodic) between the two numeric variables and sometimes yields insights that correlations alone cannot provide.
p1 <- ggplot(persinj, aes(x = op_time, y = amt)) + geom_point(alpha = 0.05) + geom_smooth(method = "lm", se = FALSE) p2 <- ggplot(persinj, aes(x = op_time, y = log(amt))) + geom_point(alpha = 0.05) + geom_smooth(method = "lm", se = FALSE) grid.arrange(p1, p2, ncol = 2)
Both plots exhibit an increasing relationship, but the scatterplot for the log of claim amount displays a much more conspicuous upward sloping trend, indicating that the log of claim amount is approximately positively linear in operational time.
This is a further manifestation of the merits of the log transformation in uncovering relationships that would otherwise be obscure.
Although a scatterplot itself is confined to depicting the relationship between only two numeric variables, the effect of a third, categorical variable can be incorporated and investigated by decorating the observations by color, shape, or other visual elements according to the levels assumed by this third variable.
This way, we can visually inspect whether the relationship between the two numeric variables varies with the levels of the third, categorical variable. In statistical language, this phenomenon is known as interaction and is an important modeling issue to keep in mind when constructing an effective predictive model.
To make a scatterplot for the log of claim amount against operational time, with the observations color-distinguished by legal representation:
ggplot(persinj, aes(x = op_time, y = log(amt), color = legrep)) + geom_point(alpha = 0.25) + geom_smooth(method = "lm", se = FALSE)
The scatterplot shows that the two smoothed lines corresponding to the two levels of legrep
have markedly different slopes and intercepts.
In other words, the linear relationship between the log of claim amount and operational time depends materially on whether legal representation is present or not. We can also roughly tell the effect of legal representation on the (log of) claim amount: Injuries with legal representation (i.e., those with legrep = 1
) tend to produce higher claim amounts, except when operational time is extraordinarily large (90 or higher), in which case legal representation does not seem to have a noticeable impact on claim amount.
Numeric vs. Categorical
Descriptive Statistics
To summarize the association between a numeric variable and a categorical variable, we can partition the data into different subsets, one subset for each level of the categorical variable, and compute the mean of the numeric variable there. These conditional means varying substantially may suggest a strong relationship between the two variables.
Below we produce a table of the mean of the log-transformed claim amount (it is also fine to use the untransformed claim amount variable) split by different levels of inj
and legrep
, which are categorical.
Note on tidyverse
package:
-
- Installation:
install.packages(tidyverse)
- May need to install dependencies on Linux:
apt install libharfbuzz-dev libfribidi-dev
- Installation:
library(tidyverse) persinj %>% group_by(inj) %>% # make sure inj is factorized: persinj$inj <- as.factor(persinj$inj) summarise( mean = mean(log(amt)), median = median(log(amt)), n = n() ) # A tibble: 7 × 4 inj mean median n <fct> <dbl> <dbl> <int> 1 1 9.37 9.36 15638 2 2 10.3 10.3 3376 3 3 10.7 10.9 1133 4 4 11.0 11.2 189 5 5 10.8 10.6 188 6 6 9.68 9.07 256 7 9 8.35 8.57 1256
Insights:
-
- The claim amount on average increases from injury code 1 to injury code 4, then decreases down to injury code 9.
library(tidyverse) persinj %>% group_by(legrep) %>% # make sure legrep is factorized: persinj$legrep <- as.factor(persinj$legrep) summarise( mean = mean(log(amt)), median = median(log(amt)), n = n() ) # A tibble: 2 × 4 legrep mean median n <fct> <dbl> <dbl> <int> 1 0 9.18 9.32 8008 2 1 9.77 9.64 14028
Insights:
-
- Larger claim amounts are associated with the use of legal representation.
Graphical Displays
Split Boxplots: The conditional distribution of a numeric variable given a second, categorical variable is best visualized by split boxplots, where a series of boxplots of the numeric variable split by the categorical variable are made.
Below we construct two split boxplots for the log of claim amount, one split by injury code and one split by legal representation. The categorical variable that is used to split the numeric variable simply enters the x
aesthetic and a collection of boxplots of the numeric variable for each level of the categorical variable will be shown.
pl<- ggplot(persinj, aes(x = inj, y = log(amt))) + geom_boxplot() p2 <- ggplot(persinj, aes(x =legrep, y = log(amt))) + geom_boxplot() library(gridExtra) grid.arrange(p1, p2, ncol = 2)
The above plots can be condensed as below:
ggplot(persinj, aes(x = inj , y = log(amt), fill = legrep)) + geom_boxplot()
We split the log of claim amount by injury code (the x aesthetic), followed by legal representation (the fill aesthetic) within each injury code, to view a three-way relationship.
Highlights:
-
- Regardless of the injury code, larger claim sizes tend to be injuries with legal representation, with the effect being the most prominent for injuries of code 5 and code 9.
Histograms: histograms can also be adapted to visualize the distribution of a numeric variable split by a categorical variable. They can either be histograms stacked on top of one another (using the fill aesthetic) to highlight the contribution of each categorical level to the overall distribution of the numeric variable, or dodged histograms with each bin placed side by side for comparison (note the argument position = "dodge"
).
For the dodged histogram, it is necessary to specify y = ..density..
so that the histogram shows the density rather than raw counts; counts are misleading because there are a lot more injuries with legal representation than those without.
# Stacked (top) and dodged (bottom) histograms of the log of claim amount in the persinj data. p1 <- ggplot(persinj, aes(x = log(amt), fill= legrep)) + geom_histogram() p2 <- ggplot(persinj, aes(x = log(amt), y = ..density.., fill= legrep)) + geom_histogram(position = "dodge") library(gridExtra) grid.arrange(p1, p2)
Both histograms suggest that larger claims (e.g., those with log(amt)>=9) tend to be those with legal representation, as expected.
Categorical vs. Categorical
Descriptive Statistics
When examining a pair of categorical variables, it is often useful to construct a two-way frequency table showing the number of observations for every combination of the levels of the two variables using the table()
function. When two arguments are supplied to the table()
function, the first argument will correspond to the rows of the two-way frequency table while the second argument will correspond to its columns.
# Make a two-way frequency table for legal representation crossed with injury code table(persinj$legrep, persinj$inj) 1 2 3 4 5 6 9 0 5571 1152 374 56 85 121 649 1 10067 2224 759 133 103 135 607
Graphical Displays
Bar Charts: To visualize the distribution of a categorical variable split by another categorical variable effectively, split bar charts can be of use.
# Produce three bar charts for injury code split by legal representation p1 <- ggplot(persinj, aes(x = inj, fill = legrep)) + geom_bar() p2 <- ggplot(persinj, aes(x = inj, fill = legrep)) + geom_bar(position = "dodge") p3 <- ggplot(persinj, aes(x = inj, fill = legrep)) + geom_bar(position = "fill") + ylab("Proportion") library(gridExtra) grid.arrange(p1, p2, p3, ncol = 2)
Highlights:
-
p1
(Stacked): The first bar chart has counts within each injury code colored by legal representation because the fill aesthetic is set tolegrep
. In other words, each bar is broken down proportionally into injuries with legal representation (colored in green) and those without legal representation (colored in red).p2
(Dodged): The second bar chart has counts within each injury code separated according to legal representation and placed side by side for comparison due to the optionposition = "dodge"
in thegeom_bar()
function.p3
(Filled): In the third bar chart, the relative proportions (not counts) of injuries with and without legal representation within each injury code are shown due to the optionposition = "fill"
(not to be confused with the fill aesthetic). This makes it easy to compare proportions across different injury codes, although we lose the ability to see the number of injuries in each code. In predictive modeling, factor levels with more observations are generally considered more reliable than sparse levels.
Filled bar charts are usually the most useful for depicting the interplay between two categorical variables. If a categorical variable is an important predictor, then the proportions of 0 (in red) and 1 (in green) should vary significantly across different levels of the categorical variable.
Exercise 1
Preparation:
persinj <- read.csv("persinj.csv") persinj$inj <- as.factor(persinj$inj) persinj$legrep <- as.factor(persinj$legrep) library(ggplot2)
# CHUNK 1 ggplot(persinj, aes(x = inj, color = legrep)) + geom_bar() # CHUNK 2 ggplot(persinj, aes(x = inj, fill = legrep)) + geom_bar() # CHUNK 3 ggplot(persinj, aes(x = inj)) + geom_bar(fill = legrep) # CHUNK 4 ggplot(persinj, aes(x = inj)) + geom_bar(aes(fill = legrep))
Error in list2(just = just, width = width, na.rm = na.rm, orientation = orientation, : object 'legrep' not found |
Solution:
- Chunk 1: a bar chart for injury code with the boundary (not the interior) of the vertical bars colored according to legal representation.
- Chunk 2: This is similar to CHUNK 1, except that this time it is the interior of the vertical bars that is color-coded according to legal representation
- Chunk 3: This chunk of code does not work. The reason is that the fill argument (not fill aesthetic) of the
geom_bar()
function is mapped to a variable (legrep
here) instead of a constant. - Chunk 4: generates the same output as CHUNK 2. Instead of putting the fill aesthetic in the
ggplot()
call, it is placed inside thegeom_bar()
function.
Exercise 2
Preparation:
library(ggplot2) data(diamonds)
a) Determine the number of observations and variables in the diamonds dataset.
# The number of observations nrow(diamonds) [1] 53940 # The number of variables length(colnames(diamonds)) [1] 10 # Or use dim(<data set>) dim(diamonds) [1] 53940 10
With the aid of appropriate graphical displays and/or summary statistics, complete the following tasks.
(b) Perform univariate exploration of the price of diamonds (price) and the quality of the cut (cut). Determine if any of these two variables should be transformed and, if so, what transformation should be made. Do your recommended transformation(s), if any, and delete the original variable(s).
# summary(diamonds$price) Min. 1st Qu. Median Mean 3rd Qu. Max. 326 950 2401 3933 5324 18823
The variable ranges from 326
to 18,823
and its mean is much higher than its median, an indication of its pronounced right skewness. We use histogram to prove it:
ggplot(diamonds, aes(x = price)) + geom_histogram()
To deal with the right skewness of price, we can use a log transformation. The following commands create the log-transformed price and delete the original price variable.
diamonds$Lprice <- log(diamonds$price) diamonds$price <- NULL ggplot(diamonds, aes(x = Lprice)) + geom_histogram()
The resulting histogram shows that the distribution of the log-transformed price is closer to symmetric, although it is not bell-shaped.
The cut variable is a 5-level categorical variable; its levels are “Fair”, “Good”, ”Very Good”, “Premium”, and “Ideal”.
levels(diamonds$cut) [1] "Fair" "Good" "Very Good" "Premium" "Ideal" # The following table shows the counts for each level: table(diamonds$cut) Fair Good Very Good Premium Ideal 1610 4906 12082 13791 21551 # The following table shows the percentage for each level: table(diamonds$cut)/nrow(diamonds) Fair Good Very Good Premium Ideal 0.02984798 0.09095291 0.22398962 0.25567297 0.39953652
Since cut
is a categorical variable, numeric transformations such as the log transformation are not applicable and so we would prefer to leave it as it is.
(c) Explore the relationship between the price of diamonds and the weight of diamonds (carat).
Because Lprice
and carat
are both numeric variables, a scatterplot for the two variables is appropriate for exploring their relationship.
ggplot(diamonds, aes(x = carat, y = Lprice)) + geom_point(alpha = 0.5)
Highlights:
- The scatterplot shows that the two variables are strongly positively related; the heavier the diamond, the more expensive it is, conforming to our intuition.
(d) Explore the relationship between the price of diamonds and the quality of the cut.
Since carat
is a categorical variable, we use boxplot to explore their relationship.
ggplot(diamonds, aes(x = cut, y = Lprice)) + geom_boxplot(fill = "red")
Highlights:
- The split boxplot, however, suggests the counter-intuitive idea that diamonds of a higher quality tend to be cheaper (though only slightly). How can this be the case?
(e) Reconcile the apparent contradiction between what you get in parts (c) and (d).
To reconcile the contradiction between parts (c) and (d), one can look at the relationship between carat and cut. Again, as carat is numeric and cut is categorical, a split boxplot for carat split by cut will serve our purpose.
ggplot(diamonds, aes(x = cut, y = carat)) + geom_boxplot(fill = "red")
Highlights:
- The split boxplot shows that the weight of a diamond tends to drop as the quality of the cut becomes higher (“Premium” is an exception). According to part (c), the
carat
is an important predictor ofLprice
, so the findings in part (d) may be a result of the negative relationship between carat and cut — higher quality diamonds may be less pricey because they weigh less.
Summary of Plot Use
Categorical | Numerical | |
Categorical | Filled Bar Charts | Split Boxplots, Dodged Histograms |
Numerical | – | Scatterplot |