I love data visualization. My initial forays into data vizualization led me to the package
ggplot2, which seems to be quickly becoming the standard in the R community (or at least the one I see most often). When I first started using it, the syntax seemed pretty daunting (all the
geom_whatever arguments were confusing to me), but I was motivated enough by some pretty examples and helpful documentation to hack through other people’s code and make my own plots. I slowly started to get the syntax down, and now making plots with
ggplot is mostly intuitive and “fun”. I still have to look specific stuff up once in a while, but I usually know what I’m looking for at least. Anyway, I think
ggplot2 is great, but the learning curve can feel a bit steep, so here is a simple tutorial.
I’ll use some data on musicals I found because I was recently reminded of my childhood love for musicals (especially Rodgers & Hammerstein) and I’m feeling sentimental.
I took the inflation-adjusted box office numbers and release year of the 27 top grossing musicals from this website and put them in a spreadsheet. (Why did they only include the top 27 musicals? Seems like a strange number to decide on.)
Then, for each musical I typed “[musical name] tracklisting” and “[musical name] runtime” into google and recorded what came up for each search in my spreadsheet, which can be downloaded on my github (feel free to contribute data!). There’s probably a way I could have written some code to automatically scrape that data and more, but that would’ve been a whole thing.
Anyway, here is the simple dataset:
|Annie Get Your Gun||127245278||31||1950||107|
|Bye Bye Birdie||130212874||11||1963||120|
|Into The Woods||130894237||21||2014||125|
|Flower Drum Song||131191774||16||1961||132|
|The Rocky Horror Picture Show||140576117||14||1975||101|
|Gentlemen Prefer Blondes||168600000||6||1953||91|
|The Music Man||180087030||18||1962||155|
|Paint Your Wagon||188064853||15||1969||164|
|The Best Little Whorehouse In Texas||199858769||15||1982||114|
|The King And I||359118000||21||1956||144|
|Fiddler On The Roof||408170029||17||1971||181|
|West Side Story||533899997||15||1957||153|
|My Fair Lady||652645154||15||1964||175|
|The Sound Of Music||1362273686||16||1965||174|
The variable names are pretty self-explanatory. Across these elite (i.e., truncated range of) musicals, the average number of songs is 18.56 (SD = 6.95) and the average runtime is 138.59 (SD = 26.49)1.
Exploring musical “quality”
For the sake of this exercise, let’s imagine that the amount of money a musical makes is a decent reflection of its quality (I’m sure there are many reasons why this isn’t valid, but the same could be said for a lot of operationalizations in psychology). Let’s try to find some interesting relationships between some other objective metrics and musical quality with
Number of songs?
A reasonable person might guess that a good musical would have a lot of music. Maybe there’s a relationship between the number of songs in a musical and how much money it made at the box office. Let’s make a basic scatterplot exploring that relationship.
At minimum, we have to specify the data and the overall aesthetics (
aes), which are usually at least the x (
song_quantity) and y (
BoxOffice) axis. One cool thing about
ggplot is that you just add layers to this backbone using a
+ for each layer. Markers can be added using
geom_jitter to prevent points from overlapping too much). If you save the backbone plot (or any additions thereafter) as an object (e.g.,
myplot), you can simply add additional layers to that object as well (illustrated below).
# specify the backbone myplot <- ggplot(data = df, # specify data aes(x = song_quantity, # specify x axis variable y = BoxOffice)) # specify y axis variable # add markers to the backbone myplot + geom_point()
So there’s the visualized relationship between song quantity and musical quality: there isn’t one.
Maybe musical fans just want a good escape from their real life, so musicals that are simply longer—providing more escape—are “better”. Here’s a scatterplot exploring the relationship between runtime (
length_minutes) and box office earnings (
BoxOffice). I changed the point
color just to show how.
plot2 <- ggplot(df, aes(length_minutes, BoxOffice, label = Name)) + geom_point(color = 'blue', # change color of points shape = 'triangle', # change point shape size = 3) # change the point size plot2
If there is a relationship, it’s probably driven by that outlier that is the highest-grossing musical. Which musical is it? We could look in the dataframe to see, but this is a plotting tutorial so let’s make the plot tell us. We can add text labels to the points using
geom_text and change the distance of the labels from the points using
hjust. Because this adds another layer that depends on the data, we have to provide the aesthetics, using the
label arguement in this case.
# add text labels with the musical name plot2 + geom_text(aes(label = Name), vjust = -.1)
Most those labels are pretty bunched (very ugly) up but we can see that the highest grossing musical is The Sound of Music. We can’t very well get rid of one of the most iconic musicals of all time, so let’s just keep the outlier for the sake of the exercise. The labels look terrible, though, so we should probably get rid of them (we can bring them back later in a pretty way). While we’re at it, let’s make the plot generally prettier by changing the theme and setting the axis labels to something more clear.
There are a number of ways to change the axis labels, scales, and tick marks, but the ones that seem to give the most control are
library(scales) plot2_revised <- plot2 + scale_x_continuous(name = "Runtime (in minutes)", # new x-axis label # setting tick mark every 10 minutes breaks = seq(90, 190, 10)) + scale_y_continuous(name = "Gross Box Office Earnings (USD)", # changes y-axis tick labels to dollar values labels = dollar, breaks = seq(min(df$BoxOffice), max(df$BoxOffice), 50000000)) plot2_revised
There are a whole bunch of preset themes in
ggplot (see full list of default themes here). I like
plot_final <- plot2_revised + theme_classic() plot_final
Now we’ve basically got a publication-ready plot!
Bonus: Add interactive labels
I liked being able to see which dot belonged to each musical, but the labels were too ugly. Luckily there is an interactive plotting package called
plotly that has a wrapper to turn any
ggplot object into an interactive plot.
library(plotly) int_plot <- ggplotly(plot_final) style(int_plot, hoverinfo = df$Name)
Now we can see the name of the musical when we hover over each point. That’s cool. I’ve played around with
plotly a bit and it has much more functionality than I can go into here; it plays well enough with
ggplot for stuff like this, but it looks like it might be worth just learning
plotly too (maybe in a future blog post).
Go forth and ggplot
So that’s some basic
ggplot stuff (with a bonus of
ggplotly). I haven’t even scratched the surface of what’s possible with
ggplot here but hopefully it helps someone get started. With enough practice and googling you can feel as free in ggplot as Julie Andrews did in the alps.
Thanks for tuning in! My semester starts next week, and this blogging thing always takes me more time than I think it does, so I think I’ll probably dial it back for a while.
In R Markdown you can programatically write these statistics using backticks, specifying the language
`r`and then writing a function, e.g.,
round(mean(df$var), 2)within the backticks, but after the