I love data visualization. My initial forays into data vizualization led me to the package ggplot2
, which seems to be quickly becoming the standard in the R community (or at least the one I see most often). When I first started using it, the syntax seemed pretty daunting (all the aes
and geom_whatever
arguments were confusing to me), but I was motivated enough by some pretty examples and helpful documentation to hack through other people’s code and make my own plots. I slowly started to get the syntax down, and now making plots with ggplot
is mostly intuitive and “fun”. I still have to look specific stuff up once in a while, but I usually know what I’m looking for at least. Anyway, I think ggplot2
is great, but the learning curve can feel a bit steep, so here is a simple tutorial.
library(ggplot2)
The data
I’ll use some data on musicals I found because I was recently reminded of my childhood love for musicals (especially Rodgers & Hammerstein) and I’m feeling sentimental.
I took the inflation-adjusted box office numbers and release year of the 27 top grossing musicals from this website and put them in a spreadsheet. (Why did they only include the top 27 musicals? Seems like a strange number to decide on.)
Then, for each musical I typed “[musical name] tracklisting” and “[musical name] runtime” into google and recorded what came up for each search in my spreadsheet, which can be downloaded on my github (feel free to contribute data!). There’s probably a way I could have written some code to automatically scrape that data and more, but that would’ve been a whole thing.
Anyway, here is the simple dataset:
Name | BoxOffice | song_quantity | year | length_minutes |
---|---|---|---|---|
Annie Get Your Gun | 127245278 | 31 | 1950 | 107 |
Dreamgirls | 128941402 | 39 | 2006 | 130 |
Bye Bye Birdie | 130212874 | 11 | 1963 | 120 |
Into The Woods | 130894237 | 21 | 2014 | 125 |
Flower Drum Song | 131191774 | 16 | 1961 | 132 |
Gypsy | 133397795 | 16 | 1962 | 153 |
The Rocky Horror Picture Show | 140576117 | 14 | 1975 | 101 |
Hairspray | 145652571 | 17 | 2007 | 120 |
Les Miserables | 156592957 | 34 | 2012 | 160 |
Annie | 163607956 | 16 | 1982 | 128 |
Gentlemen Prefer Blondes | 168600000 | 6 | 1953 | 91 |
Mamma Mia | 169222345 | 24 | 2008 | 109 |
The Music Man | 180087030 | 18 | 1962 | 155 |
Paint Your Wagon | 188064853 | 15 | 1969 | 164 |
The Best Little Whorehouse In Texas | 199858769 | 15 | 1982 | 114 |
Cabaret | 204930552 | 17 | 1972 | 124 |
Camelot | 218495610 | 12 | 1967 | 179 |
Chicago | 239112061 | 18 | 2002 | 113 |
Oliver | 240691792 | 20 | 1968 | 153 |
The King And I | 359118000 | 21 | 1956 | 144 |
Funny Girl | 376454194 | 17 | 1968 | 155 |
Fiddler On The Roof | 408170029 | 17 | 1971 | 181 |
South Pacific | 456211764 | 16 | 1958 | 171 |
West Side Story | 533899997 | 15 | 1957 | 153 |
Grease | 602892685 | 24 | 1978 | 111 |
My Fair Lady | 652645154 | 15 | 1964 | 175 |
The Sound Of Music | 1362273686 | 16 | 1965 | 174 |
The variable names are pretty self-explanatory. Across these elite (i.e., truncated range of) musicals, the average number of songs is 18.56 (SD = 6.95) and the average runtime is 138.59 (SD = 26.49)1.
Exploring musical “quality”
For the sake of this exercise, let’s imagine that the amount of money a musical makes is a decent reflection of its quality (I’m sure there are many reasons why this isn’t valid, but the same could be said for a lot of operationalizations in psychology). Let’s try to find some interesting relationships between some other objective metrics and musical quality with ggplot
.
Number of songs?
A reasonable person might guess that a good musical would have a lot of music. Maybe there’s a relationship between the number of songs in a musical and how much money it made at the box office. Let’s make a basic scatterplot exploring that relationship.
At minimum, we have to specify the data and the overall aesthetics (aes
), which are usually at least the x (song_quantity
) and y (BoxOffice
) axis. One cool thing about ggplot
is that you just add layers to this backbone using a +
for each layer. Markers can be added using geom_point
(or geom_jitter
to prevent points from overlapping too much). If you save the backbone plot (or any additions thereafter) as an object (e.g., myplot
), you can simply add additional layers to that object as well (illustrated below).
# specify the backbone
myplot <- ggplot(data = df, # specify data
aes(x = song_quantity, # specify x axis variable
y = BoxOffice)) # specify y axis variable
# add markers to the backbone
myplot + geom_point()
So there’s the visualized relationship between song quantity and musical quality: there isn’t one.
Runtime?
Maybe musical fans just want a good escape from their real life, so musicals that are simply longer—providing more escape—are “better”. Here’s a scatterplot exploring the relationship between runtime (length_minutes
) and box office earnings (BoxOffice
). I changed the point size
, shape
, and color
just to show how.
plot2 <- ggplot(df, aes(length_minutes, BoxOffice, label = Name)) +
geom_point(color = 'blue', # change color of points
shape = 'triangle', # change point shape
size = 3) # change the point size
plot2
If there is a relationship, it’s probably driven by that outlier that is the highest-grossing musical. Which musical is it? We could look in the dataframe to see, but this is a plotting tutorial so let’s make the plot tell us. We can add text labels to the points using geom_text
and change the distance of the labels from the points using vjust
and hjust
. Because this adds another layer that depends on the data, we have to provide the aesthetics, using the label
arguement in this case.
# add text labels with the musical name
plot2 + geom_text(aes(label = Name), vjust = -.1)
Most those labels are pretty bunched (very ugly) up but we can see that the highest grossing musical is The Sound of Music. We can’t very well get rid of one of the most iconic musicals of all time, so let’s just keep the outlier for the sake of the exercise. The labels look terrible, though, so we should probably get rid of them (we can bring them back later in a pretty way). While we’re at it, let’s make the plot generally prettier by changing the theme and setting the axis labels to something more clear.
There are a number of ways to change the axis labels, scales, and tick marks, but the ones that seem to give the most control are scale_x_continuous
and scale_y_continuous
.
library(scales)
plot2_revised <- plot2 +
scale_x_continuous(name = "Runtime (in minutes)", # new x-axis label
# setting tick mark every 10 minutes
breaks = seq(90, 190, 10)) +
scale_y_continuous(name = "Gross Box Office Earnings (USD)",
# changes y-axis tick labels to dollar values
labels = dollar,
breaks = seq(min(df$BoxOffice), max(df$BoxOffice), 50000000))
plot2_revised
There are a whole bunch of preset themes in ggplot
(see full list of default themes here). I like theme_classic
.
plot_final <- plot2_revised + theme_classic()
plot_final
Now we’ve basically got a publication-ready plot!
Bonus: Add interactive labels
I liked being able to see which dot belonged to each musical, but the labels were too ugly. Luckily there is an interactive plotting package called plotly
that has a wrapper to turn any ggplot
object into an interactive plot.
library(plotly)
int_plot <- ggplotly(plot_final)
style(int_plot, hoverinfo = df$Name)
Now we can see the name of the musical when we hover over each point. That’s cool. I’ve played around with plotly
a bit and it has much more functionality than I can go into here; it plays well enough with ggplot
for stuff like this, but it looks like it might be worth just learning plotly
too (maybe in a future blog post).
Go forth and ggplot
So that’s some basic ggplot
stuff (with a bonus of ggplotly
). I haven’t even scratched the surface of what’s possible with ggplot
here but hopefully it helps someone get started. With enough practice and googling you can feel as free in ggplot as Julie Andrews did in the alps.
Thanks for tuning in! My semester starts next week, and this blogging thing always takes me more time than I think it does, so I think I’ll probably dial it back for a while.
In R Markdown you can programatically write these statistics using backticks, specifying the language
`r`
and then writing a function, e.g.,round(mean(df$var), 2)
within the backticks, but after ther
.↩