Data visualization basics

I love data visualization. My initial forays into data vizualization led me to the package ggplot2, which seems to be quickly becoming the standard in the R community (or at least the one I see most often). When I first started using it, the syntax seemed pretty daunting (all the aes and geom_whatever arguments were confusing to me), but I was motivated enough by some pretty examples and helpful documentation to hack through other people’s code and make my own plots. I slowly started to get the syntax down, and now making plots with ggplot is mostly intuitive and “fun”. I still have to look specific stuff up once in a while, but I usually know what I’m looking for at least. Anyway, I think ggplot2 is great, but the learning curve can feel a bit steep, so here is a simple tutorial.


The data

I’ll use some data on musicals I found because I was recently reminded of my childhood love for musicals (especially Rodgers & Hammerstein) and I’m feeling sentimental.

I took the inflation-adjusted box office numbers and release year of the 27 top grossing musicals from this website and put them in a spreadsheet. (Why did they only include the top 27 musicals? Seems like a strange number to decide on.)

Then, for each musical I typed “[musical name] tracklisting” and “[musical name] runtime” into google and recorded what came up for each search in my spreadsheet, which can be downloaded on my github (feel free to contribute data!). There’s probably a way I could have written some code to automatically scrape that data and more, but that would’ve been a whole thing.

Anyway, here is the simple dataset:

Name BoxOffice song_quantity year length_minutes
Annie Get Your Gun 127245278 31 1950 107
Dreamgirls 128941402 39 2006 130
Bye Bye Birdie 130212874 11 1963 120
Into The Woods 130894237 21 2014 125
Flower Drum Song 131191774 16 1961 132
Gypsy 133397795 16 1962 153
The Rocky Horror Picture Show 140576117 14 1975 101
Hairspray 145652571 17 2007 120
Les Miserables 156592957 34 2012 160
Annie 163607956 16 1982 128
Gentlemen Prefer Blondes 168600000 6 1953 91
Mamma Mia 169222345 24 2008 109
The Music Man 180087030 18 1962 155
Paint Your Wagon 188064853 15 1969 164
The Best Little Whorehouse In Texas 199858769 15 1982 114
Cabaret 204930552 17 1972 124
Camelot 218495610 12 1967 179
Chicago 239112061 18 2002 113
Oliver 240691792 20 1968 153
The King And I 359118000 21 1956 144
Funny Girl 376454194 17 1968 155
Fiddler On The Roof 408170029 17 1971 181
South Pacific 456211764 16 1958 171
West Side Story 533899997 15 1957 153
Grease 602892685 24 1978 111
My Fair Lady 652645154 15 1964 175
The Sound Of Music 1362273686 16 1965 174

The variable names are pretty self-explanatory. Across these elite (i.e., truncated range of) musicals, the average number of songs is 18.56 (SD = 6.95) and the average runtime is 138.59 (SD = 26.49)1.

Exploring musical “quality”

For the sake of this exercise, let’s imagine that the amount of money a musical makes is a decent reflection of its quality (I’m sure there are many reasons why this isn’t valid, but the same could be said for a lot of operationalizations in psychology). Let’s try to find some interesting relationships between some other objective metrics and musical quality with ggplot.

Number of songs?

A reasonable person might guess that a good musical would have a lot of music. Maybe there’s a relationship between the number of songs in a musical and how much money it made at the box office. Let’s make a basic scatterplot exploring that relationship.

At minimum, we have to specify the data and the overall aesthetics (aes), which are usually at least the x (song_quantity) and y (BoxOffice) axis. One cool thing about ggplot is that you just add layers to this backbone using a + for each layer. Markers can be added using geom_point (or geom_jitter to prevent points from overlapping too much). If you save the backbone plot (or any additions thereafter) as an object (e.g., myplot), you can simply add additional layers to that object as well (illustrated below).

# specify the backbone
myplot <- ggplot(data = df, # specify data
                 aes(x = song_quantity, # specify x axis variable
                     y = BoxOffice)) # specify y axis variable
# add markers to the backbone
myplot + geom_point()

So there’s the visualized relationship between song quantity and musical quality: there isn’t one.


Maybe musical fans just want a good escape from their real life, so musicals that are simply longer—providing more escape—are “better”. Here’s a scatterplot exploring the relationship between runtime (length_minutes) and box office earnings (BoxOffice). I changed the point size, shape, and color just to show how.

plot2 <- ggplot(df, aes(length_minutes, BoxOffice, label = Name)) +
  geom_point(color = 'blue', # change color of points
             shape = 'triangle', # change point shape
             size = 3) # change the point size 

If there is a relationship, it’s probably driven by that outlier that is the highest-grossing musical. Which musical is it? We could look in the dataframe to see, but this is a plotting tutorial so let’s make the plot tell us. We can add text labels to the points using geom_text and change the distance of the labels from the points using vjust and hjust. Because this adds another layer that depends on the data, we have to provide the aesthetics, using the label arguement in this case.

# add text labels with the musical name
plot2 + geom_text(aes(label = Name), vjust = -.1) 

Most those labels are pretty bunched (very ugly) up but we can see that the highest grossing musical is The Sound of Music. We can’t very well get rid of one of the most iconic musicals of all time, so let’s just keep the outlier for the sake of the exercise. The labels look terrible, though, so we should probably get rid of them (we can bring them back later in a pretty way). While we’re at it, let’s make the plot generally prettier by changing the theme and setting the axis labels to something more clear.

There are a number of ways to change the axis labels, scales, and tick marks, but the ones that seem to give the most control are scale_x_continuous and scale_y_continuous.

plot2_revised <- plot2 + 
  scale_x_continuous(name = "Runtime (in minutes)", # new x-axis label
                           # setting tick mark every 10 minutes
                           breaks = seq(90, 190, 10)) + 
  scale_y_continuous(name = "Gross Box Office Earnings (USD)",
                     # changes y-axis tick labels to dollar values
                     labels = dollar,
                     breaks = seq(min(df$BoxOffice), max(df$BoxOffice), 50000000))

There are a whole bunch of preset themes in ggplot (see full list of default themes here). I like theme_classic.

plot_final <- plot2_revised + theme_classic()


Now we’ve basically got a publication-ready plot!

Bonus: Add interactive labels

I liked being able to see which dot belonged to each musical, but the labels were too ugly. Luckily there is an interactive plotting package called plotly that has a wrapper to turn any ggplot object into an interactive plot.


int_plot <- ggplotly(plot_final)

style(int_plot, hoverinfo = df$Name)

Now we can see the name of the musical when we hover over each point. That’s cool. I’ve played around with plotly a bit and it has much more functionality than I can go into here; it plays well enough with ggplot for stuff like this, but it looks like it might be worth just learning plotly too (maybe in a future blog post).

Go forth and ggplot

So that’s some basic ggplot stuff (with a bonus of ggplotly). I haven’t even scratched the surface of what’s possible with ggplot here but hopefully it helps someone get started. With enough practice and googling you can feel as free in ggplot as Julie Andrews did in the alps.

Thanks for tuning in! My semester starts next week, and this blogging thing always takes me more time than I think it does, so I think I’ll probably dial it back for a while.

  1. In R Markdown you can programatically write these statistics using backticks, specifying the language `r` and then writing a function, e.g., round(mean(df$var), 2) within the backticks, but after the r.