Worked Example: Cars (2024)

Johan Larsson

Behnaz Pirzamanbein

If you haven’t done so yet, you need to install the tidyverse collectionof packages. To do so, simply call the following line.(Note that this will install several R-packages andtake a lot of time if you haven’t already installed these packages.)

install.packages("tidyverse")

If you already have tidyverse installed, make sure that all of your packagesare up to date, either by calling

update.packages(ask = FALSE) # ask = FALSE is to avoid being prompted

or by going to the Packages tab in the lower-right R Studio pane and clickingon Update.

R is extremely versatile when it comes to handling data of differentformats and types. Much of this functionality is made available via R-packages,which let you import data into R from Microsoft Excel, Stata, SAS, or SPSS files inaddition to standard file types such as comma-separated files (.csv)and tab-separated files (.tsv). In this course, most of the data sets we use will beavailable directly through R and R packages, but knowing how to import data directly is auseful skill.1

First we need to find some data to import. Download the US Cars datasetthat we have provided in the git repository for the course. Thedata is availablehere(this link may not work if you’re browsing this page frominside Canvas). Alternatively, you can call following lines of code todownload the dataset.

# check that we have a data folder in our working directory, and if not create# oneif (!dir.exists("data")) { dir.create("data")}# download the dataset to data/us_cars.csvdownload.file( "https://raw.githubusercontent.com/stat-lu/dataviz/main/data/us_cars.csv", file.path("data", "us_cars.csv"), mode = "wb")

Now we load this dataset into R via the readr package (part of tidyverse).This is a comma-separated file (note the file extension), so we use theread_csv() function2

library(tidyverse)us_cars <- read_csv(file.path("data", "us_cars.csv"))

When you call read_csv(), the readr package helpfully prints a message to your consolewith information about how the columns in the dataset were formatted when youimported the data. Take a look at this information—does everything lookokay?

An alternative to this approach is to use readr to read data directlyfrom an URL into R. To do so, you simply use the URL instead of the file namein the call to read_csv(), like this.

us_cars <- read_csv( "https://raw.githubusercontent.com/stat-lu/dataviz/main/data/us_cars.csv")

2.1 Taking a Glimpse

Whenever you start working with new data, always start by taking a look at thedata in raw form. The best first step is usually just to print the data to theconsole or by calling head() (to see the first few lines).

us_cars

## # A tibble: 2,499 × 12## price brand model year title_status mileage color vin lot state country## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> ## 1 6300 toyo… crui… 2008 clean vehic… 274117 black jtez… 1.59e8 new … usa ## 2 2899 ford se 2011 clean vehic… 190552 silv… 2fmd… 1.67e8 tenn… usa ## 3 5350 dodge mpv 2018 clean vehic… 39590 silv… 3c4p… 1.68e8 geor… usa ## 4 25000 ford door 2014 clean vehic… 64146 blue 1ftf… 1.68e8 virg… usa ## 5 27700 chev… 1500 2018 clean vehic… 6654 red 3gcp… 1.68e8 flor… usa ## 6 5700 dodge mpv 2018 clean vehic… 45561 white 2c4r… 1.68e8 texas usa ## 7 7300 chev… pk 2010 clean vehic… 149050 black 1gcs… 1.68e8 geor… usa ## 8 13350 gmc door 2017 clean vehic… 23525 gray 1gks… 1.68e8 cali… usa ## 9 14600 chev… mali… 2018 clean vehic… 9371 silv… 1g1z… 1.68e8 flor… usa ## 10 5250 ford mpv 2017 clean vehic… 63418 black 2fmp… 1.68e8 texas usa ## # ℹ 2,489 more rows## # ℹ 1 more variable: condition <chr>

If you want to see more rows of the data than are printed by default, try to callprint() with the n argument set to the number of rows you want, or usehead() in the same way.

Another useful function, particularly when there are many columns in thedataset, is glimpse().

glimpse(us_cars)

## Rows: 2,499## Columns: 12## $ price <dbl> 6300, 2899, 5350, 25000, 27700, 5700, 7300, 13350, 14600,…## $ brand <chr> "toyota", "ford", "dodge", "ford", "chevrolet", "dodge", …## $ model <chr> "cruiser", "se", "mpv", "door", "1500", "mpv", "pk", "doo…## $ year <dbl> 2008, 2011, 2018, 2014, 2018, 2018, 2010, 2017, 2018, 201…## $ title_status <chr> "clean vehicle", "clean vehicle", "clean vehicle", "clean…## $ mileage <dbl> 274117, 190552, 39590, 64146, 6654, 45561, 149050, 23525,…## $ color <chr> "black", "silver", "silver", "blue", "red", "white", "bla…## $ vin <chr> "jtezu11f88k007763", "2fmdk3gc4bbb02217", "3c4pdcgg5jt346…## $ lot <dbl> 159348797, 166951262, 167655728, 167753855, 167763266, 16…## $ state <chr> "new jersey", "tennessee", "georgia", "virginia", "florid…## $ country <chr> "usa", "usa", "usa", "usa", "usa", "usa", "usa", "usa", "…## $ condition <chr> "10 days left", "6 days left", "2 days left", "22 hours l…

3.1 Pivoting

One of the most important data wrangling tools when it comes to datavisualization is pivoting, which can be used to transform messy data intotidy data, which in turn will make visualization much easier for us.

Let’s see what this can entail by looking at a simple dataset, table2 fromthe tidyr package, which contains information on Tuberculosis casesin a few countries from 1999 and 2000.

table2

## # A tibble: 12 × 4## country year type count## <chr> <dbl> <chr> <dbl>## 1 Afghanistan 1999 cases 745## 2 Afghanistan 1999 population 19987071## 3 Afghanistan 2000 cases 2666## 4 Afghanistan 2000 population 20595360## 5 Brazil 1999 cases 37737## 6 Brazil 1999 population 172006362## 7 Brazil 2000 cases 80488## 8 Brazil 2000 population 174504898## 9 China 1999 cases 212258## 10 China 1999 population 1272915272## 11 China 2000 cases 213766## 12 China 2000 population 1280428583

This data is not tidy—but why not? Take a moment to consider this foryourself before moving on.

Did you figure it out? The problem is that cases and population are twodifferent variables but don’t have their own separate columns. To fix this, weneed to reshape the data by pivoting it to a wider form using pivot_wider().Before you continue, take a look at the documentation for the function bycalling help(pivot_wider) (or ?pivot_wilder) in R and see if you can makesense of the manual entry for the function.

The function has many arguments but we only need to concern ourselveswith data, names_from, and values_from right now. names_from shouldindicate which columns store the names of the variables we want to pivot,while values_from should contain those variables’ values. Putting thistogether, we get the following.

table2_tidy <- pivot_wider( table2, names_from = "type", values_from = "count" )table2_tidy

## # A tibble: 6 × 4## country year cases population## <chr> <dbl> <dbl> <dbl>## 1 Afghanistan 1999 745 19987071## 2 Afghanistan 2000 2666 20595360## 3 Brazil 1999 37737 172006362## 4 Brazil 2000 80488 174504898## 5 China 1999 212258 1272915272## 6 China 2000 213766 1280428583

Now it’s easy to visualize this data!

ggplot(table2_tidy, aes(year, cases, fill = country)) + geom_col(position = "dodge")

3.2 Manipulation

In the tidyverse vocabulary, filtering refers to the process of selecting asubset of rows (observations if the data is tidy) from your dataset, whereasselecting means selecting a subset of columns (variables if the data is tidy).We use the aptly named filter() and select() respectively for these tasks.

3.3 Filtering

filter() is used on a dataset (tidyverse functions typically alwaystake data as its first argument) together with a number of logical expressionsthat specify which rows to keep in the dataset. Recall that a logical expressionis a binary expression that relates the left-hand side to the right-hand sidein some way, for instance to check equality or inequality, like so:

c(1, 2, 3) < c(0.2, 0.5, 3.8)

## [1] FALSE FALSE TRUE

1:3 == 3:1

## [1] FALSE TRUE FALSE

c("a", "b", "c") %in% c("a", "c")

## [1] TRUE FALSE TRUE

Let’s say that we only wanted to look at cars from Tennessee from years2015 and onward. Then we can use filter() in the following way.

filter(us_cars, state == "tennessee", year >= 2015)

## # A tibble: 17 × 12## price brand model year title_status mileage color vin lot state country## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> ## 1 31900 chev… 1500 2018 clean vehic… 22909 black 3gcu… 1.68e8 tenn… usa ## 2 16500 buick enco… 2018 clean vehic… 20002 red kl4c… 1.68e8 tenn… usa ## 3 19200 chev… 1500 2018 clean vehic… 2430 white 1gcn… 1.68e8 tenn… usa ## 4 30500 chev… 1500 2018 clean vehic… 30442 red 3gcu… 1.68e8 tenn… usa ## 5 19000 chev… mali… 2017 clean vehic… 18414 white 1g1z… 1.68e8 tenn… usa ## 6 27000 buick encl… 2017 clean vehic… 32107 no_c… 5gak… 1.68e8 tenn… usa ## 7 29400 bmw x3 2017 clean vehic… 23765 black 5uxw… 1.68e8 tenn… usa ## 8 8500 ford fusi… 2017 clean vehic… 78739 black 3fa6… 1.68e8 tenn… usa ## 9 30300 ford f-150 2019 clean vehic… 31899 white 1fte… 1.68e8 tenn… usa ## 10 39200 ford max 2019 clean vehic… 43617 black 1fmj… 1.68e8 tenn… usa ## 11 41200 ford srw 2019 clean vehic… 25638 gray 1ft7… 1.68e8 tenn… usa ## 12 29000 ford door 2017 clean vehic… 42291 black 1fte… 1.68e8 tenn… usa ## 13 14500 niss… rogue 2017 clean vehic… 8677 blue knma… 1.68e8 tenn… usa ## 14 18200 niss… sport 2018 clean vehic… 28009 black jn1b… 1.68e8 tenn… usa ## 15 10900 niss… sent… 2018 clean vehic… 28880 silv… 3n1a… 1.68e8 tenn… usa ## 16 12600 niss… sent… 2017 clean vehic… 11837 gray 3n1a… 1.68e8 tenn… usa ## 17 12700 niss… door 2017 clean vehic… 17738 black 1n4a… 1.68e8 tenn… usa ## # ℹ 1 more variable: condition <chr>

3.4 Selecting

If you think of filtering as slicing a dataset horizontally, select() does theopposite, slicing a dataset vertically, by selecting a subset of the columns.The interface is similar to filter()’s. You begin with the data and thenthrough various arguments select which columns it is you want to keep.

There is a plethora of possibilities when it comes to select(). We’ll try tocover some of the most common ones here, but see the documentation forselect() if you want to learn more.

The simplest option is simply to list the columns you want to keep.

select(us_cars, brand, "model") # notice that you may omit quotation marks

## # A tibble: 2,499 × 2## brand model ## <chr> <chr> ## 1 toyota cruiser## 2 ford se ## 3 dodge mpv ## 4 ford door ## 5 chevrolet 1500 ## 6 dodge mpv ## 7 chevrolet pk ## 8 gmc door ## 9 chevrolet malibu ## 10 ford mpv ## # ℹ 2,489 more rows

When using select(), it’s actually possible to change the names of the columnsby using a value = name pair, like this:

select(us_cars, vintage = vin, "price_in_dollars" = price)

## # A tibble: 2,499 × 2## vintage price_in_dollars## <chr> <dbl>## 1 jtezu11f88k007763 6300## 2 2fmdk3gc4bbb02217 2899## 3 3c4pdcgg5jt346413 5350## 4 1ftfw1et4efc23745 25000## 5 3gcpcrec2jg473991 27700## 6 2c4rdgeg9jr237989 5700## 7 1gcsksea1az121133 7300## 8 1gks2gkc3hr326762 13350## 9 1g1zd5st5jf191860 14600## 10 2fmpk3j92hbc12542 5250## # ℹ 2,489 more rows

If you instead want to drop a particular column, you just preface it with -or !.

select(us_cars, !brand, -model)

## # A tibble: 2,499 × 10## price year title_status mileage color vin lot state country condition## <dbl> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 6300 2008 clean vehicle 274117 black jtez… 1.59e8 new … usa 10 days …## 2 2899 2011 clean vehicle 190552 silver 2fmd… 1.67e8 tenn… usa 6 days l…## 3 5350 2018 clean vehicle 39590 silver 3c4p… 1.68e8 geor… usa 2 days l…## 4 25000 2014 clean vehicle 64146 blue 1ftf… 1.68e8 virg… usa 22 hours…## 5 27700 2018 clean vehicle 6654 red 3gcp… 1.68e8 flor… usa 22 hours…## 6 5700 2018 clean vehicle 45561 white 2c4r… 1.68e8 texas usa 2 days l…## 7 7300 2010 clean vehicle 149050 black 1gcs… 1.68e8 geor… usa 22 hours…## 8 13350 2017 clean vehicle 23525 gray 1gks… 1.68e8 cali… usa 20 hours…## 9 14600 2018 clean vehicle 9371 silver 1g1z… 1.68e8 flor… usa 22 hours…## 10 5250 2017 clean vehicle 63418 black 2fmp… 1.68e8 texas usa 2 days l…## # ℹ 2,489 more rows

Notice that, when you only have negative indexing, the function assumes that youwant to keep all of the remaining columns.

It can, however, be tedious to manually select every column, in which case youmay use the : operator to specify a range instead.

select(us_cars, title_status:state)

## # A tibble: 2,499 × 6## title_status mileage color vin lot state ## <chr> <dbl> <chr> <chr> <dbl> <chr> ## 1 clean vehicle 274117 black jtezu11f88k007763 159348797 new jersey## 2 clean vehicle 190552 silver 2fmdk3gc4bbb02217 166951262 tennessee ## 3 clean vehicle 39590 silver 3c4pdcgg5jt346413 167655728 georgia ## 4 clean vehicle 64146 blue 1ftfw1et4efc23745 167753855 virginia ## 5 clean vehicle 6654 red 3gcpcrec2jg473991 167763266 florida ## 6 clean vehicle 45561 white 2c4rdgeg9jr237989 167655771 texas ## 7 clean vehicle 149050 black 1gcsksea1az121133 167753872 georgia ## 8 clean vehicle 23525 gray 1gks2gkc3hr326762 167692494 california## 9 clean vehicle 9371 silver 1g1zd5st5jf191860 167763267 florida ## 10 clean vehicle 63418 black 2fmpk3j92hbc12542 167656121 texas ## # ℹ 2,489 more rows

Finally, it can sometimes be useful to match columns by name in some way, forinstance if a number of columns that you want to contains a specific word. To beable to do this, tidyr provides a set of helper functions, such as

starts_with(),
ends_with(), and
contains().

The manual entry for select() contains several examples using these helperfunctions, as does the individual entries for each function.

Consider taking a little time to play around with select() and its helperfunctions on the us_cars dataset.

3.5 Mutating

Frequently when visualizing data you will want to transform that data in someway. Perhaps you’re more interested in the proportion than a number, want toconvert a value to a different unit or you just need to change the names of somefactor variables. In this cases, mutate() is your best friend.

Let’s start by example. Say that we want to convert the price of the carsin the us_cars dataset to Swedish kronor instead. Here’s how to do this:

mutate(us_cars, price = price * 8.92)

## # A tibble: 2,499 × 12## price brand model year title_status mileage color vin lot state## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr>## 1 56196 toyota cruiser 2008 clean vehic… 274117 black jtez… 1.59e8 new …## 2 25859. ford se 2011 clean vehic… 190552 silv… 2fmd… 1.67e8 tenn…## 3 47722 dodge mpv 2018 clean vehic… 39590 silv… 3c4p… 1.68e8 geor…## 4 223000 ford door 2014 clean vehic… 64146 blue 1ftf… 1.68e8 virg…## 5 247084 chevrolet 1500 2018 clean vehic… 6654 red 3gcp… 1.68e8 flor…## 6 50844 dodge mpv 2018 clean vehic… 45561 white 2c4r… 1.68e8 texas## 7 65116 chevrolet pk 2010 clean vehic… 149050 black 1gcs… 1.68e8 geor…## 8 119082 gmc door 2017 clean vehic… 23525 gray 1gks… 1.68e8 cali…## 9 130232 chevrolet malibu 2018 clean vehic… 9371 silv… 1g1z… 1.68e8 flor…## 10 46830 ford mpv 2017 clean vehic… 63418 black 2fmp… 1.68e8 texas## # ℹ 2,489 more rows## # ℹ 2 more variables: country <chr>, condition <chr>

In this instance we simply overwrote the price variable with a new value,but mutate can also be used to create new variables.

mutate(us_cars, price_sek = price * 8.92)

## # A tibble: 2,499 × 13## price brand model year title_status mileage color vin lot state country## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> ## 1 6300 toyo… crui… 2008 clean vehic… 274117 black jtez… 1.59e8 new … usa ## 2 2899 ford se 2011 clean vehic… 190552 silv… 2fmd… 1.67e8 tenn… usa ## 3 5350 dodge mpv 2018 clean vehic… 39590 silv… 3c4p… 1.68e8 geor… usa ## 4 25000 ford door 2014 clean vehic… 64146 blue 1ftf… 1.68e8 virg… usa ## 5 27700 chev… 1500 2018 clean vehic… 6654 red 3gcp… 1.68e8 flor… usa ## 6 5700 dodge mpv 2018 clean vehic… 45561 white 2c4r… 1.68e8 texas usa ## 7 7300 chev… pk 2010 clean vehic… 149050 black 1gcs… 1.68e8 geor… usa ## 8 13350 gmc door 2017 clean vehic… 23525 gray 1gks… 1.68e8 cali… usa ## 9 14600 chev… mali… 2018 clean vehic… 9371 silv… 1g1z… 1.68e8 flor… usa ## 10 5250 ford mpv 2017 clean vehic… 63418 black 2fmp… 1.68e8 texas usa ## # ℹ 2,489 more rows## # ℹ 2 more variables: condition <chr>, price_sek <dbl>

Inside mutate(), you can basically use any operation that would work on atypical vector in R. You can, for instance, convert between different vectortypes, or use arbitrary functions (as long as they return a new vector).

mutate( us_cars, year = as.integer(year), # convert year from a double to an integer state = toupper(state), # capitalize state names price = round(price, -2), # round to hundreds brand = fct_lump_prop(brand, 0.1) # lump together low-frequent brands)

## # A tibble: 2,499 × 12## price brand model year title_status mileage color vin lot state country## <dbl> <fct> <chr> <int> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> ## 1 6300 Other crui… 2008 clean vehic… 274117 black jtez… 1.59e8 NEW … usa ## 2 2900 ford se 2011 clean vehic… 190552 silv… 2fmd… 1.67e8 TENN… usa ## 3 5400 dodge mpv 2018 clean vehic… 39590 silv… 3c4p… 1.68e8 GEOR… usa ## 4 25000 ford door 2014 clean vehic… 64146 blue 1ftf… 1.68e8 VIRG… usa ## 5 27700 chev… 1500 2018 clean vehic… 6654 red 3gcp… 1.68e8 FLOR… usa ## 6 5700 dodge mpv 2018 clean vehic… 45561 white 2c4r… 1.68e8 TEXAS usa ## 7 7300 chev… pk 2010 clean vehic… 149050 black 1gcs… 1.68e8 GEOR… usa ## 8 13400 Other door 2017 clean vehic… 23525 gray 1gks… 1.68e8 CALI… usa ## 9 14600 chev… mali… 2018 clean vehic… 9371 silv… 1g1z… 1.68e8 FLOR… usa ## 10 5200 ford mpv 2017 clean vehic… 63418 black 2fmp… 1.68e8 TEXAS usa ## # ℹ 2,489 more rows## # ℹ 1 more variable: condition <chr>

3.6 Rename

When you produce visualizations, you need to make sure that annotation in theplot, such as the axis titles, are legible. Many datasets, however, havevariable (column) names that are not. You can always fix this when youset up your plot, but if you’re producing visualizations it may becometedious to rename the axes with every plot.

That is why it is often useful to rename your variables during thedata wrangling step. This also has the benefit of making your data wranglingsteps more readable. There are multiple ways to rename variables withthe tidyverse approach. We’ve already seen that you can use select() torename variables, but if you only want to rename and not select, it’syou should instead use rename().Just as with select(), you use a new_name = old_name pair to do so. Let’srename “vin” to “vintage”.

us_cars %>% rename(vintage = vin)

## # A tibble: 2,499 × 12## price brand model year title_status mileage color vintage lot state## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr>## 1 6300 toyota cruiser 2008 clean vehic… 274117 black jtezu1… 1.59e8 new …## 2 2899 ford se 2011 clean vehic… 190552 silv… 2fmdk3… 1.67e8 tenn…## 3 5350 dodge mpv 2018 clean vehic… 39590 silv… 3c4pdc… 1.68e8 geor…## 4 25000 ford door 2014 clean vehic… 64146 blue 1ftfw1… 1.68e8 virg…## 5 27700 chevrolet 1500 2018 clean vehic… 6654 red 3gcpcr… 1.68e8 flor…## 6 5700 dodge mpv 2018 clean vehic… 45561 white 2c4rdg… 1.68e8 texas## 7 7300 chevrolet pk 2010 clean vehic… 149050 black 1gcsks… 1.68e8 geor…## 8 13350 gmc door 2017 clean vehic… 23525 gray 1gks2g… 1.68e8 cali…## 9 14600 chevrolet malibu 2018 clean vehic… 9371 silv… 1g1zd5… 1.68e8 flor…## 10 5250 ford mpv 2017 clean vehic… 63418 black 2fmpk3… 1.68e8 texas## # ℹ 2,489 more rows## # ℹ 2 more variables: country <chr>, condition <chr>

You can have spaces in your variable names if you want to, but this is badpractice because including special characterslike %, &, $ in variable names can have undesired side-effects that youdo best to avoid.

3.7 Grouping, Mutating, and Summarizing

Another common operation in data-visualization is to group and summarize data.This is useful when you want to visualize a summary statistic rather than theraw data. Before taking this route in data visualization, however, make it ahabit to ask yourself whether this aggregation is needed. If you are able tocraft a visualization where you showcase all your data without sacrificingthe communicative properties of the visualization, then this is always abetter alternative.

3.8 Grouping

To group data using the tidyverse methodology, we use group_by()(from dplyr). This is a very simple function. You simply list all thevariables (columns) that you want to group by. Let’s group the us_cars datasetby country (USA or Canada).

group_by(us_cars, country)

## # A tibble: 2,499 × 12## # Groups: country [2]## price brand model year title_status mileage color vin lot state country## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> ## 1 6300 toyo… crui… 2008 clean vehic… 274117 black jtez… 1.59e8 new … usa ## 2 2899 ford se 2011 clean vehic… 190552 silv… 2fmd… 1.67e8 tenn… usa ## 3 5350 dodge mpv 2018 clean vehic… 39590 silv… 3c4p… 1.68e8 geor… usa ## 4 25000 ford door 2014 clean vehic… 64146 blue 1ftf… 1.68e8 virg… usa ## 5 27700 chev… 1500 2018 clean vehic… 6654 red 3gcp… 1.68e8 flor… usa ## 6 5700 dodge mpv 2018 clean vehic… 45561 white 2c4r… 1.68e8 texas usa ## 7 7300 chev… pk 2010 clean vehic… 149050 black 1gcs… 1.68e8 geor… usa ## 8 13350 gmc door 2017 clean vehic… 23525 gray 1gks… 1.68e8 cali… usa ## 9 14600 chev… mali… 2018 clean vehic… 9371 silv… 1g1z… 1.68e8 flor… usa ## 10 5250 ford mpv 2017 clean vehic… 63418 black 2fmp… 1.68e8 texas usa ## # ℹ 2,489 more rows## # ℹ 1 more variable: condition <chr>

Okay, so not much happened! If you look closely, however, you’ll see that thistibble is now grouped—but what does this mean? Well, the truth is thatgrouping is only useful when combined with functions that modify the data,especially summarize().

3.9 Summarizing

summarize() takes a dataset—usually a grouped dataset—and a set offunctions that takes vectors as input and returns summary statistics,such as

mean(),
median(),
sum(), or
quantile(..., probs = 0.25).

Let’s see how this looks in practice by computing the median and mean pricesof cars in USA and Canada.

us_cars_grouped <- group_by(us_cars, country)summarize( us_cars_grouped, median_price = median(price), mean_price = mean(price), number_of_cars = n() # captures the size of the group)

## # A tibble: 2 × 4## country median_price mean_price number_of_cars## <chr> <dbl> <dbl> <int>## 1 canada 30000 30357. 7## 2 usa 16894 18735. 2492

Perfect! Note the use of the n() function that counts the number ofobservations in each group. In this case, it highlights the problem thatcan occur when we group, summarize, and plot, namely, that we lose informationabout the size of the groups. It would not be reasonable todraw conclusions about cars from Canada based on only seven observations.

Missing data is an issue that is prevalent in much of the real-life datathat you may encounter, particularly when it comes to data involvingpeople. Dealing with missing data is an important topic but, as we said inthe introduction, it is a topic that lies mostlyoutside the scope of this course. We do, however, recommend that you always taketime to consider the reasons for why data may be missing and for themost part make it a habit to include missing data in your visualizationsinstead of simply removing it.

What you do need to know, however, is that missing data (coded as NA in R)may interfere with your work in R. Let’s say that you’re working withsleep data on mammals using the msleep dataset from gplot2, and wantto compute the mean number of hours in REM sleep:

mean(msleep$sleep_rem)

## [1] NA

The result here, NA, is a consequence of the presence of NA values inthe column sleep_rem in msleep and this behavior is actually the defaultfor many functions in R. To deal with this, you havetwo options: 1) directly remove the observations with NA values from the dataset using drop_na() or 2) set na.rm = TRUE in the call to mean(), whichexcludes the NA observations in that particular function call.

The second option is usually the best choice, since it means thatyou can still keep the NA values around when you want to producevisualizations, but in this course we’ll sometimes do drop_na() tomake working with the datasets a little bit smoother.

The pipe operator, %>%, is a staple in data science workflows becauseit makes the process of managing data much more manageable, modular, andreadable. The pipe operator can be hard to get to grips with at first, butonce you do, you will never want to go back.

The pipe operator takes an object on its left-hand side (or line above)and pipes it through to an object (almost exclusively a function) on theright-hand side. The object on the left-hand side will enterthe function on the right hand side as the first argument of that function,replacing (pushing back) any other argument that’s already there.

Let’s say that we have the following function, my_frac(), which takesarguments x and y and returns x/y.

my_frac <- function(x, y) { x / y}

Now we can use this function on two numbers, like so:

my_frac(2, 5)

## [1] 0.4

If we were to do this with the pipe operator instead, it would look like this:

# use the pipe operator2 %>% my_frac(5) # equivalent to my_frac(2, 5)

## [1] 0.4

# or we can do2 %>% my_frac(5)

## [1] 0.4

You’re probably thinking that this doesn’t seem very useful at all, giventhat we’ve now spent more code to accomplish exactly what my_frac()already accomplished very well without the pipe. Where the pipe operatorcomes into its own, however, is when we need to chain multiple functions.

5.1 Case Study using Pipes

Let’s say that we want to take our us_cars dataset and perform a set ofoptions, namely:

filter out all cars that are black,
group cars by state,
compute the mean mileage per state,
sort the states by mean mileage, and
plot a simple bar chart of state versus mean mileage.

There are more or less three ways to do this:

use intermediary objects,
use function composition, or
use pipes.

5.2 Intermediate Objects

To solve this by storing temporary objects, here’s what we would do:

black_cars <- filter(us_cars, color == "black")black_cars_groupedby_states <- group_by(black_cars, state)black_cars_state_mileage <- summarize(black_cars_groupedby_states, mean_mileage = mean(mileage))arrange(black_cars_state_mileage, mean_mileage)

## # A tibble: 35 × 2## state mean_mileage## <chr> <dbl>## 1 kentucky 7232 ## 2 arizona 17987.## 3 washington 23432 ## 4 virginia 31971.## 5 nevada 35147.## 6 minnesota 36396.## 7 pennsylvania 36828.## 8 west virginia 37832.## 9 california 39376.## 10 illinois 40350.## # ℹ 25 more rows

The largest downside to this approach is that you’re clutteringthe workspace with names of intermediary objects that you need to keep track ofand name with meaningful names, or alternatively name by incrementing acounter such as cars1, cars2, etc. This latter approach is actuallywhat you’ll often end up doing, and (at least for me) often leads tomixing up the counters and ending up with the wrong results.

Storing intermediate results is sometimes precisely the right way to goabout this, especially when you will later use the intermediary result,but in this case it leads to code that is hard to read and manage.

5.3 Function Composition

An alternative is to use composite functions for our call. The codeabove then looks like this.

arrange( summarize( group_by( filter(us_cars, color == "black"), state ), mean_mileage = mean(mileage) ), mean_mileage)

## # A tibble: 35 × 2## state mean_mileage## <chr> <dbl>## 1 kentucky 7232 ## 2 arizona 17987.## 3 washington 23432 ## 4 virginia 31971.## 5 nevada 35147.## 6 minnesota 36396.## 7 pennsylvania 36828.## 8 west virginia 37832.## 9 california 39376.## 10 illinois 40350.## # ℹ 25 more rows

Code like this is incredibly hard to manage (and read). Simply keepingtrack of which argument belongs to which function is strenuous in itself.On top of that, imagine that you want to swap order of one of these operationsor add another step somewhere, and it’s easy to see why this is not anattractive approach.

5.4 Pipes

With pipes, we get the following result:

us_cars %>% filter(color == "black") %>% group_by(state) %>% summarize(mean_mileage = mean(mileage)) %>% arrange(mean_mileage)

## # A tibble: 35 × 2## state mean_mileage## <chr> <dbl>## 1 kentucky 7232 ## 2 arizona 17987.## 3 washington 23432 ## 4 virginia 31971.## 5 nevada 35147.## 6 minnesota 36396.## 7 pennsylvania 36828.## 8 west virginia 37832.## 9 california 39376.## 10 illinois 40350.## # ℹ 25 more rows

The code is clean and expressive. If we want to switch the order of the stepsor add another, it’s only a matter of adding or removing a row of thecode. On top of that, you don’t need to name any intermediate objects (butremember that doing so is sometimes actually desirable).

The tidyverse is designed specifically with the pipe operator in mind. Thereason that it works so well with the tidyverse is that virtually all of themain functions in the tidyverse take a data set as the first argument, whichmeans that piping always works. The base R functions and many other packagesout there don’t necessarily abide by this principle, however, so make sure youknow what your functions are doing when you’re using pipes or you may end upwith unexpected (and undesirable) results.

Note that the pipe operator that we have so far (%>%) covered is not part ofthe base R distribution but is, however, included via all of the tidyversepackages. So simply calling library(tidyverse) or library(dplyr) will getyou access to it.

5.5 The Native Pipe Operator `|>`

R has actually recently3 implemented its own native pipeoperator, which you don’t need any additional packages to use. It is writtenas |>. So in the last example we covered, you may as well have writtenthe first two lines as the following.

us_cars |> filter(color == "black")

## # A tibble: 516 × 12## price brand model year title_status mileage color vin lot state country## <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr> ## 1 6300 toyo… crui… 2008 clean vehic… 274117 black jtez… 1.59e8 new … usa ## 2 7300 chev… pk 2010 clean vehic… 149050 black 1gcs… 1.68e8 geor… usa ## 3 5250 ford mpv 2017 clean vehic… 63418 black 2fmp… 1.68e8 texas usa ## 4 31900 chev… 1500 2018 clean vehic… 22909 black 3gcu… 1.68e8 tenn… usa ## 5 20700 ford door 2013 clean vehic… 100757 black 1ftf… 1.68e8 virg… usa ## 6 7300 kia forte 2018 clean vehic… 38823 black 3kpf… 1.68e8 nort… usa ## 7 15000 chev… door 2015 clean vehic… 61578 black 2gnf… 1.68e8 ohio usa ## 8 11900 gmc door 2017 clean vehic… 27965 black 1gks… 1.68e8 cali… usa ## 9 55000 ford srw 2017 clean vehic… 15273 black 1ft7… 1.68e8 penn… usa ## 10 12520 gmc door 2017 clean vehic… 28972 black 1gks… 1.68e8 cali… usa ## # ℹ 506 more rows## # ℹ 1 more variable: condition <chr>

For all purposes that we use it for in this course, you may aswell use the native pipe operator if you prefer to. %>% has a bit ofadditional functionality, but we will not need it in this course.

If you want to read more about pipes or just need a longer introduction, werecommend to take a look at the pipingsection of the R for Data Science book.

Have fun piping in the tidyverse!

The source code for this document is available athttps://github.com/stat-lu/dataviz/blob/main/worked-examples/cars.Rmd.

And this might very well come in handy for your project, where you willchoose your data yourself.↩︎
There is a read_csv2() function as well, which is forsemicolon-separated data.↩︎
Since version 4.1↩︎