dplyr
is a powerful R package for data manipulation.- It provides a consistent and intuitive set of “verbs” for common data-wrangling tasks.
- It is a core part of the
tidyverse
, a collection of packages designed for data science.
2025-08-20
dplyr
?dplyr
is a powerful R package for data manipulation.tidyverse
, a collection of packages designed for data science.tidyverse
Philosophy%>%
, to chain multiple operations together.%>%
)The pipe operator, %>%
, allows you to pass the result of one function to the next.
Example:
> library(magrittr) # You get this with `dplyr` > # Without the pipe > mean(c(1, 2, 3, 4, 5)) [1] 3 > > # With the pipe > c(1, 2, 3, 4, 5) %>% + mean() [1] 3
We’ll use the starwars
dataset from the dplyr
package for our examples.
Code:
> library(dplyr) > starwars # A tibble: 87 × 14 name height mass hair_color skin_color eye_color birth_year sex gender <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 Luke Sk… 172 77 blond fair blue 19 male mascu… 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… 4 Darth V… 202 136 none white yellow 41.9 male mascu… 5 Leia Or… 150 49 brown light brown 19 fema… femin… 6 Owen La… 178 120 brown, gr… light blue 52 male mascu… 7 Beru Wh… 165 75 brown light blue 47 fema… femin… 8 R5-D4 97 32 <NA> white, red red NA none mascu… 9 Biggs D… 183 84 black light brown 24 male mascu… 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu… # ℹ 77 more rows # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, # vehicles <list>, starships <list>
Output: A tibble (a modern data frame) with 14 variables and 87 observations.
select()
: Choosing Columnsselect()
function allows you to choose columns by name.Example: Selecting name
, height
, and mass
.
> starwars %>% select(name, height, mass) # A tibble: 87 × 3 name height mass <chr> <int> <dbl> 1 Luke Skywalker 172 77 2 C-3PO 167 75 3 R2-D2 96 32 4 Darth Vader 202 136 5 Leia Organa 150 49 6 Owen Lars 178 120 7 Beru Whitesun Lars 165 75 8 R5-D4 97 32 9 Biggs Darklighter 183 84 10 Obi-Wan Kenobi 182 77 # ℹ 77 more rows
select()
: Excluding ColumnsYou can also use select()
to drop columns by using a minus sign (-
).
Example: Dropping films
and vehicles
.
> starwars %>% select(-films, -vehicles) # A tibble: 87 × 12 name height mass hair_color skin_color eye_color birth_year sex gender <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 Luke Sk… 172 77 blond fair blue 19 male mascu… 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… 4 Darth V… 202 136 none white yellow 41.9 male mascu… 5 Leia Or… 150 49 brown light brown 19 fema… femin… 6 Owen La… 178 120 brown, gr… light blue 52 male mascu… 7 Beru Wh… 165 75 brown light blue 47 fema… femin… 8 R5-D4 97 32 <NA> white, red red NA none mascu… 9 Biggs D… 183 84 black light brown 24 male mascu… 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu… # ℹ 77 more rows # ℹ 3 more variables: homeworld <chr>, species <chr>, starships <list>
select()
with Helper Functionsselect()
has helpers like starts_with()
, ends_with()
, and contains()
to choose columns programmatically.
Example: Selecting all columns that end with color
.
> starwars %>% select(ends_with("color")) # A tibble: 87 × 3 hair_color skin_color eye_color <chr> <chr> <chr> 1 blond fair blue 2 <NA> gold yellow 3 <NA> white, blue red 4 none white yellow 5 brown light brown 6 brown, grey light blue 7 brown light blue 8 <NA> white, red red 9 black light brown 10 auburn, white fair blue-gray # ℹ 77 more rows
filter()
: Choosing Rowsfilter()
function allows you to choose rows based on a condition.Example: Filtering for characters with a height
greater than 150.
> starwars %>% filter(height > 150) # A tibble: 69 × 14 name height mass hair_color skin_color eye_color birth_year sex gender <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 Luke Sk… 172 77 blond fair blue 19 male mascu… 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… 3 Darth V… 202 136 none white yellow 41.9 male mascu… 4 Owen La… 178 120 brown, gr… light blue 52 male mascu… 5 Beru Wh… 165 75 brown light blue 47 fema… femin… 6 Biggs D… 183 84 black light brown 24 male mascu… 7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu… 8 Anakin … 188 84 blond fair blue 41.9 male mascu… 9 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu… 10 Chewbac… 228 112 brown unknown blue 200 male mascu… # ℹ 59 more rows # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, # vehicles <list>, starships <list>
filter()
with Multiple ConditionsYou can combine conditions using logical operators: * &
for AND (both conditions must be true) * |
for OR (at least one condition must be true)
Example: Filtering for characters with height > 150
AND mass > 50
.
> starwars %>% filter(height > 150 & mass > 50) # A tibble: 46 × 14 name height mass hair_color skin_color eye_color birth_year sex gender <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 Luke Sk… 172 77 blond fair blue 19 male mascu… 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… 3 Darth V… 202 136 none white yellow 41.9 male mascu… 4 Owen La… 178 120 brown, gr… light blue 52 male mascu… 5 Beru Wh… 165 75 brown light blue 47 fema… femin… 6 Biggs D… 183 84 black light brown 24 male mascu… 7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu… 8 Anakin … 188 84 blond fair blue 41.9 male mascu… 9 Chewbac… 228 112 brown unknown blue 200 male mascu… 10 Han Solo 180 80 brown fair brown 29 male mascu… # ℹ 36 more rows # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, # vehicles <list>, starships <list>
arrange()
: Reordering Rowsarrange()
function sorts rows by a specified column.Example: Arranging rows by height
in ascending order.
> starwars %>% arrange(height) # A tibble: 87 × 14 name height mass hair_color skin_color eye_color birth_year sex gender <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 Yoda 66 17 white green brown 896 male mascu… 2 Ratts T… 79 15 none grey, blue unknown NA male mascu… 3 Wicket … 88 20 brown brown brown 8 male mascu… 4 Dud Bolt 94 45 none blue, grey yellow NA male mascu… 5 R2-D2 96 32 <NA> white, bl… red 33 none mascu… 6 R4-P17 96 NA none silver, r… red, blue NA none femin… 7 R5-D4 97 32 <NA> white, red red NA none mascu… 8 Sebulba 112 40 none grey, red orange NA male mascu… 9 Gasgano 122 NA none white, bl… black NA male mascu… 10 Watto 137 NA black blue, grey yellow NA male mascu… # ℹ 77 more rows # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, # vehicles <list>, starships <list>
arrange()
: Descending OrderUse the desc()
function inside arrange()
to sort in descending order.
Example: Arranging rows by height
in descending order.
> starwars %>% arrange(desc(height)) # A tibble: 87 × 14 name height mass hair_color skin_color eye_color birth_year sex gender <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 Yarael … 264 NA none white yellow NA male mascu… 2 Tarfful 234 136 brown brown blue NA male mascu… 3 Lama Su 229 88 none grey black NA male mascu… 4 Chewbac… 228 112 brown unknown blue 200 male mascu… 5 Roos Ta… 224 82 none grey orange NA male mascu… 6 Grievous 216 159 none brown, wh… green, y… NA male mascu… 7 Taun We 213 NA none grey black NA fema… femin… 8 Rugor N… 206 NA none green orange NA male mascu… 9 Tion Me… 206 80 none grey black NA male mascu… 10 Darth V… 202 136 none white yellow 41.9 male mascu… # ℹ 77 more rows # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, # vehicles <list>, starships <list>
mutate()
: Adding New Columnsmutate()
function adds new columns to your data frame.Example: Creating a new column height_meters
.
> starwars %>% mutate(height_meters = height / 100) %>% select(name,height,height_meters) # A tibble: 87 × 3 name height height_meters <chr> <int> <dbl> 1 Luke Skywalker 172 1.72 2 C-3PO 167 1.67 3 R2-D2 96 0.96 4 Darth Vader 202 2.02 5 Leia Organa 150 1.5 6 Owen Lars 178 1.78 7 Beru Whitesun Lars 165 1.65 8 R5-D4 97 0.97 9 Biggs Darklighter 183 1.83 10 Obi-Wan Kenobi 182 1.82 # ℹ 77 more rows
mutate()
with a CalculationYou can use mutate()
to create a new column based on a calculation involving existing columns.
Example: Creating a bmi
column.
> starwars %>% mutate(height_meters = height / 100, bmi = mass / (height_meters^2)) %>% select(name, bmi) # A tibble: 87 × 2 name bmi <chr> <dbl> 1 Luke Skywalker 26.0 2 C-3PO 26.9 3 R2-D2 34.7 4 Darth Vader 33.3 5 Leia Organa 21.8 6 Owen Lars 37.9 7 Beru Whitesun Lars 27.5 8 R5-D4 34.0 9 Biggs Darklighter 25.1 10 Obi-Wan Kenobi 23.2 # ℹ 77 more rows
summarize()
: Condensing Datasummarize()
(or summarise()
) collapses a data frame into a single row.na.rm= TRUE
option if there is missing dataExample: Finding the mean_height
.
> starwars %>% summarize(mean_height = mean(height, na.rm = TRUE)) # A tibble: 1 × 1 mean_height <dbl> 1 175.
summarize()
with Multiple StatisticsYou can calculate multiple summary statistics at once.
Example: Calculating mean height
and mass
.
> starwars %>% + summarize( + mean_height = mean(height, na.rm = TRUE), + mean_mass = mean(mass, na.rm = TRUE) + ) # A tibble: 1 × 2 mean_height mean_mass <dbl> <dbl> 1 175. 97.3
group_by()
: The Key to Grouped Summariesgroup_by()
function doesn’t change the data immediately, but it sets up a special grouping structure.dplyr
verb will operate on these groups independently.Example: Grouping the data by gender
.
> starwars %>% group_by(gender) # A tibble: 87 × 14 # Groups: gender [3] name height mass hair_color skin_color eye_color birth_year sex gender <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> 1 Luke Sk… 172 77 blond fair blue 19 male mascu… 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu… 4 Darth V… 202 136 none white yellow 41.9 male mascu… 5 Leia Or… 150 49 brown light brown 19 fema… femin… 6 Owen La… 178 120 brown, gr… light blue 52 male mascu… 7 Beru Wh… 165 75 brown light blue 47 fema… femin… 8 R5-D4 97 32 <NA> white, red red NA none mascu… 9 Biggs D… 183 84 black light brown 24 male mascu… 10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu… # ℹ 77 more rows # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>, # vehicles <list>, starships <list>
group_by()
+ summarize()
dplyr
.group_by()
a categorical variable, then you summarize()
to get a summary statistic for each group.Example: Finding the mean height
for each gender
.
> starwars %>% group_by(gender) %>% summarize(mean_height = mean(height, na.rm = TRUE)) # A tibble: 3 × 2 gender mean_height <chr> <dbl> 1 feminine 167. 2 masculine 177. 3 <NA> 175
group_by()
with mutate()
mutate()
after group_by()
, the new column is calculated within each group.Example: Calculating a character’s mass as a percentage of their species’ average mass.
> starwars %>% + group_by(species) %>% + mutate(species_average_mass = mean(mass, na.rm = TRUE)) %>% + mutate(mass_proportion = (mass / species_average_mass) * 100) %>% + select(name, species, mass, species_average_mass, mass_proportion) # A tibble: 87 × 5 # Groups: species [38] name species mass species_average_mass mass_proportion <chr> <chr> <dbl> <dbl> <dbl> 1 Luke Skywalker Human 77 81.3 94.7 2 C-3PO Droid 75 69.8 108. 3 R2-D2 Droid 32 69.8 45.9 4 Darth Vader Human 136 81.3 167. 5 Leia Organa Human 49 81.3 60.3 6 Owen Lars Human 120 81.3 148. 7 Beru Whitesun Lars Human 75 81.3 92.2 8 R5-D4 Droid 32 69.8 45.9 9 Biggs Darklighter Human 84 81.3 103. 10 Obi-Wan Kenobi Human 77 81.3 94.7 # ℹ 77 more rows
n()
n()
function is a powerful helper function for use inside summarize()
.Example: Counting the number of characters for each gender
and species
.
> starwars %>% group_by(gender, species) %>% summarize(count = n()) # A tibble: 42 × 3 # Groups: gender [3] gender species count <chr> <chr> <int> 1 feminine Clawdite 1 2 feminine Droid 1 3 feminine Human 9 4 feminine Kaminoan 1 5 feminine Mirialan 2 6 feminine Tholothian 1 7 feminine Togruta 1 8 feminine Twi'lek 1 9 masculine Aleena 1 10 masculine Besalisk 1 # ℹ 32 more rows
Let’s find the mean mass
of all masculine, humanoid characters taller than 170cm.
> starwars %>% filter(gender == "masculine" & species == "Human" & height > 170) %>% + summarize(mean_mass = mean(mass, na.rm = TRUE)) # A tibble: 1 × 1 mean_mass <dbl> 1 87.0
dplyr
Verbsselect()
: Choose columns.filter()
: Choose rows.arrange()
: Reorder rows.mutate()
: Add new columns.summarize()
: Condense data to a single row.group_by()
: Perform operations on groups.