2025-08-20

What is dplyr?

  • dplyr is a powerful R package for data manipulation.
  • It provides a consistent and intuitive set of “verbs” for common data-wrangling tasks.
  • It is a core part of the tidyverse, a collection of packages designed for data science.

The tidyverse Philosophy

  • Focus on consistency and readability.
  • Use the pipe operator, %>%, to chain multiple operations together.
  • The output of one function becomes the first argument of the next, making code easier to read from left to right.

The Pipe Operator (%>%)

The pipe operator, %>%, allows you to pass the result of one function to the next.

Example:

> library(magrittr) # You get this with `dplyr`
> # Without the pipe
> mean(c(1, 2, 3, 4, 5))
[1] 3
> 
> # With the pipe
> c(1, 2, 3, 4, 5) %>%
+   mean()
[1] 3

Sample Data

We’ll use the starwars dataset from the dplyr package for our examples.

Code:

> library(dplyr)
> starwars
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Output: A tibble (a modern data frame) with 14 variables and 87 observations.

select(): Choosing Columns

  • The select() function allows you to choose columns by name.
  • The first argument is the data frame, followed by the columns you want to keep.

Example: Selecting name, height, and mass.

> starwars %>% select(name, height, mass)
# A tibble: 87 × 3
   name               height  mass
   <chr>               <int> <dbl>
 1 Luke Skywalker        172    77
 2 C-3PO                 167    75
 3 R2-D2                  96    32
 4 Darth Vader           202   136
 5 Leia Organa           150    49
 6 Owen Lars             178   120
 7 Beru Whitesun Lars    165    75
 8 R5-D4                  97    32
 9 Biggs Darklighter     183    84
10 Obi-Wan Kenobi        182    77
# ℹ 77 more rows

select(): Excluding Columns

You can also use select() to drop columns by using a minus sign (-).

Example: Dropping films and vehicles.

> starwars %>% select(-films, -vehicles)
# A tibble: 87 × 12
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 3 more variables: homeworld <chr>, species <chr>, starships <list>

select() with Helper Functions

select() has helpers like starts_with(), ends_with(), and contains() to choose columns programmatically.

Example: Selecting all columns that end with color.

> starwars %>% select(ends_with("color"))
# A tibble: 87 × 3
   hair_color    skin_color  eye_color
   <chr>         <chr>       <chr>    
 1 blond         fair        blue     
 2 <NA>          gold        yellow   
 3 <NA>          white, blue red      
 4 none          white       yellow   
 5 brown         light       brown    
 6 brown, grey   light       blue     
 7 brown         light       blue     
 8 <NA>          white, red  red      
 9 black         light       brown    
10 auburn, white fair        blue-gray
# ℹ 77 more rows

filter(): Choosing Rows

  • The filter() function allows you to choose rows based on a condition.
  • The first argument is the data frame, followed by logical conditions.

Example: Filtering for characters with a height greater than 150.

> starwars %>% filter(height > 150)
# A tibble: 69 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 4 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 5 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 6 Biggs D…    183    84 black      light      brown           24   male  mascu…
 7 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 8 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 9 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
10 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
# ℹ 59 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

filter() with Multiple Conditions

You can combine conditions using logical operators: * & for AND (both conditions must be true) * | for OR (at least one condition must be true)

Example: Filtering for characters with height > 150 AND mass > 50.

> starwars %>% filter(height > 150 & mass > 50)
# A tibble: 46 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 4 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 5 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 6 Biggs D…    183    84 black      light      brown           24   male  mascu…
 7 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 8 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 9 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
10 Han Solo    180    80 brown      fair       brown           29   male  mascu…
# ℹ 36 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

arrange(): Reordering Rows

  • The arrange() function sorts rows by a specified column.
  • The first argument is the data frame, followed by the column(s) to sort by.

Example: Arranging rows by height in ascending order.

> starwars %>% arrange(height)
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Yoda         66    17 white      green      brown            896 male  mascu…
 2 Ratts T…     79    15 none       grey, blue unknown           NA male  mascu…
 3 Wicket …     88    20 brown      brown      brown              8 male  mascu…
 4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu…
 5 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
 6 R4-P17       96    NA none       silver, r… red, blue         NA none  femin…
 7 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
 8 Sebulba     112    40 none       grey, red  orange            NA male  mascu…
 9 Gasgano     122    NA none       white, bl… black             NA male  mascu…
10 Watto       137    NA black      blue, grey yellow            NA male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

arrange(): Descending Order

Use the desc() function inside arrange() to sort in descending order.

Example: Arranging rows by height in descending order.

> starwars %>% arrange(desc(height))
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Yarael …    264    NA none       white      yellow          NA   male  mascu…
 2 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
 3 Lama Su     229    88 none       grey       black           NA   male  mascu…
 4 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 5 Roos Ta…    224    82 none       grey       orange          NA   male  mascu…
 6 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
 7 Taun We     213    NA none       grey       black           NA   fema… femin…
 8 Rugor N…    206    NA none       green      orange          NA   male  mascu…
 9 Tion Me…    206    80 none       grey       black           NA   male  mascu…
10 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

mutate(): Adding New Columns

  • The mutate() function adds new columns to your data frame.
  • The first argument is the data frame, followed by the new column name and its value.

Example: Creating a new column height_meters.

> starwars %>% mutate(height_meters = height / 100) %>% select(name,height,height_meters)
# A tibble: 87 × 3
   name               height height_meters
   <chr>               <int>         <dbl>
 1 Luke Skywalker        172          1.72
 2 C-3PO                 167          1.67
 3 R2-D2                  96          0.96
 4 Darth Vader           202          2.02
 5 Leia Organa           150          1.5 
 6 Owen Lars             178          1.78
 7 Beru Whitesun Lars    165          1.65
 8 R5-D4                  97          0.97
 9 Biggs Darklighter     183          1.83
10 Obi-Wan Kenobi        182          1.82
# ℹ 77 more rows

mutate() with a Calculation

You can use mutate() to create a new column based on a calculation involving existing columns.

Example: Creating a bmi column.

> starwars %>% mutate(height_meters = height / 100, bmi = mass / (height_meters^2)) %>% select(name, bmi)
# A tibble: 87 × 2
   name                 bmi
   <chr>              <dbl>
 1 Luke Skywalker      26.0
 2 C-3PO               26.9
 3 R2-D2               34.7
 4 Darth Vader         33.3
 5 Leia Organa         21.8
 6 Owen Lars           37.9
 7 Beru Whitesun Lars  27.5
 8 R5-D4               34.0
 9 Biggs Darklighter   25.1
10 Obi-Wan Kenobi      23.2
# ℹ 77 more rows

summarize(): Condensing Data

  • summarize() (or summarise()) collapses a data frame into a single row.
  • It’s useful for calculating summary statistics.
  • Need to add na.rm= TRUE option if there is missing data

Example: Finding the mean_height.

> starwars %>% summarize(mean_height = mean(height, na.rm = TRUE))
# A tibble: 1 × 1
  mean_height
        <dbl>
1        175.

summarize() with Multiple Statistics

You can calculate multiple summary statistics at once.

Example: Calculating mean height and mass.

> starwars %>%
+   summarize(
+     mean_height = mean(height, na.rm = TRUE),
+     mean_mass = mean(mass, na.rm = TRUE)
+   )
# A tibble: 1 × 2
  mean_height mean_mass
        <dbl>     <dbl>
1        175.      97.3

group_by(): The Key to Grouped Summaries

  • The group_by() function doesn’t change the data immediately, but it sets up a special grouping structure.
  • Any subsequent dplyr verb will operate on these groups independently.

Example: Grouping the data by gender.

> starwars %>% group_by(gender)
# A tibble: 87 × 14
# Groups:   gender [3]
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

group_by() + summarize()

  • This is one of the most powerful combinations in dplyr.
  • First, you group_by() a categorical variable, then you summarize() to get a summary statistic for each group.

Example: Finding the mean height for each gender.

> starwars %>% group_by(gender) %>% summarize(mean_height = mean(height, na.rm = TRUE))
# A tibble: 3 × 2
  gender    mean_height
  <chr>           <dbl>
1 feminine         167.
2 masculine        177.
3 <NA>             175 

group_by() with mutate()

  • When you use mutate() after group_by(), the new column is calculated within each group.
  • This is perfect for standardizing data or finding values relative to a group.

Example: Calculating a character’s mass as a percentage of their species’ average mass.

> starwars %>%
+   group_by(species) %>%
+   mutate(species_average_mass = mean(mass, na.rm = TRUE)) %>%
+   mutate(mass_proportion = (mass / species_average_mass) * 100) %>%
+   select(name, species, mass, species_average_mass, mass_proportion)
# A tibble: 87 × 5
# Groups:   species [38]
   name               species  mass species_average_mass mass_proportion
   <chr>              <chr>   <dbl>                <dbl>           <dbl>
 1 Luke Skywalker     Human      77                 81.3            94.7
 2 C-3PO              Droid      75                 69.8           108. 
 3 R2-D2              Droid      32                 69.8            45.9
 4 Darth Vader        Human     136                 81.3           167. 
 5 Leia Organa        Human      49                 81.3            60.3
 6 Owen Lars          Human     120                 81.3           148. 
 7 Beru Whitesun Lars Human      75                 81.3            92.2
 8 R5-D4              Droid      32                 69.8            45.9
 9 Biggs Darklighter  Human      84                 81.3           103. 
10 Obi-Wan Kenobi     Human      77                 81.3            94.7
# ℹ 77 more rows

Counting with n()

  • The n() function is a powerful helper function for use inside summarize().
  • It counts the number of observations in the current group.

Example: Counting the number of characters for each gender and species.

> starwars %>% group_by(gender, species) %>% summarize(count = n())
# A tibble: 42 × 3
# Groups:   gender [3]
   gender    species    count
   <chr>     <chr>      <int>
 1 feminine  Clawdite       1
 2 feminine  Droid          1
 3 feminine  Human          9
 4 feminine  Kaminoan       1
 5 feminine  Mirialan       2
 6 feminine  Tholothian     1
 7 feminine  Togruta        1
 8 feminine  Twi'lek        1
 9 masculine Aleena         1
10 masculine Besalisk       1
# ℹ 32 more rows

Putting It All Together: A Chained Example

Let’s find the mean mass of all masculine, humanoid characters taller than 170cm.

> starwars %>% filter(gender == "masculine" & species == "Human" & height > 170) %>%
+   summarize(mean_mass = mean(mass, na.rm = TRUE))
# A tibble: 1 × 1
  mean_mass
      <dbl>
1      87.0

Summary of dplyr Verbs

  • select(): Choose columns.
  • filter(): Choose rows.
  • arrange(): Reorder rows.
  • mutate(): Add new columns.
  • summarize(): Condense data to a single row.
  • group_by(): Perform operations on groups.