Есть ли способ сгруппировать данные дважды (то есть по месяцам, а затем по сезонам) в R?

Я пытаюсь ответить на этот вопрос:

Используйте пакет nycflights13 и фрейм данных о рейсах, чтобы ответить на следующие вопросы: В каком месяце была самая высокая доля отмененных рейсов? В каком месяце был самый низкий показатель? Интерпретируйте любые сезонные закономерности.

Технически я ответил на вопрос, но я пытаюсь сделать более краткую информацию, чем то, что у меня есть сейчас.

Это то, что у меня есть до сих пор:

#Load packages
library(nycflights13)
library(tidyverse)

#Data frame "cancprop" with three new variables ("canc" = flights that were canceled, "notc" = flights that were not canceled, and "canp" = proportion of all flights that were canceled)
cancprop <- flights %>%
  mutate(
    canc = is.na(dep_time),
    notc = !is.na(dep_time),
    canp = canc / (canc + notc)
  )

#A tibble showing the average proportion of all flights that were canceled by month sorted by descending average proportion.
cancprop %>%
  group_by(month) %>% 
  summarize(mcanp = mean(canp)) %>% 
  arrange(desc(mcanp))
# A tibble: 12 x 2
   month   mcanp
   <int>   <dbl>
 1     2 0.0505 
 2    12 0.0364 
 3     6 0.0357 
 4     7 0.0319 
 5     3 0.0299 
 6     4 0.0236 
 7     5 0.0196 
 8     1 0.0193 
 9     8 0.0166 
10     9 0.0164 
11    11 0.00854
12    10 0.00817

#Data frame "seas" with a new variable ("season" = the season corresponding with the month)
seas <- cancprop %>% 
  group_by(month) %>% 
  summarize(mcanp = mean(canp)) %>% 
  mutate(
    season = case_when(
      month %in% 3:5 ~ "Spring",
      month %in% 6:8 ~ "Summer",
      month %in% 9:11 ~ "Fall",
      TRUE ~ "Winter"
    ))
seas
# A tibble: 12 x 3
   month   mcanp season
   <int>   <dbl> <chr> 
 1     1 0.0193  Winter
 2     2 0.0505  Winter
 3     3 0.0299  Spring
 4     4 0.0236  Spring
 5     5 0.0196  Spring
 6     6 0.0357  Summer
 7     7 0.0319  Summer
 8     8 0.0166  Summer
 9     9 0.0164  Fall  
10    10 0.00817 Fall  
11    11 0.00854 Fall  
12    12 0.0364  Winter

#A plot showing the proportion of flights canceled
ggplot(seas, aes(x = factor(month), y = mcanp, fill = season)) +
  geom_bar(stat = "identity") +
  labs(x = "Month", y = "Proportion of Flights Canceled", color = "Season")

То, что я хочу создать, — это таблица, показывающая среднюю долю рейсов, отмененных за сезон, например эту (со случайными, не рассчитанными пропорциями, поскольку я не уверен, как на самом деле получить результаты):

# A tibble: 4 x 2
       season   mcanp
        <chr>   <dbl> 
 1     Winter  0.0433
 2     Spring  0.0235
 3     Summer  0.0109
 4     Fall    0.0246

Любая помощь приветствуется, спасибо!

r dplyr

Taylor Lee 29.02.2020 источник

comment

Я думаю, вам нужно seas %>% group_by(season) %>% summarise(mcanp = mean(mcanp)) ? - Ronak Shah 29.02.2020

comment

Это работает достаточно хорошо в этом случае, но это не совсем то, что я ищу, поскольку это берет средние месячные средние значения в каждом сезоне, а не средние сезонные. - Taylor Lee 29.02.2020

comment

Использование seas %>% group_by(season) %>% summarise(mcanp = mean(mcanp)) дает 1 Зима 0,0354, 2 Лето 0,0281, 3 Весна 0,0243, 4 Осень 0,0110 Принимая во внимание, что ответ, который я ищу, - 1 Зима 0,0350< /b>, 2 Лето 0,0280, 3 Весна 0,0243, 4 Осень 0,0110 - Taylor Lee 29.02.2020

Ответы (2)

arrow_upward
0
arrow_downward

Если я правильно понимаю, вам нужна доля отмен, по сезонам. Если это так, то большую часть работы вы выполнили самостоятельно. Не указывайте последовательно group_by месяц и сезон, так как ваш комментарий правильно указывает, что при этом вычисляется среднее значение месячных пропорций отмен в каждом сезоне. Вместо этого создайте переменную сезона и добавьте ее к несгруппированному фрейму данных внутри файла mutate.

cancprop <- flights %>% mutate( canc = is.na(dep_time), notc = !is.na(dep_time), canp = canc / (canc + notc), season = case_when( month %in% 3:5 ~ "Spring", month %in% 6:8 ~ "Summer", month %in% 9:11 ~ "Fall", TRUE ~ "Winter")) cancprop %>% group_by(season) %>% summarize(mcanp = mean(canp)) %>% arrange(desc(mcanp)) # A tibble: 4 x 2 season mcanp <chr> <dbl> 1 Winter 0.0350 2 Summer 0.0280 3 Spring 0.0243 4 Fall 0.0110

Это доля отмен по сезонам в порядке убывания.

Thomas Bilach 29.02.2020

arrow_upward
0
arrow_downward

Я разобрался - нужно было начинать со всего фрейма данных, а не сгруппированного по месяцам.

library(nycflights13) library(tidyverse) cancprop <- flights %>% mutate( canc = is.na(dep_time), notc = !is.na(dep_time), canp = canc / (canc + notc), season = case_when( month %in% 3:5 ~ "Spring", month %in% 6:8 ~ "Summer", month %in% 9:11 ~ "Fall", TRUE ~ "Winter" ) ) cancprop # A tibble: 336,776 x 23 year month day dep_time sched_dep_time <int> <int> <int> <int> <int> 1 2013 1 1 517 515 2 2013 1 1 533 529 3 2013 1 1 542 540 4 2013 1 1 544 545 5 2013 1 1 554 600 6 2013 1 1 554 558 7 2013 1 1 555 600 8 2013 1 1 557 600 9 2013 1 1 557 600 10 2013 1 1 558 600 # ... with 336,766 more rows, and 18 more # variables: dep_delay <dbl>, arr_time <int>, # sched_arr_time <int>, arr_delay <dbl>, # carrier <chr>, flight <int>, tailnum <chr>, # origin <chr>, dest <chr>, air_time <dbl>, # distance <dbl>, hour <dbl>, minute <dbl>, # time_hour <dttm>, canc <lgl>, notc <lgl>, # canp <dbl>, season <chr> mcp <- cancprop %>% group_by(month, season) %>% summarize(mcanp = mean(canp)) %>% arrange(desc(mcanp)) mcp # A tibble: 12 x 3 # Groups: month [12] month season mcanp <int> <chr> <dbl> 1 2 Winter 0.0505 2 12 Winter 0.0364 3 6 Summer 0.0357 4 7 Summer 0.0319 5 3 Spring 0.0299 6 4 Spring 0.0236 7 5 Spring 0.0196 8 1 Winter 0.0193 9 8 Summer 0.0166 10 9 Fall 0.0164 11 11 Fall 0.00854 12 10 Fall 0.00817 ggplot(mcp, aes(x = factor(month), y = mcanp, fill = season)) + geom_bar(stat = "identity") + labs(x = "Month", y = "Proportion of Flights Canceled", color = "Season") # February had the highest proportion of canceled flights and October had the lowest. scp <- cancprop %>% group_by(season) %>% summarize(mcanp = mean(canp)) %>% arrange(desc(mcanp)) scp # A tibble: 4 x 2 season mcanp <chr> <dbl> 1 Winter 0.0350 2 Summer 0.0280 3 Spring 0.0243 4 Fall 0.0110 ggplot(scp, aes(x = factor(season), y = mcanp, fill = season)) + geom_bar(stat = "identity") + labs(x = "Month", y = "Proportion of Flights Canceled", color = "Season") # Winter had the highest proportion of canceled flights and Fall had the lowest.

Taylor Lee 29.02.2020

comment

Похоже, вы сами это поняли, когда я писал ответ. Двойная группировка не была необходимой для ответа на ваш вопрос. Молодец, разбирайся сам! - Thomas Bilach; 29.02.2020

Есть ли способ сгруппировать данные дважды (то есть по месяцам, а затем по сезонам) в R?

Ответы (2)

Похожие вопросы