통계분석 | KWANGSIK LEE's log - Part 2

통계분석

개발자가 배우는 R : 8강, tidyverse

2017년 12월 15일 R 641 comments machine learning, R, 개발자, 기계학습, 머신러닝, 알, 통계분석

개요

R에서 실무에서 사용할만한 데이터 조작에 대해 배워본다.

tidyverse 설치

tidyverse를 설치

샘플 데이터 불러오기

> install.packages("nycflights13")
> library(nycflights13)
> flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin  dest air_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>  <chr> <chr>    <dbl>
 1  2013     1     1      517            515         2      830            819        11      UA   1545  N14228    EWR   IAH      227
 2  2013     1     1      533            529         4      850            830        20      UA   1714  N24211    LGA   IAH      227
 3  2013     1     1      542            540         2      923            850        33      AA   1141  N619AA    JFK   MIA      160
 4  2013     1     1      544            545        -1     1004           1022       -18      B6    725  N804JB    JFK   BQN      183
 5  2013     1     1      554            600        -6      812            837       -25      DL    461  N668DN    LGA   ATL      116
 6  2013     1     1      554            558        -4      740            728        12      UA   1696  N39463    EWR   ORD      150
 7  2013     1     1      555            600        -5      913            854        19      B6    507  N516JB    EWR   FLL      158
 8  2013     1     1      557            600        -3      709            723       -14      EV   5708  N829AS    LGA   IAD       53
 9  2013     1     1      557            600        -3      838            846        -8      B6     79  N593JB    JFK   MCO      140
10  2013     1     1      558            600        -2      753            745         8      AA    301  N3ALAA    LGA   ORD      138
# ... with 336,766 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

tidyverse 설치

> install.packages("tidyverse")
> library(tidyverse)

Dataset 보기

> View(flights)

주요 함수

mutate()
- 기존의 변수를 함수로 구성하여 새로운 변수를 추가함
select()
- SQL의 select
filter()
- 기준에 따라 subset을 만듦
summarise()
- apply랑 비슷한데 복수의 값을 reduce하여 하나의 single value로 요약
arrange()
- 엑셀의 정렬처럼 소팅함
  -> 위의 모든 함수는 group_by()랑 같이 사용할 수 있음

filter

> filter(flights, month == 1, day == 1) # 조건 부분은 하나 이상의 n개의 조건 지정 가능

arrange

> arrange(flights, year, month, day) # year, month, day 순으로 정렬
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin  dest air_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>  <chr> <chr>    <dbl>
 1  2013     1     1      517            515         2      830            819        11      UA   1545  N14228    EWR   IAH      227
 2  2013     1     1      533            529         4      850            830        20      UA   1714  N24211    LGA   IAH      227
 3  2013     1     1      542            540         2      923            850        33      AA   1141  N619AA    JFK   MIA      160
 4  2013     1     1      544            545        -1     1004           1022       -18      B6    725  N804JB    JFK   BQN      183
 5  2013     1     1      554            600        -6      812            837       -25      DL    461  N668DN    LGA   ATL      116
 6  2013     1     1      554            558        -4      740            728        12      UA   1696  N39463    EWR   ORD      150
 7  2013     1     1      555            600        -5      913            854        19      B6    507  N516JB    EWR   FLL      158
 8  2013     1     1      557            600        -3      709            723       -14      EV   5708  N829AS    LGA   IAD       53
 9  2013     1     1      557            600        -3      838            846        -8      B6     79  N593JB    JFK   MCO      140
10  2013     1     1      558            600        -2      753            745         8      AA    301  N3ALAA    LGA   ORD      138
# ... with 336,766 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

descending으로 정렬 필요 시 아래와 같이 입력한다.

> arrange(flights, desc(arr_delay))
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin  dest air_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>  <chr> <chr>    <dbl>
 1  2013     1     9      641            900      1301     1242           1530      1272      HA     51  N384HA    JFK   HNL      640
 2  2013     6    15     1432           1935      1137     1607           2120      1127      MQ   3535  N504MQ    JFK   CMH       74
 3  2013     1    10     1121           1635      1126     1239           1810      1109      MQ   3695  N517MQ    EWR   ORD      111
 4  2013     9    20     1139           1845      1014     1457           2210      1007      AA    177  N338AA    JFK   SFO      354
 5  2013     7    22      845           1600      1005     1044           1815       989      MQ   3075  N665MQ    JFK   CVG       96
 6  2013     4    10     1100           1900       960     1342           2211       931      DL   2391  N959DL    JFK   TPA      139
 7  2013     3    17     2321            810       911      135           1020       915      DL   2119  N927DA    LGA   MSP      167
 8  2013     7    22     2257            759       898      121           1026       895      DL   2047  N6716C    LGA   ATL      109
 9  2013    12     5      756           1700       896     1058           2020       878      AA    172  N5DMAA    EWR   MIA      149
10  2013     5     3     1133           2055       878     1250           2215       875      MQ   3744  N523MQ    EWR   ORD      112
# ... with 336,766 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

select

> select(flights, year, month, day, arr_delay)
# A tibble: 336,776 x 4
    year month   day arr_delay
   <int> <int> <int>     <dbl>
 1  2013     1     1        11
 2  2013     1     1        20
 3  2013     1     1        33
 4  2013     1     1       -18
 5  2013     1     1       -25
 6  2013     1     1        12
 7  2013     1     1        19
 8  2013     1     1       -14
 9  2013     1     1        -8
10  2013     1     1         8
# ... with 336,766 more rows

실제로는 아래와 같이 조합해서 쓸 듯 하다.

> select(arrange(flights, desc(arr_delay)), year, month, day, arr_delay)
    # A tibble: 336,776 x 4
    year month   day arr_delay
   <int> <int> <int>     <dbl>
 1  2013     1     9      1272
 2  2013     6    15      1127
 3  2013     1    10      1109
 4  2013     9    20      1007
 5  2013     7    22       989
 6  2013     4    10       931
 7  2013     3    17       915
 8  2013     7    22       895
 9  2013    12     5       878
10  2013     5     3       875
# ... with 336,766 more rows

mutate

> flights_sml <- select(flights,
+                          year:day, # 여기서 콜론의 의미는 year~day까지의 range 지정을 의미한다.
+                          ends_with("delay"), ## 컬럼명이 delay로 끝나는 컬럼을 선택
+                          distance,
+                          air_time
+                          )
> mutate(flights_sml,
+             gain = arr_delay - dep_delay,
+             speed = distance / air_time * 60
+             ) # 새로운 gain, speed라는 변수를 만들어 기존의 데이터에 추가함
# A tibble: 336,776 x 9
    year month   day dep_delay arr_delay distance air_time  gain    speed
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl>    <dbl>
 1  2013     1     1         2        11     1400      227     9 370.0441
 2  2013     1     1         4        20     1416      227    16 374.2731
 3  2013     1     1         2        33     1089      160    31 408.3750
 4  2013     1     1        -1       -18     1576      183   -17 516.7213
 5  2013     1     1        -6       -25      762      116   -19 394.1379
 6  2013     1     1        -4        12      719      150    16 287.6000
 7  2013     1     1        -5        19     1065      158    24 404.4304
 8  2013     1     1        -3       -14      229       53   -11 259.2453
 9  2013     1     1        -3        -8      944      140    -5 404.5714
10  2013     1     1        -2         8      733      138    10 318.6957
# ... with 336,766 more rows

summarise

> summarise(flights, delay = mean(dep_delay, na.rm = TRUE)) # delay라는 이름으로 dep_delay에서 na값을 뺀 평균을 구한다.
# A tibble: 1 x 1
     delay
     <dbl>
1 12.63907

> by_day <- group_by(flights, year, month, day) # year, month, day로 group by
> summarise(by_day, delay = mean(dep_delay, na.rm = TRUE)) # 각각의 group by 된 데이터별로 summarise
# A tibble: 365 x 4
# Groups:   year, month [?]
    year month   day     delay
   <int> <int> <int>     <dbl>
 1  2013     1     1 11.548926
 2  2013     1     2 13.858824
 3  2013     1     3 10.987832
 4  2013     1     4  8.951595
 5  2013     1     5  5.732218
 6  2013     1     6  7.148014
 7  2013     1     7  5.417204
 8  2013     1     8  2.553073
 9  2013     1     9  2.276477
10  2013     1    10  2.844995
# ... with 355 more rows

단계 생략하기

아래와 같이 3개의 명령으로 처리하던 데이터를 한 명령에 처리 가능하다.

> by_dest <- group_by(flights, dest)
> delay <- summarise(by_dest,
+                      count = n(),
+                      dist = mean(distance, na.rm = TRUE),
+                      delay = mean(arr_delay, na.rm = TRUE)
+                      )
> delay <- filter(delay, count > 20, dest != "HNL")

%>%는 chain operator로써 중간 결과를 연결하는 의미이다.

> delays <- flights %>%
+     group_by(dest) %>%
+     summarise(
+       count = n(),
+       dist = mean(distance, na.rm = TRUE),
+       delay = mean(arr_delay, na.rm = TRUE)
+       ) %>%
+     filter(count > 20, dest != "HNL")