This activity will focus on doing basic positional and filtering joins using some built-in data sets in R
to prepare you to be able to do spatial joins later in the workshop using georeferenced data sources. Data joins commonly fit within a larger work-flow in tidyverse—here are the slides & narrative for review.
For this activity, you will need the following packages:
library( knitr )
library( tidyverse )
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.3 ✓ purrr 0.3.4
✓ tibble 3.1.2 ✓ dplyr 1.0.6
✓ tidyr 1.1.3 ✓ stringr 1.4.0
✓ readr 1.4.0 ✓ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library( kableExtra )
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
library( nycflights13 )
and for reference purposes, here are links to both the slides and the narrative for this topic.
flights %>%
mutate( no_flight = is.na(air_time) ) %>%
select( no_flight, carrier ) %>%
group_by( carrier ) %>%
summarize( Canceled = sum( no_flight == TRUE ),
Good = sum( no_flight == FALSE )) %>%
mutate( `Cancel Rate` = Canceled / Good ) %>%
left_join( airlines, by="carrier") %>%
select( Carrier = name,
`Cancel Rate`) %>%
arrange( -`Cancel Rate`) %>%
kable( digits = 3,
caption = "The frequency of flight cancelations by airline.") %>%
kable_styling( full_width=TRUE )
Carrier | Cancel Rate |
---|---|
Mesa Airlines Inc. | 0.105 |
SkyWest Airlines Inc. | 0.103 |
Endeavor Air Inc. | 0.067 |
ExpressJet Airlines Inc. | 0.060 |
Envoy Air | 0.054 |
US Airways Inc. | 0.036 |
AirTran Airways Corporation | 0.027 |
American Airlines Inc. | 0.024 |
Southwest Airlines Co. | 0.019 |
United Air Lines Inc. | 0.015 |
JetBlue Airways | 0.011 |
Delta Air Lines Inc. | 0.009 |
Virgin America | 0.009 |
Alaska Airlines Inc. | 0.007 |
Frontier Airlines Inc. | 0.006 |
Hawaiian Airlines Inc. | 0.000 |
flights %>%
group_by( tailnum ) %>%
summarize( AveDelay = mean( arr_delay, na.rm=TRUE) ) %>%
left_join( planes, by="tailnum" ) %>%
select( seats, AveDelay ) %>%
group_by( seats ) %>%
summarize( delay = mean( AveDelay, na.rm=TRUE) ) %>%
mutate( seats = as.numeric( seats ) ) %>%
filter( !is.na(delay),
!is.na(seats) ) -> df
summary( df )
seats delay
Min. : 2.0 Min. :-10.2500
1st Qu.: 19.0 1st Qu.: 0.5252
Median :146.0 Median : 4.6666
Mean :156.9 Mean : 8.6665
3rd Qu.:256.2 3rd Qu.: 10.4785
Max. :450.0 Max. :120.0000
df %>%
ggplot( aes(seats, delay) ) +
geom_point() +
stat_smooth(method="lm", formula="y ~ x")
fit.df <- lm( delay ~ seats, data=df)
summary( fit.df )
Call:
lm(formula = delay ~ seats, data = df)
Residuals:
Min 1Q Median 3Q Max
-27.862 -9.235 -2.199 5.170 95.362
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.11540 4.08539 0.028 0.978
seats 0.05449 0.02029 2.686 0.010 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.74 on 46 degrees of freedom
Multiple R-squared: 0.1356, Adjusted R-squared: 0.1168
F-statistic: 7.215 on 1 and 46 DF, p-value: 0.01002
flights %>%
left_join(weather) %>%
select( -(year:sched_dep_time),
-(arr_time:time_hour) ) %>%
filter( !is.na(dep_delay) ) %>%
gather( feature, value, temp:visib, factor_key = TRUE ) %>%
mutate( delayed = dep_delay > 30 ) -> df
Joining, by = c("year", "month", "day", "origin", "hour", "time_hour")
df %>%
group_by( feature ) %>%
summarize( on_time = mean( value[ df$delayed==FALSE], na.rm=TRUE),
delayed = mean( value[ df$delayed==TRUE], na.rm=TRUE),
t_test = t.test(value[ df$delayed==FALSE],
value[ df$delayed==TRUE])$statistic,
P = t.test(value[ df$delayed==FALSE],
value[ df$delayed==TRUE])$p.value ) %>%
kable( digits=3,
caption = "Relationship between environmenal features measured for flights that had ") %>%
kable_styling( full_width = TRUE )
feature | on_time | delayed | t_test | P |
---|---|---|---|---|
temp | 56.642 | 59.272 | -28.757 | 0 |
dewp | 40.807 | 45.692 | -49.755 | 0 |
humid | 58.359 | 64.284 | -58.578 | 0 |
wind_dir | 202.420 | 198.768 | 7.313 | 0 |
wind_speed | 10.946 | 11.734 | -28.076 | 0 |
wind_gust | 25.030 | 25.756 | -11.824 | 0 |
precip | 0.003 | 0.010 | -29.477 | 0 |
pressure | 1018.212 | 1015.851 | 58.490 | 0 |
visib | 9.356 | 8.893 | 39.489 | 0 |
flights %>%
filter( !is.na(air_time) ) %>%
group_by(carrier, tailnum ) %>%
summarize( flights = length( distance ), .groups="keep" ) %>%
filter( flights >= 100 ) %>%
left_join(airlines, by="carrier") %>%
group_by( carrier ) %>%
summarize( Planes = length(tailnum) ) %>%
left_join( airlines, by="carrier" ) %>%
arrange( -Planes ) %>%
select( Carrier = name, `Number of Flights` = Planes) %>%
knitr::kable( caption="Number of planes by carrier that have over 100 flights in 2013.") %>%
kableExtra::kable_styling(full_width = FALSE)
Carrier | Number of Flights |
---|---|
United Air Lines Inc. | 317 |
ExpressJet Airlines Inc. | 270 |
Delta Air Lines Inc. | 209 |
JetBlue Airways | 187 |
Envoy Air | 80 |
US Airways Inc. | 47 |
Endeavor Air Inc. | 41 |
American Airlines Inc. | 28 |
Virgin America | 26 |