How to Manage and Process Big Datasets in R - an Example with eBird Data
Journal Club - March 23th
Who is this guy?
What are Big Data?
"Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software."
Wikipedia
Remote sensing data
Logger data
Citizen science projects
Have you ever tried to load an 8 GB .csv/.txt file into R?
Why though?
"The tools you learn in this book will easily handle (...) 1-2 Gb of data. If you’re routinely working with larger data (10-100 Gb, say), you should learn *something else*"
R for Data Science (2nd edition) (Wickham & Grolemund)
Apache Arrow
Tidyverse + Arrow
How did COVID-19 lockdowns affect birding in Portugal?
WTF is eBird?
eBird Basic Dataset (EBD) (PT)
Install Apache Arrow
install.packages("arrow")
Inspect the dataset
Convert to Parquet format
arrow::write_dataset(data_export, "ebird_parquet",
format="parquet")
Import Parquet dataset
data<- open_dataset("ebird_parquet")
data<-data %>%
mutate(year=lubdridate::year(`OBSERVATION DATE`))
How did the number of eBird checklists change between 2008 and 2022? Were there fewer eBird lists in 2020 and in 2021?
Lockdown stringency index in Portugal (2020-2022)
eBird checklists between 2008 and 2022
data %>% select(year,`SAMPLING EVENT IDENTIFIER`) %>%
group_by(year) %>%
summarise(n_list= n_distinct(`SAMPLING EVENT IDENTIFIER`)) %>%
filter(year>=2008 ) %>% collect() %>%
ggplot(aes(y=n_list,x=year))+geom_point(size=3)+
labs(y="Number of eBird lists",
x="Year",
title="Number of eBird lists submitted between 2008 and 2022 in Portugal",
caption="Source: eBird basic dataset (PT)")+
theme_bw()+ theme(text = element_text(size = 20,family="roboto condensed"))
Calculate the number of eBird checklists between 2008 and 2022 and make a plot
Ok, but what about the number of active birders?
Number of active birders between 2008 and 2022
data %>% select(year,`OBSERVER ID`) %>%
group_by(year) %>%
summarise(n_list= n_distinct(`OBSERVER ID`)) %>%
filter(year>=2008) %>% collect() %>%
ggplot(aes(y=n_list,x=year))+
geom_point(size=3)+
labs(y="Number of active birders",
x="Year",
title="Number of active birders on eBird between 2008 and 2022",
caption="Source: eBird basic dataset")+
theme_bw()+
theme(text = element_text(size = 20,family="roboto condensed"))
Number of active birders between 2008 and 2022
Scientific conclusions
Technical conclusion (I)
Technical conclusion (II)
Thank God it's over