How to Manage and Process Big Datasets in R - an Example with eBird Data
Journal Club - March 23th
Who is this guy?
What are Big Data?
"Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application software."
Wikipedia
Remote sensing data
Logger data
Citizen science projects
Have you ever tried to load an 8 GB .csv/.txt file into R?
Why though?
"The tools you learn in this book will easily handle (...) 1-2 Gb of data. If you’re routinely working with larger data (10-100 Gb, say), you should learn *something else*"
R for Data Science (2nd edition) (Wickham & Grolemund)
Apache Arrow
Tidyverse + Arrow
How did COVID-19 lockdowns affect birding in Portugal?
WTF is eBird?
eBird Basic Dataset (EBD) (PT)
Install Apache Arrow
install.packages("arrow")
Inspect the dataset
library(arrow) library(tidyverse) data <- open_dataset( sources = "ebd_PT_prv_relDec-2022.txt", format = "tsv") data %>% glimpse() FileSystemDataset with 1 csv file 8,896,216 rows x 50 columns $ `GLOBAL UNIQUE IDENTIFIER` <string> "URN:CornellLabOfOrnithology:EBIRD:OBS1533374605", "URN:CornellLabO… $ `LAST EDITED DATE` <timestamp[ns]> 2022-10-12 19:35:43, 2021-07-18 11:50:07, 2021-07-18 11:50:07, 2021… $ `TAXONOMIC ORDER` <int64> 664, 6386, 6623, 6449, 444, 5689, 5433, 23729, 6469, 6563, 6973, 20… $ CATEGORY <string> "species", "species", "species", "species", "species", "species", "… $ `TAXON CONCEPT ID` <string> "avibase-B77377EE", "avibase-FB02DD96", "avibase-4D2FF6F1", "avibas… $ `COMMON NAME` <string> "Common Eider", "Black-headed Gull", "Common Tern", "Yellow-legged … $ `SCIENTIFIC NAME` <string> "Somateria mollissima", "Chroicocephalus ridibundus", "Sterna hirun… $ `SUBSPECIES COMMON NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `SUBSPECIES SCIENTIFIC NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `EXOTIC CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `OBSERVATION COUNT` <string> "6", "X", "X", "X", "X", "3", "1", "X", "X", "X", "X", "X", "X", "X… $ `BREEDING CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BREEDING CATEGORY` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BEHAVIOR CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `AGE/SEX` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ COUNTRY <string> "Portugal", "Portugal", "Portugal", "Portugal", "Portugal", "Portug… $ `COUNTRY CODE` <string> "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "… $ STATE <string> "Lisboa", "Região Autónoma dos Açores", "Região Autónoma dos Açores… $ `STATE CODE` <string> "PT-11", "PT-20", "PT-20", "PT-20", "PT-14", "PT-14", "PT-20", "PT-… $ COUNTY <string> "Cascais", "Lagoa", "Lagoa", "Lagoa", "Abrantes", "Abrantes", "Ribe… $ `COUNTY CODE` <string> "PT-11-CS", "PT-20-LG", "PT-20-LG", "PT-20-LG", "PT-14-AB", "PT-14-… $ `IBA CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `BCR CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `USFWS CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `ATLAS BLOCK` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ LOCALITY <string> "PN Sintra-Cascais--Cabo Raso", "St. Michaels, Azores", "St. Michae… $ `LOCALITY ID` <string> "L930224", "L15809663", "L15809663", "L15809663", "L22107968", "L21… $ `LOCALITY TYPE` <string> "H", "P", "P", "P", "P", "H", "P", "P", "P", "P", "H", "H", "H", "H… $ LATITUDE <double> 38.70946, 37.75265, 37.75265, 37.75265, 39.40783, 39.31732, 37.7750… $ LONGITUDE <double> -9.485836, -25.530767, -25.530767, -25.530767, -8.254034, -8.222752… $ `OBSERVATION DATE` <date32[day]> 1891-10-01, 1900-10-01, 1900-10-01, 1900-10-01, 1939-01-03, 1944-04… $ `TIME OBSERVATIONS STARTED` <time32[s]> 09:00:00, NA, NA, NA, NA, NA, N… $ `OBSERVER ID` <string> "obsr3419517", "obsr237338", "obsr237338", "obsr237338", "obsr35299… $ `SAMPLING EVENT IDENTIFIER` <string> "S120076980", "S91932209", "S91932209", "S91932209", "S125612599", … $ `PROTOCOL TYPE` <string> "Incidental", "Historical", "Historical", "Historical", "Incidental… $ `PROTOCOL CODE` <string> "P20", "P62", "P62", "P62", "P20", "P62", "P20", "P62", "P62", "P62… $ `PROJECT CODE` <string> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_POR", "EBIRD_POR", "EBIR… $ `DURATION MINUTES` <int64> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT DISTANCE KM` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT AREA HA` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `NUMBER OBSERVERS` <int64> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… $ `ALL SPECIES REPORTED` <int64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ `GROUP IDENTIFIER` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `HAS MEDIA` <int64> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ APPROVED <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ REVIEWED <int64> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0… $ REASON <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `TRIP COMMENTS` <string> "", "", "", "", "", "", "Purple Gallinule records uploaded by Marsh… $ `SPECIES COMMENTS` <string> "In Rei D. Carlos de Bragança. 2002. Inéditos - 1 das aves encontra… $ `` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… data_export <- data %>% select(`OBSERVATION DATE`, `SAMPLING EVENT IDENTIFIER`, `OBSERVER ID`,COUNTY)
library(arrow) library(tidyverse) data <- open_dataset( sources = "ebd_PT_prv_relDec-2022.txt", format = "tsv") data %>% glimpse() FileSystemDataset with 1 csv file 8,896,216 rows x 50 columns $ `GLOBAL UNIQUE IDENTIFIER` <string> "URN:CornellLabOfOrnithology:EBIRD:OBS1533374605", "URN:CornellLabO… $ `LAST EDITED DATE` <timestamp[ns]> 2022-10-12 19:35:43, 2021-07-18 11:50:07, 2021-07-18 11:50:07, 2021… $ `TAXONOMIC ORDER` <int64> 664, 6386, 6623, 6449, 444, 5689, 5433, 23729, 6469, 6563, 6973, 20… $ CATEGORY <string> "species", "species", "species", "species", "species", "species", "… $ `TAXON CONCEPT ID` <string> "avibase-B77377EE", "avibase-FB02DD96", "avibase-4D2FF6F1", "avibas… $ `COMMON NAME` <string> "Common Eider", "Black-headed Gull", "Common Tern", "Yellow-legged … $ `SCIENTIFIC NAME` <string> "Somateria mollissima", "Chroicocephalus ridibundus", "Sterna hirun… $ `SUBSPECIES COMMON NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `SUBSPECIES SCIENTIFIC NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `EXOTIC CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `OBSERVATION COUNT` <string> "6", "X", "X", "X", "X", "3", "1", "X", "X", "X", "X", "X", "X", "X… $ `BREEDING CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BREEDING CATEGORY` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BEHAVIOR CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `AGE/SEX` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ COUNTRY <string> "Portugal", "Portugal", "Portugal", "Portugal", "Portugal", "Portug… $ `COUNTRY CODE` <string> "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "… $ STATE <string> "Lisboa", "Região Autónoma dos Açores", "Região Autónoma dos Açores… $ `STATE CODE` <string> "PT-11", "PT-20", "PT-20", "PT-20", "PT-14", "PT-14", "PT-20", "PT-… $ COUNTY <string> "Cascais", "Lagoa", "Lagoa", "Lagoa", "Abrantes", "Abrantes", "Ribe… $ `COUNTY CODE` <string> "PT-11-CS", "PT-20-LG", "PT-20-LG", "PT-20-LG", "PT-14-AB", "PT-14-… $ `IBA CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `BCR CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `USFWS CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `ATLAS BLOCK` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ LOCALITY <string> "PN Sintra-Cascais--Cabo Raso", "St. Michaels, Azores", "St. Michae… $ `LOCALITY ID` <string> "L930224", "L15809663", "L15809663", "L15809663", "L22107968", "L21… $ `LOCALITY TYPE` <string> "H", "P", "P", "P", "P", "H", "P", "P", "P", "P", "H", "H", "H", "H… $ LATITUDE <double> 38.70946, 37.75265, 37.75265, 37.75265, 39.40783, 39.31732, 37.7750… $ LONGITUDE <double> -9.485836, -25.530767, -25.530767, -25.530767, -8.254034, -8.222752… $ `OBSERVATION DATE` <date32[day]> 1891-10-01, 1900-10-01, 1900-10-01, 1900-10-01, 1939-01-03, 1944-04… $ `TIME OBSERVATIONS STARTED` <time32[s]> 09:00:00, NA, NA, NA, NA, NA, N… $ `OBSERVER ID` <string> "obsr3419517", "obsr237338", "obsr237338", "obsr237338", "obsr35299… $ `SAMPLING EVENT IDENTIFIER` <string> "S120076980", "S91932209", "S91932209", "S91932209", "S125612599", … $ `PROTOCOL TYPE` <string> "Incidental", "Historical", "Historical", "Historical", "Incidental… $ `PROTOCOL CODE` <string> "P20", "P62", "P62", "P62", "P20", "P62", "P20", "P62", "P62", "P62… $ `PROJECT CODE` <string> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_POR", "EBIRD_POR", "EBIR… $ `DURATION MINUTES` <int64> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT DISTANCE KM` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT AREA HA` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `NUMBER OBSERVERS` <int64> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… $ `ALL SPECIES REPORTED` <int64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ `GROUP IDENTIFIER` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `HAS MEDIA` <int64> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ APPROVED <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ REVIEWED <int64> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0… $ REASON <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `TRIP COMMENTS` <string> "", "", "", "", "", "", "Purple Gallinule records uploaded by Marsh… $ `SPECIES COMMENTS` <string> "In Rei D. Carlos de Bragança. 2002. Inéditos - 1 das aves encontra… $ `` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… data_export <- data %>% select(`OBSERVATION DATE`, `SAMPLING EVENT IDENTIFIER`, `OBSERVER ID`,COUNTY)
library(arrow) library(tidyverse) data <- open_dataset( sources = "ebd_PT_prv_relDec-2022.txt", format = "tsv") data %>% glimpse() FileSystemDataset with 1 csv file 8,896,216 rows x 50 columns $ `GLOBAL UNIQUE IDENTIFIER` <string> "URN:CornellLabOfOrnithology:EBIRD:OBS1533374605", "URN:CornellLabO… $ `LAST EDITED DATE` <timestamp[ns]> 2022-10-12 19:35:43, 2021-07-18 11:50:07, 2021-07-18 11:50:07, 2021… $ `TAXONOMIC ORDER` <int64> 664, 6386, 6623, 6449, 444, 5689, 5433, 23729, 6469, 6563, 6973, 20… $ CATEGORY <string> "species", "species", "species", "species", "species", "species", "… $ `TAXON CONCEPT ID` <string> "avibase-B77377EE", "avibase-FB02DD96", "avibase-4D2FF6F1", "avibas… $ `COMMON NAME` <string> "Common Eider", "Black-headed Gull", "Common Tern", "Yellow-legged … $ `SCIENTIFIC NAME` <string> "Somateria mollissima", "Chroicocephalus ridibundus", "Sterna hirun… $ `SUBSPECIES COMMON NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `SUBSPECIES SCIENTIFIC NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `EXOTIC CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `OBSERVATION COUNT` <string> "6", "X", "X", "X", "X", "3", "1", "X", "X", "X", "X", "X", "X", "X… $ `BREEDING CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BREEDING CATEGORY` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BEHAVIOR CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `AGE/SEX` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ COUNTRY <string> "Portugal", "Portugal", "Portugal", "Portugal", "Portugal", "Portug… $ `COUNTRY CODE` <string> "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "… $ STATE <string> "Lisboa", "Região Autónoma dos Açores", "Região Autónoma dos Açores… $ `STATE CODE` <string> "PT-11", "PT-20", "PT-20", "PT-20", "PT-14", "PT-14", "PT-20", "PT-… $ COUNTY <string> "Cascais", "Lagoa", "Lagoa", "Lagoa", "Abrantes", "Abrantes", "Ribe… $ `COUNTY CODE` <string> "PT-11-CS", "PT-20-LG", "PT-20-LG", "PT-20-LG", "PT-14-AB", "PT-14-… $ `IBA CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `BCR CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `USFWS CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `ATLAS BLOCK` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ LOCALITY <string> "PN Sintra-Cascais--Cabo Raso", "St. Michaels, Azores", "St. Michae… $ `LOCALITY ID` <string> "L930224", "L15809663", "L15809663", "L15809663", "L22107968", "L21… $ `LOCALITY TYPE` <string> "H", "P", "P", "P", "P", "H", "P", "P", "P", "P", "H", "H", "H", "H… $ LATITUDE <double> 38.70946, 37.75265, 37.75265, 37.75265, 39.40783, 39.31732, 37.7750… $ LONGITUDE <double> -9.485836, -25.530767, -25.530767, -25.530767, -8.254034, -8.222752… $ `OBSERVATION DATE` <date32[day]> 1891-10-01, 1900-10-01, 1900-10-01, 1900-10-01, 1939-01-03, 1944-04… $ `TIME OBSERVATIONS STARTED` <time32[s]> 09:00:00, NA, NA, NA, NA, NA, N… $ `OBSERVER ID` <string> "obsr3419517", "obsr237338", "obsr237338", "obsr237338", "obsr35299… $ `SAMPLING EVENT IDENTIFIER` <string> "S120076980", "S91932209", "S91932209", "S91932209", "S125612599", … $ `PROTOCOL TYPE` <string> "Incidental", "Historical", "Historical", "Historical", "Incidental… $ `PROTOCOL CODE` <string> "P20", "P62", "P62", "P62", "P20", "P62", "P20", "P62", "P62", "P62… $ `PROJECT CODE` <string> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_POR", "EBIRD_POR", "EBIR… $ `DURATION MINUTES` <int64> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT DISTANCE KM` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT AREA HA` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `NUMBER OBSERVERS` <int64> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… $ `ALL SPECIES REPORTED` <int64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ `GROUP IDENTIFIER` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `HAS MEDIA` <int64> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ APPROVED <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ REVIEWED <int64> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0… $ REASON <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `TRIP COMMENTS` <string> "", "", "", "", "", "", "Purple Gallinule records uploaded by Marsh… $ `SPECIES COMMENTS` <string> "In Rei D. Carlos de Bragança. 2002. Inéditos - 1 das aves encontra… $ `` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… data_export <- data %>% select(`OBSERVATION DATE`, `SAMPLING EVENT IDENTIFIER`, `OBSERVER ID`,COUNTY)
library(arrow) library(tidyverse) data <- open_dataset( sources = "ebd_PT_prv_relDec-2022.txt", format = "tsv") data %>% glimpse() FileSystemDataset with 1 csv file 8,896,216 rows x 50 columns $ `GLOBAL UNIQUE IDENTIFIER` <string> "URN:CornellLabOfOrnithology:EBIRD:OBS1533374605", "URN:CornellLabO… $ `LAST EDITED DATE` <timestamp[ns]> 2022-10-12 19:35:43, 2021-07-18 11:50:07, 2021-07-18 11:50:07, 2021… $ `TAXONOMIC ORDER` <int64> 664, 6386, 6623, 6449, 444, 5689, 5433, 23729, 6469, 6563, 6973, 20… $ CATEGORY <string> "species", "species", "species", "species", "species", "species", "… $ `TAXON CONCEPT ID` <string> "avibase-B77377EE", "avibase-FB02DD96", "avibase-4D2FF6F1", "avibas… $ `COMMON NAME` <string> "Common Eider", "Black-headed Gull", "Common Tern", "Yellow-legged … $ `SCIENTIFIC NAME` <string> "Somateria mollissima", "Chroicocephalus ridibundus", "Sterna hirun… $ `SUBSPECIES COMMON NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `SUBSPECIES SCIENTIFIC NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `EXOTIC CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `OBSERVATION COUNT` <string> "6", "X", "X", "X", "X", "3", "1", "X", "X", "X", "X", "X", "X", "X… $ `BREEDING CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BREEDING CATEGORY` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BEHAVIOR CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `AGE/SEX` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ COUNTRY <string> "Portugal", "Portugal", "Portugal", "Portugal", "Portugal", "Portug… $ `COUNTRY CODE` <string> "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "… $ STATE <string> "Lisboa", "Região Autónoma dos Açores", "Região Autónoma dos Açores… $ `STATE CODE` <string> "PT-11", "PT-20", "PT-20", "PT-20", "PT-14", "PT-14", "PT-20", "PT-… $ COUNTY <string> "Cascais", "Lagoa", "Lagoa", "Lagoa", "Abrantes", "Abrantes", "Ribe… $ `COUNTY CODE` <string> "PT-11-CS", "PT-20-LG", "PT-20-LG", "PT-20-LG", "PT-14-AB", "PT-14-… $ `IBA CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `BCR CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `USFWS CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `ATLAS BLOCK` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ LOCALITY <string> "PN Sintra-Cascais--Cabo Raso", "St. Michaels, Azores", "St. Michae… $ `LOCALITY ID` <string> "L930224", "L15809663", "L15809663", "L15809663", "L22107968", "L21… $ `LOCALITY TYPE` <string> "H", "P", "P", "P", "P", "H", "P", "P", "P", "P", "H", "H", "H", "H… $ LATITUDE <double> 38.70946, 37.75265, 37.75265, 37.75265, 39.40783, 39.31732, 37.7750… $ LONGITUDE <double> -9.485836, -25.530767, -25.530767, -25.530767, -8.254034, -8.222752… $ `OBSERVATION DATE` <date32[day]> 1891-10-01, 1900-10-01, 1900-10-01, 1900-10-01, 1939-01-03, 1944-04… $ `TIME OBSERVATIONS STARTED` <time32[s]> 09:00:00, NA, NA, NA, NA, NA, N… $ `OBSERVER ID` <string> "obsr3419517", "obsr237338", "obsr237338", "obsr237338", "obsr35299… $ `SAMPLING EVENT IDENTIFIER` <string> "S120076980", "S91932209", "S91932209", "S91932209", "S125612599", … $ `PROTOCOL TYPE` <string> "Incidental", "Historical", "Historical", "Historical", "Incidental… $ `PROTOCOL CODE` <string> "P20", "P62", "P62", "P62", "P20", "P62", "P20", "P62", "P62", "P62… $ `PROJECT CODE` <string> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_POR", "EBIRD_POR", "EBIR… $ `DURATION MINUTES` <int64> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT DISTANCE KM` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT AREA HA` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `NUMBER OBSERVERS` <int64> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… $ `ALL SPECIES REPORTED` <int64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ `GROUP IDENTIFIER` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `HAS MEDIA` <int64> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ APPROVED <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ REVIEWED <int64> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0… $ REASON <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `TRIP COMMENTS` <string> "", "", "", "", "", "", "Purple Gallinule records uploaded by Marsh… $ `SPECIES COMMENTS` <string> "In Rei D. Carlos de Bragança. 2002. Inéditos - 1 das aves encontra… $ `` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… data_export <- data %>% select(`OBSERVATION DATE`, `SAMPLING EVENT IDENTIFIER`, `OBSERVER ID`,COUNTY)
library(arrow) library(tidyverse) data <- open_dataset( sources = "ebd_PT_prv_relDec-2022.txt", format = "tsv") data %>% glimpse() FileSystemDataset with 1 csv file 8,896,216 rows x 50 columns $ `GLOBAL UNIQUE IDENTIFIER` <string> "URN:CornellLabOfOrnithology:EBIRD:OBS1533374605", "URN:CornellLabO… $ `LAST EDITED DATE` <timestamp[ns]> 2022-10-12 19:35:43, 2021-07-18 11:50:07, 2021-07-18 11:50:07, 2021… $ `TAXONOMIC ORDER` <int64> 664, 6386, 6623, 6449, 444, 5689, 5433, 23729, 6469, 6563, 6973, 20… $ CATEGORY <string> "species", "species", "species", "species", "species", "species", "… $ `TAXON CONCEPT ID` <string> "avibase-B77377EE", "avibase-FB02DD96", "avibase-4D2FF6F1", "avibas… $ `COMMON NAME` <string> "Common Eider", "Black-headed Gull", "Common Tern", "Yellow-legged … $ `SCIENTIFIC NAME` <string> "Somateria mollissima", "Chroicocephalus ridibundus", "Sterna hirun… $ `SUBSPECIES COMMON NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `SUBSPECIES SCIENTIFIC NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `EXOTIC CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `OBSERVATION COUNT` <string> "6", "X", "X", "X", "X", "3", "1", "X", "X", "X", "X", "X", "X", "X… $ `BREEDING CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BREEDING CATEGORY` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BEHAVIOR CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `AGE/SEX` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ COUNTRY <string> "Portugal", "Portugal", "Portugal", "Portugal", "Portugal", "Portug… $ `COUNTRY CODE` <string> "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "… $ STATE <string> "Lisboa", "Região Autónoma dos Açores", "Região Autónoma dos Açores… $ `STATE CODE` <string> "PT-11", "PT-20", "PT-20", "PT-20", "PT-14", "PT-14", "PT-20", "PT-… $ COUNTY <string> "Cascais", "Lagoa", "Lagoa", "Lagoa", "Abrantes", "Abrantes", "Ribe… $ `COUNTY CODE` <string> "PT-11-CS", "PT-20-LG", "PT-20-LG", "PT-20-LG", "PT-14-AB", "PT-14-… $ `IBA CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `BCR CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `USFWS CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `ATLAS BLOCK` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ LOCALITY <string> "PN Sintra-Cascais--Cabo Raso", "St. Michaels, Azores", "St. Michae… $ `LOCALITY ID` <string> "L930224", "L15809663", "L15809663", "L15809663", "L22107968", "L21… $ `LOCALITY TYPE` <string> "H", "P", "P", "P", "P", "H", "P", "P", "P", "P", "H", "H", "H", "H… $ LATITUDE <double> 38.70946, 37.75265, 37.75265, 37.75265, 39.40783, 39.31732, 37.7750… $ LONGITUDE <double> -9.485836, -25.530767, -25.530767, -25.530767, -8.254034, -8.222752… $ `OBSERVATION DATE` <date32[day]> 1891-10-01, 1900-10-01, 1900-10-01, 1900-10-01, 1939-01-03, 1944-04… $ `TIME OBSERVATIONS STARTED` <time32[s]> 09:00:00, NA, NA, NA, NA, NA, N… $ `OBSERVER ID` <string> "obsr3419517", "obsr237338", "obsr237338", "obsr237338", "obsr35299… $ `SAMPLING EVENT IDENTIFIER` <string> "S120076980", "S91932209", "S91932209", "S91932209", "S125612599", … $ `PROTOCOL TYPE` <string> "Incidental", "Historical", "Historical", "Historical", "Incidental… $ `PROTOCOL CODE` <string> "P20", "P62", "P62", "P62", "P20", "P62", "P20", "P62", "P62", "P62… $ `PROJECT CODE` <string> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_POR", "EBIRD_POR", "EBIR… $ `DURATION MINUTES` <int64> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT DISTANCE KM` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT AREA HA` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `NUMBER OBSERVERS` <int64> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… $ `ALL SPECIES REPORTED` <int64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ `GROUP IDENTIFIER` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `HAS MEDIA` <int64> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ APPROVED <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ REVIEWED <int64> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0… $ REASON <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `TRIP COMMENTS` <string> "", "", "", "", "", "", "Purple Gallinule records uploaded by Marsh… $ `SPECIES COMMENTS` <string> "In Rei D. Carlos de Bragança. 2002. Inéditos - 1 das aves encontra… $ `` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… data_export <- data %>% select(`OBSERVATION DATE`, `SAMPLING EVENT IDENTIFIER`, `OBSERVER ID`,COUNTY)
library(arrow) library(tidyverse) data <- open_dataset( sources = "ebd_PT_prv_relDec-2022.txt", format = "tsv") data %>% glimpse() FileSystemDataset with 1 csv file 8,896,216 rows x 50 columns $ `GLOBAL UNIQUE IDENTIFIER` <string> "URN:CornellLabOfOrnithology:EBIRD:OBS1533374605", "URN:CornellLabO… $ `LAST EDITED DATE` <timestamp[ns]> 2022-10-12 19:35:43, 2021-07-18 11:50:07, 2021-07-18 11:50:07, 2021… $ `TAXONOMIC ORDER` <int64> 664, 6386, 6623, 6449, 444, 5689, 5433, 23729, 6469, 6563, 6973, 20… $ CATEGORY <string> "species", "species", "species", "species", "species", "species", "… $ `TAXON CONCEPT ID` <string> "avibase-B77377EE", "avibase-FB02DD96", "avibase-4D2FF6F1", "avibas… $ `COMMON NAME` <string> "Common Eider", "Black-headed Gull", "Common Tern", "Yellow-legged … $ `SCIENTIFIC NAME` <string> "Somateria mollissima", "Chroicocephalus ridibundus", "Sterna hirun… $ `SUBSPECIES COMMON NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `SUBSPECIES SCIENTIFIC NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `EXOTIC CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `OBSERVATION COUNT` <string> "6", "X", "X", "X", "X", "3", "1", "X", "X", "X", "X", "X", "X", "X… $ `BREEDING CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BREEDING CATEGORY` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BEHAVIOR CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `AGE/SEX` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ COUNTRY <string> "Portugal", "Portugal", "Portugal", "Portugal", "Portugal", "Portug… $ `COUNTRY CODE` <string> "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "… $ STATE <string> "Lisboa", "Região Autónoma dos Açores", "Região Autónoma dos Açores… $ `STATE CODE` <string> "PT-11", "PT-20", "PT-20", "PT-20", "PT-14", "PT-14", "PT-20", "PT-… $ COUNTY <string> "Cascais", "Lagoa", "Lagoa", "Lagoa", "Abrantes", "Abrantes", "Ribe… $ `COUNTY CODE` <string> "PT-11-CS", "PT-20-LG", "PT-20-LG", "PT-20-LG", "PT-14-AB", "PT-14-… $ `IBA CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `BCR CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `USFWS CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `ATLAS BLOCK` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ LOCALITY <string> "PN Sintra-Cascais--Cabo Raso", "St. Michaels, Azores", "St. Michae… $ `LOCALITY ID` <string> "L930224", "L15809663", "L15809663", "L15809663", "L22107968", "L21… $ `LOCALITY TYPE` <string> "H", "P", "P", "P", "P", "H", "P", "P", "P", "P", "H", "H", "H", "H… $ LATITUDE <double> 38.70946, 37.75265, 37.75265, 37.75265, 39.40783, 39.31732, 37.7750… $ LONGITUDE <double> -9.485836, -25.530767, -25.530767, -25.530767, -8.254034, -8.222752… $ `OBSERVATION DATE` <date32[day]> 1891-10-01, 1900-10-01, 1900-10-01, 1900-10-01, 1939-01-03, 1944-04… $ `TIME OBSERVATIONS STARTED` <time32[s]> 09:00:00, NA, NA, NA, NA, NA, N… $ `OBSERVER ID` <string> "obsr3419517", "obsr237338", "obsr237338", "obsr237338", "obsr35299… $ `SAMPLING EVENT IDENTIFIER` <string> "S120076980", "S91932209", "S91932209", "S91932209", "S125612599", … $ `PROTOCOL TYPE` <string> "Incidental", "Historical", "Historical", "Historical", "Incidental… $ `PROTOCOL CODE` <string> "P20", "P62", "P62", "P62", "P20", "P62", "P20", "P62", "P62", "P62… $ `PROJECT CODE` <string> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_POR", "EBIRD_POR", "EBIR… $ `DURATION MINUTES` <int64> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT DISTANCE KM` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT AREA HA` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `NUMBER OBSERVERS` <int64> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… $ `ALL SPECIES REPORTED` <int64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ `GROUP IDENTIFIER` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `HAS MEDIA` <int64> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ APPROVED <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ REVIEWED <int64> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0… $ REASON <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `TRIP COMMENTS` <string> "", "", "", "", "", "", "Purple Gallinule records uploaded by Marsh… $ `SPECIES COMMENTS` <string> "In Rei D. Carlos de Bragança. 2002. Inéditos - 1 das aves encontra… $ `` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… data_export <- data %>% select(`OBSERVATION DATE`, `SAMPLING EVENT IDENTIFIER`, `OBSERVER ID`,COUNTY)
library(arrow) library(tidyverse) data <- open_dataset( sources = "ebd_PT_prv_relDec-2022.txt", format = "tsv") data %>% glimpse() FileSystemDataset with 1 csv file 8,896,216 rows x 50 columns $ `GLOBAL UNIQUE IDENTIFIER` <string> "URN:CornellLabOfOrnithology:EBIRD:OBS1533374605", "URN:CornellLabO… $ `LAST EDITED DATE` <timestamp[ns]> 2022-10-12 19:35:43, 2021-07-18 11:50:07, 2021-07-18 11:50:07, 2021… $ `TAXONOMIC ORDER` <int64> 664, 6386, 6623, 6449, 444, 5689, 5433, 23729, 6469, 6563, 6973, 20… $ CATEGORY <string> "species", "species", "species", "species", "species", "species", "… $ `TAXON CONCEPT ID` <string> "avibase-B77377EE", "avibase-FB02DD96", "avibase-4D2FF6F1", "avibas… $ `COMMON NAME` <string> "Common Eider", "Black-headed Gull", "Common Tern", "Yellow-legged … $ `SCIENTIFIC NAME` <string> "Somateria mollissima", "Chroicocephalus ridibundus", "Sterna hirun… $ `SUBSPECIES COMMON NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `SUBSPECIES SCIENTIFIC NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `EXOTIC CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `OBSERVATION COUNT` <string> "6", "X", "X", "X", "X", "3", "1", "X", "X", "X", "X", "X", "X", "X… $ `BREEDING CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BREEDING CATEGORY` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BEHAVIOR CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `AGE/SEX` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ COUNTRY <string> "Portugal", "Portugal", "Portugal", "Portugal", "Portugal", "Portug… $ `COUNTRY CODE` <string> "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "… $ STATE <string> "Lisboa", "Região Autónoma dos Açores", "Região Autónoma dos Açores… $ `STATE CODE` <string> "PT-11", "PT-20", "PT-20", "PT-20", "PT-14", "PT-14", "PT-20", "PT-… $ COUNTY <string> "Cascais", "Lagoa", "Lagoa", "Lagoa", "Abrantes", "Abrantes", "Ribe… $ `COUNTY CODE` <string> "PT-11-CS", "PT-20-LG", "PT-20-LG", "PT-20-LG", "PT-14-AB", "PT-14-… $ `IBA CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `BCR CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `USFWS CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `ATLAS BLOCK` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ LOCALITY <string> "PN Sintra-Cascais--Cabo Raso", "St. Michaels, Azores", "St. Michae… $ `LOCALITY ID` <string> "L930224", "L15809663", "L15809663", "L15809663", "L22107968", "L21… $ `LOCALITY TYPE` <string> "H", "P", "P", "P", "P", "H", "P", "P", "P", "P", "H", "H", "H", "H… $ LATITUDE <double> 38.70946, 37.75265, 37.75265, 37.75265, 39.40783, 39.31732, 37.7750… $ LONGITUDE <double> -9.485836, -25.530767, -25.530767, -25.530767, -8.254034, -8.222752… $ `OBSERVATION DATE` <date32[day]> 1891-10-01, 1900-10-01, 1900-10-01, 1900-10-01, 1939-01-03, 1944-04… $ `TIME OBSERVATIONS STARTED` <time32[s]> 09:00:00, NA, NA, NA, NA, NA, N… $ `OBSERVER ID` <string> "obsr3419517", "obsr237338", "obsr237338", "obsr237338", "obsr35299… $ `SAMPLING EVENT IDENTIFIER` <string> "S120076980", "S91932209", "S91932209", "S91932209", "S125612599", … $ `PROTOCOL TYPE` <string> "Incidental", "Historical", "Historical", "Historical", "Incidental… $ `PROTOCOL CODE` <string> "P20", "P62", "P62", "P62", "P20", "P62", "P20", "P62", "P62", "P62… $ `PROJECT CODE` <string> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_POR", "EBIRD_POR", "EBIR… $ `DURATION MINUTES` <int64> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT DISTANCE KM` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT AREA HA` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `NUMBER OBSERVERS` <int64> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… $ `ALL SPECIES REPORTED` <int64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ `GROUP IDENTIFIER` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `HAS MEDIA` <int64> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ APPROVED <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ REVIEWED <int64> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0… $ REASON <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `TRIP COMMENTS` <string> "", "", "", "", "", "", "Purple Gallinule records uploaded by Marsh… $ `SPECIES COMMENTS` <string> "In Rei D. Carlos de Bragança. 2002. Inéditos - 1 das aves encontra… $ `` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… data_export <- data %>% select(`OBSERVATION DATE`, `SAMPLING EVENT IDENTIFIER`, `OBSERVER ID`,COUNTY)
library(arrow) library(tidyverse) data <- open_dataset( sources = "ebd_PT_prv_relDec-2022.txt", format = "tsv") data %>% glimpse() FileSystemDataset with 1 csv file 8,896,216 rows x 50 columns $ `GLOBAL UNIQUE IDENTIFIER` <string> "URN:CornellLabOfOrnithology:EBIRD:OBS1533374605", "URN:CornellLabO… $ `LAST EDITED DATE` <timestamp[ns]> 2022-10-12 19:35:43, 2021-07-18 11:50:07, 2021-07-18 11:50:07, 2021… $ `TAXONOMIC ORDER` <int64> 664, 6386, 6623, 6449, 444, 5689, 5433, 23729, 6469, 6563, 6973, 20… $ CATEGORY <string> "species", "species", "species", "species", "species", "species", "… $ `TAXON CONCEPT ID` <string> "avibase-B77377EE", "avibase-FB02DD96", "avibase-4D2FF6F1", "avibas… $ `COMMON NAME` <string> "Common Eider", "Black-headed Gull", "Common Tern", "Yellow-legged … $ `SCIENTIFIC NAME` <string> "Somateria mollissima", "Chroicocephalus ridibundus", "Sterna hirun… $ `SUBSPECIES COMMON NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `SUBSPECIES SCIENTIFIC NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `EXOTIC CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `OBSERVATION COUNT` <string> "6", "X", "X", "X", "X", "3", "1", "X", "X", "X", "X", "X", "X", "X… $ `BREEDING CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BREEDING CATEGORY` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BEHAVIOR CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `AGE/SEX` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ COUNTRY <string> "Portugal", "Portugal", "Portugal", "Portugal", "Portugal", "Portug… $ `COUNTRY CODE` <string> "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "… $ STATE <string> "Lisboa", "Região Autónoma dos Açores", "Região Autónoma dos Açores… $ `STATE CODE` <string> "PT-11", "PT-20", "PT-20", "PT-20", "PT-14", "PT-14", "PT-20", "PT-… $ COUNTY <string> "Cascais", "Lagoa", "Lagoa", "Lagoa", "Abrantes", "Abrantes", "Ribe… $ `COUNTY CODE` <string> "PT-11-CS", "PT-20-LG", "PT-20-LG", "PT-20-LG", "PT-14-AB", "PT-14-… $ `IBA CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `BCR CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `USFWS CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `ATLAS BLOCK` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ LOCALITY <string> "PN Sintra-Cascais--Cabo Raso", "St. Michaels, Azores", "St. Michae… $ `LOCALITY ID` <string> "L930224", "L15809663", "L15809663", "L15809663", "L22107968", "L21… $ `LOCALITY TYPE` <string> "H", "P", "P", "P", "P", "H", "P", "P", "P", "P", "H", "H", "H", "H… $ LATITUDE <double> 38.70946, 37.75265, 37.75265, 37.75265, 39.40783, 39.31732, 37.7750… $ LONGITUDE <double> -9.485836, -25.530767, -25.530767, -25.530767, -8.254034, -8.222752… $ `OBSERVATION DATE` <date32[day]> 1891-10-01, 1900-10-01, 1900-10-01, 1900-10-01, 1939-01-03, 1944-04… $ `TIME OBSERVATIONS STARTED` <time32[s]> 09:00:00, NA, NA, NA, NA, NA, N… $ `OBSERVER ID` <string> "obsr3419517", "obsr237338", "obsr237338", "obsr237338", "obsr35299… $ `SAMPLING EVENT IDENTIFIER` <string> "S120076980", "S91932209", "S91932209", "S91932209", "S125612599", … $ `PROTOCOL TYPE` <string> "Incidental", "Historical", "Historical", "Historical", "Incidental… $ `PROTOCOL CODE` <string> "P20", "P62", "P62", "P62", "P20", "P62", "P20", "P62", "P62", "P62… $ `PROJECT CODE` <string> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_POR", "EBIRD_POR", "EBIR… $ `DURATION MINUTES` <int64> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT DISTANCE KM` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT AREA HA` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `NUMBER OBSERVERS` <int64> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… $ `ALL SPECIES REPORTED` <int64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ `GROUP IDENTIFIER` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `HAS MEDIA` <int64> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ APPROVED <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ REVIEWED <int64> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0… $ REASON <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `TRIP COMMENTS` <string> "", "", "", "", "", "", "Purple Gallinule records uploaded by Marsh… $ `SPECIES COMMENTS` <string> "In Rei D. Carlos de Bragança. 2002. Inéditos - 1 das aves encontra… $ `` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… data_export <- data %>% select(`OBSERVATION DATE`, `SAMPLING EVENT IDENTIFIER`, `OBSERVER ID`,COUNTY)
library(arrow) library(tidyverse) data <- open_dataset( sources = "ebd_PT_prv_relDec-2022.txt", format = "tsv") data %>% glimpse() FileSystemDataset with 1 csv file 8,896,216 rows x 50 columns $ `GLOBAL UNIQUE IDENTIFIER` <string> "URN:CornellLabOfOrnithology:EBIRD:OBS1533374605", "URN:CornellLabO… $ `LAST EDITED DATE` <timestamp[ns]> 2022-10-12 19:35:43, 2021-07-18 11:50:07, 2021-07-18 11:50:07, 2021… $ `TAXONOMIC ORDER` <int64> 664, 6386, 6623, 6449, 444, 5689, 5433, 23729, 6469, 6563, 6973, 20… $ CATEGORY <string> "species", "species", "species", "species", "species", "species", "… $ `TAXON CONCEPT ID` <string> "avibase-B77377EE", "avibase-FB02DD96", "avibase-4D2FF6F1", "avibas… $ `COMMON NAME` <string> "Common Eider", "Black-headed Gull", "Common Tern", "Yellow-legged … $ `SCIENTIFIC NAME` <string> "Somateria mollissima", "Chroicocephalus ridibundus", "Sterna hirun… $ `SUBSPECIES COMMON NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `SUBSPECIES SCIENTIFIC NAME` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `EXOTIC CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `OBSERVATION COUNT` <string> "6", "X", "X", "X", "X", "3", "1", "X", "X", "X", "X", "X", "X", "X… $ `BREEDING CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BREEDING CATEGORY` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `BEHAVIOR CODE` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `AGE/SEX` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ COUNTRY <string> "Portugal", "Portugal", "Portugal", "Portugal", "Portugal", "Portug… $ `COUNTRY CODE` <string> "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "PT", "… $ STATE <string> "Lisboa", "Região Autónoma dos Açores", "Região Autónoma dos Açores… $ `STATE CODE` <string> "PT-11", "PT-20", "PT-20", "PT-20", "PT-14", "PT-14", "PT-20", "PT-… $ COUNTY <string> "Cascais", "Lagoa", "Lagoa", "Lagoa", "Abrantes", "Abrantes", "Ribe… $ `COUNTY CODE` <string> "PT-11-CS", "PT-20-LG", "PT-20-LG", "PT-20-LG", "PT-14-AB", "PT-14-… $ `IBA CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `BCR CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `USFWS CODE` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `ATLAS BLOCK` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ LOCALITY <string> "PN Sintra-Cascais--Cabo Raso", "St. Michaels, Azores", "St. Michae… $ `LOCALITY ID` <string> "L930224", "L15809663", "L15809663", "L15809663", "L22107968", "L21… $ `LOCALITY TYPE` <string> "H", "P", "P", "P", "P", "H", "P", "P", "P", "P", "H", "H", "H", "H… $ LATITUDE <double> 38.70946, 37.75265, 37.75265, 37.75265, 39.40783, 39.31732, 37.7750… $ LONGITUDE <double> -9.485836, -25.530767, -25.530767, -25.530767, -8.254034, -8.222752… $ `OBSERVATION DATE` <date32[day]> 1891-10-01, 1900-10-01, 1900-10-01, 1900-10-01, 1939-01-03, 1944-04… $ `TIME OBSERVATIONS STARTED` <time32[s]> 09:00:00, NA, NA, NA, NA, NA, N… $ `OBSERVER ID` <string> "obsr3419517", "obsr237338", "obsr237338", "obsr237338", "obsr35299… $ `SAMPLING EVENT IDENTIFIER` <string> "S120076980", "S91932209", "S91932209", "S91932209", "S125612599", … $ `PROTOCOL TYPE` <string> "Incidental", "Historical", "Historical", "Historical", "Incidental… $ `PROTOCOL CODE` <string> "P20", "P62", "P62", "P62", "P20", "P62", "P20", "P62", "P62", "P62… $ `PROJECT CODE` <string> "EBIRD", "EBIRD", "EBIRD", "EBIRD", "EBIRD_POR", "EBIRD_POR", "EBIR… $ `DURATION MINUTES` <int64> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT DISTANCE KM` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `EFFORT AREA HA` <double> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `NUMBER OBSERVERS` <int64> 1, NA, NA, NA, NA, NA, 1, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,… $ `ALL SPECIES REPORTED` <int64> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ `GROUP IDENTIFIER` <string> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",… $ `HAS MEDIA` <int64> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0… $ APPROVED <int64> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ REVIEWED <int64> 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0… $ REASON <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… $ `TRIP COMMENTS` <string> "", "", "", "", "", "", "Purple Gallinule records uploaded by Marsh… $ `SPECIES COMMENTS` <string> "In Rei D. Carlos de Bragança. 2002. Inéditos - 1 das aves encontra… $ `` <null> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,… data_export <- data %>% select(`OBSERVATION DATE`, `SAMPLING EVENT IDENTIFIER`, `OBSERVER ID`,COUNTY)
Convert to Parquet format
arrow::write_dataset(data_export, "ebird_parquet",
format="parquet")
Import Parquet dataset
data<- open_dataset("ebird_parquet") data<-data %>% mutate(year=lubdridate::year(`OBSERVATION DATE`))
How did the number of eBird checklists change between 2008 and 2022? Were there fewer eBird lists in 2020 and in 2021?
Lockdown stringency index in Portugal (2020-2022)
eBird checklists between 2008 and 2022
data %>% select(year,`SAMPLING EVENT IDENTIFIER`) %>% group_by(year) %>% summarise(n_list= n_distinct(`SAMPLING EVENT IDENTIFIER`)) %>% filter(year>=2008 ) %>% collect() %>% ggplot(aes(y=n_list,x=year))+geom_point(size=3)+ labs(y="Number of eBird lists", x="Year", title="Number of eBird lists submitted between 2008 and 2022 in Portugal", caption="Source: eBird basic dataset (PT)")+ theme_bw()+ theme(text = element_text(size = 20,family="roboto condensed"))
data %>% select(year,`SAMPLING EVENT IDENTIFIER`) %>% group_by(year) %>% summarise(n_list= n_distinct(`SAMPLING EVENT IDENTIFIER`)) %>% filter(year>=2008 ) %>% collect() %>% ggplot(aes(y=n_list,x=year))+geom_point(size=3)+ labs(y="Number of eBird lists", x="Year", title="Number of eBird lists submitted between 2008 and 2022 in Portugal", caption="Source: eBird basic dataset (PT)")+ theme_bw()+ theme(text = element_text(size = 20,family="roboto condensed"))
Calculate the number of eBird checklists between 2008 and 2022 and make a plot
Ok, but what about the number of active birders?
Number of active birders between 2008 and 2022
data %>% select(year,`OBSERVER ID`) %>% group_by(year) %>% summarise(n_list= n_distinct(`OBSERVER ID`)) %>% filter(year>=2008) %>% collect() %>% ggplot(aes(y=n_list,x=year))+ geom_point(size=3)+ labs(y="Number of active birders", x="Year", title="Number of active birders on eBird between 2008 and 2022", caption="Source: eBird basic dataset")+ theme_bw()+ theme(text = element_text(size = 20,family="roboto condensed"))
data %>% select(year,`OBSERVER ID`) %>% group_by(year) %>% summarise(n_list= n_distinct(`OBSERVER ID`)) %>% filter(year>=2008) %>% collect() %>% ggplot(aes(y=n_list,x=year))+ geom_point(size=3)+ labs(y="Number of active birders", x="Year", title="Number of active birders on eBird between 2008 and 2022", caption="Source: eBird basic dataset")+ theme_bw()+ theme(text = element_text(size = 20,family="roboto condensed"))
Number of active birders between 2008 and 2022
Scientific conclusions
Technical conclusion (I)
Technical conclusion (II)
Thank God it's over