Rosalie Bruel
July 18th, 2019
During my postdoc at the Rubenstein Ecosystem Science Laboratory, I’m working on building a food web model for Lake Champlain. I deal with some data coming in small to large amount, with one thing in common: the format is often heterogeneous.
To save time and hopefully increase reproducibility, I try doing most of my QA/QC and all of my data analyses in R.
I’m gathering here some code I find useful for future reference / other people’s use.
I use the Vermont DEC plankton data, a monitoring that started in 2006.
There are 258 different names associated to phytoplankton in the database. Some are in fact the same, with different spelling (or lower/upper case).Fuzzy matching is a method to search approximate matches in a string to a pattern.
I used the agrep function.
agrep(pattern, x, ignore.case = FALSE, value = FALSE,max.distance = 0.1)
# Saving which rows have been matched
phyto_summ$Row_matched <- rep(NA,nrow(phyto_summ))
# Saving the name that have been matched
phyto_summ$Name_matched <- rep(NA,nrow(phyto_summ))
# Saving the number of matches
phyto_summ$Number_matched <- rep(NA,nrow(phyto_summ))
unique_name <- seq_along(phyto_summ$Name)
for(i in seq_along(phyto_summ$Name)) {
fuzzy_match <-
agrep(phyto_summ$Name[i],phyto_summ$Name,
max = 3, ignore.case = TRUE)
fuzzy_match <- fuzzy_match[-which(fuzzy_match==i)]
unique_name <- unique_name[!unique_name %in% fuzzy_match[fuzzy_match>i]]
if(length(fuzzy_match)>0) {
phyto_summ$Row_matched[i] <- paste(fuzzy_match,collapse="-")
phyto_summ$Name_matched[i] <- paste(phyto_summ$Name[fuzzy_match],collapse=", ")
phyto_summ$Number_matched[i] <- length(fuzzy_match)
}
}
Fuzzy matching correctly identified these 4 items countaining the pattern “ougeoti” as possible same elements. So this fuzzy matching is working the same way than the grep function, without me having to specifically ask for the patterns.
Below are the rows with a lot of matches, and it looks more problematic.
phyto_summ[159,]
The value we chose for the argument max determines how many differences are allowed. Trying to make the other arguments ( insertions, deletions, substitutions) vary from 1 to 15, I found that the maximum number of match was 272, and that, as soon as all parameters are turned to 2. That makes me wonder whether I’m totally using this correctly, I will update this in the future if I realize / if I’m told I’ve used the function the wrong way.
The following plot shows the number of match using several values of insertions, deletions, and substitutions:
1. Deletion, insertion and coests vary together 2. Substitution of just 1 character results in not matching 5 names that are otherwise matched for ‘substitution ≥ 2’.
With that in mind, I re-run the same loop than above but using this time the flexible costs/insertions/deletions/substitutions arguments:
agrep(phyto_summ\(Name[i],phyto_summ\)Name, max = list(costs = 2 insertions = 2, deletions = 2, substitutions = 2), ignore.case = TRUE)
# Creating a new column for the species that generated different match
phyto_summ$different_match_by_method <-
phyto_summ$Number_matched - # species matched with max=2
phyto_summ$Number_matched_2 # species matched with max=list(all=2)
n1 <- length(which(phyto_summ$different_match_by_method>0)) # How many match were different - quantitative appproach
n2 <- nrow(phyto_summ[phyto_summ$Number_matched != phyto_summ$Number_matched_2,]) # qualitative appproach
n1==n2 # Should be the same value
## [1] TRUE
Different match were created by both method for 69 names from our initial dataset. The second method is more strict and I’m struggling to understand why some match are not made (even by allowing increasing values of costs/substitutions/etc.).
I think it is still better to use this second method and use “only” 272 matches (vs. 429 matches for the first method) for now.
The whole objective is to be able to match phytoplankton counts belonging to the same species/genus rapidly and systematically. I’m writting down below the code I’m using at the moment:
phyto_summ_3 <- NULL
unique_name <- seq_along(phyto_summ$Name)
for(i in seq_along(unique_name)) {
which_match <- as.numeric(unlist(strsplit(phyto_summ[i,"Row_matched_2"],"-")))
phyto_summ_3 <- c(phyto_summ_3,
as.character(phyto_summ[i,"Name"]),
paste(sum(phyto_summ[c(i, which_match),"Frequency_presence"], na.rm = T)),
paste(sum(phyto_summ[c(i, which_match),"Total_count"], na.rm = T)))
unique_name <- unique_name[!unique_name %in% which_match[which_match>i]]
}
phyto_summ_3 <- as.data.frame(matrix(phyto_summ_3,ncol=3,byrow=T))
phyto_summ_3 <- phyto_summ_3[unique_name,]
colnames(phyto_summ_3) <- c("Name","Frequency_presence","Total_count")
# Making frenquencies and counts numeric
phyto_summ_3$Frequency_presence <- as.numeric(paste(phyto_summ_3$Frequency_presence))
phyto_summ_3$Total_count <- as.numeric(paste(phyto_summ_3$Total_count))
Out of this smaller dataset, there are 45 names with no observations over the 2006-2018 period.
phyto_summ_3 <- phyto_summ_3[phyto_summ_3$Frequency_presence>0,]
# Removing interogation point - it was the first name used
phyto_summ_3$Name <- gsub('?','',phyto_summ_3$Name, fixed = TRUE)
phyto_summ_3 <- phyto_summ_3[order(phyto_summ_3$Name),]
Fuzzy matching is a rapid and convenient way of working with large datasets with potential small variability in data entry.
As part of our project on Lake Champlain, we’re interested in getting a biomass for each group of the food web. This approach is a first step towards getting the total biomass of phytoplankton.
Some of the code I included is hidden (echo=FALSE in the R Markdown I generated this html from). Don’t hesitate to go to my GitHub to get the raw file with all the code, or send me an email and I’ll share it with you if you’re uncomfortable with GitHub and can’t find it.
I use the Vermont DEC plankton data, a monitoring that started in 2006. The ‘visit dates’ are heterogeneous – to be fair they probably are not in the real database, but data were sent to me in two distincts excel files, which is probably the source of the different format.
# Visualization of the problem from a sample of data
plktn$PlanktonData.VisitDate[seq(1,nrow(plktn),3000)]
## [1] 05-May-00 19-Jun-06 12-Jun-08 16-Jun-08 23-Jun-09 17-Sep-98
## [7] 13-Jun-08 07-Aug-07 19-Jun-07 05-Oct-06 24-Jul-07 21-Jul-09
## [13] 03-Sep-08 10-Jul-08 04/08/2014 25/09/2015 27/09/2010 19/09/2011
## [19] 22/06/2012 09/05/2014 11/06/2015 21/06/2010 30/08/2010 19/08/2011
## [25] 12/06/2013 09/08/2012 26/08/2011
## 1047 Levels: 01-Aug-02 01-Aug-06 01-Aug-08 01-Aug-96 01-Jul-08 ... 31/08/2015
There are at least two types of format:
* d-b-y, e.g., 05-May-00
* d/m/Y, e.g., 30/08/2010
Not being able to easily access different years of the monitoring is a limitation.
The package lubridate, and its function parse_date_time, are perfect to handle different date time format.
Symbol | Meaning | Example |
---|---|---|
%d | day as a number (0-31) | 01-31 |
%a | abbreviated weekday | Mon |
%A | unabbreviated weekday | Monday |
%m | month (00-12) | 00-12 |
%b | abbreviated month | Jan |
%B | unabbreviated month | January |
%y | 2-digit year | 19 |
%Y | 4-digit year | 2019 |
plktn$VisitDate <- parse_date_time(x = plktn$PlanktonData.VisitDate,
orders = c("d-b-y","d/m/Y"))
head(plktn$VisitDate)
## [1] "2000-05-05 UTC" "2000-05-05 UTC" "2000-05-05 UTC" "2000-05-05 UTC"
## [5] "2000-05-05 UTC" "2000-05-05 UTC"
I initially found the solution to my problem here.
lubridate also allows to directly extract the year, month, day, or day of the year, from a date.
Code | Example | Output |
---|---|---|
– | plktn$VisitDate[1] | 2000-05-05 |
year() | year(plktn$VisitDate[1]) | 2000 |
month() | month(plktn$VisitDate[1]) | 5 |
day() | day(plktn$VisitDate[1]) | 5 |
yday() | yday(plktn$VisitDate[1]) | 126 |