Rosalie Bruel
July 18^th, 2019

During my postdoc at the Rubenstein Ecosystem Science Laboratory, I’m working on building a food web model for Lake Champlain. I deal with some data coming in small to large amount, with one thing in common: the format is often heterogeneous.
To save time and hopefully increase reproducibility, I try doing most of my QA/QC and all of my data analyses in R.

I’m gathering here some code I find useful for future reference / other people’s use.

1. Deal with inconsistent names with fuzzy matching

1.1. Problem

I use the Vermont DEC plankton data, a monitoring that started in 2006.

There are 258 different names associated to phytoplankton in the database. Some are in fact the same, with different spelling (or lower/upper case).
For example, searching for the pattern “ougeoti”, we find that the database countain 4 elements with Mougeotia spp. The “Frenquency_presence” column gives an information on how many times this name has been used in the database, and how many cells were attributed to that terminology (for the period 2006-2018).

1.2. Solution

Fuzzy matching is a method to search approximate matches in a string to a pattern.

I used the agrep function.

agrep(pattern, x, ignore.case = FALSE, value = FALSE,max.distance = 0.1)

pattern would be the specie name
x is the character string in which the function will try to find names that resemble the pattern
use of ignore.case to avoid problems with upper/lower case letters
use of max.distance input of how many differences we allow - if we put a larger number, could be convenient to match genus? It can be an integrer, a fraction of the pattern length (maybe an interesting option to explore if names are more complicated than the one I’m dealing with), or a list
- argument all: maximal (overall) distance. If all is missing, it is set to 10%, the other components default to all. The component names can be abbreviated.
- argument costs: maximum number/fraction of match cost (generalized Levenshtein distance)
- argument insertions: maximum number/fraction of insertions
- argument deletions: maximum number/fraction of deletions
- argument substitutions: maximum number/fraction of substitutions

1.2.1 Searching for good function parameters

# Saving which rows have been matched
phyto_summ$Row_matched <- rep(NA,nrow(phyto_summ))
# Saving the name that have been matched
phyto_summ$Name_matched <- rep(NA,nrow(phyto_summ))
# Saving the number of matches
phyto_summ$Number_matched <- rep(NA,nrow(phyto_summ))

unique_name  <-  seq_along(phyto_summ$Name)
for(i in seq_along(phyto_summ$Name)) {
  fuzzy_match <- 
    agrep(phyto_summ$Name[i],phyto_summ$Name,
          max = 3, ignore.case = TRUE)
  fuzzy_match <- fuzzy_match[-which(fuzzy_match==i)]
  unique_name <- unique_name[!unique_name %in% fuzzy_match[fuzzy_match>i]]
  if(length(fuzzy_match)>0) {
    phyto_summ$Row_matched[i] <- paste(fuzzy_match,collapse="-")
    phyto_summ$Name_matched[i] <- paste(phyto_summ$Name[fuzzy_match],collapse=", ")
    phyto_summ$Number_matched[i] <- length(fuzzy_match)     
  }
}

Fuzzy matching correctly identified these 4 items countaining the pattern “ougeoti” as possible same elements. So this fuzzy matching is working the same way than the grep function, without me having to specifically ask for the patterns.

Below are the rows with a lot of matches, and it looks more problematic.

phyto_summ[159,]

The value we chose for the argument max determines how many differences are allowed. Trying to make the other arguments ( insertions, deletions, substitutions) vary from 1 to 15, I found that the maximum number of match was 272, and that, as soon as all parameters are turned to 2. That makes me wonder whether I’m totally using this correctly, I will update this in the future if I realize / if I’m told I’ve used the function the wrong way.

The following plot shows the number of match using several values of insertions, deletions, and substitutions:
1. Deletion, insertion and coests vary together 2. Substitution of just 1 character results in not matching 5 names that are otherwise matched for ‘substitution ≥ 2’.

With that in mind, I re-run the same loop than above but using this time the flexible costs/insertions/deletions/substitutions arguments:

agrep(phyto_summ$Name[i],phyto_summ$Name, max = list(costs = 2 insertions = 2, deletions = 2, substitutions = 2), ignore.case = TRUE)

# Creating a new column for the species that generated different match
phyto_summ$different_match_by_method <-
  phyto_summ$Number_matched - # species matched with max=2
  phyto_summ$Number_matched_2 # species matched with max=list(all=2)
n1 <- length(which(phyto_summ$different_match_by_method>0)) # How many match were different - quantitative appproach
n2 <- nrow(phyto_summ[phyto_summ$Number_matched != phyto_summ$Number_matched_2,]) # qualitative appproach

n1==n2 # Should be the same value

## [1] TRUE

Different match were created by both method for 69 names from our initial dataset. The second method is more strict and I’m struggling to understand why some match are not made (even by allowing increasing values of costs/substitutions/etc.).

I think it is still better to use this second method and use “only” 272 matches (vs. 429 matches for the first method) for now.

1.2.2 Merging the phytoplankton count together using the max = list(2,2,2,2)

The whole objective is to be able to match phytoplankton counts belonging to the same species/genus rapidly and systematically. I’m writting down below the code I’m using at the moment:

phyto_summ_3 <- NULL
unique_name  <-  seq_along(phyto_summ$Name)
for(i in seq_along(unique_name)) {
  which_match <- as.numeric(unlist(strsplit(phyto_summ[i,"Row_matched_2"],"-")))
  phyto_summ_3 <- c(phyto_summ_3,
                  as.character(phyto_summ[i,"Name"]),
                  paste(sum(phyto_summ[c(i, which_match),"Frequency_presence"], na.rm = T)),
                  paste(sum(phyto_summ[c(i, which_match),"Total_count"], na.rm = T)))
  unique_name <- unique_name[!unique_name %in% which_match[which_match>i]]
}
phyto_summ_3 <- as.data.frame(matrix(phyto_summ_3,ncol=3,byrow=T))
phyto_summ_3 <- phyto_summ_3[unique_name,]
colnames(phyto_summ_3) <- c("Name","Frequency_presence","Total_count")

# Making frenquencies and counts numeric
phyto_summ_3$Frequency_presence <- as.numeric(paste(phyto_summ_3$Frequency_presence))
phyto_summ_3$Total_count <- as.numeric(paste(phyto_summ_3$Total_count))

Out of this smaller dataset, there are 45 names with no observations over the 2006-2018 period.

phyto_summ_3 <- phyto_summ_3[phyto_summ_3$Frequency_presence>0,]

# Removing interogation point - it was the first name used
phyto_summ_3$Name <- gsub('?','',phyto_summ_3$Name, fixed = TRUE)

phyto_summ_3 <- phyto_summ_3[order(phyto_summ_3$Name),]

1.3 Conclusions

Fuzzy matching is a rapid and convenient way of working with large datasets with potential small variability in data entry.
As part of our project on Lake Champlain, we’re interested in getting a biomass for each group of the food web. This approach is a first step towards getting the total biomass of phytoplankton.

Some of the code I included is hidden (echo=FALSE in the R Markdown I generated this html from). Don’t hesitate to go to my GitHub to get the raw file with all the code, or send me an email and I’ll share it with you if you’re uncomfortable with GitHub and can’t find it.

2. Handle different date time format

2.1 Problem

I use the Vermont DEC plankton data, a monitoring that started in 2006. The ‘visit dates’ are heterogeneous – to be fair they probably are not in the real database, but data were sent to me in two distincts excel files, which is probably the source of the different format.

# Visualization of the problem from a sample of data
plktn$PlanktonData.VisitDate[seq(1,nrow(plktn),3000)]

##  [1] 05-May-00  19-Jun-06  12-Jun-08  16-Jun-08  23-Jun-09  17-Sep-98 
##  [7] 13-Jun-08  07-Aug-07  19-Jun-07  05-Oct-06  24-Jul-07  21-Jul-09 
## [13] 03-Sep-08  10-Jul-08  04/08/2014 25/09/2015 27/09/2010 19/09/2011
## [19] 22/06/2012 09/05/2014 11/06/2015 21/06/2010 30/08/2010 19/08/2011
## [25] 12/06/2013 09/08/2012 26/08/2011
## 1047 Levels: 01-Aug-02 01-Aug-06 01-Aug-08 01-Aug-96 01-Jul-08 ... 31/08/2015

There are at least two types of format:
* d-b-y, e.g., 05-May-00
* d/m/Y, e.g., 30/08/2010

Not being able to easily access different years of the monitoring is a limitation.

2.2 Solution

The package lubridate, and its function parse_date_time, are perfect to handle different date time format.

Symbol	Meaning	Example
%d	day as a number (0-31)	01-31
%a	abbreviated weekday	Mon
%A	unabbreviated weekday	Monday
%m	month (00-12)	00-12
%b	abbreviated month	Jan
%B	unabbreviated month	January
%y	2-digit year	19
%Y	4-digit year	2019

plktn$VisitDate <- parse_date_time(x = plktn$PlanktonData.VisitDate,
                                   orders = c("d-b-y","d/m/Y"))
head(plktn$VisitDate)

## [1] "2000-05-05 UTC" "2000-05-05 UTC" "2000-05-05 UTC" "2000-05-05 UTC"
## [5] "2000-05-05 UTC" "2000-05-05 UTC"

I initially found the solution to my problem here.

2.3 Extract year, month, or day of the year

lubridate also allows to directly extract the year, month, day, or day of the year, from a date.

Code	Example	Output
–	plktn$VisitDate[1]	2000-05-05
year()	year(plktn$VisitDate[1])	2000
month()	month(plktn$VisitDate[1])	5
day()	day(plktn$VisitDate[1])	5
yday()	yday(plktn$VisitDate[1])	126

is loading comments…

Random but hopefully useful bits of code to deal with large ecological data set