name: title_slide
class: middle, title-slide, animated, fadeIn, faster

.titleslide-title[(Data Cleaning + Fuzzy Matching) in R]

<br>
### .title-font[Data Cleaning Brown Bag]
####.title-font[Faculty of Information (iSchool) at University of Toronto]

<br>
### <b> .title-font[Eugene Joh] </b>
.title-font[*February 28 2020*]
<br>

<div class = "my-footer"><span>Disclaimer: views and opinions expressed here are my own and do not represent my employer</div>

???

---
name: intro
class: animated, fadeIn, faster

.pull-left[  
# Outline

- Challenges with global health datasets
  - Geocoding locations using R  
    - Cleaning messy text  
    - Fuzzy matching  
    - Iteration  
  - Takeaways
]

.pull-right[
<br>
<img src="https://cdn.onlinewebfonts.com/svg/img_385487.png" width="100" style="display: block; margin: auto;" />
<br>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/724px-R_logo.svg.png" width="120px" style="display: block; margin: auto;" />
<br>
<img src="https://www.traillink.com/images/tl/custom-icons/new-map-icon.svg" width="170px" style="display: block; margin: auto;" />
]

???

---
name: tweet
class: animated, fadeIn, faster

<center>
<blockquote class="twitter-tweet" width="300" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/causalinf/status/1230146663207096320">19, February 2020</a></blockquote>
</center>

???

---
name: globalhealthtitle
class: title-slide, middle, animated, fadeIn, faster

# Challenges with Cleaning Global Health Data

???

---
name: ghdata
class: animated, fadeIn, faster

# Global Health

*"improving the health of populations at the international level"*

.pull-left[
**Health Surveys**  
 - Sample-based
 - Nested in vital statistics  
 - Household-level information
 - E.g. malaria, HIV, maternal & child health, malnutrition, etc.
]

.pull-right[
<img src="https://2oqz471sa19h3vbwa53m33yj-wpengine.netdna-ssl.com/wp-content/uploads/2017/09/681viwphrfnz-1.gif" width="350" />
]

???

---
name: commonchallenges
class: animated, fadeIn, faster

# Common Data Challenges

.pull-left50[

- Data collection -> entry errors  
- Missing/incomplete data  
- Ambiguity due to...
  - Alternative/inconsistent spelling 
  - Multiple languages
  ]
--

.pull-right50[

```r
# I know what this means
"doughnut"
"donut"

# Hmmm... Google please help me
"Gothra" #a village?
"Gotra"  #a clan or lineage? maybe a village?

"Kaipukharia" #a village
"Kaipukuria"  #also a village, k
"Kaipukhuria" #and also a village..?
```

]

???

---
name: geocodingtitle
class: title-slide, middle, animated, fadeIn, faster

# Geocoding Locations using R

???

---
name: projectdescr
class: animated, fadeIn, faster

# Project Description
**Objective**
 - Assign latitude/longitude to survey sampling units

**Data Sources**
 - Sampling units master list
 - Village amenities
 - Town amenities

???

---
name: pic1
class: animated, fadeIn, faster
<br>
<div id="htmlwidget-7ebaf3a37c701ef04fe6" style="width:650px;height:500px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-7ebaf3a37c701ef04fe6">{"x":{"diagram":"digraph {\ngraph [rankdir = LR]\nnode [shape = rectangle, fontsize = \"22\", style = filled, fillcolor = Linen, fontname = \"Arial\", shape = \"tab\" fontcolor = \"gray10\"]\n\n   sample [ label=<\n     <table border=\"0\" cellborder=\"0\" cellspacing=\"1\">\n       <tr><td align=\"left\"><b>Sampling Units         <\/b><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"> n = 9,000<\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"> - Sampling Unit ID<\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"><b> - Village Names<\/b><\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"><b> - Town Names<\/b><\/font><\/td><\/tr>\n<tr><td align=\"left\"><font point-size=\"15\"> - Admin Levels<\/font><\/td><\/tr>\n<tr><td align=\"left\"> <\/td><\/tr>\n<tr><td align=\"left\"> <\/td><\/tr>\n<tr><td align=\"left\"> <\/td><\/tr>\n<tr><td align=\"left\"> <\/td><\/tr>\n<tr><td align=\"left\"> <\/td><\/tr>\n<tr><td align=\"left\"> <\/td><\/tr>\n     <\/table>>]\n\n   vill [ label=<\n     <table border=\"0\" cellborder=\"0\" cellspacing=\"1\">\n       <tr><td align=\"left\"><b>Village Amenities         <\/b><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"> n = 640,000<\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"><b> - Village Names<\/b><\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\" color = \"black\"> - Latitude and Longitude<\/font><\/td><\/tr>\n<tr><td align=\"left\"><font point-size=\"15\"> - Admin Levels<\/font><\/td><\/tr>\n       <tr><td align=\"left\"> <\/td><\/tr>\n     <\/table>>, shape = cylinder, fillcolor = lightsteelblue1]\n\n   twn [ label=<\n     <table border=\"0\" cellborder=\"0\" cellspacing=\"1\">\n       <tr><td align=\"left\"><b>Town Amenities         <\/b><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"> n = 8,000<\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"><b> - Town Names<\/b><\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\" color = \"black\"> - Latitude and Longitude<\/font><\/td><\/tr>\n<tr><td align=\"left\"><font point-size=\"15\"> - Admin Levels<\/font><\/td><\/tr>\n       <tr><td align=\"left\"> <\/td><\/tr>\n     <\/table>>, shape = cylinder, fillcolor = lightsteelblue1]\n\nsample -> {vill twn} [arrowhead=none]\n\n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

---
name: pic2
class: animated, fadeIn, faster
<div id="htmlwidget-fe9ac950feb1b9d9ac22" style="width:1000px;height:500px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-fe9ac950feb1b9d9ac22">{"x":{"diagram":"digraph {\ngraph [rankdir = LR]\nnode [shape = rectangle, fontsize = \"22\", style = filled, fillcolor = Linen, fontname = \"Arial\", shape = \"tab\" fontcolor = \"gray10\"]\n\n  # survey [ label=<\n  #  <table border=\"0\" cellborder=\"0\" cellspacing=\"1\">\n  #    <tr><td align=\"left\"><b>Survey Forms         <\/b><\/td><\/tr>\n  #    <tr><td align=\"left\"><font point-size=\"15\"><b> - Sampling Unit ID<\/b><\/font><\/td><\/tr>\n  #    <tr><td align=\"left\"><font point-size=\"15\"> - Household Information<\/font><\/td><\/tr>\n  #    <tr><td align=\"left\"><font point-size=\"15\"> - Village Names<\/font><\/td><\/tr>\n  #    <tr><td align=\"left\"><font point-size=\"15\"> - Town Names<\/font><\/td><\/tr>\n  #    <tr><td align=\"left\"> <\/td><\/tr>\n  #    <tr><td align=\"left\"> <\/td><\/tr>\n  # <\/table>>]\n\n   sample [ label=<\n     <table border=\"0\" cellborder=\"0\" cellspacing=\"1\">\n       <tr><td align=\"left\"><b>Sampling Units         <\/b><\/td><\/tr>\n<tr><td align=\"left\"><font point-size=\"15\"> n = 9,000<\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"><b> - Sampling Unit ID<\/b><\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"> - Village Names<\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"> - Town Names<\/font><\/td><\/tr>\n<tr><td align=\"left\"><font point-size=\"15\"> - Admin Levels<\/font><\/td><\/tr>\n<tr><td align=\"left\"> <\/td><\/tr>\n<tr><td align=\"left\"> <\/td><\/tr>\n<tr><td align=\"left\"> <\/td><\/tr>\n<tr><td align=\"left\"> <\/td><\/tr>\n<tr><td align=\"left\"> <\/td><\/tr>\n<tr><td align=\"left\"> <\/td><\/tr>\n     <\/table>>]\n\n   vill [ label=<\n     <table border=\"0\" cellborder=\"0\" cellspacing=\"1\">\n       <tr><td align=\"left\"><b>Village Amenities         <\/b><\/td><\/tr>\n<tr><td align=\"left\"><font point-size=\"15\"> n = 640,000<\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"> - Village Names<\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\" color = \"black\"><b> - Latitude and Longitude<\/b><\/font><\/td><\/tr>\n<tr><td align=\"left\"><font point-size=\"15\"> - Admin Levels<\/font><\/td><\/tr>\n       <tr><td align=\"left\"> <\/td><\/tr>\n     <\/table>>, shape = cylinder, fillcolor = lightsteelblue1]\n\n   twn [ label=<\n     <table border=\"0\" cellborder=\"0\" cellspacing=\"1\">\n       <tr><td align=\"left\"><b>Town Amenities         <\/b><\/td><\/tr>\n<tr><td align=\"left\"><font point-size=\"15\"> n = 8,000<\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"> - Town Names<\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\" color = \"black\"><b> - Latitude and Longitude<\/b><\/font><\/td><\/tr>\n<tr><td align=\"left\"><font point-size=\"15\"> - Admin Levels<\/font><\/td><\/tr>\n       <tr><td align=\"left\"> <\/td><\/tr>\n     <\/table>>, shape = cylinder, fillcolor = lightsteelblue1]\n\n   out [ label=<\n     <table border=\"0\" cellborder=\"0\" cellspacing=\"1\">\n       <tr><td align=\"left\"><b>Desired Output         <\/b><\/td><\/tr>\n<tr><td align=\"left\"><font point-size=\"15\"> n = ??? <\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\"> - <b>Sampling Unit ID<\/b><\/font><\/td><\/tr>\n       <tr><td align=\"left\"><font point-size=\"15\" color = \"black\"> - <b>Latitude and Longitude<\/b><\/font><\/td><\/tr>\n       <tr><td align=\"left\"> <\/td><\/tr>\n     <\/table>>, shape = cylinder, fillcolor = skyblue2]\n\n# survey -> sample \nsample -> {vill twn} [arrowhead=none]\n{sample vill twn} -> out\n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

???

---
name: probs
class: animated, fadeIn, faster

.pull-left[
# Problem(s)

**No foreign key for consistent joins...**

Village/town name variable  
 - Messy text (case, symbols, spaces)
 - Alternative spelling
 - True value vs. data entry error
]

.pull-right[

```r
# how do I join on this?
"Basaimeo(149)"
"1 Ffb"
"Hoogrijan T.E.116/64 & 66 Nlr"
"Lingzey (Assam)$"
"(5 K.M. dairy Mill Tinkengsung)"
" Panipat (M Cl + OG)"
"12 Miles It Road"
"Wagon Pathar IV (Maanbasti )"
"Vadiyavalasa @ Vaddevalasa"
"MEGHANA"
"(Maithong II/(Namphai II)"
  
# the same location?
"Kanakpur Pt. II (CT)"
" Kanakpur Pt Ii"
"Kanakpur Ii"

# is this entry error?
"Baniakuri"
"Banjakuri"
```

]

???

---
name: precedent
class: animated, fadeIn, faster
exclude: false

.pull-left[

# What Precedent Existed?

**SQL Queries**
 - Standardize names (letter case)
 - Basic text cleaning (removing spaces)
 - Joins on "cleaned" variable
 - Nearest-Neighbour search

]

.pull-right[
<br>
<img src="https://wiki.postgresql.org/images/thumb/a/a4/PostgreSQL_logo.3colors.svg/540px-PostgreSQL_logo.3colors.svg.png" width="220" style="display: block; margin: auto;" />
<br>
<img src="https://www.osgeo.org/wp-content/uploads/postgis-logo-171x250.png" width="150" style="display: block; margin: auto;" />

]

???

---
name: rtitle
class: title-slide, middle, animated, fadeIn, faster

## How should we tackle this problem?

???

---
name: rtoolkit
class: animated, fadeIn, faster

.pull-left[
# R Packages
 - Database Connection
    - `DBI`
    - `RPostgres`
 - Cleaning, Processing, and Joins
    - `stringr`
    - `dplyr`
 - Fuzzy Matching
    - `stringdist`
 - Iteration
    - `purrr`

]

.pull-right[
<br>
<img src="img/stringr.svg" width="140" style="display: block; margin: auto;" /><img src="img/dplyr.svg" width="140" style="display: block; margin: auto;" /><img src="img/purrr.svg" width="140" style="display: block; margin: auto;" />

]

???

---
name: cleantext
class: animated, fadeIn, faster

# Cleaning Text

.pull-left[
Village and Town Names

- Remove punctuation
 - Trim white space (leading and trailing)
 - Remove extra spaces
 - Convert to lowercase

]

.pull-right[

```r
my_data
```

```
## # A tibble: 6 x 2
##   id    vilname                        
##   <chr> <chr>                          
## 1 01    Basaimeo(149)                  
## 2 02    Hoogrijan T.E.116/64 & 66 Nlr  
## 3 03    (5 K.M. dairy Mill Tinkengsung)
## 4 04    Wagon Pathar IV (Maanbasti )   
## 5 05    Vadiyavalasa @ Vaddevalasa     
## 6 06    " MEGHANA"
```

]

---
name: rmpunct
class: animated, fadeIn, faster

# Remove Punctuation

.s80[`stringr::str_replace_all()` replaces all matches]

```r
my_data$vilname <- str_replace_all(my_data$vilname, "[[:punct:]]+", "")
```
--

```
## # A tibble: 6 x 2
##   id    vilname                     
##   <chr> <chr>                       
## 1 01    Basaimeo149                 
## 2 02    Hoogrijan TE11664  66 Nlr   
## 3 03    5 KM dairy Mill Tinkengsung 
## 4 04    "Wagon Pathar IV Maanbasti "
## 5 05    Vadiyavalasa  Vaddevalasa   
## 6 06    " MEGHANA"
```

???

---
name: rmextraspaces 
class: animated, fadeIn, faster

# Remove White Space

.s80[`stringr::str_trim()` to trim white space]  
.s80[`stringr::str_replace_all()` replaces many spaces with just one]

```r
my_data$vilname <- str_trim(my_data$vilname, "both")
my_data$vilname <- str_replace_all(my_data$vilname, "[[:space:]]+", " ")
```
--

```
## # A tibble: 6 x 2
##   id    vilname                    
##   <chr> <chr>                      
## 1 01    Basaimeo149                
## 2 02    Hoogrijan TE11664 66 Nlr   
## 3 03    5 KM dairy Mill Tinkengsung
## 4 04    Wagon Pathar IV Maanbasti  
## 5 05    Vadiyavalasa Vaddevalasa   
## 6 06    MEGHANA
```

???
str_replace_all(my_data$vilname, "[[:space:]]+", " ") doesnt remove leading spaces

---
name: tolower
class: animated, fadeIn, faster

# Convert to Lower Case

.s80[`tolower()` to convert everything to lower-case]

```r
my_data$vilname <- tolower(my_data$vilname)
```
--
.pull-left50[

```r
my_data_original #original
```

.pull-right50[

```r
my_data #cleaned-ish!
```

```
## # A tibble: 6 x 2
##   id    vilname                    
##   <chr> <chr>                      
## 1 01    basaimeo149                
## 2 02    hoogrijan te11664 66 nlr   
## 3 03    5 km dairy mill tinkengsung
## 4 04    wagon pathar iv maanbasti  
## 5 05    vadiyavalasa vaddevalasa   
## 6 06    meghana
```

]

???

---
name: fncver
class: animated, fadeIn, faster

# Make a Function

Put all the steps in a function, then clean all `character` type columns

.pull-left50[

```r
my_cleaner <- function(x) {
  x %>% 
    str_replace_all("[[:punct:]]+", "") %>% 
    str_trim(side = "both") %>% 
    str_replace_all("\\s+", " ") %>% 
    tolower()
}
```
]

.pull-right50[

```r
my_data %>% 
  mutate_if(is.character, my_cleaner)
```

]

???

---
name: titlestringdist
class: title-slide, middle, animated, fadeIn, faster

## ...but can we really use this 'cleaned' variable?

???

---
name: fuzzymatch
class: animated, fadeIn, faster

# Fuzzy Matching

*"finding an approximate match, not exact"*

**String Distance**
 - Metric that measures the asimilarity or difference between two strings
 - Levenshtein Distance, Cosine Similarity, Jaro-Winkler, etc.
 - Lower number -> better match
 - [`stringdist`](https://github.com/markvanderloo/stringdist) R package

---
name: stringdist
class: animated, fadeIn, faster

# String Distance

```r
# more donuts
a <- "donut"
b <- c("donut", "dounut", "doughnut", "tunod")
```
--

```r
stringdist(a, b, method = "lv") # Levenshtein Distance
```

```
## [1] 0 1 3 4
```

```r
stringdist(a, b, method = "cos") # Cosine Similarity
```

```
## [1] 0.0000000 0.0513167 0.1514719 0.0000000
```

```r
stringdist(a, b, method = "jw") # Jaro-Winkler
```

```
## [1] 0.0000000 0.1222222 0.1916667 0.5333333
```

???

---
name: stringdistindia
class: animated, fadeIn, faster

# Village Names

Compare `"kanakapur II"` to names in `c1` and `c1a`.

```r
my_data2
```

```
## # A tibble: 3 x 3
##   c1                   c1a               c2          
##   <chr>                <chr>             <chr>       
## 1 Kanakpur Pt. II (CT) kanakpur pt ii ct kanakapur II
## 2 " Kanakpur Pt Ii"    kanakpur pt ii    kanakapur II
## 3 Kanakpur Ii          kanakpur ii       kanakapur II
```
--

```r
mutate_stringdist <- function(x, a, b, method, rev = FALSE) {
  A <- sym(a); B <- sym(b)
  mutate(
    x, !!paste(method, a, b, sep = "_") := 
*      stringdist(!!A, !!B, method = method, p = 0.2)
    )
}
```

---
name: stringdistindia
class: animated, fadeIn, faster

# Village Names

```r
my_data2_match <- my_data2 %>%
  mutate_stringdist("c1", "c2", method = "jw") %>% 
  mutate_stringdist("c1a", "c2", method = "jw") %>% 
  arrange_at(vars(matches("jw")))
```
--

```
## # A tibble: 3 x 5
##   c1                   c1a               c2           jw_c1_c2 jw_c1a_c2
##   <chr>                <chr>             <chr>           <dbl>     <dbl>
## 1 Kanakpur Ii          kanakpur ii       kanakapur II    0.218    0.0288
## 2 Kanakpur Pt. II (CT) kanakpur pt ii ct kanakapur II    0.289    0.0480
## 3 " Kanakpur Pt Ii"    kanakpur pt ii    kanakapur II    0.291    0.0405
```

---
name: purrr
class: animated, fadeIn, faster

# Iteration

In each data source "block" by state, district, substrict:
 - Compute string distance for village/town names using `stringdist()`  
 - Filter metric by a match threshold `jw < 0.1`

```r
# block no matches in village amenities table by admin levels
vil_no_matches_list <- vil_no_matches %>% 
  dplyr::group_split(state_code14, dist_code, subdist_code)

# same for sample units table
srs_no_matches_list <- srs_no_matches %>% 
  dplyr::group_split(state_code14, dist_code, subdist_code)

#loop through each block, calculate string distance, filter based on string distance metric
*purrr::map2(
  vil_no_matches_list, 
  srs_no_matches_list, 
  function(...) my_stringdist(...) %>% filter(jw < 0.1) #wrapper function around `stringdist()`
  ) 
```

???
36 states 
655 districts
5000 subdistricts

---
name: stringdistplot
class: animated, fadeIn, faster

# Output
.pull-left50[
 - After cleaning and fuzzy matching:
   - Villages **~96%** match
   - Towns: **~94%** match
- Automation of SQL queries via `DBI`
- Reduced manual matching!!!
]

.pull-right50[

]
???
still have to manually review 15,000 records but better than 42,000

---
name: takeaway
class: animated, fadeIn, faster

# Closing Thoughts
.pull-left[
 - R is great at cleaning messy text
- Prioritize comprehensive cleaning
- Fuzzy matching is useful (w/caveats)
- Leverage existing work

]

--
.pull-right[
![](https://github.com/allisonhorst/stats-illustrations/raw/master/rstats-artwork/code_hero_rstats.png)
.tiny[https://github.com/allisonhorst/stats-illustrations]
]

---
name: lastslide
class: bottom, center, animated, fadeIn, faster
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.7.2/css/all.css" integrity="sha384-fnmOCqbTlWIlj8LyTjo7mOUStjsKC4pOpQbqyi7RrhN7udi9RwhKkMHpvLbHG9Sr" crossorigin="anonymous">

# Thank you!
<br>
<h4><i class="fab fa-github fa-lg"></i> <i class="fab fa-twitter fa-lg"></i>.title-font[&nbsp;@eugejoh]</h4>

.small[[Slides](https://github.com/eugejoh/datacleaning-brownbag)]

<br>
<br>
<br>
.tiny[Created via the R package [xaringan](https://github.com/yihui/xaringan)]