name: title_slide class: middle, title-slide, animated, fadeIn, faster <div class="my-footer"></div> .titleslide-title[(Data Cleaning + Fuzzy Matching) in R] <br> ### .title-font[Data Cleaning Brown Bag] ####.title-font[Faculty of Information (iSchool) at University of Toronto] <br> ### <b> .title-font[Eugene Joh] </b> .title-font[*February 28 2020*] <br> <!-- <h4><i class="fab fa-github fa-lg"> </i> <i class="fab fa-twitter fa-lg"></i> .title-font[[ @eugejoh](https://twitter.com/eugejoh)] --> <div class = "my-footer"><span>Disclaimer: views and opinions expressed here are my own and do not represent my employer</div> ??? --- name: intro class: animated, fadeIn, faster .pull-left[ # Outline - Challenges with global health datasets - Geocoding locations using R - Cleaning messy text - Fuzzy matching - Iteration - Takeaways ] .pull-right[ <br> <img src="https://cdn.onlinewebfonts.com/svg/img_385487.png" width="100" style="display: block; margin: auto;" /> <br> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1b/R_logo.svg/724px-R_logo.svg.png" width="120px" style="display: block; margin: auto;" /> <br> <img src="https://www.traillink.com/images/tl/custom-icons/new-map-icon.svg" width="170px" style="display: block; margin: auto;" /> ] ??? --- name: tweet class: animated, fadeIn, faster <center> <blockquote class="twitter-tweet" width="300" data-lang="en"><p lang="en" dir="ltr"><a href="https://twitter.com/causalinf/status/1230146663207096320">19, February 2020</a></blockquote> </center> ??? --- name: globalhealthtitle class: title-slide, middle, animated, fadeIn, faster # Challenges with Cleaning Global Health Data ??? --- name: ghdata class: animated, fadeIn, faster # Global Health *"improving the health of populations at the international level"* .pull-left[ **Health Surveys** - Sample-based - Nested in vital statistics - Household-level information - E.g. malaria, HIV, maternal & child health, malnutrition, etc. ] .pull-right[ <img src="https://2oqz471sa19h3vbwa53m33yj-wpengine.netdna-ssl.com/wp-content/uploads/2017/09/681viwphrfnz-1.gif" width="350" /> ] ??? --- name: commonchallenges class: animated, fadeIn, faster # Common Data Challenges .pull-left50[ - Data collection -> entry errors - Missing/incomplete data - Ambiguity due to... - Alternative/inconsistent spelling - Multiple languages ] -- .pull-right50[ ```r # I know what this means "doughnut" "donut" # Hmmm... Google please help me "Gothra" #a village? "Gotra" #a clan or lineage? maybe a village? "Kaipukharia" #a village "Kaipukuria" #also a village, k "Kaipukhuria" #and also a village..? ``` ] ??? --- name: geocodingtitle class: title-slide, middle, animated, fadeIn, faster # Geocoding Locations using R ??? --- name: projectdescr class: animated, fadeIn, faster # Project Description **Objective** - Assign latitude/longitude to survey sampling units **Data Sources** - Sampling units master list - Village amenities - Town amenities ??? --- name: pic1 class: animated, fadeIn, faster <br>
--- name: pic2 class: animated, fadeIn, faster
??? --- name: probs class: animated, fadeIn, faster .pull-left[ # Problem(s) **No foreign key for consistent joins...** Village/town name variable - Messy text (case, symbols, spaces) - Alternative spelling - True value vs. data entry error ] -- .pull-right[ <br> <br> ```r # how do I join on this? "Basaimeo(149)" "1 Ffb" "Hoogrijan T.E.116/64 & 66 Nlr" "Lingzey (Assam)$" "(5 K.M. dairy Mill Tinkengsung)" " Panipat (M Cl + OG)" "12 Miles It Road" "Wagon Pathar IV (Maanbasti )" "Vadiyavalasa @ Vaddevalasa" "MEGHANA" "(Maithong II/(Namphai II)" # the same location? "Kanakpur Pt. II (CT)" " Kanakpur Pt Ii" "Kanakpur Ii" # is this entry error? "Baniakuri" "Banjakuri" ``` ] ??? --- name: precedent class: animated, fadeIn, faster exclude: false .pull-left[ # What Precedent Existed? **SQL Queries** - Standardize names (letter case) - Basic text cleaning (removing spaces) - Joins on "cleaned" variable - Nearest-Neighbour search ] .pull-right[ <br> <img src="https://wiki.postgresql.org/images/thumb/a/a4/PostgreSQL_logo.3colors.svg/540px-PostgreSQL_logo.3colors.svg.png" width="220" style="display: block; margin: auto;" /> <br> <img src="https://www.osgeo.org/wp-content/uploads/postgis-logo-171x250.png" width="150" style="display: block; margin: auto;" /> ] ??? --- name: rtitle class: title-slide, middle, animated, fadeIn, faster ## How should we tackle this problem? ??? --- name: rtoolkit class: animated, fadeIn, faster .pull-left[ # R Packages - Database Connection - `DBI` - `RPostgres` - Cleaning, Processing, and Joins - `stringr` - `dplyr` - Fuzzy Matching - `stringdist` - Iteration - `purrr` ] .pull-right[ <br> <img src="img/stringr.svg" width="140" style="display: block; margin: auto;" /><img src="img/dplyr.svg" width="140" style="display: block; margin: auto;" /><img src="img/purrr.svg" width="140" style="display: block; margin: auto;" /> ] ??? --- name: cleantext class: animated, fadeIn, faster # Cleaning Text .pull-left[ Village and Town Names - Remove punctuation - Trim white space (leading and trailing) - Remove extra spaces - Convert to lowercase ] .pull-right[ ```r my_data ``` ``` ## # A tibble: 6 x 2 ## id vilname ## <chr> <chr> ## 1 01 Basaimeo(149) ## 2 02 Hoogrijan T.E.116/64 & 66 Nlr ## 3 03 (5 K.M. dairy Mill Tinkengsung) ## 4 04 Wagon Pathar IV (Maanbasti ) ## 5 05 Vadiyavalasa @ Vaddevalasa ## 6 06 " MEGHANA" ``` ] --- name: rmpunct class: animated, fadeIn, faster # Remove Punctuation .s80[`stringr::str_replace_all()` replaces all matches] ```r my_data$vilname <- str_replace_all(my_data$vilname, "[[:punct:]]+", "") ``` -- ``` ## # A tibble: 6 x 2 ## id vilname ## <chr> <chr> ## 1 01 Basaimeo149 ## 2 02 Hoogrijan TE11664 66 Nlr ## 3 03 5 KM dairy Mill Tinkengsung ## 4 04 "Wagon Pathar IV Maanbasti " ## 5 05 Vadiyavalasa Vaddevalasa ## 6 06 " MEGHANA" ``` ??? --- name: rmextraspaces class: animated, fadeIn, faster # Remove White Space .s80[`stringr::str_trim()` to trim white space] .s80[`stringr::str_replace_all()` replaces many spaces with just one] ```r my_data$vilname <- str_trim(my_data$vilname, "both") my_data$vilname <- str_replace_all(my_data$vilname, "[[:space:]]+", " ") ``` -- ``` ## # A tibble: 6 x 2 ## id vilname ## <chr> <chr> ## 1 01 Basaimeo149 ## 2 02 Hoogrijan TE11664 66 Nlr ## 3 03 5 KM dairy Mill Tinkengsung ## 4 04 Wagon Pathar IV Maanbasti ## 5 05 Vadiyavalasa Vaddevalasa ## 6 06 MEGHANA ``` ??? str_replace_all(my_data$vilname, "[[:space:]]+", " ") doesnt remove leading spaces --- name: tolower class: animated, fadeIn, faster # Convert to Lower Case .s80[`tolower()` to convert everything to lower-case] ```r my_data$vilname <- tolower(my_data$vilname) ``` -- .pull-left50[ ```r my_data_original #original ``` ``` ## # A tibble: 6 x 2 ## id vilname ## <chr> <chr> ## 1 01 Basaimeo(149) ## 2 02 Hoogrijan T.E.116/64 & 66 Nlr ## 3 03 (5 K.M. dairy Mill Tinkengsung) ## 4 04 Wagon Pathar IV (Maanbasti ) ## 5 05 Vadiyavalasa @ Vaddevalasa ## 6 06 " MEGHANA" ``` ] .pull-right50[ ```r my_data #cleaned-ish! ``` ``` ## # A tibble: 6 x 2 ## id vilname ## <chr> <chr> ## 1 01 basaimeo149 ## 2 02 hoogrijan te11664 66 nlr ## 3 03 5 km dairy mill tinkengsung ## 4 04 wagon pathar iv maanbasti ## 5 05 vadiyavalasa vaddevalasa ## 6 06 meghana ``` ] ??? --- name: fncver class: animated, fadeIn, faster # Make a Function Put all the steps in a function, then clean all `character` type columns .pull-left50[ ```r my_cleaner <- function(x) { x %>% str_replace_all("[[:punct:]]+", "") %>% str_trim(side = "both") %>% str_replace_all("\\s+", " ") %>% tolower() } ``` ] .pull-right50[ ```r my_data %>% mutate_if(is.character, my_cleaner) ``` ``` ## # A tibble: 6 x 2 ## id vilname ## <chr> <chr> ## 1 01 basaimeo149 ## 2 02 hoogrijan te11664 66 nlr ## 3 03 5 km dairy mill tinkengsung ## 4 04 wagon pathar iv maanbasti ## 5 05 vadiyavalasa vaddevalasa ## 6 06 meghana ``` ] ??? --- name: titlestringdist class: title-slide, middle, animated, fadeIn, faster ## ...but can we really use this 'cleaned' variable? ??? --- name: fuzzymatch class: animated, fadeIn, faster # Fuzzy Matching *"finding an approximate match, not exact"* **String Distance** - Metric that measures the asimilarity or difference between two strings - Levenshtein Distance, Cosine Similarity, Jaro-Winkler, etc. - Lower number -> better match - [`stringdist`](https://github.com/markvanderloo/stringdist) R package -- <img src="https://media.giphy.com/media/CgKFTMMFDESNW/giphy.gif" width="190" style="display: block; margin: auto auto auto 0;" /> --- name: stringdist class: animated, fadeIn, faster # String Distance ```r # more donuts a <- "donut" b <- c("donut", "dounut", "doughnut", "tunod") ``` -- ```r stringdist(a, b, method = "lv") # Levenshtein Distance ``` ``` ## [1] 0 1 3 4 ``` ```r stringdist(a, b, method = "cos") # Cosine Similarity ``` ``` ## [1] 0.0000000 0.0513167 0.1514719 0.0000000 ``` ```r stringdist(a, b, method = "jw") # Jaro-Winkler ``` ``` ## [1] 0.0000000 0.1222222 0.1916667 0.5333333 ``` ??? --- name: stringdistindia class: animated, fadeIn, faster # Village Names Compare `"kanakapur II"` to names in `c1` and `c1a`. ```r my_data2 ``` ``` ## # A tibble: 3 x 3 ## c1 c1a c2 ## <chr> <chr> <chr> ## 1 Kanakpur Pt. II (CT) kanakpur pt ii ct kanakapur II ## 2 " Kanakpur Pt Ii" kanakpur pt ii kanakapur II ## 3 Kanakpur Ii kanakpur ii kanakapur II ``` -- ```r mutate_stringdist <- function(x, a, b, method, rev = FALSE) { A <- sym(a); B <- sym(b) mutate( x, !!paste(method, a, b, sep = "_") := * stringdist(!!A, !!B, method = method, p = 0.2) ) } ``` --- name: stringdistindia class: animated, fadeIn, faster # Village Names ```r my_data2_match <- my_data2 %>% mutate_stringdist("c1", "c2", method = "jw") %>% mutate_stringdist("c1a", "c2", method = "jw") %>% arrange_at(vars(matches("jw"))) ``` -- ``` ## # A tibble: 3 x 5 ## c1 c1a c2 jw_c1_c2 jw_c1a_c2 ## <chr> <chr> <chr> <dbl> <dbl> ## 1 Kanakpur Ii kanakpur ii kanakapur II 0.218 0.0288 ## 2 Kanakpur Pt. II (CT) kanakpur pt ii ct kanakapur II 0.289 0.0480 ## 3 " Kanakpur Pt Ii" kanakpur pt ii kanakapur II 0.291 0.0405 ``` --- name: purrr class: animated, fadeIn, faster # Iteration In each data source "block" by state, district, substrict: - Compute string distance for village/town names using `stringdist()` - Filter metric by a match threshold `jw < 0.1` -- ```r # block no matches in village amenities table by admin levels vil_no_matches_list <- vil_no_matches %>% dplyr::group_split(state_code14, dist_code, subdist_code) # same for sample units table srs_no_matches_list <- srs_no_matches %>% dplyr::group_split(state_code14, dist_code, subdist_code) #loop through each block, calculate string distance, filter based on string distance metric *purrr::map2( vil_no_matches_list, srs_no_matches_list, function(...) my_stringdist(...) %>% filter(jw < 0.1) #wrapper function around `stringdist()` ) ``` ??? 36 states 655 districts 5000 subdistricts --- name: stringdistplot class: animated, fadeIn, faster # Output .pull-left50[ - After cleaning and fuzzy matching: - Villages **~96%** match - Towns: **~94%** match - Automation of SQL queries via `DBI` - Reduced manual matching!!! ] .pull-right50[ <img src="img/fuzzy_match_dplot.png" width="850" style="display: block; margin: auto;" /> ] ??? still have to manually review 15,000 records but better than 42,000 --- name: takeaway class: animated, fadeIn, faster # Closing Thoughts .pull-left[ - R is great at cleaning messy text - Prioritize comprehensive cleaning - Fuzzy matching is useful (w/caveats) - Leverage existing work ] -- .pull-right[ ![](https://github.com/allisonhorst/stats-illustrations/raw/master/rstats-artwork/code_hero_rstats.png)<!-- --> .tiny[https://github.com/allisonhorst/stats-illustrations] ] --- name: lastslide class: bottom, center, animated, fadeIn, faster <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.7.2/css/all.css" integrity="sha384-fnmOCqbTlWIlj8LyTjo7mOUStjsKC4pOpQbqyi7RrhN7udi9RwhKkMHpvLbHG9Sr" crossorigin="anonymous"> # Thank you! <br> <h4><i class="fab fa-github fa-lg"></i> <i class="fab fa-twitter fa-lg"></i>.title-font[ @eugejoh]</h4> .small[[Slides](https://github.com/eugejoh/datacleaning-brownbag)] <br> <br> <br> .tiny[Created via the R package [xaringan](https://github.com/yihui/xaringan)]