Skip to content
Snippets Groups Projects
Commit c3bc56b5 authored by pavlicag's avatar pavlicag
Browse files

Added FuzzyWuzzy Tutorial

parent b7578770
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:e63a9533-73f2-4db7-ba1e-5ac85d5a84d6 tags:
# FuzzyWuzzy Tutorial
## About FuzzyWuzzy
FuzzyWuzzy is a tool to do text comparison for strings. An example of where this may be used is for comparing names in different datasets. Some people may go by their middle name, or by "Mike" instead fo "Michael". With FuzzyWuzzy, we can get a similarity score that accounts for either case.
## Tutorial
### Part 1 - Importing/ install
See code cell below for example import! It's normal to recieve a warning - this only impacts performance, not a big deal.
%% Cell type:code id:1cb37792-1e7e-4a4d-83b4-b4b1f0757046 tags:
``` python
#!pip install fuzzywuzzy levenshtein
from fuzzywuzzy import fuzz
```
%% Cell type:markdown id:c26e7835-3a61-455b-b11e-22d3462a2261 tags:
### Part 2 - Using Fuzz
The fuzz portion of FuzzyWuzzy is useful for simple string comparison. It contains several options to work better with differently formatted strings.
%% Cell type:markdown id:f9af6597-5242-4996-b81c-01946483b035 tags:
#### Simple Ratio
Simple Ratio takes the Levenstein difference to calculate the difference between two strings that are passed in.
https://en.wikipedia.org/wiki/Levenshtein_distance
%% Cell type:code id:e72ca0d5-e46b-4468-a95f-2c5b163a8c10 tags:
``` python
fuzz.ratio("Greg!","gregory")
```
%% Output
50
%% Cell type:markdown id:cd84f017-28a9-4483-a1b3-7ca3637cd198 tags:
#### Token Sort Ratio
In token sort ratio, strings are set to lowercase and punctuation is removed before comparison. This is useful to filter out noise in the data, because often we do not care about anything except the name string.
%% Cell type:code id:f3771c43-db57-40fa-b994-b5d1a176bed5 tags:
``` python
fuzz.token_sort_ratio("Greg!", "gregory")
```
%% Output
73
%% Cell type:markdown id:3994a5ea-4648-425c-89ab-cd238315fc9e tags:
#### Token Set Ratio
Token set ratio is usefull in the case that somebody goes by a middle name. In addition to the lowercase and punctuation filtering in Token Set Ratio, it tokenizes the string (sorting out each word) and checks for subsets. If the intersection of the two sets perfectly match, the score is 100%.
You can see that it performs better than token sort ratio in the following example:
%% Cell type:code id:366ffae6-8068-46d5-a578-f3d3bbe77e74 tags:
``` python
s1 = "George Santos"
s2 = "George Anthony Devolder Santos"
print(fuzz.token_set_ratio(s1,s2))
```
%% Output
100
%% Cell type:code id:c44d3c3e-7111-42e4-a315-1a008190b008 tags:
``` python
print(fuzz.token_sort_ratio(s1,s2))
```
%% Output
60
%% Cell type:markdown id:d8684e99-1fdf-44b3-8246-35675985b897 tags:
### Part 3 - using Process
Process can be used to extract the closest match from a list of strings.
%% Cell type:code id:3cbd00a0-3679-4266-8435-5e9ea8c84ee4 tags:
``` python
from fuzzywuzzy import process
list_of_strings = ["Gregory Zavalnitskiy", "Ben Ramsey", "Thao Nguyen", "Vivian Pavlica", "Okoniewski, Johnny"]
```
%% Cell type:markdown id:1906cc3d-a2f1-4d5e-85ac-7e816a3ecd0a tags:
`process.extract` extracts all matches. It takes in a string, and a list of choiches. It returns a list of tuples of matches and the corresponding Token Set Ratio score. A limit can be set with the `limit` keyword argument.
%% Cell type:code id:792009ed-6010-40be-a3eb-938c354210b3 tags:
``` python
process.extract("Viv", list_of_strings, limit = 3)
```
%% Output
[('Vivian Pavlica', 90),
('Gregory Zavalnitskiy', 30),
('Okoniewski, Johnny', 30)]
%% Cell type:markdown id:421c4a11-e872-4df8-960a-a6bad34c53a4 tags:
`process.extractOne` Only extracts one match. It takes in a string, and a list of potential matches, and it returns the closest match as a tuple of name and score. Practically, this is the same as `process.extract` when `limit = 1`
%% Cell type:code id:5930db0f-5a47-4159-a782-e1ac38c854e4 tags:
``` python
process.extractOne("Viv", list_of_strings)
```
%% Output
('Vivian Pavlica', 90)
%% Cell type:code id:17a22830-646b-4885-a2f3-0782ef1448f8 tags:
``` python
process.extractOne("Greg", list_of_strings)
```
%% Output
('Gregory Zavalnitskiy', 90)
%% Cell type:code id:5da26873-7e9f-459e-b37b-bccda93e93af tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment