"FuzzyWuzzy is a tool to do text comparison for strings. An example of where this may be used is for comparing names in different datasets. Some people may go by their middle name, or by \"Mike\" instead fo \"Michael\". With FuzzyWuzzy, we can get a similarity score that accounts for either case.\n",
"\n",
"## Tutorial\n",
"\n",
"### Part 1 - Importing/ install\n",
"See code cell below for example import! It's normal to recieve a warning - this only impacts performance, not a big deal."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "1cb37792-1e7e-4a4d-83b4-b4b1f0757046",
"metadata": {},
"outputs": [],
"source": [
"#!pip install fuzzywuzzy levenshtein\n",
"from fuzzywuzzy import fuzz"
]
},
{
"cell_type": "markdown",
"id": "c26e7835-3a61-455b-b11e-22d3462a2261",
"metadata": {},
"source": [
"### Part 2 - Using Fuzz\n",
"\n",
"The fuzz portion of FuzzyWuzzy is useful for simple string comparison. It contains several options to work better with differently formatted strings.\n"
]
},
{
"cell_type": "markdown",
"id": "f9af6597-5242-4996-b81c-01946483b035",
"metadata": {},
"source": [
"#### Simple Ratio\n",
"Simple Ratio takes the Levenstein difference to calculate the difference between two strings that are passed in.\n",
"In token sort ratio, strings are set to lowercase and punctuation is removed before comparison. This is useful to filter out noise in the data, because often we do not care about anything except the name string."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "f3771c43-db57-40fa-b994-b5d1a176bed5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"73"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fuzz.token_sort_ratio(\"Greg!\", \"gregory\")"
]
},
{
"cell_type": "markdown",
"id": "3994a5ea-4648-425c-89ab-cd238315fc9e",
"metadata": {},
"source": [
"\n",
"#### Token Set Ratio\n",
"\n",
"Token set ratio is usefull in the case that somebody goes by a middle name. In addition to the lowercase and punctuation filtering in Token Set Ratio, it tokenizes the string (sorting out each word) and checks for subsets. If the intersection of the two sets perfectly match, the score is 100%. \n",
"\n",
"You can see that it performs better than token sort ratio in the following example:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "366ffae6-8068-46d5-a578-f3d3bbe77e74",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"100\n"
]
}
],
"source": [
"s1 = \"George Santos\" \n",
"s2 = \"George Anthony Devolder Santos\"\n",
"\n",
"\n",
"print(fuzz.token_set_ratio(s1,s2))\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "c44d3c3e-7111-42e4-a315-1a008190b008",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"60\n"
]
}
],
"source": [
"print(fuzz.token_sort_ratio(s1,s2))"
]
},
{
"cell_type": "markdown",
"id": "d8684e99-1fdf-44b3-8246-35675985b897",
"metadata": {},
"source": [
"### Part 3 - using Process\n",
"\n",
"Process can be used to extract the closest match from a list of strings. "
"`process.extract` extracts all matches. It takes in a string, and a list of choiches. It returns a list of tuples of matches and the corresponding Token Set Ratio score. A limit can be set with the `limit` keyword argument."
"`process.extractOne` Only extracts one match. It takes in a string, and a list of potential matches, and it returns the closest match as a tuple of name and score. Practically, this is the same as `process.extract` when `limit = 1`"
FuzzyWuzzy is a tool to do text comparison for strings. An example of where this may be used is for comparing names in different datasets. Some people may go by their middle name, or by "Mike" instead fo "Michael". With FuzzyWuzzy, we can get a similarity score that accounts for either case.
## Tutorial
### Part 1 - Importing/ install
See code cell below for example import! It's normal to recieve a warning - this only impacts performance, not a big deal.
In token sort ratio, strings are set to lowercase and punctuation is removed before comparison. This is useful to filter out noise in the data, because often we do not care about anything except the name string.
Token set ratio is usefull in the case that somebody goes by a middle name. In addition to the lowercase and punctuation filtering in Token Set Ratio, it tokenizes the string (sorting out each word) and checks for subsets. If the intersection of the two sets perfectly match, the score is 100%.
You can see that it performs better than token sort ratio in the following example:
`process.extract` extracts all matches. It takes in a string, and a list of choiches. It returns a list of tuples of matches and the corresponding Token Set Ratio score. A limit can be set with the `limit` keyword argument.
`process.extractOne` Only extracts one match. It takes in a string, and a list of potential matches, and it returns the closest match as a tuple of name and score. Practically, this is the same as `process.extract` when `limit = 1`