Added FuzzyWuzzy Tutorial

c3bc56b5 · pavlicag · b7578770 · c3bc56b5
Commit c3bc56b5 authored 1 year ago by pavlicag
--- a/FuzzyWuzzy.ipynb
+++ b/FuzzyWuzzy.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e63a9533-73f2-4db7-ba1e-5ac85d5a84d6",
+   "metadata": {},
+   "source": [
+    "# FuzzyWuzzy Tutorial\n",
+    "\n",
+    "## About FuzzyWuzzy\n",
+    "\n",
+    "FuzzyWuzzy is a tool to do text comparison for strings. An example of where this may be used is for comparing names in different datasets. Some people may go by their middle name, or by \"Mike\" instead fo \"Michael\". With FuzzyWuzzy, we can get a similarity score that accounts for either case.\n",
+    "\n",
+    "## Tutorial\n",
+    "\n",
+    "### Part 1 - Importing/ install\n",
+    "See code cell below for example import! It's normal to recieve a warning - this only impacts performance, not a big deal."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "id": "1cb37792-1e7e-4a4d-83b4-b4b1f0757046",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!pip install fuzzywuzzy levenshtein\n",
+    "from fuzzywuzzy import fuzz"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c26e7835-3a61-455b-b11e-22d3462a2261",
+   "metadata": {},
+   "source": [
+    "### Part 2 - Using Fuzz\n",
+    "\n",
+    "The fuzz portion of FuzzyWuzzy is useful for simple string comparison. It contains several options to work better with differently formatted strings.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9af6597-5242-4996-b81c-01946483b035",
+   "metadata": {},
+   "source": [
+    "#### Simple Ratio\n",
+    "Simple Ratio takes the Levenstein difference to calculate the difference between two strings that are passed in.\n",
+    "\n",
+    "https://en.wikipedia.org/wiki/Levenshtein_distance"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "id": "e72ca0d5-e46b-4468-a95f-2c5b163a8c10",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "50"
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "fuzz.ratio(\"Greg!\",\"gregory\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cd84f017-28a9-4483-a1b3-7ca3637cd198",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "#### Token Sort Ratio\n",
+    "\n",
+    "In token sort ratio, strings are set to lowercase and punctuation is removed before comparison. This is useful to filter out noise in the data, because often we do not care about anything except the name string."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "id": "f3771c43-db57-40fa-b994-b5d1a176bed5",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "73"
+      ]
+     },
+     "execution_count": 18,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "fuzz.token_sort_ratio(\"Greg!\", \"gregory\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3994a5ea-4648-425c-89ab-cd238315fc9e",
+   "metadata": {},
+   "source": [
+    "\n",
+    "#### Token Set Ratio\n",
+    "\n",
+    "Token set ratio is usefull in the case that somebody goes by a middle name. In addition to the lowercase and punctuation filtering in Token Set Ratio, it tokenizes the string (sorting out each word) and checks for subsets. If the intersection of the two sets perfectly match, the score is 100%. \n",
+    "\n",
+    "You can see that it performs better than token sort ratio in the following example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "366ffae6-8068-46d5-a578-f3d3bbe77e74",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "100\n"
+     ]
+    }
+   ],
+   "source": [
+    "s1 = \"George Santos\" \n",
+    "s2 = \"George Anthony Devolder Santos\"\n",
+    "\n",
+    "\n",
+    "print(fuzz.token_set_ratio(s1,s2))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "c44d3c3e-7111-42e4-a315-1a008190b008",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "60\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(fuzz.token_sort_ratio(s1,s2))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d8684e99-1fdf-44b3-8246-35675985b897",
+   "metadata": {},
+   "source": [
+    "### Part 3 - using Process\n",
+    "\n",
+    "Process can be used to extract the closest match from a list of strings. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "3cbd00a0-3679-4266-8435-5e9ea8c84ee4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from fuzzywuzzy import process\n",
+    "list_of_strings = [\"Gregory Zavalnitskiy\", \"Ben Ramsey\", \"Thao Nguyen\", \"Vivian Pavlica\", \"Okoniewski, Johnny\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1906cc3d-a2f1-4d5e-85ac-7e816a3ecd0a",
+   "metadata": {},
+   "source": [
+    "`process.extract` extracts all matches. It takes in a string, and a list of choiches. It returns a list of tuples of matches and the corresponding Token Set Ratio score. A limit can be set with the `limit` keyword argument."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "792009ed-6010-40be-a3eb-938c354210b3",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[('Vivian Pavlica', 90),\n",
+       " ('Gregory Zavalnitskiy', 30),\n",
+       " ('Okoniewski, Johnny', 30)]"
+      ]
+     },
+     "execution_count": 34,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "process.extract(\"Viv\", list_of_strings, limit = 3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "421c4a11-e872-4df8-960a-a6bad34c53a4",
+   "metadata": {},
+   "source": [
+    "`process.extractOne` Only extracts one match. It takes in a string, and a list of potential matches, and it returns the closest match as a tuple of name and score. Practically, this is the same as `process.extract` when `limit = 1`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
+   "id": "5930db0f-5a47-4159-a782-e1ac38c854e4",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "('Vivian Pavlica', 90)"
+      ]
+     },
+     "execution_count": 35,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "process.extractOne(\"Viv\", list_of_strings)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "17a22830-646b-4885-a2f3-0782ef1448f8",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "('Gregory Zavalnitskiy', 90)"
+      ]
+     },
+     "execution_count": 29,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "process.extractOne(\"Greg\", list_of_strings)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5da26873-7e9f-459e-b37b-bccda93e93af",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
+%% Cell type:markdown id:e63a9533-73f2-4db7-ba1e-5ac85d5a84d6 tags:
+
+# FuzzyWuzzy Tutorial
+
+## About FuzzyWuzzy
+
+FuzzyWuzzy is a tool to do text comparison for strings. An example of where this may be used is for comparing names in different datasets. Some people may go by their middle name, or by "Mike" instead fo "Michael". With FuzzyWuzzy, we can get a similarity score that accounts for either case.
+
+## Tutorial
+
+### Part 1 - Importing/ install
+See code cell below for example import! It's normal to recieve a warning - this only impacts performance, not a big deal.
+
+%% Cell type:code id:1cb37792-1e7e-4a4d-83b4-b4b1f0757046 tags:
+
+``` python
+#!pip install fuzzywuzzy levenshtein
+from fuzzywuzzy import fuzz
+```
+
+%% Cell type:markdown id:c26e7835-3a61-455b-b11e-22d3462a2261 tags:
+
+### Part 2 - Using Fuzz
+
+The fuzz portion of FuzzyWuzzy is useful for simple string comparison. It contains several options to work better with differently formatted strings.
+
+%% Cell type:markdown id:f9af6597-5242-4996-b81c-01946483b035 tags:
+
+#### Simple Ratio
+Simple Ratio takes the Levenstein difference to calculate the difference between two strings that are passed in.
+
+https://en.wikipedia.org/wiki/Levenshtein_distance
+
+%% Cell type:code id:e72ca0d5-e46b-4468-a95f-2c5b163a8c10 tags:
+
+``` python
+fuzz.ratio("Greg!","gregory")
+```
+
+%% Output
+
+    50
+
+%% Cell type:markdown id:cd84f017-28a9-4483-a1b3-7ca3637cd198 tags:
+
+#### Token Sort Ratio
+
+In token sort ratio, strings are set to lowercase and punctuation is removed before comparison. This is useful to filter out noise in the data, because often we do not care about anything except the name string.
+
+%% Cell type:code id:f3771c43-db57-40fa-b994-b5d1a176bed5 tags:
+
+``` python
+fuzz.token_sort_ratio("Greg!", "gregory")
+```
+
+%% Output
+
+    73
+
+%% Cell type:markdown id:3994a5ea-4648-425c-89ab-cd238315fc9e tags:
+
+
+#### Token Set Ratio
+
+Token set ratio is usefull in the case that somebody goes by a middle name. In addition to the lowercase and punctuation filtering in Token Set Ratio, it tokenizes the string (sorting out each word) and checks for subsets. If the intersection of the two sets perfectly match, the score is 100%.
+
+You can see that it performs better than token sort ratio in the following example:
+
+%% Cell type:code id:366ffae6-8068-46d5-a578-f3d3bbe77e74 tags:
+
+``` python
+s1 = "George Santos"
+s2 = "George Anthony Devolder Santos"
+
+
+print(fuzz.token_set_ratio(s1,s2))
+```
+
+%% Output
+
+    100
+
+%% Cell type:code id:c44d3c3e-7111-42e4-a315-1a008190b008 tags:
+
+``` python
+print(fuzz.token_sort_ratio(s1,s2))
+```
+
+%% Output
+
+    60
+
+%% Cell type:markdown id:d8684e99-1fdf-44b3-8246-35675985b897 tags:
+
+### Part 3 - using Process
+
+Process can be used to extract the closest match from a list of strings.
+
+%% Cell type:code id:3cbd00a0-3679-4266-8435-5e9ea8c84ee4 tags:
+
+``` python
+from fuzzywuzzy import process
+list_of_strings = ["Gregory Zavalnitskiy", "Ben Ramsey", "Thao Nguyen", "Vivian Pavlica", "Okoniewski, Johnny"]
+```
+
+%% Cell type:markdown id:1906cc3d-a2f1-4d5e-85ac-7e816a3ecd0a tags:
+
+`process.extract` extracts all matches. It takes in a string, and a list of choiches. It returns a list of tuples of matches and the corresponding Token Set Ratio score. A limit can be set with the `limit` keyword argument.
+
+%% Cell type:code id:792009ed-6010-40be-a3eb-938c354210b3 tags:
+
+``` python
+process.extract("Viv", list_of_strings, limit = 3)
+```
+
+%% Output
+
+    [('Vivian Pavlica', 90),
+     ('Gregory Zavalnitskiy', 30),
+     ('Okoniewski, Johnny', 30)]
+
+%% Cell type:markdown id:421c4a11-e872-4df8-960a-a6bad34c53a4 tags:
+
+`process.extractOne` Only extracts one match. It takes in a string, and a list of potential matches, and it returns the closest match as a tuple of name and score. Practically, this is the same as `process.extract` when `limit = 1`
+
+%% Cell type:code id:5930db0f-5a47-4159-a782-e1ac38c854e4 tags:
+
+``` python
+process.extractOne("Viv", list_of_strings)
+```
+
+%% Output
+
+    ('Vivian Pavlica', 90)
+
+%% Cell type:code id:17a22830-646b-4885-a2f3-0782ef1448f8 tags:
+
+``` python
+process.extractOne("Greg", list_of_strings)
+```
+
+%% Output
+
+    ('Gregory Zavalnitskiy', 90)
+
+%% Cell type:code id:5da26873-7e9f-459e-b37b-bccda93e93af tags:
+
+``` python
+```