Compare revisions

shawk masboob · shawk masboob · shawk masboob · shawk masboob · shawk masboob · shawk masboob
--- a/Introduction.ipynb
+++ b/Introduction.ipynb
-%% Cell type:markdown id: tags:
-
-This project is composed of several jupyter notebooks that explain how to use topological data analysis with different machine learning techniques.
-
-While the notebooks are independent of one another, it might be beneficial if one reads through the notebook titled "TDA_Background," provides a general background on the theory. Additionally, it introduces the main TDA python functions.
-
-The remaining notebooks provide different applications of TDA.
-%% Cell type:markdown id: tags:
-
-This project is composed of several jupyter notebooks that explain how to use topological data analysis with different machine learning techniques.
-
-While the notebooks are independent of one another, it might be beneficial if one reads through the notebook titled "TDA_Background," provides a general background on the theory. Additionally, it introduces the main TDA python functions.
-
-The remaining notebooks provide different applications of TDA.
--- a/README.md
+++ b/README.md

 # Topological Machine Learning

-The first 1/3 of this project will require the utilization of traditional machine learning methods on a dataset. The purpose of this portion of the project is promote the benefits of TDA. It is used to show data scientists that they can achieve very interesting results when using TDA. The second portion of this project will be incorporating TDA with machine learning. This portion of the project will be the most demanding due to lack of references. I will attempt to perform both a classification type project and prediction type project. Additionally, I may have to rely on HPCC to run my script (TDA is computationally expensive). The final portion of the project will be creating a Jupyter demo project for data scientists as well as a short semi-theoretical document that discusses the more important features of TDA.
+This project provides an easy to follow introduction to Topological Data Analysis (TDA). TDA is a machine learning tool that borrows concepts from topology and applies them to datasets. While the theory behind TDA is quite complex (thought important and interesting!), this project will only focus on applications. There is, however, an introductory notebook which provides user with a brief introduction to the main concepts within TDA as well as the scikit learn TDA package. The remaining notebooks apply TDA with different machine learning methods such as prediction and classifying.

-I hope to create a script that is easy for those who are not familiar with TDA to follow. Ideally, I would like to create a document that scikit-tda is willing to publish. Hence, the document needs to be constructed in a manner that is easy to interpret by anyone who is new to TDA.
+Future improvements include:
+1. Fixing the prediction notebook.
+2. Using the Ripser function to classify data.
+3. Providing deeper analysis on the results.

-Thankfully, python is already has the libraries that I will need. They are as follows: scipy, numpy, matplotlib, pandas, seaborn, scikit-tda
+## Prerequisites

-## Getting Started
+The environment file contains all of the required packages. You may either activate the environment file or install the required packages individually. 

-* Update
-
-### Prerequisites
-
-You may need to install the Python TDA library. The installation can be done in Pypi using just one command: pip install scikit-tda. Necessary installations include seaborn, pandas, numpy, matplotlib, scipy, jupyter, Cython, scikit-tda, Ripser, and persim. All can be installed using:
-
-'''
-conda install 'package'
-'''
+To run the Jupyter Notebooks sucessfully, you must them through the project's enviornment.

 ## Authors

@@ -49,8 +44,4 @@ OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
 ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
 OTHER DEALINGS IN THE SOFTWARE.

-For more information, please refer to <https://unlicense.org>
-
-## Acknowledgments
-
-* UPDATE
\ No newline at end of file
+For more information, please refer to <https://unlicense.org>
\ No newline at end of file
--- a/Reports/0424-PROJECT-Final-Report.ipynb
+++ b/Reports/0424-PROJECT-Final-Report.ipynb
+%% Cell type:markdown id: tags:
+
+# <center> Topological Machine Learning </center>
+
+<center>By Shawk Macboob </center>
+
+%% Cell type:markdown id: tags:
+
+<img src="https://scikit-tda.org/_static/logo.png" width="20%">
+Image from: https://scikit-tda.org/#
+
+%% Cell type:markdown id: tags:
+
+---
+# Authors
+
+Shawk Masboob
+
+%% Cell type:markdown id: tags:
+
+---
+# Abstract
+
+Topological Data Analysis i(TDA) s a relatively new field with many useful applications. Essentially, TDA borrows tools from topology in order to study data. That is, TDA seeks to determine whether a particular dataset has shape and what the shape of the dataset implies. It can be used independently or applied to other machine learning techniques. This particular projects aims at applying TDA to various machine learning methods in order to demonstrate its benefits. To do so, this project first presents the general theory of TDA as well as introducing the available TDA software. Then, several notebooks are illustrate how TDA can be applied to machine learning.
+
+%% Cell type:markdown id: tags:
+
+----
+# Statement of Need
+
+The purpose of this project is to introduce data scientists to Topological Data Analysis (TDA) through various examples. That is, this project aims to demonstrate applications of TDA to those who do not have a background in mathematics.
+
+%% Cell type:markdown id: tags:
+
+----
+# Installation instructions
+
+The environment.yml file constains all of the required dependices. To install the required moduels, run the following command in the terminal:
+
+`make init`
+
+%% Cell type:markdown id: tags:
+
+----
+# Unit Tests
+
+Unit testing is done on the following two functions: `lens_1d` and `uniform_sampling`. These tests check to verify that the function takes in correct inputs.
+
+To run a unit test, simply write `make test` in the terminal.
+
+%% Cell type:markdown id: tags:
+
+---
+# Methodology
+
+I was able to meet the majority of my initial goals. I created different notebooks that used TDA in different ways. Additionally, I created a background notebook that gave a basic introduction to TDA. While I was able to meet my general goals, I think I could have approach this project differently. For instance, Ripser and Mapper are the two most used libraries from the scikit-tda package. I think I should have introduced these libraries more thoroughly. Both of these libraries have many applications which my project did not highlight. For instance, Ripser can be used for nonlinear time series analysis, feature selection, classifying, etc. While it is impossible to demonstrate all of the possible applications for Ripser, I could have discussed them more and linked some articles - though the vast majority of the papers I’ve come across focus on the mathematical side of TDA rather than the application.
+
+Mapper was incredibly challenging to use. I spent the vast majority of the semester trying to understand and apply it. I was able to do basic classification using Mapper, although I am still unsure what the best method is in order to improve the model. I also do not know how to get some sort of accuracy score. I am only able to look at the output graph and determine whether the data separated well. I have yet to figure out how topologists determine the overall accuracy of their classification. Additionally, I was not able to fully complete the prediction notebook. I know how to use Mapper in the sense that I was able to separate the data visually. The next task is to learn how to extract the data. After doing so, I can easily build a new predictive model. In general, I need to experiment more with Mapper in order to better understand how to implement it.
+
+%% Cell type:markdown id: tags:
+
+---
+# Concluding Remarks
+
+I have a much deeper understanding of TDA because of this project. Before this project, I thought Ripser was only used for persistence diagrams. That is, I thought you can only do an exploratory analysis with Ripser. I was unaware that Ripser is used in time series analysis or classification. As for Mapper, I had no idea how to work with it in general. I was not formally introduced to Mapper - everything I currently know comes from articles/papers I’ve read. My background in Mapper is still limited but I feel more confident using it and experimenting with the parameters.
+
+For future work, I would like to expand my work with Mapper. I would like to create more wrapper functions that simplify the steps that go into building a graph. Additionally, I would like to learn how to color nodes so that I can gain more insight. For instance,  I would like to learn how to color nodes based on proportions for y1 to y2. My current understanding of Mapper is shallow so I would like to expand it and add detail application to my project.Additionally, I would like to create more notebooks that do not use toy datasets. Toy datasets are easy to work with but do not add enough complexity. To full demonstrate the applications of Mapper, I need to  work with more challenging data.
+
+%% Cell type:markdown id: tags:
+
+----
+# References
+
+Individual notebooks have their own reference section. However, all notebooks use the scikit-tda package.
+
+Saul, Nathaniel and Tralie, Chris. (2019). Scikit-TDA: Topological Data Analysis for Python. Zenodo. http://doi.org/10.5281/zenodo.2533369
+%% Cell type:markdown id: tags:
+
+# <center> Topological Machine Learning </center>
+
+<center>By Shawk Macboob </center>
+
+%% Cell type:markdown id: tags:
+
+<img src="https://scikit-tda.org/_static/logo.png" width="20%">
+Image from: https://scikit-tda.org/#
+
+%% Cell type:markdown id: tags:
+
+---
+# Authors
+
+Shawk Masboob
+
+%% Cell type:markdown id: tags:
+
+---
+# Abstract
+
+Topological Data Analysis i(TDA) s a relatively new field with many useful applications. Essentially, TDA borrows tools from topology in order to study data. That is, TDA seeks to determine whether a particular dataset has shape and what the shape of the dataset implies. It can be used independently or applied to other machine learning techniques. This particular projects aims at applying TDA to various machine learning methods in order to demonstrate its benefits. To do so, this project first presents the general theory of TDA as well as introducing the available TDA software. Then, several notebooks are illustrate how TDA can be applied to machine learning.
+
+%% Cell type:markdown id: tags:
+
+----
+# Statement of Need
+
+The purpose of this project is to introduce data scientists to Topological Data Analysis (TDA) through various examples. That is, this project aims to demonstrate applications of TDA to those who do not have a background in mathematics.
+
+%% Cell type:markdown id: tags:
+
+----
+# Installation instructions
+
+The environment.yml file constains all of the required dependices. To install the required moduels, run the following command in the terminal:
+
+`make init`
+
+%% Cell type:markdown id: tags:
+
+----
+# Unit Tests
+
+Unit testing is done on the following two functions: `lens_1d` and `uniform_sampling`. These tests check to verify that the function takes in correct inputs.
+
+To run a unit test, simply write `make test` in the terminal.
+
+%% Cell type:markdown id: tags:
+
+---
+# Methodology
+
+I was able to meet the majority of my initial goals. I created different notebooks that used TDA in different ways. Additionally, I created a background notebook that gave a basic introduction to TDA. While I was able to meet my general goals, I think I could have approach this project differently. For instance, Ripser and Mapper are the two most used libraries from the scikit-tda package. I think I should have introduced these libraries more thoroughly. Both of these libraries have many applications which my project did not highlight. For instance, Ripser can be used for nonlinear time series analysis, feature selection, classifying, etc. While it is impossible to demonstrate all of the possible applications for Ripser, I could have discussed them more and linked some articles - though the vast majority of the papers I’ve come across focus on the mathematical side of TDA rather than the application.
+
+Mapper was incredibly challenging to use. I spent the vast majority of the semester trying to understand and apply it. I was able to do basic classification using Mapper, although I am still unsure what the best method is in order to improve the model. I also do not know how to get some sort of accuracy score. I am only able to look at the output graph and determine whether the data separated well. I have yet to figure out how topologists determine the overall accuracy of their classification. Additionally, I was not able to fully complete the prediction notebook. I know how to use Mapper in the sense that I was able to separate the data visually. The next task is to learn how to extract the data. After doing so, I can easily build a new predictive model. In general, I need to experiment more with Mapper in order to better understand how to implement it.
+
+%% Cell type:markdown id: tags:
+
+---
+# Concluding Remarks
+
+I have a much deeper understanding of TDA because of this project. Before this project, I thought Ripser was only used for persistence diagrams. That is, I thought you can only do an exploratory analysis with Ripser. I was unaware that Ripser is used in time series analysis or classification. As for Mapper, I had no idea how to work with it in general. I was not formally introduced to Mapper - everything I currently know comes from articles/papers I’ve read. My background in Mapper is still limited but I feel more confident using it and experimenting with the parameters.
+
+For future work, I would like to expand my work with Mapper. I would like to create more wrapper functions that simplify the steps that go into building a graph. Additionally, I would like to learn how to color nodes so that I can gain more insight. For instance,  I would like to learn how to color nodes based on proportions for y1 to y2. My current understanding of Mapper is shallow so I would like to expand it and add detail application to my project.Additionally, I would like to create more notebooks that do not use toy datasets. Toy datasets are easy to work with but do not add enough complexity. To full demonstrate the applications of Mapper, I need to  work with more challenging data.
+
+%% Cell type:markdown id: tags:
+
+----
+# References
+
+Individual notebooks have their own reference section. However, all notebooks use the scikit-tda package.
+
+Saul, Nathaniel and Tralie, Chris. (2019). Scikit-TDA: Topological Data Analysis for Python. Zenodo. http://doi.org/10.5281/zenodo.2533369
--- a/TDA_Background.ipynb
+++ b/TDA_Background.ipynb
--- a/TDA_Classification.ipynb
+++ b/TDA_Classification.ipynb
 %% Cell type:markdown id: tags:

 <h1><center>Classification using Topological Data Analysis</center></h1>
-<img src="https://i.pinimg.com/564x/42/23/f5/4223f59e7e0a7e69e2b73118c7a69cbb.jpg" width="50%">
-<p style="text-align: center;">Image from:https://www.pinterest.jp/pin/406942516319576339/</p>
+<img src="https://cdn.vox-cdn.com/thumbor/GcZR8_tOztIDAiSlX47_5oyZ-js=/0x0:1599x1066/1200x800/filters:focal(834x375:1088x629)/cdn.vox-cdn.com/uploads/chorus_image/image/55588811/King_Estate_Winery_NEXT_Amazon_Wine.0.jpg" width="70%">
+<p style="text-align: center;">Image from: https://www.vox.com/2017/7/6/15926476/amazon-next-wine-king-vintners-king-estate-winery</p>

 %% Cell type:markdown id: tags:

-Purpose of Notebook.
+In this notebook, we will classify wine quality using topological data analysis with Mapper.
+
+The general motivation of this notebook is to demonstrate how to use Mapper and how to apply TDA to classification related models.

 %% Cell type:code id: tags:

 ``` python
-init
+# imports
+from Topological_ML import tda_function as tda
+import pandas as pd
+import numpy as np
+import sklearn
+from sklearn import ensemble
+import kmapper as km
+from kmapper.plotlyviz import *
+import matplotlib.pyplot as plt
+import plotly.graph_objs as go
+from ipywidgets import (HBox, VBox)
+import warnings
+warnings.filterwarnings("ignore")
+```
+
+%% Cell type:markdown id: tags:
+
+First, we download the wine dataset from Scikit Learn.
+
+%% Cell type:code id: tags:
+
+``` python
+# import wine dataset from scikit learn
+from sklearn.datasets import load_wine
+wine = load_wine()
+df = pd.DataFrame(wine['data'],columns = wine['feature_names'])
+df['quality'] = wine['target']
+df.head()
 ```

 %% Output

-    ---------------------------------------------------------------------------
-    NameError                                 Traceback (most recent call last)
-    <ipython-input-3-19f3d006bcd3> in <module>
-    ----> 1 init
+       alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
+    0    14.23        1.71  2.43               15.6      127.0           2.80
+    1    13.20        1.78  2.14               11.2      100.0           2.65
+    2    13.16        2.36  2.67               18.6      101.0           2.80
+    3    14.37        1.95  2.50               16.8      113.0           3.85
+    4    13.24        2.59  2.87               21.0      118.0           2.80
+    
+       flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
+    0        3.06                  0.28             2.29             5.64  1.04
+    1        2.76                  0.26             1.28             4.38  1.05
+    2        3.24                  0.30             2.81             5.68  1.03
+    3        3.49                  0.24             2.18             7.80  0.86
+    4        2.69                  0.39             1.82             4.32  1.04
+    
+       od280/od315_of_diluted_wines  proline  quality
+    0                          3.92   1065.0        0
+    1                          3.40   1050.0        0
+    2                          3.17   1185.0        0
+    3                          3.45   1480.0        0
+    4                          2.93    735.0        0
+
+%% Cell type:markdown id: tags:

-    NameError: name 'init' is not defined
+Now that the data is downloaded, we seperate the reponse from the features and build a simplicial complex based on the features. The lens_1d function has other options that one can experiment with in order to build a better simplicial complex.

 %% Cell type:code id: tags:

 ``` python
+# seperate features and response
+feature_names = [c for c in df.columns if c not in ["quality"]]
+X = np.array(df[feature_names])
+y = np.array(df["quality"])
+
+# you may choose any lens type here
+lens, mapper = tda.lens_1d(X,"max")
+
+# Define the simplicial complex
+scomplex = mapper.map(lens,
+                      X,
+                      nr_cubes=15,
+                      overlap_perc=0.7,
+                      clusterer=sklearn.cluster.KMeans(n_clusters=2,
+                                                       random_state=3471))
 ```
+
+%% Cell type:markdown id: tags:
+
+The following code, borrowed from scikit-TDA, uses the simplicial complex that we just defined to build a graph. The majority of the code is building an interactive plot within a notebook.
+
+%% Cell type:code id: tags:
+
+``` python
+# color scale
+pl_brewer = [[0.0, '#006837'],
+             [0.1, '#1a9850'],
+             [0.2, '#66bd63'],
+             [0.3, '#a6d96a'],
+             [0.4, '#d9ef8b'],
+             [0.5, '#ffffbf'],
+             [0.6, '#fee08b'],
+             [0.7, '#fdae61'],
+             [0.8, '#f46d43'],
+             [0.9, '#d73027'],
+             [1.0, '#a50026']]
+
+color_function = lens [:,0] - lens[:,0].min()
+
+my_colorscale = pl_brewer
+
+kmgraph,  mapper_summary, colorf_distribution = get_mapper_graph(scomplex,
+                                                                 color_function,
+                                                                 color_function_name='Distance to x-max',
+                                                                 colorscale=my_colorscale)
+
+# assign to node['custom_tooltips']  the node label: 0 - low quality, 1 - medium quality, 2 - high quality
+for node in kmgraph['nodes']:
+    node['custom_tooltips'] = y[scomplex['nodes'][node['name']]]
+
+bgcolor = 'rgba(10,10,10, 0.9)'
+
+# on a black background the gridlines are set on  grey
+y_gridcolor = 'rgb(150,150,150)'
+
+plotly_graph_data = plotly_graph(kmgraph, graph_layout='fr', colorscale=my_colorscale,
+                                 factor_size=2.5, edge_linewidth=0.5)
+
+layout = plot_layout(title='Topological network representing the<br>  wine quality dataset',
+                     width=620, height=570,
+                     annotation_text=get_kmgraph_meta(mapper_summary),
+                     bgcolor=bgcolor)
+
+fw_graph = go.FigureWidget(data=plotly_graph_data, layout=layout)
+
+fw_hist = node_hist_fig(colorf_distribution, bgcolor=bgcolor,
+                        y_gridcolor=y_gridcolor)
+
+fw_summary = summary_fig(mapper_summary, height=300)
+
+dashboard = hovering_widgets(kmgraph,
+                             fw_graph,
+                             ctooltips=True,
+                             bgcolor=bgcolor,
+                             y_gridcolor=y_gridcolor,
+                             member_textbox_width=600)
+
+#Update the fw_graph colorbar, setting its title:
+fw_graph.data[1].marker.colorbar.title = 'dist to<br>x-min'
+
+dashboard
+```
+
+%% Output
+
+
+%% Cell type:markdown id: tags:
+
+Several observations can be made:
+1. The top half of the graph is composed of wine with quality 0
+2. The middle region is composed of all wine types
+3. The bottom region is composed of wine with quality 1 and 2
+
+We attained some seperability however, this model can be improved by using different clustering methods or a different filter function. The current graph does note seperate quality 1 and 2 very well.
+
+Further insight can be made (although it requires more coding). For instance, one can color the nodes so that they show the proportion of high quality wine to the rest.

 %% Cell type:markdown id: tags:

 <h1><center>Classification using Topological Data Analysis</center></h1>
-<img src="https://i.pinimg.com/564x/42/23/f5/4223f59e7e0a7e69e2b73118c7a69cbb.jpg" width="50%">
-<p style="text-align: center;">Image from:https://www.pinterest.jp/pin/406942516319576339/</p>
+<img src="https://cdn.vox-cdn.com/thumbor/GcZR8_tOztIDAiSlX47_5oyZ-js=/0x0:1599x1066/1200x800/filters:focal(834x375:1088x629)/cdn.vox-cdn.com/uploads/chorus_image/image/55588811/King_Estate_Winery_NEXT_Amazon_Wine.0.jpg" width="70%">
+<p style="text-align: center;">Image from: https://www.vox.com/2017/7/6/15926476/amazon-next-wine-king-vintners-king-estate-winery</p>

 %% Cell type:markdown id: tags:

-Purpose of Notebook.
+In this notebook, we will classify wine quality using topological data analysis with Mapper.
+
+The general motivation of this notebook is to demonstrate how to use Mapper and how to apply TDA to classification related models.

 %% Cell type:code id: tags:

 ``` python
-init
+# imports
+from Topological_ML import tda_function as tda
+import pandas as pd
+import numpy as np
+import sklearn
+from sklearn import ensemble
+import kmapper as km
+from kmapper.plotlyviz import *
+import matplotlib.pyplot as plt
+import plotly.graph_objs as go
+from ipywidgets import (HBox, VBox)
+import warnings
+warnings.filterwarnings("ignore")
+```
+
+%% Cell type:markdown id: tags:
+
+First, we download the wine dataset from Scikit Learn.
+
+%% Cell type:code id: tags:
+
+``` python
+# import wine dataset from scikit learn
+from sklearn.datasets import load_wine
+wine = load_wine()
+df = pd.DataFrame(wine['data'],columns = wine['feature_names'])
+df['quality'] = wine['target']
+df.head()
 ```

 %% Output

-    ---------------------------------------------------------------------------
-    NameError                                 Traceback (most recent call last)
-    <ipython-input-3-19f3d006bcd3> in <module>
-    ----> 1 init
+       alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
+    0    14.23        1.71  2.43               15.6      127.0           2.80
+    1    13.20        1.78  2.14               11.2      100.0           2.65
+    2    13.16        2.36  2.67               18.6      101.0           2.80
+    3    14.37        1.95  2.50               16.8      113.0           3.85
+    4    13.24        2.59  2.87               21.0      118.0           2.80
+    
+       flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
+    0        3.06                  0.28             2.29             5.64  1.04
+    1        2.76                  0.26             1.28             4.38  1.05
+    2        3.24                  0.30             2.81             5.68  1.03
+    3        3.49                  0.24             2.18             7.80  0.86
+    4        2.69                  0.39             1.82             4.32  1.04
+    
+       od280/od315_of_diluted_wines  proline  quality
+    0                          3.92   1065.0        0
+    1                          3.40   1050.0        0
+    2                          3.17   1185.0        0
+    3                          3.45   1480.0        0
+    4                          2.93    735.0        0
+
+%% Cell type:markdown id: tags:

-    NameError: name 'init' is not defined
+Now that the data is downloaded, we seperate the reponse from the features and build a simplicial complex based on the features. The lens_1d function has other options that one can experiment with in order to build a better simplicial complex.

 %% Cell type:code id: tags:

 ``` python
+# seperate features and response
+feature_names = [c for c in df.columns if c not in ["quality"]]
+X = np.array(df[feature_names])
+y = np.array(df["quality"])
+
+# you may choose any lens type here
+lens, mapper = tda.lens_1d(X,"max")
+
+# Define the simplicial complex
+scomplex = mapper.map(lens,
+                      X,
+                      nr_cubes=15,
+                      overlap_perc=0.7,
+                      clusterer=sklearn.cluster.KMeans(n_clusters=2,
+                                                       random_state=3471))
 ```
+
+%% Cell type:markdown id: tags:
+
+The following code, borrowed from scikit-TDA, uses the simplicial complex that we just defined to build a graph. The majority of the code is building an interactive plot within a notebook.
+
+%% Cell type:code id: tags:
+
+``` python
+# color scale
+pl_brewer = [[0.0, '#006837'],
+             [0.1, '#1a9850'],
+             [0.2, '#66bd63'],
+             [0.3, '#a6d96a'],
+             [0.4, '#d9ef8b'],
+             [0.5, '#ffffbf'],
+             [0.6, '#fee08b'],
+             [0.7, '#fdae61'],
+             [0.8, '#f46d43'],
+             [0.9, '#d73027'],
+             [1.0, '#a50026']]
+
+color_function = lens [:,0] - lens[:,0].min()
+
+my_colorscale = pl_brewer
+
+kmgraph,  mapper_summary, colorf_distribution = get_mapper_graph(scomplex,
+                                                                 color_function,
+                                                                 color_function_name='Distance to x-max',
+                                                                 colorscale=my_colorscale)
+
+# assign to node['custom_tooltips']  the node label: 0 - low quality, 1 - medium quality, 2 - high quality
+for node in kmgraph['nodes']:
+    node['custom_tooltips'] = y[scomplex['nodes'][node['name']]]
+
+bgcolor = 'rgba(10,10,10, 0.9)'
+
+# on a black background the gridlines are set on  grey
+y_gridcolor = 'rgb(150,150,150)'
+
+plotly_graph_data = plotly_graph(kmgraph, graph_layout='fr', colorscale=my_colorscale,
+                                 factor_size=2.5, edge_linewidth=0.5)
+
+layout = plot_layout(title='Topological network representing the<br>  wine quality dataset',
+                     width=620, height=570,
+                     annotation_text=get_kmgraph_meta(mapper_summary),
+                     bgcolor=bgcolor)
+
+fw_graph = go.FigureWidget(data=plotly_graph_data, layout=layout)
+
+fw_hist = node_hist_fig(colorf_distribution, bgcolor=bgcolor,
+                        y_gridcolor=y_gridcolor)
+
+fw_summary = summary_fig(mapper_summary, height=300)
+
+dashboard = hovering_widgets(kmgraph,
+                             fw_graph,
+                             ctooltips=True,
+                             bgcolor=bgcolor,
+                             y_gridcolor=y_gridcolor,
+                             member_textbox_width=600)
+
+#Update the fw_graph colorbar, setting its title:
+fw_graph.data[1].marker.colorbar.title = 'dist to<br>x-min'
+
+dashboard
+```
+
+%% Output
+
+
+%% Cell type:markdown id: tags:
+
+Several observations can be made:
+1. The top half of the graph is composed of wine with quality 0
+2. The middle region is composed of all wine types
+3. The bottom region is composed of wine with quality 1 and 2
+
+We attained some seperability however, this model can be improved by using different clustering methods or a different filter function. The current graph does note seperate quality 1 and 2 very well.
+
+Further insight can be made (although it requires more coding). For instance, one can color the nodes so that they show the proportion of high quality wine to the rest.

--- a/TDA_EDA.ipynb
+++ b/TDA_EDA.ipynb
--- a/TDA_Persistent_Homology.ipynb
+++ b/TDA_Persistent_Homology.ipynb
--- a/TDA_Prediction.ipynb
+++ b/TDA_Prediction.ipynb
--- a/TDA_Voting.ipynb
+++ b/TDA_Voting.ipynb
--- a/Topological_ML/tda_function.py
+++ b/Topological_ML/tda_function.py
 """
 This file contains all of the functions used within the notebooks.
-Date:
-Author:
+Date: April 24, 2020
+Author: Shawk Masboob
+
+The function `uniform_sampling` was borrowed from Luis Polancocontreras,
+a PhD candidate in the CMSE program at Michigan State University.
+It was slightly tweaked to fit this project.
 """
-import sklearn
-from sklearn.linear_model import LinearRegression
+from sklearn import ensemble
 import kmapper as km
-import pandas as pd
 import numpy as np
+import pandas as pd

 def numpy_to_pandas(sklearn_data):
-    """
-    Converts scikit-learn numpy data into pandas dataframe.
-    Input: name of dataframe
-    Output: pandas dataframe
+    """Converts scikit-learn numpy data into pandas dataframe
+
+    Args:
+        sklearn_data (array): name of dataframe
+
+    Returns:
+        data: pandas dataframe
+
    """
    data = pd.DataFrame(data=sklearn_data.data, columns=sklearn_data.feature_names)
    data['target'] = pd.Series(sklearn_data.target)
    return data

-def linear_regression(feature, predictor):
-    """
-    Ordinary least squares Linear Regression.
-    input: x = independent variables
-           y = dependent variable
-    output: R^2
-    """
-    model = LinearRegression()
-    model.fit(feature, predictor)
-    return model.score(feature, predictor)
+def lens_1d(x_array, proj='l2norm', random_num=1729, verbosity=0):
+    """Creates a L^2-Norm for features. This lens highlights expected features in the data.
+
+    Args:
+        x_array (array): features of dataset </br>
+        proj (string): projection type </br>
+        random_num: random state </br>
+        verbosity: verbosity </br>
+
+    Returns:
+        lens: Isolation Forest, L^2-Norm </br>
+        mapper: projected features </br>

-def lens_1d(features, random_num, verbosity):
-    """
-    input:
-    output:
    """
-    model = sklearn.ensemble.IsolationForest(random_state=random_num)
-    model.fit(features)
-    lens1 = model.decision_function(features).reshape((features.shape[0], 1))
+    if not isinstance(x_array, np.ndarray):
+        print("your input is not an array")
+        return None, None
+    if isinstance(x_array, np.ndarray) and len(x_array.shape) != 2:
+        print('your input needs to be a 2d array')
+        return None, None
+    proj_type = ['sum', 'mean', 'median', 'max', 'min', 'std', 'dist_mean',
+                 'l2norm', 'knn_distance_n']
+    if proj not in proj_type:
+        print("you may only use the following projections:", proj_type)
+        return None, None
+    # Create a custom 1-D lens with Isolation Forest
+    model = ensemble.IsolationForest(random_state=random_num)
+    model.fit(x_array)
+    lens1 = model.decision_function(x_array).reshape((x_array.shape[0], 1))
+    # Create another 1-D lens with L2-norm
    mapper = km.KeplerMapper(verbose=verbosity)
-    lens2 = mapper.fit_transform(features, projection="l2norm")
+    lens2 = mapper.fit_transform(x_array, projection=proj)
+    # Combine lenses pairwise to get a 2-D lens i.e. [Isolation Forest, L^2-Norm] lens
    lens = np.c_[lens1, lens2]
-    return lens
+    return lens, mapper
+
+def uniform_sampling(dist_matrix, n_sample):
+    """Given a distance matrix retunrs an subsamplig that preserves the distribution
+    of the original data set and the covering radious corresponding to
+    the subsampled set.
+
+    Args:
+        dist_matrix (array): Distance matrix </br>
+        n_sample (int): Size of subsample set </br>
+
+    Returns:
+        list_subsample (array): List of indices corresponding to the subsample set </br>
+        distance_to_l: Covering radious for the subsample set </br>

-def county_crosstab(data, county, year, index, columns):
-    """
-    input:
-    output:
    """
-    subset_df = data[data.year == year]
-    sub_df = subset_df[subset_df.county == county]
-    crosstab = pd.crosstab(index=sub_df[index], columns=sub_df[columns])
-    return crosstab
+    if not isinstance(dist_matrix, np.ndarray):
+        print("your input is not an array")
+        return None, None
+    if isinstance(dist_matrix, np.ndarray) and len(dist_matrix.shape) != 2:
+        print('your input needs to be a 2d array')
+        return None, None
+    n_subsample = int(n_sample)
+    if n_subsample <= 0:
+        print("Sampling size should be a positive integer.")
+        return None, None
+    num_points = dist_matrix.shape[0]
+    list_subsample = np.random.choice(num_points, n_subsample)
+    dist_to_l = np.min(dist_matrix[list_subsample, :], axis=0)
+    distance_to_l = np.max(dist_to_l)
+    return list_subsample, distance_to_l
--- a/Topological_ML/test/test_Prediction.py
+++ b/Topological_ML/test/test_Prediction.py
-from Topological_ML import TDA_Prediction as tdap
-from sklearn.datasets import fetch_california_housing
-from sklearn.model_selection import train_test_split
-from sklearn.linear_model import LinearRegression
-import kmapper as km
-import pandas as pd
-import numpy as np
-import matplotlib.pyplot as plt
-import sklearn
-from sklearn import ensemble
-
-def test_data_summary():
-    data = pd.DataFrame({"A": [1,2,3,4,5,6,7,8,9,10]})
-    correct_response = {"head": [1,2,3,4,5,6,7,8,9,10], "shape": [10, 1], "describe": 0}
-    values = tdap.data_summary(data, 5)
-    assert values == correct_response
-    return
-
-def test_linear_regression():
-    a = pd.DataFrame({"A": [1,2,3,4,5,6,7,8,9,10]})
-    b = pd.DataFrame({"B": [2,4,6,8,10,12,14,16,18,20]})
-    correct_response = 1
-    values = tdap.linear_regression(a, b)
-    assert values == correct_response
-    return
-
-def test_lens_1d():
-    a = pd.DataFrame({"A": [0,0]})
-    correct_response = np.array([[0., 0.],[0., 0.]])
-    values = tdap.lens_1d(a,123,1)
-    assert values == correct_response
-    return
-
--- a/Topological_ML/test/test_tda_functions.py
+++ b/Topological_ML/test/test_tda_functions.py
+'''
+This file contains the functions from tda_functions.py that will be tested.
+'''
+import numpy as np
+import scipy.spatial.distance as distance
+from Topological_ML import tda_function as tda
+
+def test_lens_1d_pass():
+    '''
+    Testing whether the function works given correct parameters.
+    '''
+    data = np.ones((5, 2))
+    test, _ = tda.lens_1d(data, proj="sum")
+    assert isinstance(test, np.ndarray)
+
+def test_lens_1d_fail_1():
+    '''
+    Testing whether the function fails given incorrect projection parameter.
+    '''
+    data = np.ones((5, 2))
+    test, _ = tda.lens_1d(data, proj="Sum")
+    assert test is None
+
+def test_lens_1d_fail_2():
+    '''
+    Testing whether the function fails given incorrect feature (data) parameter.
+    '''
+    data = np.ones(5)
+    test, _ = tda.lens_1d(data, proj="sum")
+    assert test is None
+
+def test_uniform_sampling_pass():
+    '''
+    Testing whether the function works given correct parameters.
+    '''
+    x_array = np.array([[1, 1], [1, 2], [2, 3]])
+    dm_x = distance.cdist(x_array, x_array)
+    test, _ = tda.uniform_sampling(dm_x, 1)
+    assert isinstance(test, np.ndarray)
+
+def test_uniform_sampling_fail_1():
+    '''
+    Testing whether the function fails given incorrect distance matrix parameter.
+    '''
+    x_array = np.ones(5)
+    test, _ = tda.uniform_sampling(x_array, 1)
+    assert test is None
+
+def test_uniform_sampling_fail_2():
+    '''
+    Testing whether the function fails given incorrect sampling size parameter.
+    '''
+    x_array = np.array([[1, 1], [1, 2], [2, 3]])
+    dm_x = distance.cdist(x_array, x_array)
+    test, _ = tda.uniform_sampling(dm_x, -2)
+    assert test is None
--- a/calihousing.html
+++ b/calihousing.html
--- a/environment.yml
+++ b/environment.yml
@@ -6,7 +6,6 @@ dependencies:
  - matplotlib
  - jupyter
  - cython
-  - pytest
  - numpy
  - selenium
  - pandas
@@ -15,6 +14,14 @@ dependencies:
  - hypothesis
  - requests
  - plotly
+  - pdoc3
+  - pylint
+  - pytest
+  - autopep8
  - pip:
    - ripser
    - kmapper
+    - persim
+    - python-igraph
+    - plotly
+    - ipywidgets
\ No newline at end of file
--- a/makefile
+++ b/makefile
@@ -22,13 +22,13 @@ init:
 	conda env create --prefix ./envs --file environment.yml

 doc:
-	pdoc --force --html --output-dir ./docs Topological_ML
+	pdoc3 --force --html --output-dir ./docs Topological_ML

 lint:
-	pylint Topological_ML 
+	pylint -v Topological_ML

 test:
-	pytest Topological_ML 
+	pytest -v --disable-warnings Topological_ML

 .PHONY: init doc lint test 

--- a/mapper_visualization_output.html
+++ b/mapper_visualization_output.html
--- a/mi_df.csv
+++ b/mi_df.csv
--- a/michigan_df.csv
+++ b/michigan_df.csv
No results found