Compare revisions

shawk masboob · shawk masboob · shawk masboob · shawk masboob · shawk masboob · shawk masboob
--- a/LICENSE
+++ b/LICENSE
+This is free and unencumbered software released into the public domain.
+
+Anyone is free to copy, modify, publish, use, compile, sell, or
+distribute this software, either in source code form or as a compiled
+binary, for any purpose, commercial or non-commercial, and by any
+means.
+
+In jurisdictions that recognize copyright laws, the author or authors
+of this software dedicate any and all copyright interest in the
+software to the public domain. We make this dedication for the benefit
+of the public at large and to the detriment of our heirs and
+successors. We intend this dedication to be an overt act of
+relinquishment in perpetuity of all present and future rights to this
+software under copyright law.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
+OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+OTHER DEALINGS IN THE SOFTWARE.
+
+For more information, please refer to <https://unlicense.org>
\ No newline at end of file
--- a/README.md
+++ b/README.md

 # Topological Machine Learning

-The first 1/3 of this project will require the utilization of traditional machine learning methods on a dataset. The purpose of this portion of the project is promote the benefits of TDA. It is used to show data scientists that they can achieve very interesting results when using TDA. The second portion of this project will be incorporating TDA with machine learning. This portion of the project will be the most demanding due to lack of references. I will attempt to perform both a classification type project and prediction type project. Additionally, I may have to rely on HPCC to run my script (TDA is computationally expensive). The final portion of the project will be creating a Jupyter demo project for data scientists as well as a short semi-theoretical document that discusses the more important features of TDA.
+This project provides an easy to follow introduction to Topological Data Analysis (TDA). TDA is a machine learning tool that borrows concepts from topology and applies them to datasets. While the theory behind TDA is quite complex (thought important and interesting!), this project will only focus on applications. There is, however, an introductory notebook which provides user with a brief introduction to the main concepts within TDA as well as the scikit learn TDA package. The remaining notebooks apply TDA with different machine learning methods such as prediction and classifying.

-I hope to create a script that is easy for those who are not familiar with TDA to follow. Ideally, I would like to create a document that scikit-tda is willing to publish. Hence, the document needs to be constructed in a manner that is easy to interpret by anyone who is new to TDA.
+Future improvements include:
+1. Fixing the prediction notebook.
+2. Using the Ripser function to classify data.
+3. Providing deeper analysis on the results.

-Thankfully, python is already has the libraries that I will need. They are as follows: scipy, numpy, matplotlib, pandas, seaborn, scikit-tda
+## Prerequisites

-## Getting Started
+The environment file contains all of the required packages. You may either activate the environment file or install the required packages individually. 

-* Update
-
-### Prerequisites
-
-You may need to install the Python TDA library. The installation can be done in Pypi using just one command: pip install scikit-tda. Some may need to install seaborn, pandas, or numpy.
-
-```
-pip install scikit-tda
-```
+To run the Jupyter Notebooks sucessfully, you must them through the project's enviornment.

 ## Authors

@@ -26,8 +21,27 @@ pip install scikit-tda

 ## License

-* UPDATE
-
-## Acknowledgments
-
-* UPDATE
\ No newline at end of file
+This is free and unencumbered software released into the public domain.
+
+Anyone is free to copy, modify, publish, use, compile, sell, or
+distribute this software, either in source code form or as a compiled
+binary, for any purpose, commercial or non-commercial, and by any
+means.
+
+In jurisdictions that recognize copyright laws, the author or authors
+of this software dedicate any and all copyright interest in the
+software to the public domain. We make this dedication for the benefit
+of the public at large and to the detriment of our heirs and
+successors. We intend this dedication to be an overt act of
+relinquishment in perpetuity of all present and future rights to this
+software under copyright law.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
+OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
+OTHER DEALINGS IN THE SOFTWARE.
+
+For more information, please refer to <https://unlicense.org>
\ No newline at end of file
--- a/Reports/0313-REPORT-ML.ipynb
+++ b/Reports/0313-REPORT-ML.ipynb
+%% Cell type:markdown id: tags:
+
+# <center>Using Machine Learning (ML) in TDA</center>
+
+<center>by Shawk Masboob</center>
+
+%% Cell type:markdown id: tags:
+
+This projects aims to incorporate Topological Data Analysis (TDA) with several machine learning methods in order to demonstrate the potential benefits of TDA within data science.
+
+Traditional clustering methods can be “enhanced” by using TDA. Clustering is concerned with distance whereas TDA uses other relationships to cluster data together such as the amount of holes contained within the data. [2]. MAPPER begins by clustering the data points within an interval. The user can choose whatever clustering method they desire and metric. That is, they can choose hierarchical clustering with the euclidean distance metric. MAPPER then transforms the clusters into nodes within a graph. According to the developers of MAPPER, some points can exist within more than one node due to overlap. When there is member intersection, an edge is drawn between the nodes. [2]. The visualization provided by MAPPER will give interesting statistical results for each node that goes beyond traditional clustering. It should also be noted that TDA is often used for classification. A quick google search will reveal that TDA is often used to classify things such as animals, body parts, the presence of cancer, etc.
+
+TDA can be used for prediction or more specifically, feature selection. Suppose one is trying to build a regression model. An important intermediate step is to perform feature selection. While there are many machine learning (e.g. random forest) and statistical techniques (e.g. stepwise regression) that can be used for feature selection, one can consider using TDA. As done in this project (within the TDA_Prediction Python notebook), TDA is used to find the most prominent features for the multiple linear model. KeplerMapper, a Python TDA library, allows one to build graphs composed of nodes. These nodes contain data points and each node contains different statistics. For instance, a study used TDA to determine how much people are willing to pay for air quality improvements. [1]. The researchers generated eleven nodes using MAPPER. The size of each node is related to the number of observations within it. The researches used the color of each node to represent the relationship between “the mean value of all entries in that node with respect to the chosen variable.” [1].
+
+TDA is in itself an unsupervised machine learning tool. One of the most popular features within TDA is persistence homology. The notebook titled “TDA_Voting” uses the Python Ripser persistent homology package to analyze the recent presidential election county level voting results. The aim of this notebook is to determine whether there is a natural pattern in voting habits and whether this pattern dissolved during the 2016 presidential election. To perform this analysis, the birth-death diagram (provided by Risper) is used to spot persistent features. In this example, a persistent feature is a loop of some counties that “behaved” similarly. To see more interesting features, the radius needs to be increased. However, the current radius is extremely small so nothing interesting appeared. It should also be noted that TDA is computationally expensive so a large radius might take hours or even days to compute.
+
+%% Cell type:markdown id: tags:
+
+---
+# References
+
+[1] Allen, Dylan. Topological Data Analysis: Giving Data Shape. Carroll, 13 May 2017, scholars.carroll.edu/cgi/viewcontent.cgi?article=1000&context=mathengcompsci_theses.
+
+[2] Keplermapper 1.2.0 Documentation a Scikit-tda Project
+https://kepler-mapper.scikit-tda.org/theory.html
+
+%% Cell type:code id: tags:
+
+``` python
+```
+%% Cell type:markdown id: tags:
+
+# <center>Using Machine Learning (ML) in TDA</center>
+
+<center>by Shawk Masboob</center>
+
+%% Cell type:markdown id: tags:
+
+This projects aims to incorporate Topological Data Analysis (TDA) with several machine learning methods in order to demonstrate the potential benefits of TDA within data science.
+
+Traditional clustering methods can be “enhanced” by using TDA. Clustering is concerned with distance whereas TDA uses other relationships to cluster data together such as the amount of holes contained within the data. [2]. MAPPER begins by clustering the data points within an interval. The user can choose whatever clustering method they desire and metric. That is, they can choose hierarchical clustering with the euclidean distance metric. MAPPER then transforms the clusters into nodes within a graph. According to the developers of MAPPER, some points can exist within more than one node due to overlap. When there is member intersection, an edge is drawn between the nodes. [2]. The visualization provided by MAPPER will give interesting statistical results for each node that goes beyond traditional clustering. It should also be noted that TDA is often used for classification. A quick google search will reveal that TDA is often used to classify things such as animals, body parts, the presence of cancer, etc.
+
+TDA can be used for prediction or more specifically, feature selection. Suppose one is trying to build a regression model. An important intermediate step is to perform feature selection. While there are many machine learning (e.g. random forest) and statistical techniques (e.g. stepwise regression) that can be used for feature selection, one can consider using TDA. As done in this project (within the TDA_Prediction Python notebook), TDA is used to find the most prominent features for the multiple linear model. KeplerMapper, a Python TDA library, allows one to build graphs composed of nodes. These nodes contain data points and each node contains different statistics. For instance, a study used TDA to determine how much people are willing to pay for air quality improvements. [1]. The researchers generated eleven nodes using MAPPER. The size of each node is related to the number of observations within it. The researches used the color of each node to represent the relationship between “the mean value of all entries in that node with respect to the chosen variable.” [1].
+
+TDA is in itself an unsupervised machine learning tool. One of the most popular features within TDA is persistence homology. The notebook titled “TDA_Voting” uses the Python Ripser persistent homology package to analyze the recent presidential election county level voting results. The aim of this notebook is to determine whether there is a natural pattern in voting habits and whether this pattern dissolved during the 2016 presidential election. To perform this analysis, the birth-death diagram (provided by Risper) is used to spot persistent features. In this example, a persistent feature is a loop of some counties that “behaved” similarly. To see more interesting features, the radius needs to be increased. However, the current radius is extremely small so nothing interesting appeared. It should also be noted that TDA is computationally expensive so a large radius might take hours or even days to compute.
+
+%% Cell type:markdown id: tags:
+
+---
+# References
+
+[1] Allen, Dylan. Topological Data Analysis: Giving Data Shape. Carroll, 13 May 2017, scholars.carroll.edu/cgi/viewcontent.cgi?article=1000&context=mathengcompsci_theses.
+
+[2] Keplermapper 1.2.0 Documentation a Scikit-tda Project
+https://kepler-mapper.scikit-tda.org/theory.html
+
+%% Cell type:code id: tags:
+
+``` python
+```
--- a/Reports/0327-REPORT-ODE.ipynb
+++ b/Reports/0327-REPORT-ODE.ipynb
+%% Cell type:markdown id: tags:
+
+# <center>Using Ordinary Differential Equations (ODEs) with Topological Data Analysis</center>
+
+<center>by Shawk Masboob</center>
+
+%% Cell type:markdown id: tags:
+
+Ordinary Differential Equations (ODEs) were not used in this project. However, ODEs can be incorporated with Topological Data Analysis (TDA). This report will cover how TDA can be used with ODE in this project and will provide an example in which researchers used both ODEs and TDA outside of this project.
+
+An ODE model consists of a set of differential equations that involve functions of only one independent variable and its derivative(s). [1]. ODEs are used to model dynamical systems in science (including social science) and engineering. Specific examples include predator and prey models, how gossip spreads, or the spread of a virus. TDA, on the other hand, utilizes methods from topology to study datasets. Persistent homology, one of the main tools in TDA, is used to measure the topological features of data (i.e. the shape of the dataset). TDA is capable of working with data that are high-dimensional, incomplete, or noisy. Based on this, researchers can apply TDA tools to perform additional analysis on their ODEs model. That is, after developing the ODEs model, the reacher can apply topological methods to understand the outputs of the model, especially if there is reason to believe that the output data has complex dimensionality.
+
+This project provides several examples of how to apply TDA with several different machine learning techniques. An additional notebook can be add which incorporates both techniques. For example, an ODE model on the spread disease can be developed. After changing the model slightly by adding different factors or more detail, the output data can be analyzed using TDA. Before using TDA, it is necessary to determine whether the data is linearly separable. If it is, other methods that work with linear data can be used (using TDA might result in “false” results because TDA will attempt to find shape/pattern within the data even if no such pattern/shape exists). If the data is compatible with TDA, the next step is to prep the data to guarantee that there are no outliers. Then, the persistent homology of the space can be found. The last step is to determine which observations composed the persistent homology. Understanding why those particular features created the homology might shed more insight on the model.
+
+Another example in which TDA can be used with ODEs modeling is when the ODE network is complicated. Generally, it is preferred to have a model that is not overly complex (heavily detailed). However, if some model is complex, TDA can definitely be used to make sense of the results. Persistent homology can be used to determine which set of components within the model interact with one another.
+
+An attempt to find research projects that utilize ODEs and TDA was made. Unfortunately, no examples were found. While it is not impossible to use both methods in a research, it is not common nor is it natural. TDA is used to analyze large, complex data. Besides analyzing the shape of the data provided by the ODE model, there is not much application. Additionally, the data must not not linearly separable. TDA is generally used with time series analysis, data reduction, agent-based modeling, etc. ODEs and TDA are both wonderful methods that can be utilized.
+
+%% Cell type:markdown id: tags:
+
+---
+# References
+
+[1] https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-9863-7_381
+%% Cell type:markdown id: tags:
+
+# <center>Using Ordinary Differential Equations (ODEs) with Topological Data Analysis</center>
+
+<center>by Shawk Masboob</center>
+
+%% Cell type:markdown id: tags:
+
+Ordinary Differential Equations (ODEs) were not used in this project. However, ODEs can be incorporated with Topological Data Analysis (TDA). This report will cover how TDA can be used with ODE in this project and will provide an example in which researchers used both ODEs and TDA outside of this project.
+
+An ODE model consists of a set of differential equations that involve functions of only one independent variable and its derivative(s). [1]. ODEs are used to model dynamical systems in science (including social science) and engineering. Specific examples include predator and prey models, how gossip spreads, or the spread of a virus. TDA, on the other hand, utilizes methods from topology to study datasets. Persistent homology, one of the main tools in TDA, is used to measure the topological features of data (i.e. the shape of the dataset). TDA is capable of working with data that are high-dimensional, incomplete, or noisy. Based on this, researchers can apply TDA tools to perform additional analysis on their ODEs model. That is, after developing the ODEs model, the reacher can apply topological methods to understand the outputs of the model, especially if there is reason to believe that the output data has complex dimensionality.
+
+This project provides several examples of how to apply TDA with several different machine learning techniques. An additional notebook can be add which incorporates both techniques. For example, an ODE model on the spread disease can be developed. After changing the model slightly by adding different factors or more detail, the output data can be analyzed using TDA. Before using TDA, it is necessary to determine whether the data is linearly separable. If it is, other methods that work with linear data can be used (using TDA might result in “false” results because TDA will attempt to find shape/pattern within the data even if no such pattern/shape exists). If the data is compatible with TDA, the next step is to prep the data to guarantee that there are no outliers. Then, the persistent homology of the space can be found. The last step is to determine which observations composed the persistent homology. Understanding why those particular features created the homology might shed more insight on the model.
+
+Another example in which TDA can be used with ODEs modeling is when the ODE network is complicated. Generally, it is preferred to have a model that is not overly complex (heavily detailed). However, if some model is complex, TDA can definitely be used to make sense of the results. Persistent homology can be used to determine which set of components within the model interact with one another.
+
+An attempt to find research projects that utilize ODEs and TDA was made. Unfortunately, no examples were found. While it is not impossible to use both methods in a research, it is not common nor is it natural. TDA is used to analyze large, complex data. Besides analyzing the shape of the data provided by the ODE model, there is not much application. Additionally, the data must not not linearly separable. TDA is generally used with time series analysis, data reduction, agent-based modeling, etc. ODEs and TDA are both wonderful methods that can be utilized.
+
+%% Cell type:markdown id: tags:
+
+---
+# References
+
+[1] https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-9863-7_381
--- a/Reports/0410-REPORT-Stats.ipynb
+++ b/Reports/0410-REPORT-Stats.ipynb
+%% Cell type:markdown id: tags:
+
+# <center>Using Topological Data Analysis with Statistical Models</center>
+
+<center>by Shawk Masboob</center>
+
+%% Cell type:markdown id: tags:
+
+Topological data analysis (TDA) and more specifically Mapper, can be used to enhance statistical modeling. Mapper is a useful tool to visualize large datasets because it transforms the data into a graph which can then be further analyzed. Standard statistic methods can often miss or “not see” the complexities that lie within data. Hence, using Mapper can greater improve statistical modeling such as classification, prediction, or forecasting.
+
+This project incorporates statistical modeling with TDA to demonstrate how Mapper works. The project used the California Housing dataset.The project begins by using standard statistical methods for exploratory data analysis and getting rid of multicollinearity such as the variance inflation factor. Stepwise regression was used to determine the which features should be used in the model. After determining the necessary features, a multiple linear regression was used to predict house price based on varying attributes such as number of bedrooms and location.
+
+Mapper was used after building the linear model. While Mapper has multiple purposes, in this project it was used to evaluate the dataset and determine whether certain portions of the dataset behave differently. Mapper allows users to use different lenses, cover, and type of clustering method. The point was to find natural separation within the data. A visualization was created to demonstrate the separated nodes. The figure below shows the mapper output for the California dataset.
+
+![alt text](mapper.png "simplex")
+
+Although the lenses used need to be improved because not enough separation is being shown, one can analyze the nodes individually. The idea is, separation between the nodes implies that the data behaves differently and hence, running a regression on the nodes individually might present interesting results. The next step in this project is to 1.) improve the separation by changing the lenses or clustering method and 2.) running a regression on the each node. The node might have different features, e.g. one node might exclude ‘number of bedrooms’ because it might not contribute to predicting house price. Another node might include ‘number of bedrooms.’
+
+There are plenty of examples which incorporate TDA in statistical analysis using Mapper. For example, one research examined Wisconsin cancer data. The aim of the research was to classify breast cancer patients. That is, the goal of the  research was to determine which of the 11 predictor variables have the most influence on the diagnosis [1]. The researchers began by building a logistic regression model. They concluded that although a logistic model is appropriate for their data and research topic, there was still room for improvement because misdiagnoses still presented to be an issue [1]. After the statistical analysis was completed, the researchers added Mapper to improve the research. TDA was capable of sorting between people with benign tumors and those with malignant tumors. After simply analyzing the nodes, the researchers looked deeper at the nodes to understand what caused the separation [1]. The researches, after using an exhaustive search, found that the nodes do behave different and hence, separate logistic models were created for each node [1]. To conclude, the researchers found that standard statical modeling was prone to false negatives and false positives. Using TDA, the researchers found that “there is a subset of patients that share similarities in many attributes such as mean area and perimeter, but differ wildly in the smoothness of the tumor, and this observation leads to a different model and consequently a different diagnosis” [1].
+
+The main idea behind using TDA when building a statistical model (e.g. multiple linear regression) is that data can behave differently and sometimes relying on traditional statistical methods is not enough. That is, standard statistical methods do not always capture the behavior of the data and hence, the developed models do not lead to the best prediction.
+
+%% Cell type:markdown id: tags:
+
+---
+# References
+
+[1] Allen, Dylan. Topological Data Analysis: Giving Data Shape. Carroll, 13 May 2017, scholars.carroll.edu/cgi/viewcontent.cgi?article=1000&context=mathengcompsci_theses.
+%% Cell type:markdown id: tags:
+
+# <center>Using Topological Data Analysis with Statistical Models</center>
+
+<center>by Shawk Masboob</center>
+
+%% Cell type:markdown id: tags:
+
+Topological data analysis (TDA) and more specifically Mapper, can be used to enhance statistical modeling. Mapper is a useful tool to visualize large datasets because it transforms the data into a graph which can then be further analyzed. Standard statistic methods can often miss or “not see” the complexities that lie within data. Hence, using Mapper can greater improve statistical modeling such as classification, prediction, or forecasting.
+
+This project incorporates statistical modeling with TDA to demonstrate how Mapper works. The project used the California Housing dataset.The project begins by using standard statistical methods for exploratory data analysis and getting rid of multicollinearity such as the variance inflation factor. Stepwise regression was used to determine the which features should be used in the model. After determining the necessary features, a multiple linear regression was used to predict house price based on varying attributes such as number of bedrooms and location.
+
+Mapper was used after building the linear model. While Mapper has multiple purposes, in this project it was used to evaluate the dataset and determine whether certain portions of the dataset behave differently. Mapper allows users to use different lenses, cover, and type of clustering method. The point was to find natural separation within the data. A visualization was created to demonstrate the separated nodes. The figure below shows the mapper output for the California dataset.
+
+![alt text](mapper.png "simplex")
+
+Although the lenses used need to be improved because not enough separation is being shown, one can analyze the nodes individually. The idea is, separation between the nodes implies that the data behaves differently and hence, running a regression on the nodes individually might present interesting results. The next step in this project is to 1.) improve the separation by changing the lenses or clustering method and 2.) running a regression on the each node. The node might have different features, e.g. one node might exclude ‘number of bedrooms’ because it might not contribute to predicting house price. Another node might include ‘number of bedrooms.’
+
+There are plenty of examples which incorporate TDA in statistical analysis using Mapper. For example, one research examined Wisconsin cancer data. The aim of the research was to classify breast cancer patients. That is, the goal of the  research was to determine which of the 11 predictor variables have the most influence on the diagnosis [1]. The researchers began by building a logistic regression model. They concluded that although a logistic model is appropriate for their data and research topic, there was still room for improvement because misdiagnoses still presented to be an issue [1]. After the statistical analysis was completed, the researchers added Mapper to improve the research. TDA was capable of sorting between people with benign tumors and those with malignant tumors. After simply analyzing the nodes, the researchers looked deeper at the nodes to understand what caused the separation [1]. The researches, after using an exhaustive search, found that the nodes do behave different and hence, separate logistic models were created for each node [1]. To conclude, the researchers found that standard statical modeling was prone to false negatives and false positives. Using TDA, the researchers found that “there is a subset of patients that share similarities in many attributes such as mean area and perimeter, but differ wildly in the smoothness of the tumor, and this observation leads to a different model and consequently a different diagnosis” [1].
+
+The main idea behind using TDA when building a statistical model (e.g. multiple linear regression) is that data can behave differently and sometimes relying on traditional statistical methods is not enough. That is, standard statistical methods do not always capture the behavior of the data and hence, the developed models do not lead to the best prediction.
+
+%% Cell type:markdown id: tags:
+
+---
+# References
+
+[1] Allen, Dylan. Topological Data Analysis: Giving Data Shape. Carroll, 13 May 2017, scholars.carroll.edu/cgi/viewcontent.cgi?article=1000&context=mathengcompsci_theses.
--- a/Reports/0424-PROJECT-Final-Report.ipynb
+++ b/Reports/0424-PROJECT-Final-Report.ipynb
+%% Cell type:markdown id: tags:
+
+# <center> Topological Machine Learning </center>
+
+<center>By Shawk Macboob </center>
+
+%% Cell type:markdown id: tags:
+
+<img src="https://scikit-tda.org/_static/logo.png" width="20%">
+Image from: https://scikit-tda.org/#
+
+%% Cell type:markdown id: tags:
+
+---
+# Authors
+
+Shawk Masboob
+
+%% Cell type:markdown id: tags:
+
+---
+# Abstract
+
+Topological Data Analysis i(TDA) s a relatively new field with many useful applications. Essentially, TDA borrows tools from topology in order to study data. That is, TDA seeks to determine whether a particular dataset has shape and what the shape of the dataset implies. It can be used independently or applied to other machine learning techniques. This particular projects aims at applying TDA to various machine learning methods in order to demonstrate its benefits. To do so, this project first presents the general theory of TDA as well as introducing the available TDA software. Then, several notebooks are illustrate how TDA can be applied to machine learning.
+
+%% Cell type:markdown id: tags:
+
+----
+# Statement of Need
+
+The purpose of this project is to introduce data scientists to Topological Data Analysis (TDA) through various examples. That is, this project aims to demonstrate applications of TDA to those who do not have a background in mathematics.
+
+%% Cell type:markdown id: tags:
+
+----
+# Installation instructions
+
+The environment.yml file constains all of the required dependices. To install the required moduels, run the following command in the terminal:
+
+`make init`
+
+%% Cell type:markdown id: tags:
+
+----
+# Unit Tests
+
+Unit testing is done on the following two functions: `lens_1d` and `uniform_sampling`. These tests check to verify that the function takes in correct inputs.
+
+To run a unit test, simply write `make test` in the terminal.
+
+%% Cell type:markdown id: tags:
+
+---
+# Methodology
+
+I was able to meet the majority of my initial goals. I created different notebooks that used TDA in different ways. Additionally, I created a background notebook that gave a basic introduction to TDA. While I was able to meet my general goals, I think I could have approach this project differently. For instance, Ripser and Mapper are the two most used libraries from the scikit-tda package. I think I should have introduced these libraries more thoroughly. Both of these libraries have many applications which my project did not highlight. For instance, Ripser can be used for nonlinear time series analysis, feature selection, classifying, etc. While it is impossible to demonstrate all of the possible applications for Ripser, I could have discussed them more and linked some articles - though the vast majority of the papers I’ve come across focus on the mathematical side of TDA rather than the application.
+
+Mapper was incredibly challenging to use. I spent the vast majority of the semester trying to understand and apply it. I was able to do basic classification using Mapper, although I am still unsure what the best method is in order to improve the model. I also do not know how to get some sort of accuracy score. I am only able to look at the output graph and determine whether the data separated well. I have yet to figure out how topologists determine the overall accuracy of their classification. Additionally, I was not able to fully complete the prediction notebook. I know how to use Mapper in the sense that I was able to separate the data visually. The next task is to learn how to extract the data. After doing so, I can easily build a new predictive model. In general, I need to experiment more with Mapper in order to better understand how to implement it.
+
+%% Cell type:markdown id: tags:
+
+---
+# Concluding Remarks
+
+I have a much deeper understanding of TDA because of this project. Before this project, I thought Ripser was only used for persistence diagrams. That is, I thought you can only do an exploratory analysis with Ripser. I was unaware that Ripser is used in time series analysis or classification. As for Mapper, I had no idea how to work with it in general. I was not formally introduced to Mapper - everything I currently know comes from articles/papers I’ve read. My background in Mapper is still limited but I feel more confident using it and experimenting with the parameters.
+
+For future work, I would like to expand my work with Mapper. I would like to create more wrapper functions that simplify the steps that go into building a graph. Additionally, I would like to learn how to color nodes so that I can gain more insight. For instance,  I would like to learn how to color nodes based on proportions for y1 to y2. My current understanding of Mapper is shallow so I would like to expand it and add detail application to my project.Additionally, I would like to create more notebooks that do not use toy datasets. Toy datasets are easy to work with but do not add enough complexity. To full demonstrate the applications of Mapper, I need to  work with more challenging data.
+
+%% Cell type:markdown id: tags:
+
+----
+# References
+
+Individual notebooks have their own reference section. However, all notebooks use the scikit-tda package.
+
+Saul, Nathaniel and Tralie, Chris. (2019). Scikit-TDA: Topological Data Analysis for Python. Zenodo. http://doi.org/10.5281/zenodo.2533369
+%% Cell type:markdown id: tags:
+
+# <center> Topological Machine Learning </center>
+
+<center>By Shawk Macboob </center>
+
+%% Cell type:markdown id: tags:
+
+<img src="https://scikit-tda.org/_static/logo.png" width="20%">
+Image from: https://scikit-tda.org/#
+
+%% Cell type:markdown id: tags:
+
+---
+# Authors
+
+Shawk Masboob
+
+%% Cell type:markdown id: tags:
+
+---
+# Abstract
+
+Topological Data Analysis i(TDA) s a relatively new field with many useful applications. Essentially, TDA borrows tools from topology in order to study data. That is, TDA seeks to determine whether a particular dataset has shape and what the shape of the dataset implies. It can be used independently or applied to other machine learning techniques. This particular projects aims at applying TDA to various machine learning methods in order to demonstrate its benefits. To do so, this project first presents the general theory of TDA as well as introducing the available TDA software. Then, several notebooks are illustrate how TDA can be applied to machine learning.
+
+%% Cell type:markdown id: tags:
+
+----
+# Statement of Need
+
+The purpose of this project is to introduce data scientists to Topological Data Analysis (TDA) through various examples. That is, this project aims to demonstrate applications of TDA to those who do not have a background in mathematics.
+
+%% Cell type:markdown id: tags:
+
+----
+# Installation instructions
+
+The environment.yml file constains all of the required dependices. To install the required moduels, run the following command in the terminal:
+
+`make init`
+
+%% Cell type:markdown id: tags:
+
+----
+# Unit Tests
+
+Unit testing is done on the following two functions: `lens_1d` and `uniform_sampling`. These tests check to verify that the function takes in correct inputs.
+
+To run a unit test, simply write `make test` in the terminal.
+
+%% Cell type:markdown id: tags:
+
+---
+# Methodology
+
+I was able to meet the majority of my initial goals. I created different notebooks that used TDA in different ways. Additionally, I created a background notebook that gave a basic introduction to TDA. While I was able to meet my general goals, I think I could have approach this project differently. For instance, Ripser and Mapper are the two most used libraries from the scikit-tda package. I think I should have introduced these libraries more thoroughly. Both of these libraries have many applications which my project did not highlight. For instance, Ripser can be used for nonlinear time series analysis, feature selection, classifying, etc. While it is impossible to demonstrate all of the possible applications for Ripser, I could have discussed them more and linked some articles - though the vast majority of the papers I’ve come across focus on the mathematical side of TDA rather than the application.
+
+Mapper was incredibly challenging to use. I spent the vast majority of the semester trying to understand and apply it. I was able to do basic classification using Mapper, although I am still unsure what the best method is in order to improve the model. I also do not know how to get some sort of accuracy score. I am only able to look at the output graph and determine whether the data separated well. I have yet to figure out how topologists determine the overall accuracy of their classification. Additionally, I was not able to fully complete the prediction notebook. I know how to use Mapper in the sense that I was able to separate the data visually. The next task is to learn how to extract the data. After doing so, I can easily build a new predictive model. In general, I need to experiment more with Mapper in order to better understand how to implement it.
+
+%% Cell type:markdown id: tags:
+
+---
+# Concluding Remarks
+
+I have a much deeper understanding of TDA because of this project. Before this project, I thought Ripser was only used for persistence diagrams. That is, I thought you can only do an exploratory analysis with Ripser. I was unaware that Ripser is used in time series analysis or classification. As for Mapper, I had no idea how to work with it in general. I was not formally introduced to Mapper - everything I currently know comes from articles/papers I’ve read. My background in Mapper is still limited but I feel more confident using it and experimenting with the parameters.
+
+For future work, I would like to expand my work with Mapper. I would like to create more wrapper functions that simplify the steps that go into building a graph. Additionally, I would like to learn how to color nodes so that I can gain more insight. For instance,  I would like to learn how to color nodes based on proportions for y1 to y2. My current understanding of Mapper is shallow so I would like to expand it and add detail application to my project.Additionally, I would like to create more notebooks that do not use toy datasets. Toy datasets are easy to work with but do not add enough complexity. To full demonstrate the applications of Mapper, I need to  work with more challenging data.
+
+%% Cell type:markdown id: tags:
+
+----
+# References
+
+Individual notebooks have their own reference section. However, all notebooks use the scikit-tda package.
+
+Saul, Nathaniel and Tralie, Chris. (2019). Scikit-TDA: Topological Data Analysis for Python. Zenodo. http://doi.org/10.5281/zenodo.2533369
--- a/Reports/mapper.png
+++ b/Reports/mapper.png
--- a/TDA_Background.ipynb
+++ b/TDA_Background.ipynb
--- a/TDA_Classification.ipynb
+++ b/TDA_Classification.ipynb
+%% Cell type:markdown id: tags:
+
+<h1><center>Classification using Topological Data Analysis</center></h1>
+<img src="https://cdn.vox-cdn.com/thumbor/GcZR8_tOztIDAiSlX47_5oyZ-js=/0x0:1599x1066/1200x800/filters:focal(834x375:1088x629)/cdn.vox-cdn.com/uploads/chorus_image/image/55588811/King_Estate_Winery_NEXT_Amazon_Wine.0.jpg" width="70%">
+<p style="text-align: center;">Image from: https://www.vox.com/2017/7/6/15926476/amazon-next-wine-king-vintners-king-estate-winery</p>
+
+%% Cell type:markdown id: tags:
+
+In this notebook, we will classify wine quality using topological data analysis with Mapper.
+
+The general motivation of this notebook is to demonstrate how to use Mapper and how to apply TDA to classification related models.
+
+%% Cell type:code id: tags:
+
+``` python
+# imports
+from Topological_ML import tda_function as tda
+import pandas as pd
+import numpy as np
+import sklearn
+from sklearn import ensemble
+import kmapper as km
+from kmapper.plotlyviz import *
+import matplotlib.pyplot as plt
+import plotly.graph_objs as go
+from ipywidgets import (HBox, VBox)
+import warnings
+warnings.filterwarnings("ignore")
+```
+
+%% Cell type:markdown id: tags:
+
+First, we download the wine dataset from Scikit Learn.
+
+%% Cell type:code id: tags:
+
+``` python
+# import wine dataset from scikit learn
+from sklearn.datasets import load_wine
+wine = load_wine()
+df = pd.DataFrame(wine['data'],columns = wine['feature_names'])
+df['quality'] = wine['target']
+df.head()
+```
+
+%% Output
+
+       alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
+    0    14.23        1.71  2.43               15.6      127.0           2.80
+    1    13.20        1.78  2.14               11.2      100.0           2.65
+    2    13.16        2.36  2.67               18.6      101.0           2.80
+    3    14.37        1.95  2.50               16.8      113.0           3.85
+    4    13.24        2.59  2.87               21.0      118.0           2.80
+    
+       flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
+    0        3.06                  0.28             2.29             5.64  1.04
+    1        2.76                  0.26             1.28             4.38  1.05
+    2        3.24                  0.30             2.81             5.68  1.03
+    3        3.49                  0.24             2.18             7.80  0.86
+    4        2.69                  0.39             1.82             4.32  1.04
+    
+       od280/od315_of_diluted_wines  proline  quality
+    0                          3.92   1065.0        0
+    1                          3.40   1050.0        0
+    2                          3.17   1185.0        0
+    3                          3.45   1480.0        0
+    4                          2.93    735.0        0
+
+%% Cell type:markdown id: tags:
+
+Now that the data is downloaded, we seperate the reponse from the features and build a simplicial complex based on the features. The lens_1d function has other options that one can experiment with in order to build a better simplicial complex.
+
+%% Cell type:code id: tags:
+
+``` python
+# seperate features and response
+feature_names = [c for c in df.columns if c not in ["quality"]]
+X = np.array(df[feature_names])
+y = np.array(df["quality"])
+
+# you may choose any lens type here
+lens, mapper = tda.lens_1d(X,"max")
+
+# Define the simplicial complex
+scomplex = mapper.map(lens,
+                      X,
+                      nr_cubes=15,
+                      overlap_perc=0.7,
+                      clusterer=sklearn.cluster.KMeans(n_clusters=2,
+                                                       random_state=3471))
+```
+
+%% Cell type:markdown id: tags:
+
+The following code, borrowed from scikit-TDA, uses the simplicial complex that we just defined to build a graph. The majority of the code is building an interactive plot within a notebook.
+
+%% Cell type:code id: tags:
+
+``` python
+# color scale
+pl_brewer = [[0.0, '#006837'],
+             [0.1, '#1a9850'],
+             [0.2, '#66bd63'],
+             [0.3, '#a6d96a'],
+             [0.4, '#d9ef8b'],
+             [0.5, '#ffffbf'],
+             [0.6, '#fee08b'],
+             [0.7, '#fdae61'],
+             [0.8, '#f46d43'],
+             [0.9, '#d73027'],
+             [1.0, '#a50026']]
+
+color_function = lens [:,0] - lens[:,0].min()
+
+my_colorscale = pl_brewer
+
+kmgraph,  mapper_summary, colorf_distribution = get_mapper_graph(scomplex,
+                                                                 color_function,
+                                                                 color_function_name='Distance to x-max',
+                                                                 colorscale=my_colorscale)
+
+# assign to node['custom_tooltips']  the node label: 0 - low quality, 1 - medium quality, 2 - high quality
+for node in kmgraph['nodes']:
+    node['custom_tooltips'] = y[scomplex['nodes'][node['name']]]
+
+bgcolor = 'rgba(10,10,10, 0.9)'
+
+# on a black background the gridlines are set on  grey
+y_gridcolor = 'rgb(150,150,150)'
+
+plotly_graph_data = plotly_graph(kmgraph, graph_layout='fr', colorscale=my_colorscale,
+                                 factor_size=2.5, edge_linewidth=0.5)
+
+layout = plot_layout(title='Topological network representing the<br>  wine quality dataset',
+                     width=620, height=570,
+                     annotation_text=get_kmgraph_meta(mapper_summary),
+                     bgcolor=bgcolor)
+
+fw_graph = go.FigureWidget(data=plotly_graph_data, layout=layout)
+
+fw_hist = node_hist_fig(colorf_distribution, bgcolor=bgcolor,
+                        y_gridcolor=y_gridcolor)
+
+fw_summary = summary_fig(mapper_summary, height=300)
+
+dashboard = hovering_widgets(kmgraph,
+                             fw_graph,
+                             ctooltips=True,
+                             bgcolor=bgcolor,
+                             y_gridcolor=y_gridcolor,
+                             member_textbox_width=600)
+
+#Update the fw_graph colorbar, setting its title:
+fw_graph.data[1].marker.colorbar.title = 'dist to<br>x-min'
+
+dashboard
+```
+
+%% Output
+
+
+%% Cell type:markdown id: tags:
+
+Several observations can be made:
+1. The top half of the graph is composed of wine with quality 0
+2. The middle region is composed of all wine types
+3. The bottom region is composed of wine with quality 1 and 2
+
+We attained some seperability however, this model can be improved by using different clustering methods or a different filter function. The current graph does note seperate quality 1 and 2 very well.
+
+Further insight can be made (although it requires more coding). For instance, one can color the nodes so that they show the proportion of high quality wine to the rest.
+%% Cell type:markdown id: tags:
+
+<h1><center>Classification using Topological Data Analysis</center></h1>
+<img src="https://cdn.vox-cdn.com/thumbor/GcZR8_tOztIDAiSlX47_5oyZ-js=/0x0:1599x1066/1200x800/filters:focal(834x375:1088x629)/cdn.vox-cdn.com/uploads/chorus_image/image/55588811/King_Estate_Winery_NEXT_Amazon_Wine.0.jpg" width="70%">
+<p style="text-align: center;">Image from: https://www.vox.com/2017/7/6/15926476/amazon-next-wine-king-vintners-king-estate-winery</p>
+
+%% Cell type:markdown id: tags:
+
+In this notebook, we will classify wine quality using topological data analysis with Mapper.
+
+The general motivation of this notebook is to demonstrate how to use Mapper and how to apply TDA to classification related models.
+
+%% Cell type:code id: tags:
+
+``` python
+# imports
+from Topological_ML import tda_function as tda
+import pandas as pd
+import numpy as np
+import sklearn
+from sklearn import ensemble
+import kmapper as km
+from kmapper.plotlyviz import *
+import matplotlib.pyplot as plt
+import plotly.graph_objs as go
+from ipywidgets import (HBox, VBox)
+import warnings
+warnings.filterwarnings("ignore")
+```
+
+%% Cell type:markdown id: tags:
+
+First, we download the wine dataset from Scikit Learn.
+
+%% Cell type:code id: tags:
+
+``` python
+# import wine dataset from scikit learn
+from sklearn.datasets import load_wine
+wine = load_wine()
+df = pd.DataFrame(wine['data'],columns = wine['feature_names'])
+df['quality'] = wine['target']
+df.head()
+```
+
+%% Output
+
+       alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
+    0    14.23        1.71  2.43               15.6      127.0           2.80
+    1    13.20        1.78  2.14               11.2      100.0           2.65
+    2    13.16        2.36  2.67               18.6      101.0           2.80
+    3    14.37        1.95  2.50               16.8      113.0           3.85
+    4    13.24        2.59  2.87               21.0      118.0           2.80
+    
+       flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
+    0        3.06                  0.28             2.29             5.64  1.04
+    1        2.76                  0.26             1.28             4.38  1.05
+    2        3.24                  0.30             2.81             5.68  1.03
+    3        3.49                  0.24             2.18             7.80  0.86
+    4        2.69                  0.39             1.82             4.32  1.04
+    
+       od280/od315_of_diluted_wines  proline  quality
+    0                          3.92   1065.0        0
+    1                          3.40   1050.0        0
+    2                          3.17   1185.0        0
+    3                          3.45   1480.0        0
+    4                          2.93    735.0        0
+
+%% Cell type:markdown id: tags:
+
+Now that the data is downloaded, we seperate the reponse from the features and build a simplicial complex based on the features. The lens_1d function has other options that one can experiment with in order to build a better simplicial complex.
+
+%% Cell type:code id: tags:
+
+``` python
+# seperate features and response
+feature_names = [c for c in df.columns if c not in ["quality"]]
+X = np.array(df[feature_names])
+y = np.array(df["quality"])
+
+# you may choose any lens type here
+lens, mapper = tda.lens_1d(X,"max")
+
+# Define the simplicial complex
+scomplex = mapper.map(lens,
+                      X,
+                      nr_cubes=15,
+                      overlap_perc=0.7,
+                      clusterer=sklearn.cluster.KMeans(n_clusters=2,
+                                                       random_state=3471))
+```
+
+%% Cell type:markdown id: tags:
+
+The following code, borrowed from scikit-TDA, uses the simplicial complex that we just defined to build a graph. The majority of the code is building an interactive plot within a notebook.
+
+%% Cell type:code id: tags:
+
+``` python
+# color scale
+pl_brewer = [[0.0, '#006837'],
+             [0.1, '#1a9850'],
+             [0.2, '#66bd63'],
+             [0.3, '#a6d96a'],
+             [0.4, '#d9ef8b'],
+             [0.5, '#ffffbf'],
+             [0.6, '#fee08b'],
+             [0.7, '#fdae61'],
+             [0.8, '#f46d43'],
+             [0.9, '#d73027'],
+             [1.0, '#a50026']]
+
+color_function = lens [:,0] - lens[:,0].min()
+
+my_colorscale = pl_brewer
+
+kmgraph,  mapper_summary, colorf_distribution = get_mapper_graph(scomplex,
+                                                                 color_function,
+                                                                 color_function_name='Distance to x-max',
+                                                                 colorscale=my_colorscale)
+
+# assign to node['custom_tooltips']  the node label: 0 - low quality, 1 - medium quality, 2 - high quality
+for node in kmgraph['nodes']:
+    node['custom_tooltips'] = y[scomplex['nodes'][node['name']]]
+
+bgcolor = 'rgba(10,10,10, 0.9)'
+
+# on a black background the gridlines are set on  grey
+y_gridcolor = 'rgb(150,150,150)'
+
+plotly_graph_data = plotly_graph(kmgraph, graph_layout='fr', colorscale=my_colorscale,
+                                 factor_size=2.5, edge_linewidth=0.5)
+
+layout = plot_layout(title='Topological network representing the<br>  wine quality dataset',
+                     width=620, height=570,
+                     annotation_text=get_kmgraph_meta(mapper_summary),
+                     bgcolor=bgcolor)
+
+fw_graph = go.FigureWidget(data=plotly_graph_data, layout=layout)
+
+fw_hist = node_hist_fig(colorf_distribution, bgcolor=bgcolor,
+                        y_gridcolor=y_gridcolor)
+
+fw_summary = summary_fig(mapper_summary, height=300)
+
+dashboard = hovering_widgets(kmgraph,
+                             fw_graph,
+                             ctooltips=True,
+                             bgcolor=bgcolor,
+                             y_gridcolor=y_gridcolor,
+                             member_textbox_width=600)
+
+#Update the fw_graph colorbar, setting its title:
+fw_graph.data[1].marker.colorbar.title = 'dist to<br>x-min'
+
+dashboard
+```
+
+%% Output
+
+
+%% Cell type:markdown id: tags:
+
+Several observations can be made:
+1. The top half of the graph is composed of wine with quality 0
+2. The middle region is composed of all wine types
+3. The bottom region is composed of wine with quality 1 and 2
+
+We attained some seperability however, this model can be improved by using different clustering methods or a different filter function. The current graph does note seperate quality 1 and 2 very well.
+
+Further insight can be made (although it requires more coding). For instance, one can color the nodes so that they show the proportion of high quality wine to the rest.
--- a/TDA_EDA.ipynb
+++ b/TDA_EDA.ipynb
--- a/TDA_Prediction.ipynb
+++ b/TDA_Prediction.ipynb
--- a/TDA_Voting.ipynb
+++ b/TDA_Voting.ipynb
--- a/Topological_ML/TDA_Prediction.py
+++ b/Topological_ML/TDA_Prediction.py
-from Topological_ML import TDA_Prediction as tdap
-from sklearn.datasets import fetch_california_housing
-from sklearn.model_selection import train_test_split
-from sklearn.linear_model import LinearRegression
-import kmapper as km
-import pandas as pd
-import numpy as np
-import matplotlib.pyplot as plt
-import sklearn
-from sklearn import ensemble
-
-def numpy_to_pandas(sklearn_data):
-    """
-    Converts scikit-learn numpy data into pandas dataframe.
-    Input: name of dataframe
-    Output: pandas dataframe
-    """
-    df = pd.DataFrame(data = sklearn_data.data, columns = sklearn_data.feature_names)
-    df['response'] = pd.Series(sklearn_data.target)
-    return df
-
-def data_summary(df, n):
-    """
-    Provides brief descriptive statistics on dataset. 
-    Input: name of dataframe
-    Output: dictionary 
-    """
-    d = dict()
-    d['head'] = df.head(n)
-    d['shape'] = df.shape
-    #d['missing values'] = df.isna().sum()
-    return d
-    
-def model_selection(df):
-    """
-    Takes dateframe as input. Performs foward/backward stepwise
-    regression. Returns best model for both methods.
-    """
-    null_fit = None
-    foward_step = None
-    backward_step = None
-    return foward_step, backward_step
-
-def MSE_fit(fit): 
-    """
-    Takes in a fitted model as the input.
-    Calculates the MSU of the fitted model.
-    Outputs the model's MSE.
-    """
-    MSE = None
-    return MSE
-
-def accuracy_metrics(fit, MSE, n, k):
-    """
-    This function is used for model validation. It returns a dictionary
-    of several regression model accuracy metrics. Its inputs are a fitted model
-    and the MSE of the fitted model.
-    """
-    d = dict()
-    y_hat = model.predict(X)
-    resid = y - y_hat
-    SSE = sum(resid**2)
-    n = None
-    p = None
-    pr = None
-    d['R2'] = None
-    d['R2ad'] = None
-    d['AIC'] = 2*k - 2*ln(SSE)
-    d['BIC'] = n*ln(SSE/n) + k*ln(n)
-    d['PRESS'] = None
-    d['Cp']= None
-    return None
-
-def linear_regression(x, y):
-    """
-    Ordinary least squares Linear Regression.
-    input: x = independent variables
-           y = dependent variable
-    output: R^2
-    """
-    model = LinearRegression()
-    model.fit(x, y)
-    return model.score(x ,y)
-
-def lens_1d(X, rs, v):
-    """
-    input:
-    output:
-    """
-    model = sklearn.ensemble.IsolationForest(random_state = rs)
-    model.fit(X)
-    lens1 = model.decision_function(X).reshape((X.shape[0], 1))
-    mapper = km.KeplerMapper(verbose = v)
-    lens2 = mapper.fit_transform(X, projection="l2norm")
-    lens = np.c_[lens1, lens2]
-    return lens
\ No newline at end of file
--- a/Topological_ML/__init__.py
+++ b/Topological_ML/__init__.py
-{
- "cells": [],
- "metadata": {},
- "nbformat": 4,
- "nbformat_minor": 2
-}
--- a/Topological_ML/tda_function.py
+++ b/Topological_ML/tda_function.py
+"""
+This file contains all of the functions used within the notebooks.
+Date: April 24, 2020
+Author: Shawk Masboob
+
+The function `uniform_sampling` was borrowed from Luis Polancocontreras,
+a PhD candidate in the CMSE program at Michigan State University.
+It was slightly tweaked to fit this project.
+"""
+from sklearn import ensemble
+import kmapper as km
+import numpy as np
+import pandas as pd
+
+def numpy_to_pandas(sklearn_data):
+    """Converts scikit-learn numpy data into pandas dataframe
+
+    Args:
+        sklearn_data (array): name of dataframe
+
+    Returns:
+        data: pandas dataframe
+
+    """
+    data = pd.DataFrame(data=sklearn_data.data, columns=sklearn_data.feature_names)
+    data['target'] = pd.Series(sklearn_data.target)
+    return data
+
+def lens_1d(x_array, proj='l2norm', random_num=1729, verbosity=0):
+    """Creates a L^2-Norm for features. This lens highlights expected features in the data.
+
+    Args:
+        x_array (array): features of dataset </br>
+        proj (string): projection type </br>
+        random_num: random state </br>
+        verbosity: verbosity </br>
+
+    Returns:
+        lens: Isolation Forest, L^2-Norm </br>
+        mapper: projected features </br>
+
+    """
+    if not isinstance(x_array, np.ndarray):
+        print("your input is not an array")
+        return None, None
+    if isinstance(x_array, np.ndarray) and len(x_array.shape) != 2:
+        print('your input needs to be a 2d array')
+        return None, None
+    proj_type = ['sum', 'mean', 'median', 'max', 'min', 'std', 'dist_mean',
+                 'l2norm', 'knn_distance_n']
+    if proj not in proj_type:
+        print("you may only use the following projections:", proj_type)
+        return None, None
+    # Create a custom 1-D lens with Isolation Forest
+    model = ensemble.IsolationForest(random_state=random_num)
+    model.fit(x_array)
+    lens1 = model.decision_function(x_array).reshape((x_array.shape[0], 1))
+    # Create another 1-D lens with L2-norm
+    mapper = km.KeplerMapper(verbose=verbosity)
+    lens2 = mapper.fit_transform(x_array, projection=proj)
+    # Combine lenses pairwise to get a 2-D lens i.e. [Isolation Forest, L^2-Norm] lens
+    lens = np.c_[lens1, lens2]
+    return lens, mapper
+
+def uniform_sampling(dist_matrix, n_sample):
+    """Given a distance matrix retunrs an subsamplig that preserves the distribution
+    of the original data set and the covering radious corresponding to
+    the subsampled set.
+
+    Args:
+        dist_matrix (array): Distance matrix </br>
+        n_sample (int): Size of subsample set </br>
+
+    Returns:
+        list_subsample (array): List of indices corresponding to the subsample set </br>
+        distance_to_l: Covering radious for the subsample set </br>
+
+    """
+    if not isinstance(dist_matrix, np.ndarray):
+        print("your input is not an array")
+        return None, None
+    if isinstance(dist_matrix, np.ndarray) and len(dist_matrix.shape) != 2:
+        print('your input needs to be a 2d array')
+        return None, None
+    n_subsample = int(n_sample)
+    if n_subsample <= 0:
+        print("Sampling size should be a positive integer.")
+        return None, None
+    num_points = dist_matrix.shape[0]
+    list_subsample = np.random.choice(num_points, n_subsample)
+    dist_to_l = np.min(dist_matrix[list_subsample, :], axis=0)
+    distance_to_l = np.max(dist_to_l)
+    return list_subsample, distance_to_l
--- a/Topological_ML/test/__init__.py
+++ b/Topological_ML/test/__init__.py
--- a/Topological_ML/test/test_tda_functions.py
+++ b/Topological_ML/test/test_tda_functions.py
+'''
+This file contains the functions from tda_functions.py that will be tested.
+'''
+import numpy as np
+import scipy.spatial.distance as distance
+from Topological_ML import tda_function as tda
+
+def test_lens_1d_pass():
+    '''
+    Testing whether the function works given correct parameters.
+    '''
+    data = np.ones((5, 2))
+    test, _ = tda.lens_1d(data, proj="sum")
+    assert isinstance(test, np.ndarray)
+
+def test_lens_1d_fail_1():
+    '''
+    Testing whether the function fails given incorrect projection parameter.
+    '''
+    data = np.ones((5, 2))
+    test, _ = tda.lens_1d(data, proj="Sum")
+    assert test is None
+
+def test_lens_1d_fail_2():
+    '''
+    Testing whether the function fails given incorrect feature (data) parameter.
+    '''
+    data = np.ones(5)
+    test, _ = tda.lens_1d(data, proj="sum")
+    assert test is None
+
+def test_uniform_sampling_pass():
+    '''
+    Testing whether the function works given correct parameters.
+    '''
+    x_array = np.array([[1, 1], [1, 2], [2, 3]])
+    dm_x = distance.cdist(x_array, x_array)
+    test, _ = tda.uniform_sampling(dm_x, 1)
+    assert isinstance(test, np.ndarray)
+
+def test_uniform_sampling_fail_1():
+    '''
+    Testing whether the function fails given incorrect distance matrix parameter.
+    '''
+    x_array = np.ones(5)
+    test, _ = tda.uniform_sampling(x_array, 1)
+    assert test is None
+
+def test_uniform_sampling_fail_2():
+    '''
+    Testing whether the function fails given incorrect sampling size parameter.
+    '''
+    x_array = np.array([[1, 1], [1, 2], [2, 3]])
+    dm_x = distance.cdist(x_array, x_array)
+    test, _ = tda.uniform_sampling(dm_x, -2)
+    assert test is None
--- a/calihousing.html
+++ b/calihousing.html
--- a/docs/TDA_Prediction.html
+++ b/docs/TDA_Prediction.html
-<!doctype html>
-<html lang="en">
-<head>
-<meta charset="utf-8">
-<meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1" />
-<meta name="generator" content="pdoc 0.7.2" />
-<title>TDA_Prediction API documentation</title>
-<meta name="description" content="" />
-<link href='https://cdnjs.cloudflare.com/ajax/libs/normalize/8.0.0/normalize.min.css' rel='stylesheet'>
-<link href='https://cdnjs.cloudflare.com/ajax/libs/10up-sanitize.css/8.0.0/sanitize.min.css' rel='stylesheet'>
-<link href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/github.min.css" rel="stylesheet">
-<style>.flex{display:flex !important}body{line-height:1.5em}#content{padding:20px}#sidebar{padding:30px;overflow:hidden}.http-server-breadcrumbs{font-size:130%;margin:0 0 15px 0}#footer{font-size:.75em;padding:5px 30px;border-top:1px solid #ddd;text-align:right}#footer p{margin:0 0 0 1em;display:inline-block}#footer p:last-child{margin-right:30px}h1,h2,h3,h4,h5{font-weight:300}h1{font-size:2.5em;line-height:1.1em}h2{font-size:1.75em;margin:1em 0 .50em 0}h3{font-size:1.4em;margin:25px 0 10px 0}h4{margin:0;font-size:105%}a{color:#058;text-decoration:none;transition:color .3s ease-in-out}a:hover{color:#e82}.title code{font-weight:bold}h2[id^="header-"]{margin-top:2em}.ident{color:#900}pre code{background:#f8f8f8;font-size:.8em;line-height:1.4em}code{background:#f2f2f1;padding:1px 4px;overflow-wrap:break-word}h1 code{background:transparent}pre{background:#f8f8f8;border:0;border-top:1px solid #ccc;border-bottom:1px solid #ccc;margin:1em 0;padding:1ex}#http-server-module-list{display:flex;flex-flow:column}#http-server-module-list div{display:flex}#http-server-module-list dt{min-width:10%}#http-server-module-list p{margin-top:0}.toc ul,#index{list-style-type:none;margin:0;padding:0}#index code{background:transparent}#index h3{border-bottom:1px solid #ddd}#index ul{padding:0}#index h4{font-weight:bold}#index h4 + ul{margin-bottom:.6em}@media (min-width:200ex){#index .two-column{column-count:2}}@media (min-width:300ex){#index .two-column{column-count:3}}dl{margin-bottom:2em}dl dl:last-child{margin-bottom:4em}dd{margin:0 0 1em 3em}#header-classes + dl > dd{margin-bottom:3em}dd dd{margin-left:2em}dd p{margin:10px 0}.name{background:#eee;font-weight:bold;font-size:.85em;padding:5px 10px;display:inline-block;min-width:40%}.name:hover{background:#e0e0e0}.name > span:first-child{white-space:nowrap}.name.class > span:nth-child(2){margin-left:.4em}.inherited{color:#999;border-left:5px solid #eee;padding-left:1em}.inheritance em{font-style:normal;font-weight:bold}.desc h2{font-weight:400;font-size:1.25em}.desc h3{font-size:1em}.desc dt code{background:inherit}.source summary,.git-link-div{color:#666;text-align:right;font-weight:400;font-size:.8em;text-transform:uppercase}.source summary > *{white-space:nowrap;cursor:pointer}.git-link{color:inherit;margin-left:1em}.source pre{max-height:500px;overflow:auto;margin:0}.source pre code{font-size:12px;overflow:visible}.hlist{list-style:none}.hlist li{display:inline}.hlist li:after{content:',\2002'}.hlist li:last-child:after{content:none}.hlist .hlist{display:inline;padding-left:1em}img{max-width:100%}.admonition{padding:.1em .5em;margin-bottom:1em}.admonition-title{font-weight:bold}.admonition.note,.admonition.info,.admonition.important{background:#aef}.admonition.todo,.admonition.versionadded,.admonition.tip,.admonition.hint{background:#dfd}.admonition.warning,.admonition.versionchanged,.admonition.deprecated{background:#fd4}.admonition.error,.admonition.danger,.admonition.caution{background:lightpink}</style>
-<style media="screen and (min-width: 700px)">@media screen and (min-width:700px){#sidebar{width:30%}#content{width:70%;max-width:100ch;padding:3em 4em;border-left:1px solid #ddd}pre code{font-size:1em}.item .name{font-size:1em}main{display:flex;flex-direction:row-reverse;justify-content:flex-end}.toc ul ul,#index ul{padding-left:1.5em}.toc > ul > li{margin-top:.5em}}</style>
-<style media="print">@media print{#sidebar h1{page-break-before:always}.source{display:none}}@media print{*{background:transparent !important;color:#000 !important;box-shadow:none !important;text-shadow:none !important}a[href]:after{content:" (" attr(href) ")";font-size:90%}a[href][title]:after{content:none}abbr[title]:after{content:" (" attr(title) ")"}.ir a:after,a[href^="javascript:"]:after,a[href^="#"]:after{content:""}pre,blockquote{border:1px solid #999;page-break-inside:avoid}thead{display:table-header-group}tr,img{page-break-inside:avoid}img{max-width:100% !important}@page{margin:0.5cm}p,h2,h3{orphans:3;widows:3}h1,h2,h3,h4,h5,h6{page-break-after:avoid}}</style>
-</head>
-<body>
-<main>
-<article id="content">
-<header>
-<h1 class="title">Module <code>TDA_Prediction</code></h1>
-</header>
-<section id="section-intro">
-<details class="source">
-<summary>
-<span>Expand source code</span>
-</summary>
-<pre><code class="python">def dataload():
-    &#34;&#34;&#34;
-    upload toy datasets from scikit-learn
-    &#34;&#34;&#34;
-    data = None
-    return data
-
-def datafetch(file_name):
-    &#34;&#34;&#34;
-    upload real world datasets from scikit-learn
-    &#34;&#34;&#34;
-    data = None
-    print(&#34;reading data from:&#34;, file_name)
-    return data
-
-def descriptive_statistic(df):
-    &#34;&#34;&#34;
-    Provides brief descriptive statistics on dataset. 
-    Takes dataframe as input.
-    &#34;&#34;&#34;
-    print(&#34;Type : &#34;, None, &#34;\n\n&#34;)
-    print(&#34;Shape : &#34;, None)
-    print(&#34;Head -- \n&#34;, None)
-    print(&#34;\n\n Tail -- \n&#34;, None)
-    print(&#34;Describe : &#34;, None)
-    
-def model_selection(df):
-    &#34;&#34;&#34;
-    Takes dateframe as input. Performs foward/backward stepwise
-    regression. Returns best model for both methods.
-    &#34;&#34;&#34;
-    null_fit = None
-    foward_step = None
-    backward_step = None
-    return foward_step, backward_step
-
-def MSE_fit(fit): 
-    &#34;&#34;&#34;
-    Takes in a fitted model as the input.
-    Calculates the MSU of the fitted model.
-    Outputs the model&#39;s MSE.
-    &#34;&#34;&#34;
-    MSE = None
-    return MSE
-
-def accuracy_metrics(fit, MSE):
-    &#34;&#34;&#34;
-    This function is used for model validation. It returns a dictionary
-    of several regression model accuracy metrics. Its inputs are a fitted model
-    and the MSE of the fitted model.
-    &#34;&#34;&#34;
-    d = dict()
-    sumObj = None
-    SSE = None
-    n = None
-    p = None
-    pr = None
-    d[&#39;R2&#39;] = None
-    d[&#39;R2ad&#39;] = None
-    d[&#39;AIC&#39;] = None
-    d[&#39;BIC&#39;] = None
-    d[&#39;PRESS&#39;] = None
-    d[&#39;Cp&#39;]= None
-    return d</code></pre>
-</details>
-</section>
-<section>
-</section>
-<section>
-</section>
-<section>
-<h2 class="section-title" id="header-functions">Functions</h2>
-<dl>
-<dt id="TDA_Prediction.MSE_fit"><code class="name flex">
-<span>def <span class="ident">MSE_fit</span></span>(<span>fit)</span>
-</code></dt>
-<dd>
-<section class="desc"><p>Takes in a fitted model as the input.
-Calculates the MSU of the fitted model.
-Outputs the model's MSE.</p></section>
-<details class="source">
-<summary>
-<span>Expand source code</span>
-</summary>
-<pre><code class="python">def MSE_fit(fit): 
-    &#34;&#34;&#34;
-    Takes in a fitted model as the input.
-    Calculates the MSU of the fitted model.
-    Outputs the model&#39;s MSE.
-    &#34;&#34;&#34;
-    MSE = None
-    return MSE</code></pre>
-</details>
-</dd>
-<dt id="TDA_Prediction.accuracy_metrics"><code class="name flex">
-<span>def <span class="ident">accuracy_metrics</span></span>(<span>fit, MSE)</span>
-</code></dt>
-<dd>
-<section class="desc"><p>This function is used for model validation. It returns a dictionary
-of several regression model accuracy metrics. Its inputs are a fitted model
-and the MSE of the fitted model.</p></section>
-<details class="source">
-<summary>
-<span>Expand source code</span>
-</summary>
-<pre><code class="python">def accuracy_metrics(fit, MSE):
-    &#34;&#34;&#34;
-    This function is used for model validation. It returns a dictionary
-    of several regression model accuracy metrics. Its inputs are a fitted model
-    and the MSE of the fitted model.
-    &#34;&#34;&#34;
-    d = dict()
-    sumObj = None
-    SSE = None
-    n = None
-    p = None
-    pr = None
-    d[&#39;R2&#39;] = None
-    d[&#39;R2ad&#39;] = None
-    d[&#39;AIC&#39;] = None
-    d[&#39;BIC&#39;] = None
-    d[&#39;PRESS&#39;] = None
-    d[&#39;Cp&#39;]= None
-    return d</code></pre>
-</details>
-</dd>
-<dt id="TDA_Prediction.datafetch"><code class="name flex">
-<span>def <span class="ident">datafetch</span></span>(<span>file_name)</span>
-</code></dt>
-<dd>
-<section class="desc"><p>upload real world datasets from scikit-learn</p></section>
-<details class="source">
-<summary>
-<span>Expand source code</span>
-</summary>
-<pre><code class="python">def datafetch(file_name):
-    &#34;&#34;&#34;
-    upload real world datasets from scikit-learn
-    &#34;&#34;&#34;
-    data = None
-    print(&#34;reading data from:&#34;, file_name)
-    return data</code></pre>
-</details>
-</dd>
-<dt id="TDA_Prediction.dataload"><code class="name flex">
-<span>def <span class="ident">dataload</span></span>(<span>)</span>
-</code></dt>
-<dd>
-<section class="desc"><p>upload toy datasets from scikit-learn</p></section>
-<details class="source">
-<summary>
-<span>Expand source code</span>
-</summary>
-<pre><code class="python">def dataload():
-    &#34;&#34;&#34;
-    upload toy datasets from scikit-learn
-    &#34;&#34;&#34;
-    data = None
-    return data</code></pre>
-</details>
-</dd>
-<dt id="TDA_Prediction.descriptive_statistic"><code class="name flex">
-<span>def <span class="ident">descriptive_statistic</span></span>(<span>df)</span>
-</code></dt>
-<dd>
-<section class="desc"><p>Provides brief descriptive statistics on dataset.
-Takes dataframe as input.</p></section>
-<details class="source">
-<summary>
-<span>Expand source code</span>
-</summary>
-<pre><code class="python">def descriptive_statistic(df):
-    &#34;&#34;&#34;
-    Provides brief descriptive statistics on dataset. 
-    Takes dataframe as input.
-    &#34;&#34;&#34;
-    print(&#34;Type : &#34;, None, &#34;\n\n&#34;)
-    print(&#34;Shape : &#34;, None)
-    print(&#34;Head -- \n&#34;, None)
-    print(&#34;\n\n Tail -- \n&#34;, None)
-    print(&#34;Describe : &#34;, None)</code></pre>
-</details>
-</dd>
-<dt id="TDA_Prediction.model_selection"><code class="name flex">
-<span>def <span class="ident">model_selection</span></span>(<span>df)</span>
-</code></dt>
-<dd>
-<section class="desc"><p>Takes dateframe as input. Performs foward/backward stepwise
-regression. Returns best model for both methods.</p></section>
-<details class="source">
-<summary>
-<span>Expand source code</span>
-</summary>
-<pre><code class="python">def model_selection(df):
-    &#34;&#34;&#34;
-    Takes dateframe as input. Performs foward/backward stepwise
-    regression. Returns best model for both methods.
-    &#34;&#34;&#34;
-    null_fit = None
-    foward_step = None
-    backward_step = None
-    return foward_step, backward_step</code></pre>
-</details>
-</dd>
-</dl>
-</section>
-<section>
-</section>
-</article>
-<nav id="sidebar">
-<h1>Index</h1>
-<div class="toc">
-<ul></ul>
-</div>
-<ul id="index">
-<li><h3><a href="#header-functions">Functions</a></h3>
-<ul class="">
-<li><code><a title="TDA_Prediction.MSE_fit" href="#TDA_Prediction.MSE_fit">MSE_fit</a></code></li>
-<li><code><a title="TDA_Prediction.accuracy_metrics" href="#TDA_Prediction.accuracy_metrics">accuracy_metrics</a></code></li>
-<li><code><a title="TDA_Prediction.datafetch" href="#TDA_Prediction.datafetch">datafetch</a></code></li>
-<li><code><a title="TDA_Prediction.dataload" href="#TDA_Prediction.dataload">dataload</a></code></li>
-<li><code><a title="TDA_Prediction.descriptive_statistic" href="#TDA_Prediction.descriptive_statistic">descriptive_statistic</a></code></li>
-<li><code><a title="TDA_Prediction.model_selection" href="#TDA_Prediction.model_selection">model_selection</a></code></li>
-</ul>
-</li>
-</ul>
-</nav>
-</main>
-<footer id="footer">
-<p>Generated by <a href="https://pdoc3.github.io/pdoc"><cite>pdoc</cite> 0.7.2</a>.</p>
-</footer>
-<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
-<script>hljs.initHighlightingOnLoad()</script>
-</body>
-</html>
\ No newline at end of file
--- a/environment.yml
+++ b/environment.yml
+channels:
+  - defaults
+dependencies:
+  - pip
+  - scipy
+  - matplotlib
+  - jupyter
+  - cython
+  - numpy
+  - selenium
+  - pandas
+  - scikit-learn
+  - seaborn
+  - hypothesis
+  - requests
+  - plotly
+  - pdoc3
+  - pylint
+  - pytest
+  - autopep8
+  - pip:
+    - ripser
+    - kmapper
+    - persim
+    - python-igraph
+    - plotly
+    - ipywidgets
\ No newline at end of file
No results found