Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • masboobs/topological_machine_learning
  • colbrydi/topological_machine_learning
  • gongmimi/topological_machine_learning
3 results
Show changes
Commits on Source (60)
Showing
with 877 additions and 220 deletions
File deleted
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
venv/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
# Sphinx documentation
docs/build/
docs/source/generated/
# pytest
.pytest_cache/
# PyBuilder
target/
# Editor files
#mac
.DS_Store
*~
#vim
*.swp
*.swo
#pycharm
.idea/
#VSCode
.vscode/
#Ipython Notebook
.ipynb_checkpoints
This is free and unencumbered software released into the public domain.
Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.
In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
For more information, please refer to <https://unlicense.org>
\ No newline at end of file
File deleted
%% Cell type:markdown id: tags:
# <center>Graph Theory within Topological Data Analysis</center>
<center>by Shawk Masboob</center>
%% Cell type:markdown id: tags:
Graph theory utilizes limited topological thought (i.e. it can be considered a subset of topology). Topological objects consist of nodes, edges, and faces. In graph theory, these objects arise through natural geo-spatial data (cell tower data, election data, etc.) or through artificial construction (social complexity theory, political relations theory, etc.). Some application of graph theory involve using graphs modeling relations and processes in physical, biological, and social and information systems. Graph theory is also used to represent computer networks.
Graph theory and topology are related in that graphs are a 1-simplicial complex. A simplicial complex is a finite collection of simplices. [1]. A simplex is a generalized triangle in any arbitrary dimension where the 0-simplex is a point, the 1-simplex is a line segment, the 2-simplex is a triangle, the 3-simplex is a tetrahedron, the 4-simplex is a 5-cell, and so on. The figure below provides a simple example of simplices.
%% Cell type:markdown id: tags:
| ![alt text](simplex.png "simplex") |
|:--:|
|__Figure 1.__ From left to right: a point (vertex), a line segment (an edge), a triangle, and a tetrahedron.
Image Source: [1] |
%% Cell type:markdown id: tags:
The shift from the 1-simplicial complex and the 2-simplicial complex provides a simple demonstration of the relationship between graph theory and topology. It is challenging to discuss concepts within topology without implicitly referencing properties of graph theory.
As mentioned above, there are many applications within graph theory. These same applications can be done using topology, specifically topological data analysis (TDA). TDA seeks to analyze datasets using various techniques from topology. It is beneficial to convert the data points with simplicial complexes based on some distance metric. A simple way of converting data points into a global object is to use each point as the vertices of a graph and to let the edges of the graph be determined by proximity. [2]. The graph at this stage (1-simplex), while capturing connectivity within the data, fails to capture higher order features beyond clustering. [2]. Moving up the “simplicial scales” allows the researcher to see other interesting features that may not be so obvious when using a standard graph. The Cech Complex and the Rips Complex are the two most common methods used to “fill in” the higher dimensional simplices. [2] Figure 2 demonstrates the process of converting a dataset into a higher order simplicial complex.
%% Cell type:markdown id: tags:
| ![alt text](complex.png "complex") |
|:--:|
|__Figure 2.__ The bottom left image is the Cech Complex and the bottom right image is the Rips Complex.
Image Source: [2] |
%% Cell type:markdown id: tags:
As seen from Figure 2, any point cloud data can be converted into a higher order simplicial complex after selecting a specified proximity measure.
A real-world example that combines graph theory and TDA is as follows: the human brain can be visualized using graph theory. However, the human brain is a very complex network that is hard to visualize. MAPPER (a Python TDA package) can be used to reduce the high-dimensional dataset without having to make many assumptions about the data. [3]. Doing so allows the researcher to visualize the data using some graph.
TDA is used to understand complex and high dimensional data problems. The simple technique mentioned above is capable to pointing out the natural patterns within a dataset that a standard graph cannot. Graph theory is useful if one seeks construct a network. If one seeks to go beyond network construction and clustering, they ought to consider using TDA. That said, TDA is not independent of graph theory. There are some theoretical similarities and applications. As noted above, a graph is a 1-simplicial complex.
%% Cell type:markdown id: tags:
---
# References
[1] Edelsbrunner, Herbert. “Simplicial Complexes.” COMPUTATIONAL TOPOLOGY, www2.cs.duke.edu/courses/fall06/cps296.1/.
[2] Ghrist, Robert. “Barcodes: The Persistent Topology of Data.” Bulletin of the American Mathematical Society, vol. 45, no. 01, 2007, pp. 61–76., doi:10.1090/s0273-0979-07-01191-3.
[3] Saggar, M., Sporns, O., Gonzalez-Castillo, J. et al. Towards a new approach to reveal dynamical organization of the brain using topological data analysis. Nat Commun 9, 1399 (2018). https://doi.org/10.1038/s41467-018-03664-4
{
"cells": [],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 2
}
def dataload():
"""
upload toy datasets from scikit-learn
"""
data = None
return data
def datafetch(file_name):
"""
upload real world datasets from scikit-learn
"""
data = None
print("reading data from:", file_name)
return data
def descriptive_statistic(df):
"""
Provides brief descriptive statistics on dataset.
Takes dataframe as input.
"""
print("Type : ", None, "\n\n")
print("Shape : ", None)
print("Head -- \n", None)
print("\n\n Tail -- \n", None)
print("Describe : ", None)
def model_selection(df):
"""
Takes dateframe as input. Performs foward/backward stepwise
regression. Returns best model for both methods.
"""
null_fit = None
foward_step = None
backward_step = None
return foward_step, backward_step
def MSE_fit(fit):
"""
Takes in a fitted model as the input.
Calculates the MSU of the fitted model.
Outputs the model's MSE.
"""
MSE = None
return MSE
def accuracy_metrics(fit, MSE):
"""
This function is used for model validation. It returns a dictionary
of several regression model accuracy metrics. Its inputs are a fitted model
and the MSE of the fitted model.
"""
d = dict()
sumObj = None
SSE = None
n = None
p = None
pr = None
d['R2'] = None
d['R2ad'] = None
d['AIC'] = None
d['BIC'] = None
d['PRESS'] = None
d['Cp']= None
return d
\ No newline at end of file
File deleted
File deleted
# Topological Machine Learning
The first 1/3 of this project will require the utilization of traditional machine learning methods on a dataset. The purpose of this portion of the project is promote the benefits of TDA. It is used to show data scientists that they can achieve very interesting results when using TDA. The second portion of this project will be incorporating TDA with machine learning. This portion of the project will be the most demanding due to lack of references. I will attempt to perform both a classification type project and prediction type project. Additionally, I may have to rely on HPCC to run my script (TDA is computationally expensive). The final portion of the project will be creating a Jupyter demo project for data scientists as well as a short semi-theoretical document that discusses the more important features of TDA.
This project provides an easy to follow introduction to Topological Data Analysis (TDA). TDA is a machine learning tool that borrows concepts from topology and applies them to datasets. While the theory behind TDA is quite complex (thought important and interesting!), this project will only focus on applications. There is, however, an introductory notebook which provides user with a brief introduction to the main concepts within TDA as well as the scikit learn TDA package. The remaining notebooks apply TDA with different machine learning methods such as prediction and classifying.
I hope to create a script that is easy for those who are not familiar with TDA to follow. Ideally, I would like to create a document that scikit-tda is willing to publish. Hence, the document needs to be constructed in a manner that is easy to interpret by anyone who is new to TDA.
Future improvements include:
1. Fixing the prediction notebook.
2. Using the Ripser function to classify data.
3. Providing deeper analysis on the results.
Thankfully, python is already has the libraries that I will need. They are as follows: scipy, numpy, matplotlib, pandas, seaborn, scikit-tda
## Prerequisites
## Getting Started
The environment file contains all of the required packages. You may either activate the environment file or install the required packages individually.
* Update
### Prerequisites
You may need to install the Python TDA library. The installation can be done in Pypi using just one command: pip install scikit-tda. Some may need to install seaborn, pandas, or numpy.
```
pip install scikit-tda
```
To run the Jupyter Notebooks sucessfully, you must them through the project's enviornment.
## Authors
......@@ -26,8 +21,27 @@ pip install scikit-tda
## License
* UPDATE
## Acknowledgments
* UPDATE
\ No newline at end of file
This is free and unencumbered software released into the public domain.
Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.
In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
For more information, please refer to <https://unlicense.org>
\ No newline at end of file
File deleted
%% Cell type:markdown id: tags:
# <center>Applying Topological Data Analysis to Agent Based</center>
<center>by Shawk Masboob</center>
%% Cell type:markdown id: tags:
Topological Data Analysis (TDA) tools can be applied to Agent Based Modeling (ABM). TDA can be used for models that reside within complicated spaces. TDA can also be applied to ABM when the outputs are high-dimensional. This report will provide two examples of how TDA can be used with ABM.
#### Example 1 - Biological Aggregation Models
This project applied TDA tools to biological aggregation models such as bird flocks, fish schools, and insect swarms. In these models, the agents interact with each other based on alignment, attraction, and/or repulsion, with each time simulation frame being a point cloud in position-velocity space. [2]. The data used within this research are the numerical simulation outputs from the models. Biological aggregation data is complicated in that there is a large quantity of data. Because of this, powerful tools need to be used that respond best to the complexity of the data.
The researchers analyzed the topological structure of the point clouds by calculating the Betti numbers and then interpreting the persistent homology. [2]. The Betti numbers capture interesting features by counting the connected components, topological circles, and trapped volumes contained in the data. [2].
By applying TDA to the simulation, the researchers found some interesting features: “the homological measures distinguish simulations that the usual alignment oder parameter cannot, …, there is topological similarity between different order parameter time series, the topological calculations recognize the presence of a double mill state.” [2]. These results may have gone unseen if the researcher relied on other methods.
#### Example 2 - Agent Taxonomy
This report analyzed agent trajectories from disaster simulation. The simulation is known as the National Planning Scenario 1: a nuclear device is detonated in Washington DC. The model includes “agent demographics, household structures, daily activity patterns, road networks, and various kinds of locations such as workplaces, schools, government buildings, etc.” [1]. The simulation also includes multiple infrastructures such as power, communication, transportation, and health. The researchers also modeled interactions between human behaviors and the infrastructures. [1]. The researchers sampled 10,000 agents of the total 730,833 that were modeled in the simulation. The aim of this research is to generate a taxonomy of agents based on the results of the simulation. A taxonomy is rich because it not only identifies meaningful types from the data set, but also establishes relationships among those types.” [1].
The researchers applied TDA to this simulation in order to gather insight about the structure of the data. TDA is used to find create a topological space where the data set naturally exists. [1]. This space is represented by a proximity graph which can be enhanced into higher dimensions. Higher dimensions reveal more interesting features about the space. For instance, TDA is capable of capturing the shape of the data as a graph regardless of the data points not being in the same clusters.
Using TDA, the researchers were able to create a taxonomy based on the agents movements: “agents who are (1) close to ground zero from the beginning, but have low exposure, (2) far from ground zero to begin with but move closer had have slightly more exposure, and (3) close to ground zero and have high exposure.” [1]. What is interesting about this taxonomy is that it emerged naturally from the agents movements, behavior, and communication. TDA was able to capture the complicated structure of the data set by allowing the researchers to see the natural connections.
%% Cell type:markdown id: tags:
---
# References
[1] Rezazadegan, Reza, and Samarth Swarup. “Generating an Agent Taxonomy Using Topological Data Analysis.” AAMAS: International Conference on Autonomous Agents and Multiagent Systems, 13 May 2017.
[2] Topaz CM, Ziegelmeier L, Halverson T (2015) Topological Data Analysis of Biological Aggregation Models. PLoS ONE 10(5): e0126383. https://doi.org/10.1371/journal.pone.0126383
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to successfully complete this assignment you need to commit this report to your project git repository on or before **11:59pm on Friday February 21**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# <center>Using Agent Based Models in #your research area here#</center>\n",
"\n",
"<center>by #Your name#</center>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# Instructions\n",
"\n",
"Write a ~1-2 page report about how Agent Based Models (ABMs) could be used in your area of research. There are three basic forms to this report:\n",
"\n",
"1. ABMs are already used in your area of research. Provide a summary and include references. If possible include a short example that could be demonstrated in class. \n",
"2. Describe how you could incorporate ABMs into your research domain. What research questions might you be able to ask, improve or validate?\n",
"3. ABMs can not be used in your area of research. Explain in detail why they can't be used. \n",
"\n",
"To be clear, I think 1 and 2 will be the most common reports. If you choose 3 you need to make a very convincing and thought out argument. \n",
"\n",
"You can write your report in any tool you want. I recommend Jupyter notebooks but Word, Latex, Google docs are all fine. If you use Jupyter notebooks (which I tend to like) make sure you do not include these instructions. Write the report so someone outside of the course could make sense of how ODEs can be used.\n",
"\n",
"**NOTE:** Make sure you remove all instructions from your reports. Make them readable so someone outside of the course could make sense of them.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"# References\n",
"\n",
"In all three example provide some references to papers you found that help illustrate your argument. I prefer references with links to papers. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"-----\n",
"### Congratulations, you are done!\n",
"\n",
"Now, you just need to commit and push this report to your project git repository. "
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
%% Cell type:markdown id: tags:
# <center>Using Machine Learning (ML) in TDA</center>
<center>by Shawk Masboob</center>
%% Cell type:markdown id: tags:
This projects aims to incorporate Topological Data Analysis (TDA) with several machine learning methods in order to demonstrate the potential benefits of TDA within data science.
Traditional clustering methods can be “enhanced” by using TDA. Clustering is concerned with distance whereas TDA uses other relationships to cluster data together such as the amount of holes contained within the data. [2]. MAPPER begins by clustering the data points within an interval. The user can choose whatever clustering method they desire and metric. That is, they can choose hierarchical clustering with the euclidean distance metric. MAPPER then transforms the clusters into nodes within a graph. According to the developers of MAPPER, some points can exist within more than one node due to overlap. When there is member intersection, an edge is drawn between the nodes. [2]. The visualization provided by MAPPER will give interesting statistical results for each node that goes beyond traditional clustering. It should also be noted that TDA is often used for classification. A quick google search will reveal that TDA is often used to classify things such as animals, body parts, the presence of cancer, etc.
TDA can be used for prediction or more specifically, feature selection. Suppose one is trying to build a regression model. An important intermediate step is to perform feature selection. While there are many machine learning (e.g. random forest) and statistical techniques (e.g. stepwise regression) that can be used for feature selection, one can consider using TDA. As done in this project (within the TDA_Prediction Python notebook), TDA is used to find the most prominent features for the multiple linear model. KeplerMapper, a Python TDA library, allows one to build graphs composed of nodes. These nodes contain data points and each node contains different statistics. For instance, a study used TDA to determine how much people are willing to pay for air quality improvements. [1]. The researchers generated eleven nodes using MAPPER. The size of each node is related to the number of observations within it. The researches used the color of each node to represent the relationship between “the mean value of all entries in that node with respect to the chosen variable.” [1].
TDA is in itself an unsupervised machine learning tool. One of the most popular features within TDA is persistence homology. The notebook titled “TDA_Voting” uses the Python Ripser persistent homology package to analyze the recent presidential election county level voting results. The aim of this notebook is to determine whether there is a natural pattern in voting habits and whether this pattern dissolved during the 2016 presidential election. To perform this analysis, the birth-death diagram (provided by Risper) is used to spot persistent features. In this example, a persistent feature is a loop of some counties that “behaved” similarly. To see more interesting features, the radius needs to be increased. However, the current radius is extremely small so nothing interesting appeared. It should also be noted that TDA is computationally expensive so a large radius might take hours or even days to compute.
%% Cell type:markdown id: tags:
---
# References
[1] Allen, Dylan. Topological Data Analysis: Giving Data Shape. Carroll, 13 May 2017, scholars.carroll.edu/cgi/viewcontent.cgi?article=1000&context=mathengcompsci_theses.
[2] Keplermapper 1.2.0 Documentation a Scikit-tda Project
https://kepler-mapper.scikit-tda.org/theory.html
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# <center>Using Ordinary Differential Equations (ODEs) with Topological Data Analysis</center>
<center>by Shawk Masboob</center>
%% Cell type:markdown id: tags:
Ordinary Differential Equations (ODEs) were not used in this project. However, ODEs can be incorporated with Topological Data Analysis (TDA). This report will cover how TDA can be used with ODE in this project and will provide an example in which researchers used both ODEs and TDA outside of this project.
An ODE model consists of a set of differential equations that involve functions of only one independent variable and its derivative(s). [1]. ODEs are used to model dynamical systems in science (including social science) and engineering. Specific examples include predator and prey models, how gossip spreads, or the spread of a virus. TDA, on the other hand, utilizes methods from topology to study datasets. Persistent homology, one of the main tools in TDA, is used to measure the topological features of data (i.e. the shape of the dataset). TDA is capable of working with data that are high-dimensional, incomplete, or noisy. Based on this, researchers can apply TDA tools to perform additional analysis on their ODEs model. That is, after developing the ODEs model, the reacher can apply topological methods to understand the outputs of the model, especially if there is reason to believe that the output data has complex dimensionality.
This project provides several examples of how to apply TDA with several different machine learning techniques. An additional notebook can be add which incorporates both techniques. For example, an ODE model on the spread disease can be developed. After changing the model slightly by adding different factors or more detail, the output data can be analyzed using TDA. Before using TDA, it is necessary to determine whether the data is linearly separable. If it is, other methods that work with linear data can be used (using TDA might result in “false” results because TDA will attempt to find shape/pattern within the data even if no such pattern/shape exists). If the data is compatible with TDA, the next step is to prep the data to guarantee that there are no outliers. Then, the persistent homology of the space can be found. The last step is to determine which observations composed the persistent homology. Understanding why those particular features created the homology might shed more insight on the model.
Another example in which TDA can be used with ODEs modeling is when the ODE network is complicated. Generally, it is preferred to have a model that is not overly complex (heavily detailed). However, if some model is complex, TDA can definitely be used to make sense of the results. Persistent homology can be used to determine which set of components within the model interact with one another.
An attempt to find research projects that utilize ODEs and TDA was made. Unfortunately, no examples were found. While it is not impossible to use both methods in a research, it is not common nor is it natural. TDA is used to analyze large, complex data. Besides analyzing the shape of the data provided by the ODE model, there is not much application. Additionally, the data must not not linearly separable. TDA is generally used with time series analysis, data reduction, agent-based modeling, etc. ODEs and TDA are both wonderful methods that can be utilized.
%% Cell type:markdown id: tags:
---
# References
[1] https://link.springer.com/referenceworkentry/10.1007%2F978-1-4419-9863-7_381
%% Cell type:markdown id: tags:
# <center>Using Topological Data Analysis with Statistical Models</center>
<center>by Shawk Masboob</center>
%% Cell type:markdown id: tags:
Topological data analysis (TDA) and more specifically Mapper, can be used to enhance statistical modeling. Mapper is a useful tool to visualize large datasets because it transforms the data into a graph which can then be further analyzed. Standard statistic methods can often miss or “not see” the complexities that lie within data. Hence, using Mapper can greater improve statistical modeling such as classification, prediction, or forecasting.
This project incorporates statistical modeling with TDA to demonstrate how Mapper works. The project used the California Housing dataset.The project begins by using standard statistical methods for exploratory data analysis and getting rid of multicollinearity such as the variance inflation factor. Stepwise regression was used to determine the which features should be used in the model. After determining the necessary features, a multiple linear regression was used to predict house price based on varying attributes such as number of bedrooms and location.
Mapper was used after building the linear model. While Mapper has multiple purposes, in this project it was used to evaluate the dataset and determine whether certain portions of the dataset behave differently. Mapper allows users to use different lenses, cover, and type of clustering method. The point was to find natural separation within the data. A visualization was created to demonstrate the separated nodes. The figure below shows the mapper output for the California dataset.
![alt text](mapper.png "simplex")
Although the lenses used need to be improved because not enough separation is being shown, one can analyze the nodes individually. The idea is, separation between the nodes implies that the data behaves differently and hence, running a regression on the nodes individually might present interesting results. The next step in this project is to 1.) improve the separation by changing the lenses or clustering method and 2.) running a regression on the each node. The node might have different features, e.g. one node might exclude ‘number of bedrooms’ because it might not contribute to predicting house price. Another node might include ‘number of bedrooms.’
There are plenty of examples which incorporate TDA in statistical analysis using Mapper. For example, one research examined Wisconsin cancer data. The aim of the research was to classify breast cancer patients. That is, the goal of the research was to determine which of the 11 predictor variables have the most influence on the diagnosis [1]. The researchers began by building a logistic regression model. They concluded that although a logistic model is appropriate for their data and research topic, there was still room for improvement because misdiagnoses still presented to be an issue [1]. After the statistical analysis was completed, the researchers added Mapper to improve the research. TDA was capable of sorting between people with benign tumors and those with malignant tumors. After simply analyzing the nodes, the researchers looked deeper at the nodes to understand what caused the separation [1]. The researches, after using an exhaustive search, found that the nodes do behave different and hence, separate logistic models were created for each node [1]. To conclude, the researchers found that standard statical modeling was prone to false negatives and false positives. Using TDA, the researchers found that “there is a subset of patients that share similarities in many attributes such as mean area and perimeter, but differ wildly in the smoothness of the tumor, and this observation leads to a different model and consequently a different diagnosis” [1].
The main idea behind using TDA when building a statistical model (e.g. multiple linear regression) is that data can behave differently and sometimes relying on traditional statistical methods is not enough. That is, standard statistical methods do not always capture the behavior of the data and hence, the developed models do not lead to the best prediction.
%% Cell type:markdown id: tags:
---
# References
[1] Allen, Dylan. Topological Data Analysis: Giving Data Shape. Carroll, 13 May 2017, scholars.carroll.edu/cgi/viewcontent.cgi?article=1000&context=mathengcompsci_theses.
%% Cell type:markdown id: tags:
# <center> Topological Machine Learning </center>
<center>By Shawk Macboob </center>
%% Cell type:markdown id: tags:
<img src="https://scikit-tda.org/_static/logo.png" width="20%">
Image from: https://scikit-tda.org/#
%% Cell type:markdown id: tags:
---
# Authors
Shawk Masboob
%% Cell type:markdown id: tags:
---
# Abstract
Topological Data Analysis i(TDA) s a relatively new field with many useful applications. Essentially, TDA borrows tools from topology in order to study data. That is, TDA seeks to determine whether a particular dataset has shape and what the shape of the dataset implies. It can be used independently or applied to other machine learning techniques. This particular projects aims at applying TDA to various machine learning methods in order to demonstrate its benefits. To do so, this project first presents the general theory of TDA as well as introducing the available TDA software. Then, several notebooks are illustrate how TDA can be applied to machine learning.
%% Cell type:markdown id: tags:
----
# Statement of Need
The purpose of this project is to introduce data scientists to Topological Data Analysis (TDA) through various examples. That is, this project aims to demonstrate applications of TDA to those who do not have a background in mathematics.
%% Cell type:markdown id: tags:
----
# Installation instructions
The environment.yml file constains all of the required dependices. To install the required moduels, run the following command in the terminal:
`make init`
%% Cell type:markdown id: tags:
----
# Unit Tests
Unit testing is done on the following two functions: `lens_1d` and `uniform_sampling`. These tests check to verify that the function takes in correct inputs.
To run a unit test, simply write `make test` in the terminal.
%% Cell type:markdown id: tags:
---
# Methodology
I was able to meet the majority of my initial goals. I created different notebooks that used TDA in different ways. Additionally, I created a background notebook that gave a basic introduction to TDA. While I was able to meet my general goals, I think I could have approach this project differently. For instance, Ripser and Mapper are the two most used libraries from the scikit-tda package. I think I should have introduced these libraries more thoroughly. Both of these libraries have many applications which my project did not highlight. For instance, Ripser can be used for nonlinear time series analysis, feature selection, classifying, etc. While it is impossible to demonstrate all of the possible applications for Ripser, I could have discussed them more and linked some articles - though the vast majority of the papers I’ve come across focus on the mathematical side of TDA rather than the application.
Mapper was incredibly challenging to use. I spent the vast majority of the semester trying to understand and apply it. I was able to do basic classification using Mapper, although I am still unsure what the best method is in order to improve the model. I also do not know how to get some sort of accuracy score. I am only able to look at the output graph and determine whether the data separated well. I have yet to figure out how topologists determine the overall accuracy of their classification. Additionally, I was not able to fully complete the prediction notebook. I know how to use Mapper in the sense that I was able to separate the data visually. The next task is to learn how to extract the data. After doing so, I can easily build a new predictive model. In general, I need to experiment more with Mapper in order to better understand how to implement it.
%% Cell type:markdown id: tags:
---
# Concluding Remarks
I have a much deeper understanding of TDA because of this project. Before this project, I thought Ripser was only used for persistence diagrams. That is, I thought you can only do an exploratory analysis with Ripser. I was unaware that Ripser is used in time series analysis or classification. As for Mapper, I had no idea how to work with it in general. I was not formally introduced to Mapper - everything I currently know comes from articles/papers I’ve read. My background in Mapper is still limited but I feel more confident using it and experimenting with the parameters.
For future work, I would like to expand my work with Mapper. I would like to create more wrapper functions that simplify the steps that go into building a graph. Additionally, I would like to learn how to color nodes so that I can gain more insight. For instance, I would like to learn how to color nodes based on proportions for y1 to y2. My current understanding of Mapper is shallow so I would like to expand it and add detail application to my project.Additionally, I would like to create more notebooks that do not use toy datasets. Toy datasets are easy to work with but do not add enough complexity. To full demonstrate the applications of Mapper, I need to work with more challenging data.
%% Cell type:markdown id: tags:
----
# References
Individual notebooks have their own reference section. However, all notebooks use the scikit-tda package.
Saul, Nathaniel and Tralie, Chris. (2019). Scikit-TDA: Topological Data Analysis for Python. Zenodo. http://doi.org/10.5281/zenodo.2533369
Reports/mapper.png

433 KiB

This diff is collapsed.
%% Cell type:markdown id: tags:
<h1><center>Classification using Topological Data Analysis</center></h1>
<img src="https://cdn.vox-cdn.com/thumbor/GcZR8_tOztIDAiSlX47_5oyZ-js=/0x0:1599x1066/1200x800/filters:focal(834x375:1088x629)/cdn.vox-cdn.com/uploads/chorus_image/image/55588811/King_Estate_Winery_NEXT_Amazon_Wine.0.jpg" width="70%">
<p style="text-align: center;">Image from: https://www.vox.com/2017/7/6/15926476/amazon-next-wine-king-vintners-king-estate-winery</p>
%% Cell type:markdown id: tags:
In this notebook, we will classify wine quality using topological data analysis with Mapper.
The general motivation of this notebook is to demonstrate how to use Mapper and how to apply TDA to classification related models.
%% Cell type:code id: tags:
``` python
# imports
from Topological_ML import tda_function as tda
import pandas as pd
import numpy as np
import sklearn
from sklearn import ensemble
import kmapper as km
from kmapper.plotlyviz import *
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from ipywidgets import (HBox, VBox)
import warnings
warnings.filterwarnings("ignore")
```
%% Cell type:markdown id: tags:
First, we download the wine dataset from Scikit Learn.
%% Cell type:code id: tags:
``` python
# import wine dataset from scikit learn
from sklearn.datasets import load_wine
wine = load_wine()
df = pd.DataFrame(wine['data'],columns = wine['feature_names'])
df['quality'] = wine['target']
df.head()
```
%% Output
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols \
0 14.23 1.71 2.43 15.6 127.0 2.80
1 13.20 1.78 2.14 11.2 100.0 2.65
2 13.16 2.36 2.67 18.6 101.0 2.80
3 14.37 1.95 2.50 16.8 113.0 3.85
4 13.24 2.59 2.87 21.0 118.0 2.80
flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue \
0 3.06 0.28 2.29 5.64 1.04
1 2.76 0.26 1.28 4.38 1.05
2 3.24 0.30 2.81 5.68 1.03
3 3.49 0.24 2.18 7.80 0.86
4 2.69 0.39 1.82 4.32 1.04
od280/od315_of_diluted_wines proline quality
0 3.92 1065.0 0
1 3.40 1050.0 0
2 3.17 1185.0 0
3 3.45 1480.0 0
4 2.93 735.0 0
%% Cell type:markdown id: tags:
Now that the data is downloaded, we seperate the reponse from the features and build a simplicial complex based on the features. The lens_1d function has other options that one can experiment with in order to build a better simplicial complex.
%% Cell type:code id: tags:
``` python
# seperate features and response
feature_names = [c for c in df.columns if c not in ["quality"]]
X = np.array(df[feature_names])
y = np.array(df["quality"])
# you may choose any lens type here
lens, mapper = tda.lens_1d(X,"max")
# Define the simplicial complex
scomplex = mapper.map(lens,
X,
nr_cubes=15,
overlap_perc=0.7,
clusterer=sklearn.cluster.KMeans(n_clusters=2,
random_state=3471))
```
%% Cell type:markdown id: tags:
The following code, borrowed from scikit-TDA, uses the simplicial complex that we just defined to build a graph. The majority of the code is building an interactive plot within a notebook.
%% Cell type:code id: tags:
``` python
# color scale
pl_brewer = [[0.0, '#006837'],
[0.1, '#1a9850'],
[0.2, '#66bd63'],
[0.3, '#a6d96a'],
[0.4, '#d9ef8b'],
[0.5, '#ffffbf'],
[0.6, '#fee08b'],
[0.7, '#fdae61'],
[0.8, '#f46d43'],
[0.9, '#d73027'],
[1.0, '#a50026']]
color_function = lens [:,0] - lens[:,0].min()
my_colorscale = pl_brewer
kmgraph, mapper_summary, colorf_distribution = get_mapper_graph(scomplex,
color_function,
color_function_name='Distance to x-max',
colorscale=my_colorscale)
# assign to node['custom_tooltips'] the node label: 0 - low quality, 1 - medium quality, 2 - high quality
for node in kmgraph['nodes']:
node['custom_tooltips'] = y[scomplex['nodes'][node['name']]]
bgcolor = 'rgba(10,10,10, 0.9)'
# on a black background the gridlines are set on grey
y_gridcolor = 'rgb(150,150,150)'
plotly_graph_data = plotly_graph(kmgraph, graph_layout='fr', colorscale=my_colorscale,
factor_size=2.5, edge_linewidth=0.5)
layout = plot_layout(title='Topological network representing the<br> wine quality dataset',
width=620, height=570,
annotation_text=get_kmgraph_meta(mapper_summary),
bgcolor=bgcolor)
fw_graph = go.FigureWidget(data=plotly_graph_data, layout=layout)
fw_hist = node_hist_fig(colorf_distribution, bgcolor=bgcolor,
y_gridcolor=y_gridcolor)
fw_summary = summary_fig(mapper_summary, height=300)
dashboard = hovering_widgets(kmgraph,
fw_graph,
ctooltips=True,
bgcolor=bgcolor,
y_gridcolor=y_gridcolor,
member_textbox_width=600)
#Update the fw_graph colorbar, setting its title:
fw_graph.data[1].marker.colorbar.title = 'dist to<br>x-min'
dashboard
```
%% Output
%% Cell type:markdown id: tags:
Several observations can be made:
1. The top half of the graph is composed of wine with quality 0
2. The middle region is composed of all wine types
3. The bottom region is composed of wine with quality 1 and 2
We attained some seperability however, this model can be improved by using different clustering methods or a different filter function. The current graph does note seperate quality 1 and 2 very well.
Further insight can be made (although it requires more coding). For instance, one can color the nodes so that they show the proportion of high quality wine to the rest.