Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • masboobs/topological_machine_learning
  • colbrydi/topological_machine_learning
  • gongmimi/topological_machine_learning
3 results
Show changes
Commits on Source (20)
Showing with 1831 additions and 466048 deletions
%% Cell type:markdown id: tags:
This project is composed of several jupyter notebooks that explain how to use topological data analysis with different machine learning techniques.
While the notebooks are independent of one another, it might be beneficial if one reads through the notebook titled "TDA_Background," provides a general background on the theory. Additionally, it introduces the main TDA python functions.
The remaining notebooks provide different applications of TDA.
# Topological Machine Learning
The first 1/3 of this project will require the utilization of traditional machine learning methods on a dataset. The purpose of this portion of the project is promote the benefits of TDA. It is used to show data scientists that they can achieve very interesting results when using TDA. The second portion of this project will be incorporating TDA with machine learning. This portion of the project will be the most demanding due to lack of references. I will attempt to perform both a classification type project and prediction type project. Additionally, I may have to rely on HPCC to run my script (TDA is computationally expensive). The final portion of the project will be creating a Jupyter demo project for data scientists as well as a short semi-theoretical document that discusses the more important features of TDA.
This project provides an easy to follow introduction to Topological Data Analysis (TDA). TDA is a machine learning tool that borrows concepts from topology and applies them to datasets. While the theory behind TDA is quite complex (thought important and interesting!), this project will only focus on applications. There is, however, an introductory notebook which provides user with a brief introduction to the main concepts within TDA as well as the scikit learn TDA package. The remaining notebooks apply TDA with different machine learning methods such as prediction and classifying.
I hope to create a script that is easy for those who are not familiar with TDA to follow. Ideally, I would like to create a document that scikit-tda is willing to publish. Hence, the document needs to be constructed in a manner that is easy to interpret by anyone who is new to TDA.
Future improvements include:
1. Fixing the prediction notebook.
2. Using the Ripser function to classify data.
3. Providing deeper analysis on the results.
Thankfully, python is already has the libraries that I will need. They are as follows: scipy, numpy, matplotlib, pandas, seaborn, scikit-tda
## Prerequisites
## Getting Started
The environment file contains all of the required packages. You may either activate the environment file or install the required packages individually.
* Update
### Prerequisites
You may need to install the Python TDA library. The installation can be done in Pypi using just one command: pip install scikit-tda. Necessary installations include seaborn, pandas, numpy, matplotlib, scipy, jupyter, Cython, scikit-tda, Ripser, and persim. All can be installed using:
'''
conda install 'package'
'''
To run the Jupyter Notebooks sucessfully, you must them through the project's enviornment.
## Authors
......@@ -49,8 +44,4 @@ OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
For more information, please refer to <https://unlicense.org>
## Acknowledgments
* UPDATE
\ No newline at end of file
For more information, please refer to <https://unlicense.org>
\ No newline at end of file
%% Cell type:markdown id: tags:
# <center> Topological Machine Learning </center>
<center>By Shawk Macboob </center>
%% Cell type:markdown id: tags:
<img src="https://scikit-tda.org/_static/logo.png" width="20%">
Image from: https://scikit-tda.org/#
%% Cell type:markdown id: tags:
---
# Authors
Shawk Masboob
%% Cell type:markdown id: tags:
---
# Abstract
Topological Data Analysis i(TDA) s a relatively new field with many useful applications. Essentially, TDA borrows tools from topology in order to study data. That is, TDA seeks to determine whether a particular dataset has shape and what the shape of the dataset implies. It can be used independently or applied to other machine learning techniques. This particular projects aims at applying TDA to various machine learning methods in order to demonstrate its benefits. To do so, this project first presents the general theory of TDA as well as introducing the available TDA software. Then, several notebooks are illustrate how TDA can be applied to machine learning.
%% Cell type:markdown id: tags:
----
# Statement of Need
The purpose of this project is to introduce data scientists to Topological Data Analysis (TDA) through various examples. That is, this project aims to demonstrate applications of TDA to those who do not have a background in mathematics.
%% Cell type:markdown id: tags:
----
# Installation instructions
The environment.yml file constains all of the required dependices. To install the required moduels, run the following command in the terminal:
`make init`
%% Cell type:markdown id: tags:
----
# Unit Tests
Unit testing is done on the following two functions: `lens_1d` and `uniform_sampling`. These tests check to verify that the function takes in correct inputs.
To run a unit test, simply write `make test` in the terminal.
%% Cell type:markdown id: tags:
---
# Methodology
I was able to meet the majority of my initial goals. I created different notebooks that used TDA in different ways. Additionally, I created a background notebook that gave a basic introduction to TDA. While I was able to meet my general goals, I think I could have approach this project differently. For instance, Ripser and Mapper are the two most used libraries from the scikit-tda package. I think I should have introduced these libraries more thoroughly. Both of these libraries have many applications which my project did not highlight. For instance, Ripser can be used for nonlinear time series analysis, feature selection, classifying, etc. While it is impossible to demonstrate all of the possible applications for Ripser, I could have discussed them more and linked some articles - though the vast majority of the papers I’ve come across focus on the mathematical side of TDA rather than the application.
Mapper was incredibly challenging to use. I spent the vast majority of the semester trying to understand and apply it. I was able to do basic classification using Mapper, although I am still unsure what the best method is in order to improve the model. I also do not know how to get some sort of accuracy score. I am only able to look at the output graph and determine whether the data separated well. I have yet to figure out how topologists determine the overall accuracy of their classification. Additionally, I was not able to fully complete the prediction notebook. I know how to use Mapper in the sense that I was able to separate the data visually. The next task is to learn how to extract the data. After doing so, I can easily build a new predictive model. In general, I need to experiment more with Mapper in order to better understand how to implement it.
%% Cell type:markdown id: tags:
---
# Concluding Remarks
I have a much deeper understanding of TDA because of this project. Before this project, I thought Ripser was only used for persistence diagrams. That is, I thought you can only do an exploratory analysis with Ripser. I was unaware that Ripser is used in time series analysis or classification. As for Mapper, I had no idea how to work with it in general. I was not formally introduced to Mapper - everything I currently know comes from articles/papers I’ve read. My background in Mapper is still limited but I feel more confident using it and experimenting with the parameters.
For future work, I would like to expand my work with Mapper. I would like to create more wrapper functions that simplify the steps that go into building a graph. Additionally, I would like to learn how to color nodes so that I can gain more insight. For instance, I would like to learn how to color nodes based on proportions for y1 to y2. My current understanding of Mapper is shallow so I would like to expand it and add detail application to my project.Additionally, I would like to create more notebooks that do not use toy datasets. Toy datasets are easy to work with but do not add enough complexity. To full demonstrate the applications of Mapper, I need to work with more challenging data.
%% Cell type:markdown id: tags:
----
# References
Individual notebooks have their own reference section. However, all notebooks use the scikit-tda package.
Saul, Nathaniel and Tralie, Chris. (2019). Scikit-TDA: Topological Data Analysis for Python. Zenodo. http://doi.org/10.5281/zenodo.2533369
This diff is collapsed.
%% Cell type:markdown id: tags:
<h1><center>Classification using Topological Data Analysis</center></h1>
<img src="https://i.pinimg.com/564x/42/23/f5/4223f59e7e0a7e69e2b73118c7a69cbb.jpg" width="50%">
<p style="text-align: center;">Image from:https://www.pinterest.jp/pin/406942516319576339/</p>
<img src="https://cdn.vox-cdn.com/thumbor/GcZR8_tOztIDAiSlX47_5oyZ-js=/0x0:1599x1066/1200x800/filters:focal(834x375:1088x629)/cdn.vox-cdn.com/uploads/chorus_image/image/55588811/King_Estate_Winery_NEXT_Amazon_Wine.0.jpg" width="70%">
<p style="text-align: center;">Image from: https://www.vox.com/2017/7/6/15926476/amazon-next-wine-king-vintners-king-estate-winery</p>
%% Cell type:markdown id: tags:
Purpose of Notebook.
In this notebook, we will classify wine quality using topological data analysis with Mapper.
The general motivation of this notebook is to demonstrate how to use Mapper and how to apply TDA to classification related models.
%% Cell type:code id: tags:
``` python
init
# imports
from Topological_ML import tda_function as tda
import pandas as pd
import numpy as np
import sklearn
from sklearn import ensemble
import kmapper as km
from kmapper.plotlyviz import *
import matplotlib.pyplot as plt
import plotly.graph_objs as go
from ipywidgets import (HBox, VBox)
import warnings
warnings.filterwarnings("ignore")
```
%% Cell type:markdown id: tags:
First, we download the wine dataset from Scikit Learn.
%% Cell type:code id: tags:
``` python
# import wine dataset from scikit learn
from sklearn.datasets import load_wine
wine = load_wine()
df = pd.DataFrame(wine['data'],columns = wine['feature_names'])
df['quality'] = wine['target']
df.head()
```
%% Output
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-19f3d006bcd3> in <module>
----> 1 init
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols \
0 14.23 1.71 2.43 15.6 127.0 2.80
1 13.20 1.78 2.14 11.2 100.0 2.65
2 13.16 2.36 2.67 18.6 101.0 2.80
3 14.37 1.95 2.50 16.8 113.0 3.85
4 13.24 2.59 2.87 21.0 118.0 2.80
flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue \
0 3.06 0.28 2.29 5.64 1.04
1 2.76 0.26 1.28 4.38 1.05
2 3.24 0.30 2.81 5.68 1.03
3 3.49 0.24 2.18 7.80 0.86
4 2.69 0.39 1.82 4.32 1.04
od280/od315_of_diluted_wines proline quality
0 3.92 1065.0 0
1 3.40 1050.0 0
2 3.17 1185.0 0
3 3.45 1480.0 0
4 2.93 735.0 0
%% Cell type:markdown id: tags:
NameError: name 'init' is not defined
Now that the data is downloaded, we seperate the reponse from the features and build a simplicial complex based on the features. The lens_1d function has other options that one can experiment with in order to build a better simplicial complex.
%% Cell type:code id: tags:
``` python
# seperate features and response
feature_names = [c for c in df.columns if c not in ["quality"]]
X = np.array(df[feature_names])
y = np.array(df["quality"])
# you may choose any lens type here
lens, mapper = tda.lens_1d(X,"max")
# Define the simplicial complex
scomplex = mapper.map(lens,
X,
nr_cubes=15,
overlap_perc=0.7,
clusterer=sklearn.cluster.KMeans(n_clusters=2,
random_state=3471))
```
%% Cell type:markdown id: tags:
The following code, borrowed from scikit-TDA, uses the simplicial complex that we just defined to build a graph. The majority of the code is building an interactive plot within a notebook.
%% Cell type:code id: tags:
``` python
# color scale
pl_brewer = [[0.0, '#006837'],
[0.1, '#1a9850'],
[0.2, '#66bd63'],
[0.3, '#a6d96a'],
[0.4, '#d9ef8b'],
[0.5, '#ffffbf'],
[0.6, '#fee08b'],
[0.7, '#fdae61'],
[0.8, '#f46d43'],
[0.9, '#d73027'],
[1.0, '#a50026']]
color_function = lens [:,0] - lens[:,0].min()
my_colorscale = pl_brewer
kmgraph, mapper_summary, colorf_distribution = get_mapper_graph(scomplex,
color_function,
color_function_name='Distance to x-max',
colorscale=my_colorscale)
# assign to node['custom_tooltips'] the node label: 0 - low quality, 1 - medium quality, 2 - high quality
for node in kmgraph['nodes']:
node['custom_tooltips'] = y[scomplex['nodes'][node['name']]]
bgcolor = 'rgba(10,10,10, 0.9)'
# on a black background the gridlines are set on grey
y_gridcolor = 'rgb(150,150,150)'
plotly_graph_data = plotly_graph(kmgraph, graph_layout='fr', colorscale=my_colorscale,
factor_size=2.5, edge_linewidth=0.5)
layout = plot_layout(title='Topological network representing the<br> wine quality dataset',
width=620, height=570,
annotation_text=get_kmgraph_meta(mapper_summary),
bgcolor=bgcolor)
fw_graph = go.FigureWidget(data=plotly_graph_data, layout=layout)
fw_hist = node_hist_fig(colorf_distribution, bgcolor=bgcolor,
y_gridcolor=y_gridcolor)
fw_summary = summary_fig(mapper_summary, height=300)
dashboard = hovering_widgets(kmgraph,
fw_graph,
ctooltips=True,
bgcolor=bgcolor,
y_gridcolor=y_gridcolor,
member_textbox_width=600)
#Update the fw_graph colorbar, setting its title:
fw_graph.data[1].marker.colorbar.title = 'dist to<br>x-min'
dashboard
```
%% Output
%% Cell type:markdown id: tags:
Several observations can be made:
1. The top half of the graph is composed of wine with quality 0
2. The middle region is composed of all wine types
3. The bottom region is composed of wine with quality 1 and 2
We attained some seperability however, this model can be improved by using different clustering methods or a different filter function. The current graph does note seperate quality 1 and 2 very well.
Further insight can be made (although it requires more coding). For instance, one can color the nodes so that they show the proportion of high quality wine to the rest.
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
"""
This file contains all of the functions used within the notebooks.
Date:
Author:
Date: April 24, 2020
Author: Shawk Masboob
The function `uniform_sampling` was borrowed from Luis Polancocontreras,
a PhD candidate in the CMSE program at Michigan State University.
It was slightly tweaked to fit this project.
"""
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn import ensemble
import kmapper as km
import pandas as pd
import numpy as np
import pandas as pd
def numpy_to_pandas(sklearn_data):
"""
Converts scikit-learn numpy data into pandas dataframe.
Input: name of dataframe
Output: pandas dataframe
"""Converts scikit-learn numpy data into pandas dataframe
Args:
sklearn_data (array): name of dataframe
Returns:
data: pandas dataframe
"""
data = pd.DataFrame(data=sklearn_data.data, columns=sklearn_data.feature_names)
data['target'] = pd.Series(sklearn_data.target)
return data
def linear_regression(feature, predictor):
"""
Ordinary least squares Linear Regression.
input: x = independent variables
y = dependent variable
output: R^2
"""
model = LinearRegression()
model.fit(feature, predictor)
return model.score(feature, predictor)
def lens_1d(x_array, proj='l2norm', random_num=1729, verbosity=0):
"""Creates a L^2-Norm for features. This lens highlights expected features in the data.
Args:
x_array (array): features of dataset </br>
proj (string): projection type </br>
random_num: random state </br>
verbosity: verbosity </br>
Returns:
lens: Isolation Forest, L^2-Norm </br>
mapper: projected features </br>
def lens_1d(features, random_num, verbosity):
"""
input:
output:
"""
model = sklearn.ensemble.IsolationForest(random_state=random_num)
model.fit(features)
lens1 = model.decision_function(features).reshape((features.shape[0], 1))
if not isinstance(x_array, np.ndarray):
print("your input is not an array")
return None, None
if isinstance(x_array, np.ndarray) and len(x_array.shape) != 2:
print('your input needs to be a 2d array')
return None, None
proj_type = ['sum', 'mean', 'median', 'max', 'min', 'std', 'dist_mean',
'l2norm', 'knn_distance_n']
if proj not in proj_type:
print("you may only use the following projections:", proj_type)
return None, None
# Create a custom 1-D lens with Isolation Forest
model = ensemble.IsolationForest(random_state=random_num)
model.fit(x_array)
lens1 = model.decision_function(x_array).reshape((x_array.shape[0], 1))
# Create another 1-D lens with L2-norm
mapper = km.KeplerMapper(verbose=verbosity)
lens2 = mapper.fit_transform(features, projection="l2norm")
lens2 = mapper.fit_transform(x_array, projection=proj)
# Combine lenses pairwise to get a 2-D lens i.e. [Isolation Forest, L^2-Norm] lens
lens = np.c_[lens1, lens2]
return lens
return lens, mapper
def uniform_sampling(dist_matrix, n_sample):
"""Given a distance matrix retunrs an subsamplig that preserves the distribution
of the original data set and the covering radious corresponding to
the subsampled set.
Args:
dist_matrix (array): Distance matrix </br>
n_sample (int): Size of subsample set </br>
Returns:
list_subsample (array): List of indices corresponding to the subsample set </br>
distance_to_l: Covering radious for the subsample set </br>
def county_crosstab(data, county, year, index, columns):
"""
input:
output:
"""
subset_df = data[data.year == year]
sub_df = subset_df[subset_df.county == county]
crosstab = pd.crosstab(index=sub_df[index], columns=sub_df[columns])
return crosstab
if not isinstance(dist_matrix, np.ndarray):
print("your input is not an array")
return None, None
if isinstance(dist_matrix, np.ndarray) and len(dist_matrix.shape) != 2:
print('your input needs to be a 2d array')
return None, None
n_subsample = int(n_sample)
if n_subsample <= 0:
print("Sampling size should be a positive integer.")
return None, None
num_points = dist_matrix.shape[0]
list_subsample = np.random.choice(num_points, n_subsample)
dist_to_l = np.min(dist_matrix[list_subsample, :], axis=0)
distance_to_l = np.max(dist_to_l)
return list_subsample, distance_to_l
from Topological_ML import TDA_Prediction as tdap
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import kmapper as km
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn import ensemble
def test_data_summary():
data = pd.DataFrame({"A": [1,2,3,4,5,6,7,8,9,10]})
correct_response = {"head": [1,2,3,4,5,6,7,8,9,10], "shape": [10, 1], "describe": 0}
values = tdap.data_summary(data, 5)
assert values == correct_response
return
def test_linear_regression():
a = pd.DataFrame({"A": [1,2,3,4,5,6,7,8,9,10]})
b = pd.DataFrame({"B": [2,4,6,8,10,12,14,16,18,20]})
correct_response = 1
values = tdap.linear_regression(a, b)
assert values == correct_response
return
def test_lens_1d():
a = pd.DataFrame({"A": [0,0]})
correct_response = np.array([[0., 0.],[0., 0.]])
values = tdap.lens_1d(a,123,1)
assert values == correct_response
return
'''
This file contains the functions from tda_functions.py that will be tested.
'''
import numpy as np
import scipy.spatial.distance as distance
from Topological_ML import tda_function as tda
def test_lens_1d_pass():
'''
Testing whether the function works given correct parameters.
'''
data = np.ones((5, 2))
test, _ = tda.lens_1d(data, proj="sum")
assert isinstance(test, np.ndarray)
def test_lens_1d_fail_1():
'''
Testing whether the function fails given incorrect projection parameter.
'''
data = np.ones((5, 2))
test, _ = tda.lens_1d(data, proj="Sum")
assert test is None
def test_lens_1d_fail_2():
'''
Testing whether the function fails given incorrect feature (data) parameter.
'''
data = np.ones(5)
test, _ = tda.lens_1d(data, proj="sum")
assert test is None
def test_uniform_sampling_pass():
'''
Testing whether the function works given correct parameters.
'''
x_array = np.array([[1, 1], [1, 2], [2, 3]])
dm_x = distance.cdist(x_array, x_array)
test, _ = tda.uniform_sampling(dm_x, 1)
assert isinstance(test, np.ndarray)
def test_uniform_sampling_fail_1():
'''
Testing whether the function fails given incorrect distance matrix parameter.
'''
x_array = np.ones(5)
test, _ = tda.uniform_sampling(x_array, 1)
assert test is None
def test_uniform_sampling_fail_2():
'''
Testing whether the function fails given incorrect sampling size parameter.
'''
x_array = np.array([[1, 1], [1, 2], [2, 3]])
dm_x = distance.cdist(x_array, x_array)
test, _ = tda.uniform_sampling(dm_x, -2)
assert test is None
This diff is collapsed.
......@@ -6,7 +6,6 @@ dependencies:
- matplotlib
- jupyter
- cython
- pytest
- numpy
- selenium
- pandas
......@@ -15,6 +14,14 @@ dependencies:
- hypothesis
- requests
- plotly
- pdoc3
- pylint
- pytest
- autopep8
- pip:
- ripser
- kmapper
- persim
- python-igraph
- plotly
- ipywidgets
\ No newline at end of file
......@@ -22,13 +22,13 @@ init:
conda env create --prefix ./envs --file environment.yml
doc:
pdoc --force --html --output-dir ./docs Topological_ML
pdoc3 --force --html --output-dir ./docs Topological_ML
lint:
pylint Topological_ML
pylint -v Topological_ML
test:
pytest Topological_ML
pytest -v --disable-warnings Topological_ML
.PHONY: init doc lint test
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.