Skip to content
Snippets Groups Projects
Commit 6debac52 authored by Colbry, Dirk's avatar Colbry, Dirk
Browse files

Merge branch 'oscar' into 'main'

fixing timeout issue

See merge request CMSE/datatools_tutorial_demo!39
parents 09930260 28d8ef8c
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:29cb46d7 tags:
### Run this cell to install commands
%% Cell type:code id:eaee9fda tags:
``` python
%pip install torch
%pip install xgboost
%pip install tpot
```
%% Output
Requirement already satisfied: torch in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (1.13.1)
Requirement already satisfied: typing-extensions in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from torch) (3.10.0.2)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: xgboost in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (1.7.3)
Requirement already satisfied: numpy in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from xgboost) (1.21.2)
Requirement already satisfied: scipy in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from xgboost) (1.7.3)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: tpot in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (0.11.7)
Requirement already satisfied: tqdm>=4.36.1 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (4.62.3)
Requirement already satisfied: xgboost>=1.1.0 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.7.3)
Requirement already satisfied: deap>=1.2 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.3.3)
Requirement already satisfied: pandas>=0.24.2 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.3.5)
Requirement already satisfied: update-checker>=0.16 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (0.18.0)
Requirement already satisfied: stopit>=1.1.1 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.1.2)
Requirement already satisfied: scipy>=1.3.1 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.7.3)
Requirement already satisfied: scikit-learn>=0.22.0 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.0.1)
Requirement already satisfied: joblib>=0.13.2 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.1.0)
Requirement already satisfied: numpy>=1.16.3 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.21.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from pandas>=0.24.2->tpot) (2.8.2)
Requirement already satisfied: pytz>=2017.3 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from pandas>=0.24.2->tpot) (2021.3)
Requirement already satisfied: six>=1.5 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas>=0.24.2->tpot) (1.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from scikit-learn>=0.22.0->tpot) (2.2.0)
Requirement already satisfied: requests>=2.3.0 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from update-checker>=0.16->tpot) (2.27.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (3.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (2.0.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (1.26.7)
Requirement already satisfied: certifi>=2017.4.17 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (2022.9.24)
Note: you may need to restart the kernel to use updated packages.
%% Cell type:markdown id:60aa22cd tags:
* What is TPOT?
* TPOT is an automated machine learning tool that utilizes genetic programming to optimize machine learning pipelines. It is essentially an assistant for tree based pipeline optimization. <br/>
<br/>
* What/Who is it good for?
* TPOT rids you of having to do the most tedious portion of machine learning. It does this by exploring many varieties of pipelines and finding the best one for the data you are working with.
* This AutoML tool is an unbeatable asset and is a real bargain if you want to get a classification accuracy which is very competitive. Over and above that, this tool can identify artificial feature constructors that can enhance the classification accuracy in a very demanding way by identifying novel pipeline operators. The operators of TPOT are chained together to develop a series of operations acting on the given dataset <br/>
* TPOT can be used for both classification and regression.
<br/>
<br/>
* How to Install
* We used the call `pip install tpot` in order to install TPOT. It is noted to have PyTorch installed as well, but it is not necessary. The installation for PyTorch is `pip install torch` <br/>
<br/>
Link to another TPOT tutorial: https://machinelearningmastery.com/tpot-for-automated-machine-learning-in-python/
Links to additional documentation: <br/>
http://epistasislab.github.io/tpot/ <br/>
https://github.com/EpistasisLab/tpot
#### Remain target to class – important step
%% Cell type:markdown id:ec644eb3 tags:
### Run this cell to import all of the necessary libraries
%% Cell type:code id:022ab1ce tags:
``` python
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
```
%% Cell type:markdown id:f1b0e863 tags:
## Example 1
%% Cell type:code id:2898aaa9 tags:
``` python
#load in all of the data
iris = load_iris()
iris.data[0:5], iris.target
```
%% Output
(array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2]]),
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]))
%% Cell type:code id:1d1ac27a tags:
``` python
#split data into a test and train data set
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, train_size=0.75, test_size=0.25)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
```
%% Output
((112, 4), (38, 4), (112,), (38,))
%% Cell type:code id:015515d0 tags:
``` python
# Fit the model based on the training data, get a score based on testing data.
# Will report the score of the best found pipeline
# Change max_time_mins to a higher time to allow TPOT to run without interruption. #issue number 25
# It is currently at 2 mins for sake of not taking to long
tpot = TPOTClassifier(verbosity=2, max_time_mins=4)
tpot = TPOTClassifier(verbosity=2, max_time_mins=3)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
```
%% Output
2.02 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.
WARNING: TPOT may not provide a good pipeline if TPOT is stopped/interrupted in a early generation.
TPOT closed prematurely. Will use the current best pipeline.
Best pipeline: LogisticRegression(input_matrix, C=25.0, dual=False, penalty=l2)
0.9736842105263158
%% Cell type:markdown id:22bb780f tags:
Issued warning of TPOT closed prematurely. I am increasing the max_time to 4 so tpot can completely run and the results are more accurate
%% Cell type:code id:fedcae2c tags:
``` python
#export the pipeline created for future use
tpot.export('tpot_iris_pipeline.py')
```
%% Cell type:markdown id:efa3269b tags:
## Example 2
%% Cell type:code id:f2ac0eda tags:
``` python
#read in data
titanic = pd.read_csv('titanic_train.csv')
titanic.head(5)
```
%% Output
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
%% Cell type:code id:30eeb3aa tags:
``` python
#rename target variable to "class", in this case that is the 'survivor' column
titanic.rename(columns={'Survived': 'class'}, inplace=True)
```
%% Cell type:code id:bcc561a3 tags:
``` python
# Find out how many different categories there are for each of these 5 features
for cat in ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']:
print("Number of levels in category '{0}': \b {1:2.2f} ".format(cat, titanic[cat].unique().size))
```
%% Output
Number of levels in category 'Name':  891.00
Number of levels in category 'Sex':  2.00
Number of levels in category 'Ticket':  681.00
Number of levels in category 'Cabin':  148.00
Number of levels in category 'Embarked':  4.00
%% Cell type:code id:5be7251f tags:
``` python
#print out what those categories are for 'Sex' and 'Embarked'
for cat in ['Sex', 'Embarked']:
print("Levels for catgeory '{0}': {1}".format(cat, titanic[cat].unique()))
```
%% Output
Levels for catgeory 'Sex': ['male' 'female']
Levels for catgeory 'Embarked': ['S' 'C' 'Q' nan]
%% Cell type:code id:8bb50fa8 tags:
``` python
# Map the categories to numerical values
titanic['Sex'] = titanic['Sex'].map({'male':0,'female':1})
titanic['Embarked'] = titanic['Embarked'].map({'S':0,'C':1,'Q':2})
```
%% Cell type:code id:11c06d01 tags:
``` python
# fill na values and then double check there are non left
titanic = titanic.fillna(-999)
pd.isnull(titanic).any()
```
%% Output
PassengerId False
class False
Pclass False
Name False
Sex False
Age False
SibSp False
Parch False
Ticket False
Fare False
Cabin False
Embarked False
dtype: bool
%% Cell type:code id:aa017a2a tags:
``` python
# Encode categorical features, specifically 'Cabin'
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
CabinTrans = mlb.fit_transform([{str(val)} for val in titanic['Cabin'].values])
```
%% Cell type:code id:34567d07 tags:
``` python
CabinTrans
```
%% Output
array([[1, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0],
...,
[1, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0]])
%% Cell type:code id:f31db644 tags:
``` python
# drop features that we won't use
titanic_new = titanic.drop(['Name','Ticket','Cabin','class'], axis=1)
```
%% Cell type:code id:e8ccd33c tags:
``` python
#check correct encoding done
assert (len(titanic['Cabin'].unique()) == len(mlb.classes_)), "Not Equal"
```
%% Cell type:code id:594776f6 tags:
``` python
#add CabinTrans to the features we kept
# stack the arrays column wise
titanic_new = np.hstack((titanic_new.values,CabinTrans))
```
%% Cell type:code id:e8d47aef tags:
``` python
# make sure there are no nas in the data
np.isnan(titanic_new).any()
```
%% Output
False
%% Cell type:code id:6fb6b7df tags:
``` python
titanic_new[0].size
```
%% Output
156
%% Cell type:code id:09fe6803 tags:
``` python
# get the class ('survivor') values
titanic_class = titanic['class'].values
```
%% Cell type:code id:a14bb5fd tags:
``` python
# split the data into testing and training- this will give us indicies
training_indices, validation_indices = training_indices, testing_indices = train_test_split(titanic.index, stratify = titanic_class, train_size=0.75, test_size=0.25)
training_indices.size, validation_indices.size
```
%% Output
(668, 223)
%% Cell type:code id:227863a0 tags:
``` python
# create the classifier and fit the model, reports the best pipeline
# Parameters within the TPOT Classifier can be changed to allow for longer run time across more models
tpot = TPOTClassifier(verbosity=2, max_time_mins=4, max_eval_time_mins=0.04, population_size=40)
tpot.fit(titanic_new[training_indices], titanic_class[training_indices])
```
%% Output
%% Cell type:markdown id:910d6e80 tags:
Issued warning of TPOT closed prematurely. I am increasing the max_time so tpot can completely run and the results are more accurate
%% Cell type:code id:ca18f35b tags:
``` python
#gives the score from the best pipeline
tpot.score(titanic_new[validation_indices], titanic.loc[validation_indices, 'class'].values)
```
%% Output
0.7533632286995515
%% Cell type:code id:4d710e07 tags:
``` python
#export the best pipeline for future use
tpot.export('tpot_titanic_pipeline.py')
```
%% Cell type:code id:d7ff45ed tags:
``` python
#Read in the test set that hasn't been touched yet
titanic_sub = pd.read_csv('titanic_test.csv')
titanic_sub.describe()
```
%% Output
PassengerId Pclass Age SibSp Parch Fare
count 418.000000 418.000000 332.000000 418.000000 418.000000 417.000000
mean 1100.500000 2.265550 30.272590 0.447368 0.392344 35.627188
std 120.810458 0.841838 14.181209 0.896760 0.981429 55.907576
min 892.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 996.250000 1.000000 21.000000 0.000000 0.000000 7.895800
50% 1100.500000 3.000000 27.000000 0.000000 0.000000 14.454200
75% 1204.750000 3.000000 39.000000 1.000000 0.000000 31.500000
max 1309.000000 3.000000 76.000000 8.000000 9.000000 512.329200
%% Cell type:code id:8264d2ed tags:
``` python
# clean the data and remove any nas
for var in ['Cabin']: #,'Name','Ticket']:
new = list(set(titanic_sub[var]) - set(titanic[var]))
titanic_sub.loc[titanic_sub[var].isin(new), var] = -999
```
%% Cell type:code id:fe8198e3 tags:
``` python
# encode sex and embarked to numerical values
titanic_sub['Sex'] = titanic_sub['Sex'].map({'male':0,'female':1})
titanic_sub['Embarked'] = titanic_sub['Embarked'].map({'S':0,'C':1,'Q':2})
```
%% Cell type:code id:13204313 tags:
``` python
# fill the nas and double check none are left
titanic_sub = titanic_sub.fillna(-999)
pd.isnull(titanic_sub).any()
```
%% Output
PassengerId False
Pclass False
Name False
Sex False
Age False
SibSp False
Parch False
Ticket False
Fare False
Cabin False
Embarked False
dtype: bool
%% Cell type:code id:82e8d3fb tags:
``` python
# Encode categorical features, specifically 'Cabin', drop select columns to create a new dataframe
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
SubCabinTrans = mlb.fit([{str(val)} for val in titanic['Cabin'].values]).transform([{str(val)} for val in
titanic_sub['Cabin'].values])
titanic_sub = titanic_sub.drop(['Name','Ticket','Cabin'], axis=1)
```
%% Cell type:code id:185ba8c1 tags:
``` python
# combine slimmed dataframe with cabin now encoded
titanic_sub_new = np.hstack((titanic_sub.values,SubCabinTrans))
```
%% Cell type:code id:359c8b6b tags:
``` python
np.any(np.isnan(titanic_sub_new))
```
%% Output
False
%% Cell type:code id:e73a0c32 tags:
``` python
assert (titanic_new.shape[1] == titanic_sub_new.shape[1]), "Not Equal"
```
%% Cell type:code id:d868e452 tags:
``` python
# predict the class based on the data given
submission = tpot.predict(titanic_sub_new)
```
%% Cell type:code id:1666f357 tags:
``` python
submission[:10]
```
%% Output
array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])
%% Cell type:code id:3d91d737 tags:
``` python
#create a data frame with passenger id and what class they belong to (if they survived or not)
#save as csv
final = pd.DataFrame({'PassengerId': titanic_sub['PassengerId'], 'Survived': submission})
final.to_csv('data/submission.csv', index = False)
```
%% Output
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
/var/folders/tk/mrjkyqms0651_r4m4qvb9xz00000gn/T/ipykernel_43658/2451894762.py in <module>
2 #save as csv
3 final = pd.DataFrame({'PassengerId': titanic_sub['PassengerId'], 'Survived': submission})
----> 4 final.to_csv('data/submission.csv', index = False)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, date_format, doublequote, escapechar, decimal, errors, storage_options)
3464 )
3465
-> 3466 return DataFrameRenderer(formatter).to_csv(
3467 path_or_buf,
3468 line_terminator=line_terminator,
~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/formats/format.py in to_csv(self, path_or_buf, encoding, sep, columns, index_label, mode, compression, quoting, quotechar, line_terminator, chunksize, date_format, doublequote, escapechar, errors, storage_options)
1103 formatter=self.fmt,
1104 )
-> 1105 csv_formatter.save()
1106
1107 if created_buffer:
~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/formats/csvs.py in save(self)
235 """
236 # apply compression and byte/text conversion
--> 237 with get_handle(
238 self.filepath_or_buffer,
239 self.mode,
~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
700 if ioargs.encoding and "b" not in ioargs.mode:
701 # Encoding
--> 702 handle = open(
703 handle,
704 ioargs.mode,
FileNotFoundError: [Errno 2] No such file or directory: 'data/submission.csv'
%% Cell type:code id:240feb73 tags:
``` python
final.shape
```
%% Cell type:markdown id:12aac202 tags:
### References
https://github.com/EpistasisLab/tpot
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment