Merge branch 'oscar' into 'main'

fixing timeout issue See merge request CMSE/datatools_tutorial_demo!39

Merge branch 'oscar' into 'main'
fixing timeout issue See merge request CMSE/datatools_tutorial_demo!39
6debac52 · Colbry, Dirk · 09930260 · 28d8ef8c · 6debac52
Commit 6debac52 authored 2 years ago by Colbry, Dirk
--- a/tpot_tutorial.ipynb
+++ b/tpot_tutorial.ipynb
@@ -213,7 +213,7 @@
    "# Will report the score of the best found pipeline\n",
    "# Change max_time_mins to a higher time to allow TPOT to run without interruption. #issue number 25\n",
    "# It is currently at 2 mins for sake of not taking to long\n",
-    "tpot = TPOTClassifier(verbosity=2, max_time_mins=4)\n",
+    "tpot = TPOTClassifier(verbosity=2, max_time_mins=3)\n",
    "tpot.fit(X_train, y_train)\n",
    "print(tpot.score(X_test, y_test))"
   ]
@@ -1072,7 +1072,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.2"
+   "version": "3.9.1"
  }
 },
 "nbformat": 4,

 %% Cell type:markdown id:29cb46d7 tags:

 ### Run this cell to install commands

 %% Cell type:code id:eaee9fda tags:

 ``` python
 %pip install torch
 %pip install xgboost
 %pip install tpot
 ```

 %% Output

    Requirement already satisfied: torch in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (1.13.1)
    Requirement already satisfied: typing-extensions in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from torch) (3.10.0.2)
    Note: you may need to restart the kernel to use updated packages.
    Requirement already satisfied: xgboost in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (1.7.3)
    Requirement already satisfied: numpy in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from xgboost) (1.21.2)
    Requirement already satisfied: scipy in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from xgboost) (1.7.3)
    Note: you may need to restart the kernel to use updated packages.
    Requirement already satisfied: tpot in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (0.11.7)
    Requirement already satisfied: tqdm>=4.36.1 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (4.62.3)
    Requirement already satisfied: xgboost>=1.1.0 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.7.3)
    Requirement already satisfied: deap>=1.2 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.3.3)
    Requirement already satisfied: pandas>=0.24.2 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.3.5)
    Requirement already satisfied: update-checker>=0.16 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (0.18.0)
    Requirement already satisfied: stopit>=1.1.1 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.1.2)
    Requirement already satisfied: scipy>=1.3.1 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.7.3)
    Requirement already satisfied: scikit-learn>=0.22.0 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.0.1)
    Requirement already satisfied: joblib>=0.13.2 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.1.0)
    Requirement already satisfied: numpy>=1.16.3 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from tpot) (1.21.2)
    Requirement already satisfied: python-dateutil>=2.7.3 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from pandas>=0.24.2->tpot) (2.8.2)
    Requirement already satisfied: pytz>=2017.3 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from pandas>=0.24.2->tpot) (2021.3)
    Requirement already satisfied: six>=1.5 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas>=0.24.2->tpot) (1.16.0)
    Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from scikit-learn>=0.22.0->tpot) (2.2.0)
    Requirement already satisfied: requests>=2.3.0 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from update-checker>=0.16->tpot) (2.27.1)
    Requirement already satisfied: idna<4,>=2.5 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (3.3)
    Requirement already satisfied: charset-normalizer~=2.0.0 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (2.0.4)
    Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (1.26.7)
    Requirement already satisfied: certifi>=2017.4.17 in /Users/izaanys/opt/anaconda3/lib/python3.8/site-packages (from requests>=2.3.0->update-checker>=0.16->tpot) (2022.9.24)
    Note: you may need to restart the kernel to use updated packages.

 %% Cell type:markdown id:60aa22cd tags:

 * What is TPOT?
    * TPOT is an automated machine learning tool that utilizes genetic programming to optimize machine learning pipelines. It is essentially an assistant for tree based pipeline optimization. <br/>
 <br/>

 * What/Who is it good for?
    * TPOT rids you of having to do the most tedious portion of machine learning. It does this by exploring many varieties of pipelines and finding the best one for the data you are working with.
    * This AutoML tool is an unbeatable asset and is a real bargain if you want to get a classification accuracy which is very competitive. Over and above that, this tool can identify artificial feature constructors that can enhance the classification accuracy in a very demanding way by identifying novel pipeline operators. The operators of TPOT are chained together to develop a series of operations acting on the given dataset <br/>
    * TPOT can be used for both classification and regression.
 <br/>
 <br/>
 * How to Install
    * We used the call `pip install tpot` in order to install TPOT. It is noted to have PyTorch installed as well, but it is not necessary. The installation for PyTorch is `pip install torch` <br/>
 <br/>

 Link to another TPOT tutorial: https://machinelearningmastery.com/tpot-for-automated-machine-learning-in-python/

 Links to additional documentation: <br/>
 http://epistasislab.github.io/tpot/ <br/>
 https://github.com/EpistasisLab/tpot


 #### Remain target to class – important step

 %% Cell type:markdown id:ec644eb3 tags:

 ### Run this cell to import all of the necessary libraries

 %% Cell type:code id:022ab1ce tags:

 ``` python
 from tpot import TPOTClassifier
 from sklearn.model_selection import train_test_split
 from sklearn.datasets import load_iris
 import pandas as pd
 import numpy as np
 ```

 %% Cell type:markdown id:f1b0e863 tags:

 ## Example 1

 %% Cell type:code id:2898aaa9 tags:

 ``` python
 #load in all of the data
 iris = load_iris()
 iris.data[0:5], iris.target
 ```

 %% Output

    (array([[5.1, 3.5, 1.4, 0.2],
            [4.9, 3. , 1.4, 0.2],
            [4.7, 3.2, 1.3, 0.2],
            [4.6, 3.1, 1.5, 0.2],
            [5. , 3.6, 1.4, 0.2]]),
     array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
            1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
            2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
            2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]))

 %% Cell type:code id:1d1ac27a tags:

 ``` python
 #split data into a test and train data set
 X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, train_size=0.75, test_size=0.25)
 X_train.shape, X_test.shape, y_train.shape, y_test.shape
 ```

 %% Output

    ((112, 4), (38, 4), (112,), (38,))

 %% Cell type:code id:015515d0 tags:

 ``` python
 # Fit the model based on the training data, get a score based on testing data.
 # Will report the score of the best found pipeline
 # Change max_time_mins to a higher time to allow TPOT to run without interruption. #issue number 25
 # It is currently at 2 mins for sake of not taking to long
-tpot = TPOTClassifier(verbosity=2, max_time_mins=4)
+tpot = TPOTClassifier(verbosity=2, max_time_mins=3)
 tpot.fit(X_train, y_train)
 print(tpot.score(X_test, y_test))
 ```

 %% Output


    
    2.02 minutes have elapsed. TPOT will close down.
    TPOT closed during evaluation in one generation.
    WARNING: TPOT may not provide a good pipeline if TPOT is stopped/interrupted in a early generation.
    
    
    TPOT closed prematurely. Will use the current best pipeline.
    
    Best pipeline: LogisticRegression(input_matrix, C=25.0, dual=False, penalty=l2)
    0.9736842105263158

 %% Cell type:markdown id:22bb780f tags:

 Issued warning of TPOT closed prematurely. I am increasing the max_time to 4 so tpot can completely run and the results are more accurate

 %% Cell type:code id:fedcae2c tags:

 ``` python
 #export the pipeline created for future use
 tpot.export('tpot_iris_pipeline.py')
 ```

 %% Cell type:markdown id:efa3269b tags:

 ## Example 2

 %% Cell type:code id:f2ac0eda tags:

 ``` python
 #read in data
 titanic = pd.read_csv('titanic_train.csv')
 titanic.head(5)
 ```

 %% Output

       PassengerId  Survived  Pclass  \
    0            1         0       3
    1            2         1       1
    2            3         1       3
    3            4         1       1
    4            5         0       3
    
                                                    Name     Sex   Age  SibSp  \
    0                            Braund, Mr. Owen Harris    male  22.0      1
    1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1
    2                             Heikkinen, Miss. Laina  female  26.0      0
    3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1
    4                           Allen, Mr. William Henry    male  35.0      0
    
       Parch            Ticket     Fare Cabin Embarked
    0      0         A/5 21171   7.2500   NaN        S
    1      0          PC 17599  71.2833   C85        C
    2      0  STON/O2. 3101282   7.9250   NaN        S
    3      0            113803  53.1000  C123        S
    4      0            373450   8.0500   NaN        S

 %% Cell type:code id:30eeb3aa tags:

 ``` python
 #rename target variable to "class", in this case that is the 'survivor' column
 titanic.rename(columns={'Survived': 'class'}, inplace=True)
 ```

 %% Cell type:code id:bcc561a3 tags:

 ``` python
 # Find out how many different categories there are for each of these 5 features
 for cat in ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']:
    print("Number of levels in category '{0}': \b {1:2.2f} ".format(cat, titanic[cat].unique().size))
 ```

 %% Output

    Number of levels in category 'Name':  891.00
    Number of levels in category 'Sex':  2.00
    Number of levels in category 'Ticket':  681.00
    Number of levels in category 'Cabin':  148.00
    Number of levels in category 'Embarked':  4.00

 %% Cell type:code id:5be7251f tags:

 ``` python
 #print out what those categories are for 'Sex' and 'Embarked'
 for cat in ['Sex', 'Embarked']:
    print("Levels for catgeory '{0}': {1}".format(cat, titanic[cat].unique()))
 ```

 %% Output

    Levels for catgeory 'Sex': ['male' 'female']
    Levels for catgeory 'Embarked': ['S' 'C' 'Q' nan]

 %% Cell type:code id:8bb50fa8 tags:

 ``` python
 # Map the categories to numerical values
 titanic['Sex'] = titanic['Sex'].map({'male':0,'female':1})
 titanic['Embarked'] = titanic['Embarked'].map({'S':0,'C':1,'Q':2})
 ```

 %% Cell type:code id:11c06d01 tags:

 ``` python
 # fill na values and then double check there are non left
 titanic = titanic.fillna(-999)
 pd.isnull(titanic).any()
 ```

 %% Output

    PassengerId    False
    class          False
    Pclass         False
    Name           False
    Sex            False
    Age            False
    SibSp          False
    Parch          False
    Ticket         False
    Fare           False
    Cabin          False
    Embarked       False
    dtype: bool

 %% Cell type:code id:aa017a2a tags:

 ``` python
 # Encode categorical features, specifically 'Cabin'
 from sklearn.preprocessing import MultiLabelBinarizer
 mlb = MultiLabelBinarizer()
 CabinTrans = mlb.fit_transform([{str(val)} for val in titanic['Cabin'].values])
 ```

 %% Cell type:code id:34567d07 tags:

 ``` python
 CabinTrans
 ```

 %% Output

    array([[1, 0, 0, ..., 0, 0, 0],
           [0, 0, 0, ..., 0, 0, 0],
           [1, 0, 0, ..., 0, 0, 0],
           ...,
           [1, 0, 0, ..., 0, 0, 0],
           [0, 0, 0, ..., 0, 0, 0],
           [1, 0, 0, ..., 0, 0, 0]])

 %% Cell type:code id:f31db644 tags:

 ``` python
 # drop features that we won't use
 titanic_new = titanic.drop(['Name','Ticket','Cabin','class'], axis=1)
 ```

 %% Cell type:code id:e8ccd33c tags:

 ``` python
 #check correct encoding done
 assert (len(titanic['Cabin'].unique()) == len(mlb.classes_)), "Not Equal"
 ```

 %% Cell type:code id:594776f6 tags:

 ``` python
 #add CabinTrans to the features we kept
 # stack the arrays column wise
 titanic_new = np.hstack((titanic_new.values,CabinTrans))
 ```

 %% Cell type:code id:e8d47aef tags:

 ``` python
 # make sure there are no nas in the data
 np.isnan(titanic_new).any()
 ```

 %% Output

    False

 %% Cell type:code id:6fb6b7df tags:

 ``` python
 titanic_new[0].size
 ```

 %% Output

    156

 %% Cell type:code id:09fe6803 tags:

 ``` python
 # get the class ('survivor') values
 titanic_class = titanic['class'].values
 ```

 %% Cell type:code id:a14bb5fd tags:

 ``` python
 # split the data into testing and training- this will give us indicies
 training_indices, validation_indices = training_indices, testing_indices = train_test_split(titanic.index, stratify = titanic_class, train_size=0.75, test_size=0.25)
 training_indices.size, validation_indices.size
 ```

 %% Output

    (668, 223)

 %% Cell type:code id:227863a0 tags:

 ``` python
 # create the classifier and fit the model, reports the best pipeline
 # Parameters within the TPOT Classifier can be changed to allow for longer run time across more models
 tpot = TPOTClassifier(verbosity=2, max_time_mins=4, max_eval_time_mins=0.04, population_size=40)
 tpot.fit(titanic_new[training_indices], titanic_class[training_indices])
 ```

 %% Output


 %% Cell type:markdown id:910d6e80 tags:

 Issued warning of TPOT closed prematurely. I am increasing the max_time so tpot can completely run and the results are more accurate

 %% Cell type:code id:ca18f35b tags:

 ``` python
 #gives the score from the best pipeline
 tpot.score(titanic_new[validation_indices], titanic.loc[validation_indices, 'class'].values)
 ```

 %% Output

    0.7533632286995515

 %% Cell type:code id:4d710e07 tags:

 ``` python
 #export the best pipeline for future use
 tpot.export('tpot_titanic_pipeline.py')
 ```

 %% Cell type:code id:d7ff45ed tags:

 ``` python
 #Read in the test set that hasn't been touched yet
 titanic_sub = pd.read_csv('titanic_test.csv')
 titanic_sub.describe()
 ```

 %% Output

           PassengerId      Pclass         Age       SibSp       Parch        Fare
    count   418.000000  418.000000  332.000000  418.000000  418.000000  417.000000
    mean   1100.500000    2.265550   30.272590    0.447368    0.392344   35.627188
    std     120.810458    0.841838   14.181209    0.896760    0.981429   55.907576
    min     892.000000    1.000000    0.170000    0.000000    0.000000    0.000000
    25%     996.250000    1.000000   21.000000    0.000000    0.000000    7.895800
    50%    1100.500000    3.000000   27.000000    0.000000    0.000000   14.454200
    75%    1204.750000    3.000000   39.000000    1.000000    0.000000   31.500000
    max    1309.000000    3.000000   76.000000    8.000000    9.000000  512.329200

 %% Cell type:code id:8264d2ed tags:

 ``` python
 # clean the data and remove any nas
 for var in ['Cabin']: #,'Name','Ticket']:
    new = list(set(titanic_sub[var]) - set(titanic[var]))
    titanic_sub.loc[titanic_sub[var].isin(new), var] = -999
 ```

 %% Cell type:code id:fe8198e3 tags:

 ``` python
 # encode sex and embarked to numerical values
 titanic_sub['Sex'] = titanic_sub['Sex'].map({'male':0,'female':1})
 titanic_sub['Embarked'] = titanic_sub['Embarked'].map({'S':0,'C':1,'Q':2})
 ```

 %% Cell type:code id:13204313 tags:

 ``` python
 # fill the nas and double check none are left
 titanic_sub = titanic_sub.fillna(-999)
 pd.isnull(titanic_sub).any()
 ```

 %% Output

    PassengerId    False
    Pclass         False
    Name           False
    Sex            False
    Age            False
    SibSp          False
    Parch          False
    Ticket         False
    Fare           False
    Cabin          False
    Embarked       False
    dtype: bool

 %% Cell type:code id:82e8d3fb tags:

 ``` python
 # Encode categorical features, specifically 'Cabin', drop select columns to create a new dataframe
 from sklearn.preprocessing import MultiLabelBinarizer
 mlb = MultiLabelBinarizer()
 SubCabinTrans = mlb.fit([{str(val)} for val in titanic['Cabin'].values]).transform([{str(val)} for val in
                                                                                    titanic_sub['Cabin'].values])
 titanic_sub = titanic_sub.drop(['Name','Ticket','Cabin'], axis=1)
 ```

 %% Cell type:code id:185ba8c1 tags:

 ``` python
 # combine slimmed dataframe with cabin now encoded
 titanic_sub_new = np.hstack((titanic_sub.values,SubCabinTrans))
 ```

 %% Cell type:code id:359c8b6b tags:

 ``` python
 np.any(np.isnan(titanic_sub_new))
 ```

 %% Output

    False

 %% Cell type:code id:e73a0c32 tags:

 ``` python
 assert (titanic_new.shape[1] == titanic_sub_new.shape[1]), "Not Equal"
 ```

 %% Cell type:code id:d868e452 tags:

 ``` python
 # predict the class based on the data given
 submission = tpot.predict(titanic_sub_new)
 ```

 %% Cell type:code id:1666f357 tags:

 ``` python
 submission[:10]
 ```

 %% Output

    array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0])

 %% Cell type:code id:3d91d737 tags:

 ``` python
 #create a data frame with passenger id and what class they belong to (if they survived or not)
 #save as csv
 final = pd.DataFrame({'PassengerId': titanic_sub['PassengerId'], 'Survived': submission})
 final.to_csv('data/submission.csv', index = False)
 ```

 %% Output

    ---------------------------------------------------------------------------
    FileNotFoundError                         Traceback (most recent call last)
    /var/folders/tk/mrjkyqms0651_r4m4qvb9xz00000gn/T/ipykernel_43658/2451894762.py in <module>
          2 #save as csv
          3 final = pd.DataFrame({'PassengerId': titanic_sub['PassengerId'], 'Survived': submission})
    ----> 4 final.to_csv('data/submission.csv', index = False)

    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, date_format, doublequote, escapechar, decimal, errors, storage_options)
       3464         )
       3465
    -> 3466         return DataFrameRenderer(formatter).to_csv(
       3467             path_or_buf,
       3468             line_terminator=line_terminator,
    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/formats/format.py in to_csv(self, path_or_buf, encoding, sep, columns, index_label, mode, compression, quoting, quotechar, line_terminator, chunksize, date_format, doublequote, escapechar, errors, storage_options)
       1103             formatter=self.fmt,
       1104         )
    -> 1105         csv_formatter.save()
       1106
       1107         if created_buffer:
    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/formats/csvs.py in save(self)
        235         """
        236         # apply compression and byte/text conversion
    --> 237         with get_handle(
        238             self.filepath_or_buffer,
        239             self.mode,
    ~/opt/anaconda3/lib/python3.8/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
        700         if ioargs.encoding and "b" not in ioargs.mode:
        701             # Encoding
    --> 702             handle = open(
        703                 handle,
        704                 ioargs.mode,
    FileNotFoundError: [Errno 2] No such file or directory: 'data/submission.csv'

 %% Cell type:code id:240feb73 tags:

 ``` python
 final.shape
 ```

 %% Cell type:markdown id:12aac202 tags:

 ### References

 https://github.com/EpistasisLab/tpot