Mordred and Morgan#

# Computing mordred and morgan features and building a model from these features
%matplotlib inline

import Utils as model_helpers
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

Computing Features

half_enantiomer_data = pd.read_csv("half_enantiomer_data.csv")
# Remove line separaters
half_enantiomer_data["SMILES String"] = half_enantiomer_data["SMILES String"].apply(lambda x : x.replace("\\n", "") and x.replace("\\r", ""))
half_enantiomer_data["SMILES String"] = half_enantiomer_data["SMILES String"].apply(lambda x : x.replace("\\n", ""))
# Calculate the mordred features
mordred_data = model_helpers.calculate_features(half_enantiomer_data, "mordred")
Convering SMILES string to Mol format...
Computing 3D coordinates...
Removing CC1=C2CC(CC1C3(CC4CC(=C3C)C4(C)C)C5[C@@H](O)C6CC(=C5C)C6(C)C)C2(C)C due to failed embedding
Computing Mordred features...
100%|██████████| 206/206 [00:12<00:00, 16.57it/s]
There are 206 molecules and 1826 features
# Calculate the morgan features
morgan_data = model_helpers.calculate_features(half_enantiomer_data, "morgan")
[11:40:49] WARNING: not removing hydrogen atom without neighbors
[11:40:52] Explicit valence for atom # 1 Si, 8, is greater than permitted
[11:40:52] Explicit valence for atom # 3 Si, 8, is greater than permitted
[11:40:53] Explicit valence for atom # 1 Si, 8, is greater than permitted
[11:40:53] Explicit valence for atom # 1 Si, 8, is greater than permitted
[11:41:00] WARNING: not removing hydrogen atom without neighbors
[11:41:00] WARNING: not removing hydrogen atom without neighbors
[11:41:00] WARNING: not removing hydrogen atom without neighbors
[11:41:00] WARNING: not removing hydrogen atom without neighbors
[11:41:03] WARNING: not removing hydrogen atom without neighbors
[11:41:04] WARNING: not removing hydrogen atom without neighbors
[11:41:04] WARNING: not removing hydrogen atom without neighbors
[11:41:04] WARNING: not removing hydrogen atom without neighbors
[11:41:04] WARNING: not removing hydrogen atom without neighbors
[11:41:05] WARNING: not removing hydrogen atom without neighbors
[11:41:05] WARNING: not removing hydrogen atom without neighbors
[11:41:05] WARNING: not removing hydrogen atom without neighbors
[11:41:06] WARNING: not removing hydrogen atom without neighbors
[11:41:06] WARNING: not removing hydrogen atom without neighbors
[11:41:06] WARNING: not removing hydrogen atom without neighbors
[11:41:06] WARNING: not removing hydrogen atom without neighbors
[11:41:06] WARNING: not removing hydrogen atom without neighbors
[11:41:06] WARNING: not removing hydrogen atom without neighbors
[11:41:06] WARNING: not removing hydrogen atom without neighbors
[11:41:06] WARNING: not removing hydrogen atom without neighbors
[11:41:07] WARNING: not removing hydrogen atom without neighbors
207 similarity features for 21685 molecules
[11:41:07] WARNING: not removing hydrogen atom without neighbors
[11:41:07] WARNING: not removing hydrogen atom without neighbors
[11:41:07] WARNING: not removing hydrogen atom without neighbors
#zero_var_cols = [mordred_data[col] for col in mordred_data.iloc[:,11:] if (mordred_data[col].var() > 0) == True]
#mordred_data.drop(columns)
# Dataframe with molecules that have mordred and morgan features computed
common_index = mordred_data.index.intersection(morgan_data.index)
mordred_data = mordred_data.loc[common_index]
morgan_data = morgan_data.loc[common_index]
# Reset index
mordred_data.set_index('Molecule Name').head().iloc[:, 10:];
morgan_data.set_index('Molecule Name').head().iloc[:, 10:];
# Data frame that has both the mordred and morgan features
both = mordred_data.join(morgan_data.iloc[:,10:], how="inner", rsuffix='morg_')
both.head()
Unnamed: 0 Unnamed: 0.1 Molecule Name Pubchem ID # Note Other SMILES Method Contributor Detection Threshold Detection Units ... O.[OH-].[O].[O-][Si](=O)[O-].[Mg].[Al].[Fe] CCCCOC(=O)/C=C\C(=O)OCCCC.CC(=O)OC=C.C[C@]12CC[C@H](C1(C)C)C[C@@H]2OC(=O)C=C C=CC(=O)N.C=CC(=O)O.[Na] C[Si](C)(Cl)Cl.O=[Si]=O CCCC/C=C/CCO.CC(=O)O CN(C)CCCNC(=O)NCCCN(C)C.C(CCl)OCCCl C(=O)([O-])[O-].O.[OH-].[OH-].[OH-].[OH-].[OH-].[OH-].[OH-].[Mg+2].[Al+3] CCC/C=C/CO.C1=CC=C(C=C1)C(=O)O CC(=C)C(=O)NCCC[N+](C)(C)C.COC(=O)C=C.C=CC(=O)O.[Cl-] C(C(C(F)(F)F)(OC(C(C(F)(F)F)(OC(C(C(F)(F)F)(O)F)(F)F)F)(F)F)F)(F)F
\rCC(=O)C(/C)=C/[C@H]1C(=CCCC1(C)C)C\r 406 406 (+)-(6R)-alpha-Iso-methylionone (ee = 94%) 5372174 NaN \rCC(=O)C(/C)=C/[C@H]1C(=CCCC1(C)C)C\r chem draw + cactus LC 7.90E-02 ng/L air ... 0.032258 0.202247 0.140845 0.093750 0.157303 0.104478 0.090909 0.142857 0.189655 0.000000
\rCCCCCC[C@H](C)Oc1cnccn1 320 320 (+)-2-{[(1S)-1-methylheptyl]oxy}pyrazine 122379392 NaN \rCCCCCC[C@H](C)Oc1cnccn1\r chem draw + cactus LC 3.00E+01 NaN ... 0.000000 0.134715 0.000000 0.050633 0.211538 0.134228 0.000000 0.176991 0.106870 0.026846
\rCC[C@H](C)Oc1cnccn1\r 314 314 (+)-2-{[(1S)-1-methylpropyl]oxy}pyrazine 520098 CAS No.24168-70-5 from https://www.sigmaaldric... \rCC[C@H](C)Oc1cnccn1\r chem draw + cactus LC 1.00E+02 NaN ... 0.000000 0.059524 0.000000 0.074074 0.101266 0.064516 0.000000 0.159091 0.075472 0.032258
\rC[C@H]1OCC[C@@H]1SC(C)=O 366 366 (2R,3S)-(+)-2-methyltetrahydrofuran-3-thiol ac... 7157263 https://pubchem.ncbi.nlm.nih.gov/compound/2S_3... \rC[C@H]1OCC[C@@H]1SC(C)=O\r chem draw + cactus LC 1.86E+01 ng/L air ... 0.044444 0.136646 0.111111 0.127660 0.166667 0.085470 0.122449 0.098765 0.121212 0.000000
C(CC(=O)O)[C@@H](C(=O)O)N 24 24 L-(-)-glutamatic acid 33032 NaN NaN NaN NaN 4.00E-01 ug/L in water ... 0.047619 0.101266 0.352941 0.090909 0.231884 0.087719 0.130435 0.205128 0.208333 0.035088

5 rows × 23533 columns

#Need to drop var columns
print(both.var().max())
print(both.var().min())
340658911179732.4
0.0
# Gets all Mordred or Mogan features that have numeric values and not Null values
# Joins the final mordred and morgan features 
finite_mordred = model_helpers.finite_features(mordred_data)
finite_morgan = model_helpers.finite_features(morgan_data)
both_features = finite_mordred | finite_morgan

Model

# Illustrate the magnitude differences across enantiomeric pairs in the dataset
model_helpers.fold_difference_of_enantiomers(half_enantiomer_data)
# Gets the appropriate parameter values for mordred model
# Gets the valid features (not null values) from feature data frame and the log_abs values from the feature dataframe
X_morded = mordred_data[finite_mordred]
y = mordred_data['log_abs']
X_morded = X_morded[y < 10]
y_mordred = y[y < 10]
Xn_mordred = pd.DataFrame(StandardScaler().fit_transform(X_morded), index=X_morded.index, columns=X_morded.columns)
# Gets the appropriate parameter values for Morgan model
# Gets the valid features (not null values) from feature data frame and the log_abs values from the feature dataframe
x_morgan = morgan_data[finite_morgan]
y_morgan = morgan_data["log_abs"]
x_morgan = x_morgan[y_morgan < 10]
y_morgan = y_morgan[y_morgan < 10]
Xn_morgan = pd.DataFrame(StandardScaler().fit_transform(x_morgan), index=x_morgan.index, columns=x_morgan.columns)
# Model for Morgan data
model_helpers.create_model(Xn_morgan, y_morgan)
100%|██████████| 13/13 [23:58<00:00, 110.63s/it]
_images/MordredAndMorgan_18_1.png
model_helpers.cross_val(Xn_morgan, y_morgan)
_images/MordredAndMorgan_19_0.png
# Model for Mordred data
model_helpers.create_model(Xn_mordred, y_mordred)
100%|██████████| 13/13 [01:02<00:00,  4.77s/it]
_images/MordredAndMorgan_20_1.png
model_helpers.cross_val(Xn_mordred, y_mordred)
_images/MordredAndMorgan_21_0.png