Note

Go to the end to download the full example code

1. EDA

import matplotlib.pyplot as plt
plt.rcParams["font.family"] = "Times New Roman"

import pandas as pd

from ai4water.eda import EDA
from easy_mpl import plot, boxplot, hist
from easy_mpl.utils import create_subplots

from utils import make_data, box_violin, \
    DYE_TYPES, ADSORBENT_TYPES

/home/docs/checkouts/readthedocs.org/user_builds/ai4adsorption/envs/latest/lib/python3.7/site-packages/sklearn/experimental/enable_hist_gradient_boosting.py:17: UserWarning: Since version 1.0, it is not needed to import enable_hist_gradient_boosting anymore. HistGradientBoostingClassifier and HistGradientBoostingRegressor are now stable and can be normally imported from sklearn.ensemble.
  "Since version 1.0, "
IPython could not be loaded!

Loading the original dataset

data, _, _ = make_data()

Here, we are printing the shape of original dataset. The first value shows the number of samples/examples/datapoints and the second one shows the number of features.

print(data.shape)

(1514, 13)

The first five samples are

data.head()

	Adsorption Time (min)	Pyrolysis Temperature	Initial Concentration	Solution pH	Adsorbent Loading	Volume (L)	Adsorption Temperature	Surface Area	Pore Volume	Adsorbent	Dye	Adsorption
0	0.0	25	200.0	2.8	10.0	1.0	25.0	2.75	0.0001	GIC	CR	0.0000
1	5.0	25	200.0	2.8	10.0	1.0	25.0	2.75	0.0001	GIC	CR	6.0000
2	10.0	25	200.0	2.8	10.0	1.0	25.0	2.75	0.0001	GIC	CR	11.4310
3	15.0	25	200.0	2.8	10.0	1.0	25.0	2.75	0.0001	GIC	CR	18.3542
4	20.0	25	200.0	2.8	10.0	1.0	25.0	2.75	0.0001	GIC	CR	19.2639

The last five samples are

data.tail()

	Adsorption Time (min)	Pyrolysis Temperature	Pyrolysis Time (min)	Initial Concentration	Solution pH	Adsorbent Loading	Volume (L)	Adsorption Temperature	Surface Area	Pore Volume	Adsorbent	Dye	Adsorption
1509	60.0	700	360.0	10.0	7.0	0.05	0.025	25.0	1085.0	0.19	MZ-Biochar	FG FCF	4.860404
1510	60.0	700	360.0	10.0	7.0	0.05	0.025	35.0	1085.0	0.19	MZ-Biochar	FG FCF	4.906988
1511	60.0	700	360.0	10.0	7.0	0.05	0.025	15.0	1085.0	0.19	MZ-Biochar	FG FCF	4.347981
1512	60.0	700	360.0	10.0	7.0	0.05	0.025	25.0	1085.0	0.19	MZ-Biochar	FG FCF	4.798292
1513	60.0	700	360.0	10.0	7.0	0.05	0.025	35.0	1085.0	0.19	MZ-Biochar	FG FCF	4.860404

The names of different adsorbents and their counts

data['Adsorbent'].value_counts()

Exfoliated GIC    213
GIC               141
MN-Biochar        114
M-Biochar         114
MZ-Biochar        114
PSB-LDHMgAl        99
APAC               98
PSB                93
CMCAC              72
TSAC               46
SAC                34
WSAC               32
CAC                26
CBAC               26
HAC                26
MC550              24
AC800              21
AC600              21
AC700              21
AC900              21
CS                 13
BGBHAC             10
CS-AC-H4P2O7        9
CS-AC-H3PO4         9
CS-AC-NaOH          9
CS-AC-KOH           9
TWAC                7
VAC                 6
MCNaOH40            5
MCNaOH50            5
MC30/1              5
MCNaOH30            5
MC0.478             5
MC0.75              5
MC0.659             5
MC0.569             5
MC20/1              5
MC25/1              5
MC35/1              5
MCNaOH10            5
MC600               4
MC500               4
MC450               4
MC400               4
MC350               4
TRAC                4
GSAC-Ce-1           1
GSAC                1
Name: Adsorbent, dtype: int64

The names of different dyes and their counts

data['Dye'].value_counts()

CR        378
MB        326
MO        269
FG FCF    171
NR        104
SYF        99
GR         36
Rhd B      32
MV         24
AR         24
RB5        22
YD          9
AM          7
BV14        7
RD          4
AB25        2
Name: Dye, dtype: int64

Removing the categorical features from our dataframe

data.pop("Adsorbent")
data.pop("Dye")

         CR
         CR
         CR
         CR
         CR
         ...
  FG FCF
  FG FCF
  FG FCF
  FG FCF
  FG FCF
Name: Dye, Length: 1514, dtype: object

get statistical summary of data

pd.set_option('display.max_columns', None)

print(data.describe())

       Adsorption Time (min)  Pyrolysis Temperature  Pyrolysis Time (min)  \
count            1514.000000            1514.000000           1514.000000
mean              171.938573             575.769485            119.175028
std               269.035920             260.147498            136.463100
min                 0.000000              25.000000              0.000000
25%                60.000000             550.000000              0.333300
50%                60.000000             700.000000             60.000000
75%               150.000000             700.000000            120.000000
max              1440.000000             900.000000            360.000000

       Initial Concentration  Solution pH  Adsorbent Loading   Volume (L)  \
count            1514.000000  1514.000000        1514.000000  1514.000000
mean              164.771493     5.976427           0.931035     0.152318
std               188.311013     1.758724           2.296184     0.229913
min                 1.005333     2.800000           0.001000     0.020000
25%                10.000000     5.150000           0.037500     0.025000
50%               100.000000     7.000000           0.050000     0.050000
75%               300.000000     7.000000           0.500000     0.250000
max               900.000000    11.670000          10.000000     1.000000

       Adsorption Temperature  Surface Area  Pore Volume   Adsorption
count             1514.000000   1514.000000  1514.000000  1514.000000
mean                25.772787    537.959901     0.237928   180.774681
std                  4.827209    697.224000     0.297145   419.203955
min                 15.000000      2.750000     0.000100     0.000000
25%                 25.000000     26.550000     0.016000     4.952459
50%                 25.000000    181.100000     0.183000    20.000000
75%                 25.000000    893.000000     0.389000    98.487500
max                 60.000000   2430.000000     1.321000  2235.000000

initializing an instance of EDA class from AI4Water in order to get some insights of the data

eda = EDA(data = data, save=False, show=False)

plot correlation between numerical features

ax = eda.correlation(figsize=(9,9))
ax.set_xticklabels(ax.get_xticklabels(), fontsize=12, weight='bold')
ax.set_yticklabels(ax.get_yticklabels(), fontsize=12, weight='bold')
plt.tight_layout()
plt.show()

findfont: Font family ['Times New Roman'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Times New Roman'] not found. Falling back to DejaVu Sans.
findfont: Font family ['Times New Roman'] not found. Falling back to DejaVu Sans.

making a line plot for numerical features

fig, axes = create_subplots(data.shape[1])

for ax, col, label  in zip(axes.flat, data, data.columns):

    plot(data[col].values, ax=ax, ax_kws=dict(ylabel=col),
         lw=0.9,
         color='darkcyan', show=False)
plt.tight_layout()
plt.show()

fig, axes = create_subplots(data.shape[1])
for ax, col in zip(axes.flat, data.columns):
    boxplot(data[col].values, ax=ax, vert=False, fill_color='lightpink',
            flierprops={"ms": 1.0}, show=False, patch_artist=True,
            widths=0.6, medianprops={"color": "gray"},
            ax_kws=dict(xlabel=col, xlabel_kws={'weight': "bold"}))
plt.tight_layout()
plt.show()

findfont: Font family ['Times New Roman'] not found. Falling back to DejaVu Sans.

show the box and (half) violin plots together

fig, axes = create_subplots(data.shape[1])
for ax, col in zip(axes.flat, data.columns):
    box_violin(ax=ax, data=data[col], palette="Set2")
    ax.set_xlabel(xlabel=col, weight='bold')
    ax.set_yticklabels(ax.get_yticklabels(), weight='bold')
plt.tight_layout()
plt.show()

fig, axes = create_subplots(data.shape[1])

for ax, col, label  in zip(axes.flat, data, data.columns):

    hist(data[col].values, ax=ax, bins=10,  show=False,
         grid=False,linewidth=0.5, edgecolor="k", color="khaki",
         ax_kws=dict(ylabel="Counts", xlabel=col))
plt.tight_layout()
plt.show()

data, _, _ = make_data()
data.pop('Dye')
feature = data['Adsorbent']
d = {k:ADSORBENT_TYPES[k] for k in feature.unique()}
feature = feature.map(d)
data['Adsorbent'] = feature

df_gb = data.loc[data['Adsorbent']=="GB"]
df_ac = data.loc[data['Adsorbent']=="AC"]
df_bio = data.loc[data['Adsorbent']=="Biochar"]
data.pop('Adsorbent')

fig, axes = create_subplots(data.shape[1])

for ax, col in zip(axes.flat, data.columns):

    boxplot([df_gb[col], df_ac[col], df_bio[col]],
            labels=["GB", "AC", "BC"],
                ax=ax,
                flierprops={"ms": 0.6},
                fill_color='lightpink',
                patch_artist=True,
                widths=0.5,
            medianprops={"color": "gray"},
            vert=False,
            show=False,
            ax_kws=dict(xlabel=col, xlabel_kws={'weight': 'bold'})
                )
    ax.set_yticklabels(ax.get_yticklabels(), weight='bold')
plt.tight_layout()
plt.show()

data, _, _ = make_data()
data.pop('Adsorbent')
feature = data['Dye']
d = {k:DYE_TYPES[k] for k in feature.unique()}
feature = feature.map(d)
data['Dye'] = feature

df_an = data.loc[data['Dye']=="Anionic"]
df_cat = data.loc[data['Dye']=="Cationic"]
data.pop('Dye')

fig, axes = create_subplots(data.shape[1])

for ax, col in zip(axes.flat, data.columns):

    boxplot([df_an[col], df_cat[col]],
            labels=["AN", "CT"],
                ax=ax,
                flierprops={"ms": 0.6},
            medianprops={"color": "gray"},
                fill_color='lightpink',
            patch_artist=True,
                vert=False,
                widths=0.5,
            show=False,
            ax_kws=dict(xlabel=col, xlabel_kws={"weight": "bold"})
                )
    ax.set_yticklabels(ax.get_yticklabels(), weight='bold')
plt.tight_layout()
plt.show()

data, _, _ = make_data()
feature = data['Adsorbent']
d = {k:ADSORBENT_TYPES[k] for k in feature.unique()}
feature = feature.map(d)
data['Adsorbent'] = feature

df_ac = data.loc[data['Adsorbent']=="AC"]
df_ac['code'] = "AC"
df_bc = data.loc[data['Adsorbent']=="Biochar"]
df_bc['code'] = "BC"
df_gb = data.loc[data['Adsorbent']=="GB"]
df_gb['code'] = "GB"


df_ac.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

	Adsorption Time (min)	Pyrolysis Temperature	Pyrolysis Time (min)	Initial Concentration	Solution pH	Adsorbent Loading	Volume (L)	Adsorption Temperature	Surface Area	Pore Volume	Adsorption
count	626.000000	626.000000	626.000000	626.000000	626.000000	626.000000	626.000000	626.00000	626.000000	626.000000	626.000000
mean	223.546326	579.856230	54.632588	226.609190	6.588994	0.516677	0.087077	26.86901	962.381933	0.475778	380.680009
std	336.329779	280.604026	33.982713	214.136733	0.896097	0.835641	0.068823	6.97117	830.197539	0.325806	593.047870
min	0.000000	25.000000	0.000000	1.911000	3.000000	0.010000	0.020000	20.00000	6.500000	0.006400	0.000000
25%	42.500000	550.000000	30.000000	50.000000	6.500000	0.040000	0.040000	25.00000	222.000000	0.183200	15.000000
50%	90.000000	650.000000	60.000000	104.677713	7.000000	0.100000	0.050000	25.00000	623.000000	0.389000	73.123885
75%	200.000000	800.000000	60.000000	400.000000	7.000000	0.500000	0.100000	25.00000	1814.000000	0.530000	431.595500
max	1440.000000	900.000000	120.000000	900.000000	11.670000	3.995000	0.250000	60.00000	2430.000000	1.321000	2235.000000

df_ac.median()

Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.

Adsorption Time (min)      90.000000
Pyrolysis Temperature     650.000000
Pyrolysis Time (min)       60.000000
Initial Concentration     104.677713
Solution pH                 7.000000
Adsorbent Loading           0.100000
Volume (L)                  0.050000
Adsorption Temperature     25.000000
Surface Area              623.000000
Pore Volume                 0.389000
Adsorption                 73.123885
dtype: float64

df_bc.describe()

	Adsorption Time (min)	Pyrolysis Temperature	Pyrolysis Time (min)	Initial Concentration	Solution pH	Adsorbent Loading	Volume (L)	Adsorption Temperature	Surface Area	Pore Volume	Adsorption
count	534.000000	534.000000	534.000000	534.000000	534.000000	534.000000	5.340000e+02	534.000000	534.000000	534.000000	534.000000
mean	189.775281	664.044944	273.707865	13.308404	6.865169	0.031174	2.500000e-02	25.000000	385.725281	0.115803	6.937043
std	244.521100	48.031863	115.276472	16.902007	0.890108	0.017613	3.472700e-18	2.598888	395.796516	0.087568	11.987670
min	30.000000	600.000000	120.000000	1.005333	3.000000	0.001000	2.500000e-02	15.000000	8.500000	0.016000	0.024606
25%	60.000000	600.000000	120.000000	5.000000	7.000000	0.010000	2.500000e-02	25.000000	147.200000	0.031000	0.499613
50%	60.000000	700.000000	360.000000	10.000000	7.000000	0.037500	2.500000e-02	25.000000	181.100000	0.120000	1.322144
75%	360.000000	700.000000	360.000000	10.000000	7.000000	0.050000	2.500000e-02	25.000000	495.700000	0.190000	9.060564
max	1440.000000	700.000000	360.000000	80.053333	9.000000	0.050000	2.500000e-02	35.000000	1085.000000	0.239000	88.525132

df_bc.median()

Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.

Adsorption Time (min)      60.000000
Pyrolysis Temperature     700.000000
Pyrolysis Time (min)      360.000000
Initial Concentration      10.000000
Solution pH                 7.000000
Adsorbent Loading           0.037500
Volume (L)                  0.025000
Adsorption Temperature     25.000000
Surface Area              181.100000
Pore Volume                 0.120000
Adsorption                  1.322144
dtype: float64

df_gb.describe()

	Adsorption Time (min)	Pyrolysis Temperature	Pyrolysis Time (min)	Initial Concentration	Solution pH	Adsorbent Loading	Volume (L)	Adsorption Temperature	Surface Area	Pore Volume	Adsorption
count	354.000000	354.000000	354.000000	354.000000	354.000000	354.000000	354.000000	354.0	354.000000	354.000000	354.000000
mean	53.771186	435.381356	0.200545	283.898305	3.552542	3.021186	0.459746	25.0	17.070339	0.001544	89.499437
std	21.570128	339.992391	0.163398	121.057900	1.699194	3.930781	0.302206	0.0	11.667756	0.001177	73.113378
min	0.000000	25.000000	0.000000	100.000000	2.800000	0.250000	0.250000	25.0	2.750000	0.000100	0.000000
25%	60.000000	25.000000	0.000000	200.000000	2.800000	0.500000	0.250000	25.0	2.750000	0.000100	22.411250
50%	60.000000	700.000000	0.333300	200.000000	2.800000	0.500000	0.250000	25.0	26.550000	0.002500	63.866667
75%	60.000000	700.000000	0.333300	400.000000	2.800000	6.000000	0.500000	25.0	26.550000	0.002500	146.732500
max	120.000000	900.000000	0.333300	500.000000	9.000000	10.000000	1.000000	25.0	26.550000	0.002500	246.835000

df_gb.median()

Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.

Adsorption Time (min)      60.000000
Pyrolysis Temperature     700.000000
Pyrolysis Time (min)        0.333300
Initial Concentration     200.000000
Solution pH                 2.800000
Adsorbent Loading           0.500000
Volume (L)                  0.250000
Adsorption Temperature     25.000000
Surface Area               26.550000
Pore Volume                 0.002500
Adsorption                 63.866667
dtype: float64

COLUMNS = ['Pyrolysis Temperature', 'Pyrolysis Time (min)', 'Surface Area', 'Pore Volume']

fig, axes = create_subplots(len(COLUMNS))

for ax, col in zip(axes.flat, COLUMNS):
    df_ads_feat = pd.concat([df_ac[[col, 'code']],
                             df_bc[[col, 'code']],
                             df_gb[[col, 'code']]])

    boxplot([df_ac[col], df_bc[col], df_gb[col]],
            labels=["AC", "BC", "GB"],
            ax=ax,
            flierprops={"ms": 0.6},
            medianprops={"color": "black"},
            fill_color='lightpink',
            patch_artist=True,
            vert=False,
            widths=0.5,
            show=False,
            ax_kws=dict(xlabel=col, xlabel_kws={"weight": "bold"})
                )
    ax.set_yticklabels(ax.get_yticklabels(), weight='bold')
plt.tight_layout()
plt.show()

Total running time of the script: (0 minutes 15.233 seconds)

Gallery generated by Sphinx-Gallery