Scaled & Efficient Supervised Learning with AutoML

Saturday, April 18, 2020

In continuation of the binary classification project hypothetically chartered by the Blue Moon Brewing Company 12 year ago, this project is a longitudinal study to re-validate the relevance of Blue Moon’s STP (Segmentation, Targeting, Positioning) marketing strategy on today’s evolving beer consumers.

The objective of this data analysis is the same: to infer whether demographic data around gender, age, marital status, and income continue to indicate a consumer preference for light beer. To achieve this, I collected survey data from 1,500 beer consumers. Employing automated machine learning (AutoML), this project explores broad alternative classification methods, beyond a baseline logistic regression that was applied in the prior project.

Data Understanding

For data understanding, I imported a CSV file with 100 records and 5 columns. These columns include gender (0 for female, 1 for male), marital status (0 for unmarried, 1 for married), income, age, and beer preference (0 for regular, 1 for light). This initial analysis is critical for identifying the dataset’s structure and preparing for subsequent data exploration and modeling.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is essential for understanding single variables (univariate), pairs (bivariate), and multiple (multivariate) interactions. It reveals trends, patterns, and anomalies, informing subsequent analysis and hypothesis development.

Univariate Analysis

Univariate analysis summarizes and identifies patterns in individual variables. It informs subsequent analysis, revealing insights into distribution, central tendency, and variability. By examining one variable at a time, it detects outliers, assesses data quality, and sets the stage for more complex bivariate and multivariate analyses.

summary(beer_r)

     gender        married         income           age      prefer_light
 Min.   :0.00   Min.   :0.00   Min.   :24796   Min.   :21   Min.   :0.0  
 1st Qu.:0.00   1st Qu.:0.00   1st Qu.:46279   1st Qu.:49   1st Qu.:0.0  
 Median :0.00   Median :0.00   Median :52306   Median :56   Median :0.5  
 Mean   :0.43   Mean   :0.45   Mean   :52561   Mean   :55   Mean   :0.5  
 3rd Qu.:1.00   3rd Qu.:1.00   3rd Qu.:58955   3rd Qu.:62   3rd Qu.:1.0  
 Max.   :1.00   Max.   :1.00   Max.   :84031   Max.   :87   Max.   :1.0

Python

Bivariate Analysis

Python

Multivariate Analysis

Multivariate analysis plays a pivotal role in understanding intricate relationships within datasets. By exploring patterns, identifying outliers, and revealing the underlying structure, it provides valuable insights. One essential tool in multivariate analysis is the correlation matrix, which quantifies the strength and direction of relationships between variables. Positive values indicate direct associations, while negative values imply inverse relationships. Leveraging insights from the correlation matrix, we can make informed decisions about feature selection, hypothesis testing, and model accuracy. Detecting multicollinearity and identifying potential predictors becomes more effective with this analytical approach.

correlation_matrix_r <- round(cor(beer_r), 2)
head(correlation_matrix_r[5:1, 1:5])

             gender married income   age prefer_light
prefer_light  -0.09    0.05   0.40 -0.41         1.00
age            0.19    0.22   0.11  1.00        -0.41
income         0.03    0.33   1.00  0.11         0.40
married       -0.04    1.00   0.33  0.22         0.05
gender         1.00   -0.04   0.03  0.19        -0.09

Python

Data Preparation

Data Frame Conversion to H2O Data Frame

beer_r$gender <- factor(beer_r$gender, levels=c(0, 1), labels=c("Female", "Male"))
beer_r$married <- as.logical(beer_r$married)
beer_r$prefer_light <- as.logical(beer_r$prefer_light)

local_h2o <- h2o.init()
beer_h2o_r <- as.h2o(beer_r)

dim(beer_h2o_r)

[1] 1500    5

head(beer_h2o_r)

  gender married income age prefer_light
1   Male    TRUE  35885  48        FALSE
2 Female   FALSE  37737  66        FALSE
3 Female   FALSE  26388  62        FALSE
4 Female    TRUE  43483  61        FALSE
5 Female   FALSE  38079  54        FALSE
6 Female   FALSE  44328  41         TRUE

Python

Train & Test Data Splitting

beer_splits_h2o_r <- h2o.splitFrame(data=beer_h2o_r, ratios=0.7, seed=1754)  #RoarLionRoar 🦁
beer_train_h2o_r <- beer_splits_h2o_r[[1]]
beer_test_h2o_r <- beer_splits_h2o_r[[2]]
dim(beer_train_h2o_r)

[1] 1040    5

head(beer_train_h2o_r)

  gender married income age prefer_light
1   Male    TRUE  35885  48        FALSE
2 Female   FALSE  37737  66        FALSE
3 Female    TRUE  43483  61        FALSE
4 Female   FALSE  44328  41         TRUE
5 Female    TRUE  40865  64        FALSE
6 Female   FALSE  54499  45         TRUE

dim(beer_test_h2o_r)

[1] 460   5

head(beer_test_h2o_r)

  gender married income age prefer_light
1 Female   FALSE  26388  62        FALSE
2 Female   FALSE  38079  54        FALSE
3   Male    TRUE  62118  62         TRUE
4   Male    TRUE  67201  85        FALSE
5 Female   FALSE  40382  62        FALSE
6   Male    TRUE  54339  38         TRUE

Python

Data Modeling

AutoML Classification Models: Training

models_classification_predictors_r <- c("gender", "married", "income", "age")
models_classification_response_r <- "prefer_light"

models_classification_r <- h2o.automl(
    x=models_classification_predictors_r,
    y=models_classification_response_r,
    training_frame=beer_train_h2o_r,
    max_models=12,
    seed=1754  #RoarLionRoar 🦁
)

Python

AutoML Regression Models: Training

Suppose, within its brewery & restaurant in the RiNo district of Denver, Blue Moon seeks to optimize upselling opportunities by predicting income range. A numerical prediction can be made using the same data points: gender, marital status, age range, and preference for light beer. Using AutoML, I can perform this regression task efficiently and accurately, beyond simply using a baseline linear regression.

models_regression_predictors_r <- c("gender", "married", "age", "prefer_light")
models_regression_response_r <- "income"

models_regression_r <- h2o.automl(
    y=models_classification_response_r,
    training_frame=beer_train_h2o_r,
    leaderboard_frame=beer_test_h2o_r,
    max_runtime_secs=30,
    seed=1754  #RoarLionRoar 🦁
)

Python

Model Evaluation

AutoML Classification Models

# print(models_classification_r@leaderboard, n=nrow(models_classification_r@leaderboard))
h2o.get_leaderboard(object=models_classification_r, extra_columns="ALL")

                                                 model_id  auc logloss aucpr mean_per_class_error rmse  mse training_time_ms predict_time_per_row_ms            algo
1 StackedEnsemble_BestOfFamily_1_AutoML_3_20240529_150054 0.85    0.47  0.85                 0.23 0.39 0.16              944                  0.0196 StackedEnsemble
2    StackedEnsemble_AllModels_1_AutoML_3_20240529_150054 0.85    0.47  0.85                 0.23 0.39 0.16              946                  0.0194 StackedEnsemble
3                          GLM_1_AutoML_3_20240529_150054 0.85    0.47  0.85                 0.22 0.39 0.16               30                  0.0033             GLM
4                          GBM_1_AutoML_3_20240529_150054 0.85    0.49  0.84                 0.24 0.40 0.16              406                  0.0095             GBM
5                 DeepLearning_1_AutoML_3_20240529_150054 0.85    0.49  0.85                 0.26 0.40 0.16               53                  0.0050    DeepLearning
6                      XGBoost_1_AutoML_3_20240529_150054 0.84    0.49  0.84                 0.23 0.40 0.16              124                  0.0041         XGBoost

[14 rows x 10 columns]

models_classification_predictions_r <- h2o.predict(models_classification_r, beer_test_h2o_r)

head(models_classification_predictions_r)

  predict FALSE  TRUE
1   FALSE 0.987 0.013
2   FALSE 0.878 0.122
3    TRUE 0.366 0.634
4   FALSE 0.884 0.116
5   FALSE 0.945 0.055
6    TRUE 0.082 0.918

models_classification_performance_r <- h2o.performance(models_classification_r@leader, beer_test_h2o_r)
models_classification_performance_r

H2OBinomialMetrics: stackedensemble

MSE:  0.15
RMSE:  0.38
LogLoss:  0.45
Mean Per-Class Error:  0.19
AUC:  0.87
AUCPR:  0.84
Gini:  0.74

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
       FALSE TRUE    Error     Rate
FALSE    188   57 0.232653  =57/245
TRUE      32  183 0.148837  =32/215
Totals   220  240 0.193478  =89/460

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold      value idx
1                       max f1  0.461769   0.804396 206
2                       max f2  0.262300   0.863986 269
3                 max f0point5  0.740746   0.794224 129
4                 max accuracy  0.461769   0.806522 206
5                max precision  0.996241   1.000000   0
6                   max recall  0.102316   1.000000 343
7              max specificity  0.996241   1.000000   0
8             max absolute_mcc  0.461769   0.617777 206
9   max min_per_class_accuracy  0.506297   0.795349 188
10 max mean_per_class_accuracy  0.461769   0.809255 206
11                     max tns  0.996241 245.000000   0
12                     max fns  0.996241 214.000000   0
13                     max fps  0.002607 245.000000 399
14                     max tps  0.102316 215.000000 343
15                     max tnr  0.996241   1.000000   0
16                     max fnr  0.996241   0.995349   0
17                     max fpr  0.002607   1.000000 399
18                     max tpr  0.102316   1.000000 343

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Python

AutoML Regression Models

print(models_regression_r@leaderboard, n=nrow(models_regression_r@leaderboard))

                                                  model_id  auc logloss aucpr mean_per_class_error rmse  mse
1     DeepLearning_grid_3_AutoML_4_20240529_150107_model_3 0.87    0.46  0.85                 0.19 0.39 0.15
2                           GLM_1_AutoML_4_20240529_150107 0.87    0.45  0.85                 0.20 0.38 0.15
3     DeepLearning_grid_1_AutoML_4_20240529_150107_model_4 0.87    0.45  0.85                 0.19 0.38 0.15
4     DeepLearning_grid_1_AutoML_4_20240529_150107_model_3 0.87    0.44  0.85                 0.20 0.38 0.14
5     DeepLearning_grid_1_AutoML_4_20240529_150107_model_5 0.87    0.48  0.85                 0.20 0.39 0.15
6     DeepLearning_grid_2_AutoML_4_20240529_150107_model_3 0.87    0.45  0.85                 0.20 0.38 0.14
7     DeepLearning_grid_1_AutoML_4_20240529_150107_model_1 0.87    0.46  0.85                 0.20 0.38 0.15
8     DeepLearning_grid_2_AutoML_4_20240529_150107_model_1 0.87    0.48  0.84                 0.20 0.40 0.16
9  StackedEnsemble_BestOfFamily_4_AutoML_4_20240529_150107 0.87    0.46  0.85                 0.19 0.39 0.15
10 StackedEnsemble_BestOfFamily_3_AutoML_4_20240529_150107 0.87    0.46  0.84                 0.19 0.39 0.15
11 StackedEnsemble_BestOfFamily_5_AutoML_4_20240529_150107 0.87    0.46  0.85                 0.19 0.38 0.15
12    DeepLearning_grid_3_AutoML_4_20240529_150107_model_2 0.87    0.47  0.84                 0.20 0.39 0.15
13 StackedEnsemble_BestOfFamily_2_AutoML_4_20240529_150107 0.87    0.46  0.84                 0.19 0.39 0.15
14 StackedEnsemble_BestOfFamily_1_AutoML_4_20240529_150107 0.87    0.46  0.84                 0.19 0.39 0.15
15    StackedEnsemble_AllModels_1_AutoML_4_20240529_150107 0.87    0.46  0.84                 0.19 0.39 0.15
16    StackedEnsemble_AllModels_2_AutoML_4_20240529_150107 0.87    0.46  0.84                 0.19 0.39 0.15
17    DeepLearning_grid_2_AutoML_4_20240529_150107_model_2 0.87    0.52  0.84                 0.21 0.41 0.17
18                 DeepLearning_1_AutoML_4_20240529_150107 0.87    0.46  0.84                 0.19 0.39 0.15
19    StackedEnsemble_AllModels_4_AutoML_4_20240529_150107 0.87    0.46  0.84                 0.19 0.39 0.15
20    StackedEnsemble_AllModels_3_AutoML_4_20240529_150107 0.87    0.46  0.84                 0.19 0.39 0.15
21    DeepLearning_grid_1_AutoML_4_20240529_150107_model_2 0.87    0.45  0.84                 0.21 0.38 0.15
22    DeepLearning_grid_3_AutoML_4_20240529_150107_model_1 0.87    0.46  0.83                 0.20 0.39 0.15
23        XGBoost_grid_1_AutoML_4_20240529_150107_model_14 0.86    0.47  0.81                 0.21 0.39 0.15
24         XGBoost_grid_1_AutoML_4_20240529_150107_model_6 0.86    0.47  0.82                 0.20 0.39 0.15
25            GBM_grid_1_AutoML_4_20240529_150107_model_19 0.86    0.48  0.82                 0.21 0.40 0.16
26            GBM_grid_1_AutoML_4_20240529_150107_model_17 0.86    0.47  0.83                 0.21 0.40 0.16
27                          GBM_1_AutoML_4_20240529_150107 0.86    0.48  0.81                 0.21 0.40 0.16
28                      XGBoost_1_AutoML_4_20240529_150107 0.85    0.48  0.80                 0.20 0.40 0.16
29            GBM_grid_1_AutoML_4_20240529_150107_model_13 0.85    0.49  0.81                 0.21 0.40 0.16
30         XGBoost_grid_1_AutoML_4_20240529_150107_model_3 0.85    0.48  0.81                 0.22 0.40 0.16
31         XGBoost_grid_1_AutoML_4_20240529_150107_model_7 0.85    0.48  0.79                 0.21 0.39 0.16
32         XGBoost_grid_1_AutoML_4_20240529_150107_model_2 0.85    0.49  0.80                 0.22 0.40 0.16
33         XGBoost_grid_1_AutoML_4_20240529_150107_model_1 0.85    0.49  0.81                 0.22 0.40 0.16
34             GBM_grid_1_AutoML_4_20240529_150107_model_6 0.85    0.49  0.79                 0.22 0.40 0.16
35        XGBoost_grid_1_AutoML_4_20240529_150107_model_10 0.85    0.49  0.82                 0.22 0.40 0.16
36        XGBoost_grid_1_AutoML_4_20240529_150107_model_11 0.85    0.49  0.80                 0.23 0.40 0.16
37                      XGBoost_2_AutoML_4_20240529_150107 0.85    0.49  0.80                 0.21 0.40 0.16
38             GBM_grid_1_AutoML_4_20240529_150107_model_2 0.85    0.49  0.81                 0.23 0.40 0.16
39        XGBoost_grid_1_AutoML_4_20240529_150107_model_15 0.85    0.49  0.80                 0.21 0.40 0.16
40            GBM_grid_1_AutoML_4_20240529_150107_model_10 0.85    0.49  0.79                 0.23 0.40 0.16
41                      XGBoost_3_AutoML_4_20240529_150107 0.85    0.49  0.81                 0.22 0.40 0.16
42            GBM_grid_1_AutoML_4_20240529_150107_model_11 0.84    0.49  0.80                 0.23 0.40 0.16
43        XGBoost_grid_1_AutoML_4_20240529_150107_model_13 0.84    0.50  0.80                 0.21 0.40 0.16
44         XGBoost_grid_1_AutoML_4_20240529_150107_model_8 0.84    0.50  0.80                 0.24 0.41 0.17
45             GBM_grid_1_AutoML_4_20240529_150107_model_5 0.84    0.50  0.81                 0.22 0.40 0.16
46         XGBoost_grid_1_AutoML_4_20240529_150107_model_4 0.84    0.50  0.80                 0.25 0.40 0.16
47            GBM_grid_1_AutoML_4_20240529_150107_model_16 0.84    0.51  0.79                 0.23 0.41 0.17
48                          GBM_2_AutoML_4_20240529_150107 0.84    0.50  0.80                 0.22 0.41 0.16
49             GBM_grid_1_AutoML_4_20240529_150107_model_8 0.84    0.50  0.80                 0.25 0.41 0.17
50                          GBM_4_AutoML_4_20240529_150107 0.84    0.50  0.80                 0.22 0.41 0.17
51            GBM_grid_1_AutoML_4_20240529_150107_model_15 0.83    0.52  0.81                 0.25 0.41 0.17
52                          GBM_5_AutoML_4_20240529_150107 0.83    0.51  0.79                 0.22 0.41 0.17
53                          GBM_3_AutoML_4_20240529_150107 0.83    0.51  0.79                 0.22 0.41 0.17
54             GBM_grid_1_AutoML_4_20240529_150107_model_7 0.83    0.53  0.79                 0.25 0.42 0.18
55                          XRT_1_AutoML_4_20240529_150107 0.83    0.52  0.79                 0.23 0.41 0.17
56         XGBoost_grid_1_AutoML_4_20240529_150107_model_9 0.83    0.55  0.81                 0.26 0.42 0.18
57             GBM_grid_1_AutoML_4_20240529_150107_model_3 0.83    0.52  0.79                 0.23 0.42 0.17
58        XGBoost_grid_1_AutoML_4_20240529_150107_model_12 0.82    0.55  0.80                 0.25 0.43 0.18
59            GBM_grid_1_AutoML_4_20240529_150107_model_14 0.82    0.54  0.77                 0.25 0.42 0.18
60                          DRF_1_AutoML_4_20240529_150107 0.81    0.56  0.79                 0.26 0.43 0.18
61         XGBoost_grid_1_AutoML_4_20240529_150107_model_5 0.81    0.64  0.79                 0.25 0.44 0.20
62            GBM_grid_1_AutoML_4_20240529_150107_model_18 0.81    0.54  0.79                 0.25 0.42 0.18
63             GBM_grid_1_AutoML_4_20240529_150107_model_9 0.81    0.54  0.78                 0.29 0.43 0.18
64             GBM_grid_1_AutoML_4_20240529_150107_model_1 0.81    0.54  0.78                 0.26 0.43 0.18
65            GBM_grid_1_AutoML_4_20240529_150107_model_12 0.75    0.60  0.73                 0.35 0.45 0.21
66             GBM_grid_1_AutoML_4_20240529_150107_model_4 0.74    0.60  0.71                 0.35 0.46 0.21

[66 rows x 7 columns]

models_regression_predictions_r <- h2o.predict(models_regression_r, beer_test_h2o_r)

head(models_regression_predictions_r)

  predict FALSE   TRUE
1   FALSE 0.992 0.0076
2   FALSE 0.940 0.0596
3    TRUE 0.315 0.6848
4   FALSE 0.946 0.0538
5   FALSE 0.979 0.0212
6    TRUE 0.041 0.9590

models_regression_performance_r <- h2o.performance(models_regression_r@leader, beer_test_h2o_r)
models_regression_performance_r

H2OBinomialMetrics: deeplearning

MSE:  0.15
RMSE:  0.39
LogLoss:  0.46
Mean Per-Class Error:  0.19
AUC:  0.87
AUCPR:  0.85
Gini:  0.75

Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
       FALSE TRUE    Error     Rate
FALSE    183   62 0.253061  =62/245
TRUE      29  186 0.134884  =29/215
Totals   212  248 0.197826  =91/460

Maximum Metrics: Maximum metrics at their respective thresholds
                        metric threshold      value idx
1                       max f1  0.498087   0.803456 213
2                       max f2  0.236614   0.867521 273
3                 max f0point5  0.803307   0.807382 133
4                 max accuracy  0.546356   0.804348 200
5                max precision  0.989163   1.000000   0
6                   max recall  0.039902   1.000000 351
7              max specificity  0.989163   1.000000   0
8             max absolute_mcc  0.498087   0.612631 213
9   max min_per_class_accuracy  0.589536   0.795349 186
10 max mean_per_class_accuracy  0.546356   0.806360 200
11                     max tns  0.989163 245.000000   0
12                     max fns  0.989163 214.000000   0
13                     max fps  0.004281 245.000000 399
14                     max tps  0.039902 215.000000 351
15                     max tnr  0.989163   1.000000   0
16                     max fnr  0.989163   0.995349   0
17                     max fpr  0.004281   1.000000 399
18                     max tpr  0.039902   1.000000 351

Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Python

Appendix A: Environment, Language & Package Versions, and Coding Style

If you are interested in reproducing this work, here are the versions of R, Python, and Julia that I used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me. Finally, the data visualizations are mostly (if not entirely) implemented using the Grammar of Graphics framework.

cat(
    R.version$version.string, "-", R.version$nickname,
    "\nOS:", Sys.info()["sysname"], R.version$platform,
    "\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)

R version 4.2.3 (2023-03-15) - Shortstop Beagle 
OS: Darwin x86_64-apple-darwin17.0 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz

require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.5.0", repos="http://cran.us.r-project.org")
devtools::install_version("h2o", version="3.44.0.3", repos="http://cran.us.r-project.org")

library(package=dplyr)
library(package=ggplot2)
library(package=h2o)

Python

import sys
import platform
import os
import cpuinfo
print(
    "Python", sys.version,
    "\nOS:", platform.system(), platform.platform(),
    "\nCPU:", os.cpu_count(), "x", cpuinfo.get_cpu_info()["brand_raw"]
)

Python 3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)] 
OS: Darwin macOS-10.16-x86_64-i386-64bit 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz

!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install scipy==1.11.1
!pip install h20==3.46.0.2

import numpy
import pandas
from scipy import stats
import h2o

Appendix B: H2O.ai Initiation

# Start the H2O cluster (locally)
h2o.init()

 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         32 minutes 41 seconds 
    H2O cluster timezone:       America/Denver 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.44.0.3 
    H2O cluster version age:    5 months and 8 days 
    H2O cluster name:           H2O_started_from_R_michael_vdz997 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   1.85 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 4.2.3 (2023-03-15)

Python

# Start the H2O cluster (locally)
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321. connected.
Attempting to start a local H2O server...
  Java Version: java version "21.0.2" 2024-01-16 LTS; Java(TM) SE Runtime Environment (build 21.0.2+13-LTS-58); Java HotSpot(TM) 64-Bit Server VM (build 21.0.2+13-LTS-58, mixed mode, sharing)
  Starting server from /Volumes/Personal/Mami/__Netlify/hello@michaelmallari.com/www.michaelmallari.com/pythonenv/v3.11.4/lib/python3.11/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/b8/074_xp1n1kdcp6m3ljdf5shh0000gn/T/tmp9r0fuzbf
  JVM stdout: /var/folders/b8/074_xp1n1kdcp6m3ljdf5shh0000gn/T/tmp9r0fuzbf/h2o_michael_started_from_python.out
  JVM stderr: /var/folders/b8/074_xp1n1kdcp6m3ljdf5shh0000gn/T/tmp9r0fuzbf/h2o_michael_started_from_python.err
  Server is running at http://127.0.0.1:54325
Connecting to H2O server at http://127.0.0.1:54325 ... successful.
--------------------------  ------------------------------
H2O_cluster_uptime:         03 secs
H2O_cluster_timezone:       America/Denver
H2O_data_parsing_timezone:  UTC
H2O_cluster_version:        3.46.0.2
H2O_cluster_version_age:    15 days
H2O_cluster_name:           H2O_from_python_michael_tuyuco
H2O_cluster_total_nodes:    1
H2O_cluster_free_memory:    1.983 Gb
H2O_cluster_total_cores:    8
H2O_cluster_allowed_cores:  8
H2O_cluster_status:         locked, healthy
H2O_connection_url:         http://127.0.0.1:54325
H2O_connection_proxy:       {"http": null, "https": null}
H2O_internal_security:      False
Python_version:             3.11.4 final
--------------------------  ------------------------------

Appendix C: A Case for H2O.ai for AutoML

Advantages

H2O.ai provides an easy-to-use interface that automates the end-to-end data science process.
It has efficient AutoML capabilities, making machine learning more accessible and saving time in model development.
H2O.ai is scalable, accommodating various data volumes for businesses.
It provides a “leaderboard” of H2O models which can be easily exported for use in production.

Disadvantages

One downside of H2O.ai is that it can be difficult to determine who is accountable for the decisions made by the automated Machine Learning models. This could be a concern in regulated industries like Financial Services or Healthcare.

Data Understanding

Exploratory Data Analysis

Univariate Analysis

Bivariate Analysis

Multivariate Analysis

Data Preparation

Data Frame Conversion to H2O Data Frame

Train & Test Data Splitting

Data Modeling

AutoML Classification Models: Training

AutoML Regression Models: Training

Model Evaluation

AutoML Classification Models

AutoML Regression Models

Appendix A: Environment, Language & Package Versions, and Coding Style

Appendix B: H2O.ai Initiation

Appendix C: A Case for H2O.ai for AutoML

Advantages

Disadvantages

Further Readings