Scaled & Efficient Supervised Learning with AutoML
Accelerating time-to-value by automating modeling tasks on beer consumer data—using H2O.ai in R and Python.
In continuation of the binary classification project hypothetically chartered by the Blue Moon Brewing Company 12 year ago, this project is a longitudinal study to re-validate the relevance of Blue Moon’s STP (Segmentation, Targeting, Positioning) marketing strategy on today’s evolving beer consumers.
The objective of this data analysis is the same: to infer whether demographic data around gender, age, marital status, and income continue to indicate a consumer preference for light beer. To achieve this, I collected survey data from 1,500 beer consumers. Employing automated machine learning (AutoML), this project explores broad alternative classification methods, beyond a baseline logistic regression that was applied in the prior project.
Data Understanding
For data understanding, I imported a CSV file with 100 records and 5 columns. These columns include gender (0 for female, 1 for male), marital status (0 for unmarried, 1 for married), income, age, and beer preference (0 for regular, 1 for light). This initial analysis is critical for identifying the dataset’s structure and preparing for subsequent data exploration and modeling.
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is essential for understanding single variables (univariate), pairs (bivariate), and multiple (multivariate) interactions. It reveals trends, patterns, and anomalies, informing subsequent analysis and hypothesis development.
Univariate Analysis
Univariate analysis summarizes and identifies patterns in individual variables. It informs subsequent analysis, revealing insights into distribution, central tendency, and variability. By examining one variable at a time, it detects outliers, assesses data quality, and sets the stage for more complex bivariate and multivariate analyses.
summary(beer_r)
gender married income age prefer_light
Min. :0.00 Min. :0.00 Min. :24796 Min. :21 Min. :0.0
1st Qu.:0.00 1st Qu.:0.00 1st Qu.:46279 1st Qu.:49 1st Qu.:0.0
Median :0.00 Median :0.00 Median :52306 Median :56 Median :0.5
Mean :0.43 Mean :0.45 Mean :52561 Mean :55 Mean :0.5
3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.:58955 3rd Qu.:62 3rd Qu.:1.0
Max. :1.00 Max. :1.00 Max. :84031 Max. :87 Max. :1.0
Bivariate Analysis
Multivariate Analysis
Multivariate analysis plays a pivotal role in understanding intricate relationships within datasets. By exploring patterns, identifying outliers, and revealing the underlying structure, it provides valuable insights. One essential tool in multivariate analysis is the correlation matrix, which quantifies the strength and direction of relationships between variables. Positive values indicate direct associations, while negative values imply inverse relationships. Leveraging insights from the correlation matrix, we can make informed decisions about feature selection, hypothesis testing, and model accuracy. Detecting multicollinearity and identifying potential predictors becomes more effective with this analytical approach.
correlation_matrix_r <- round(cor(beer_r), 2)
head(correlation_matrix_r[5:1, 1:5])
gender married income age prefer_light
prefer_light -0.09 0.05 0.40 -0.41 1.00
age 0.19 0.22 0.11 1.00 -0.41
income 0.03 0.33 1.00 0.11 0.40
married -0.04 1.00 0.33 0.22 0.05
gender 1.00 -0.04 0.03 0.19 -0.09
Data Preparation
Data Frame Conversion to H2O Data Frame
beer_r$gender <- factor(beer_r$gender, levels=c(0, 1), labels=c("Female", "Male"))
beer_r$married <- as.logical(beer_r$married)
beer_r$prefer_light <- as.logical(beer_r$prefer_light)
local_h2o <- h2o.init()
beer_h2o_r <- as.h2o(beer_r)
dim(beer_h2o_r)
[1] 1500 5
head(beer_h2o_r)
gender married income age prefer_light
1 Male TRUE 35885 48 FALSE
2 Female FALSE 37737 66 FALSE
3 Female FALSE 26388 62 FALSE
4 Female TRUE 43483 61 FALSE
5 Female FALSE 38079 54 FALSE
6 Female FALSE 44328 41 TRUE
Train & Test Data Splitting
beer_splits_h2o_r <- h2o.splitFrame(data=beer_h2o_r, ratios=0.7, seed=1754) #RoarLionRoar 🦁
beer_train_h2o_r <- beer_splits_h2o_r[[1]]
beer_test_h2o_r <- beer_splits_h2o_r[[2]]
dim(beer_train_h2o_r)
[1] 1040 5
head(beer_train_h2o_r)
gender married income age prefer_light
1 Male TRUE 35885 48 FALSE
2 Female FALSE 37737 66 FALSE
3 Female TRUE 43483 61 FALSE
4 Female FALSE 44328 41 TRUE
5 Female TRUE 40865 64 FALSE
6 Female FALSE 54499 45 TRUE
dim(beer_test_h2o_r)
[1] 460 5
head(beer_test_h2o_r)
gender married income age prefer_light
1 Female FALSE 26388 62 FALSE
2 Female FALSE 38079 54 FALSE
3 Male TRUE 62118 62 TRUE
4 Male TRUE 67201 85 FALSE
5 Female FALSE 40382 62 FALSE
6 Male TRUE 54339 38 TRUE
Data Modeling
AutoML Classification Models: Training
models_classification_predictors_r <- c("gender", "married", "income", "age")
models_classification_response_r <- "prefer_light"
models_classification_r <- h2o.automl(
x=models_classification_predictors_r,
y=models_classification_response_r,
training_frame=beer_train_h2o_r,
max_models=12,
seed=1754 #RoarLionRoar 🦁
)
AutoML Regression Models: Training
Suppose, within its brewery & restaurant in the RiNo district of Denver, Blue Moon seeks to optimize upselling opportunities by predicting income range. A numerical prediction can be made using the same data points: gender, marital status, age range, and preference for light beer. Using AutoML, I can perform this regression task efficiently and accurately, beyond simply using a baseline linear regression.
models_regression_predictors_r <- c("gender", "married", "age", "prefer_light")
models_regression_response_r <- "income"
models_regression_r <- h2o.automl(
y=models_classification_response_r,
training_frame=beer_train_h2o_r,
leaderboard_frame=beer_test_h2o_r,
max_runtime_secs=30,
seed=1754 #RoarLionRoar 🦁
)
Model Evaluation
AutoML Classification Models
# print(models_classification_r@leaderboard, n=nrow(models_classification_r@leaderboard))
h2o.get_leaderboard(object=models_classification_r, extra_columns="ALL")
model_id auc logloss aucpr mean_per_class_error rmse mse training_time_ms predict_time_per_row_ms algo
1 StackedEnsemble_BestOfFamily_1_AutoML_3_20240529_150054 0.85 0.47 0.85 0.23 0.39 0.16 944 0.0196 StackedEnsemble
2 StackedEnsemble_AllModels_1_AutoML_3_20240529_150054 0.85 0.47 0.85 0.23 0.39 0.16 946 0.0194 StackedEnsemble
3 GLM_1_AutoML_3_20240529_150054 0.85 0.47 0.85 0.22 0.39 0.16 30 0.0033 GLM
4 GBM_1_AutoML_3_20240529_150054 0.85 0.49 0.84 0.24 0.40 0.16 406 0.0095 GBM
5 DeepLearning_1_AutoML_3_20240529_150054 0.85 0.49 0.85 0.26 0.40 0.16 53 0.0050 DeepLearning
6 XGBoost_1_AutoML_3_20240529_150054 0.84 0.49 0.84 0.23 0.40 0.16 124 0.0041 XGBoost
[14 rows x 10 columns]
models_classification_predictions_r <- h2o.predict(models_classification_r, beer_test_h2o_r)
head(models_classification_predictions_r)
predict FALSE TRUE
1 FALSE 0.987 0.013
2 FALSE 0.878 0.122
3 TRUE 0.366 0.634
4 FALSE 0.884 0.116
5 FALSE 0.945 0.055
6 TRUE 0.082 0.918
models_classification_performance_r <- h2o.performance(models_classification_r@leader, beer_test_h2o_r)
models_classification_performance_r
H2OBinomialMetrics: stackedensemble
MSE: 0.15
RMSE: 0.38
LogLoss: 0.45
Mean Per-Class Error: 0.19
AUC: 0.87
AUCPR: 0.84
Gini: 0.74
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
FALSE TRUE Error Rate
FALSE 188 57 0.232653 =57/245
TRUE 32 183 0.148837 =32/215
Totals 220 240 0.193478 =89/460
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.461769 0.804396 206
2 max f2 0.262300 0.863986 269
3 max f0point5 0.740746 0.794224 129
4 max accuracy 0.461769 0.806522 206
5 max precision 0.996241 1.000000 0
6 max recall 0.102316 1.000000 343
7 max specificity 0.996241 1.000000 0
8 max absolute_mcc 0.461769 0.617777 206
9 max min_per_class_accuracy 0.506297 0.795349 188
10 max mean_per_class_accuracy 0.461769 0.809255 206
11 max tns 0.996241 245.000000 0
12 max fns 0.996241 214.000000 0
13 max fps 0.002607 245.000000 399
14 max tps 0.102316 215.000000 343
15 max tnr 0.996241 1.000000 0
16 max fnr 0.996241 0.995349 0
17 max fpr 0.002607 1.000000 399
18 max tpr 0.102316 1.000000 343
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
AutoML Regression Models
print(models_regression_r@leaderboard, n=nrow(models_regression_r@leaderboard))
model_id auc logloss aucpr mean_per_class_error rmse mse
1 DeepLearning_grid_3_AutoML_4_20240529_150107_model_3 0.87 0.46 0.85 0.19 0.39 0.15
2 GLM_1_AutoML_4_20240529_150107 0.87 0.45 0.85 0.20 0.38 0.15
3 DeepLearning_grid_1_AutoML_4_20240529_150107_model_4 0.87 0.45 0.85 0.19 0.38 0.15
4 DeepLearning_grid_1_AutoML_4_20240529_150107_model_3 0.87 0.44 0.85 0.20 0.38 0.14
5 DeepLearning_grid_1_AutoML_4_20240529_150107_model_5 0.87 0.48 0.85 0.20 0.39 0.15
6 DeepLearning_grid_2_AutoML_4_20240529_150107_model_3 0.87 0.45 0.85 0.20 0.38 0.14
7 DeepLearning_grid_1_AutoML_4_20240529_150107_model_1 0.87 0.46 0.85 0.20 0.38 0.15
8 DeepLearning_grid_2_AutoML_4_20240529_150107_model_1 0.87 0.48 0.84 0.20 0.40 0.16
9 StackedEnsemble_BestOfFamily_4_AutoML_4_20240529_150107 0.87 0.46 0.85 0.19 0.39 0.15
10 StackedEnsemble_BestOfFamily_3_AutoML_4_20240529_150107 0.87 0.46 0.84 0.19 0.39 0.15
11 StackedEnsemble_BestOfFamily_5_AutoML_4_20240529_150107 0.87 0.46 0.85 0.19 0.38 0.15
12 DeepLearning_grid_3_AutoML_4_20240529_150107_model_2 0.87 0.47 0.84 0.20 0.39 0.15
13 StackedEnsemble_BestOfFamily_2_AutoML_4_20240529_150107 0.87 0.46 0.84 0.19 0.39 0.15
14 StackedEnsemble_BestOfFamily_1_AutoML_4_20240529_150107 0.87 0.46 0.84 0.19 0.39 0.15
15 StackedEnsemble_AllModels_1_AutoML_4_20240529_150107 0.87 0.46 0.84 0.19 0.39 0.15
16 StackedEnsemble_AllModels_2_AutoML_4_20240529_150107 0.87 0.46 0.84 0.19 0.39 0.15
17 DeepLearning_grid_2_AutoML_4_20240529_150107_model_2 0.87 0.52 0.84 0.21 0.41 0.17
18 DeepLearning_1_AutoML_4_20240529_150107 0.87 0.46 0.84 0.19 0.39 0.15
19 StackedEnsemble_AllModels_4_AutoML_4_20240529_150107 0.87 0.46 0.84 0.19 0.39 0.15
20 StackedEnsemble_AllModels_3_AutoML_4_20240529_150107 0.87 0.46 0.84 0.19 0.39 0.15
21 DeepLearning_grid_1_AutoML_4_20240529_150107_model_2 0.87 0.45 0.84 0.21 0.38 0.15
22 DeepLearning_grid_3_AutoML_4_20240529_150107_model_1 0.87 0.46 0.83 0.20 0.39 0.15
23 XGBoost_grid_1_AutoML_4_20240529_150107_model_14 0.86 0.47 0.81 0.21 0.39 0.15
24 XGBoost_grid_1_AutoML_4_20240529_150107_model_6 0.86 0.47 0.82 0.20 0.39 0.15
25 GBM_grid_1_AutoML_4_20240529_150107_model_19 0.86 0.48 0.82 0.21 0.40 0.16
26 GBM_grid_1_AutoML_4_20240529_150107_model_17 0.86 0.47 0.83 0.21 0.40 0.16
27 GBM_1_AutoML_4_20240529_150107 0.86 0.48 0.81 0.21 0.40 0.16
28 XGBoost_1_AutoML_4_20240529_150107 0.85 0.48 0.80 0.20 0.40 0.16
29 GBM_grid_1_AutoML_4_20240529_150107_model_13 0.85 0.49 0.81 0.21 0.40 0.16
30 XGBoost_grid_1_AutoML_4_20240529_150107_model_3 0.85 0.48 0.81 0.22 0.40 0.16
31 XGBoost_grid_1_AutoML_4_20240529_150107_model_7 0.85 0.48 0.79 0.21 0.39 0.16
32 XGBoost_grid_1_AutoML_4_20240529_150107_model_2 0.85 0.49 0.80 0.22 0.40 0.16
33 XGBoost_grid_1_AutoML_4_20240529_150107_model_1 0.85 0.49 0.81 0.22 0.40 0.16
34 GBM_grid_1_AutoML_4_20240529_150107_model_6 0.85 0.49 0.79 0.22 0.40 0.16
35 XGBoost_grid_1_AutoML_4_20240529_150107_model_10 0.85 0.49 0.82 0.22 0.40 0.16
36 XGBoost_grid_1_AutoML_4_20240529_150107_model_11 0.85 0.49 0.80 0.23 0.40 0.16
37 XGBoost_2_AutoML_4_20240529_150107 0.85 0.49 0.80 0.21 0.40 0.16
38 GBM_grid_1_AutoML_4_20240529_150107_model_2 0.85 0.49 0.81 0.23 0.40 0.16
39 XGBoost_grid_1_AutoML_4_20240529_150107_model_15 0.85 0.49 0.80 0.21 0.40 0.16
40 GBM_grid_1_AutoML_4_20240529_150107_model_10 0.85 0.49 0.79 0.23 0.40 0.16
41 XGBoost_3_AutoML_4_20240529_150107 0.85 0.49 0.81 0.22 0.40 0.16
42 GBM_grid_1_AutoML_4_20240529_150107_model_11 0.84 0.49 0.80 0.23 0.40 0.16
43 XGBoost_grid_1_AutoML_4_20240529_150107_model_13 0.84 0.50 0.80 0.21 0.40 0.16
44 XGBoost_grid_1_AutoML_4_20240529_150107_model_8 0.84 0.50 0.80 0.24 0.41 0.17
45 GBM_grid_1_AutoML_4_20240529_150107_model_5 0.84 0.50 0.81 0.22 0.40 0.16
46 XGBoost_grid_1_AutoML_4_20240529_150107_model_4 0.84 0.50 0.80 0.25 0.40 0.16
47 GBM_grid_1_AutoML_4_20240529_150107_model_16 0.84 0.51 0.79 0.23 0.41 0.17
48 GBM_2_AutoML_4_20240529_150107 0.84 0.50 0.80 0.22 0.41 0.16
49 GBM_grid_1_AutoML_4_20240529_150107_model_8 0.84 0.50 0.80 0.25 0.41 0.17
50 GBM_4_AutoML_4_20240529_150107 0.84 0.50 0.80 0.22 0.41 0.17
51 GBM_grid_1_AutoML_4_20240529_150107_model_15 0.83 0.52 0.81 0.25 0.41 0.17
52 GBM_5_AutoML_4_20240529_150107 0.83 0.51 0.79 0.22 0.41 0.17
53 GBM_3_AutoML_4_20240529_150107 0.83 0.51 0.79 0.22 0.41 0.17
54 GBM_grid_1_AutoML_4_20240529_150107_model_7 0.83 0.53 0.79 0.25 0.42 0.18
55 XRT_1_AutoML_4_20240529_150107 0.83 0.52 0.79 0.23 0.41 0.17
56 XGBoost_grid_1_AutoML_4_20240529_150107_model_9 0.83 0.55 0.81 0.26 0.42 0.18
57 GBM_grid_1_AutoML_4_20240529_150107_model_3 0.83 0.52 0.79 0.23 0.42 0.17
58 XGBoost_grid_1_AutoML_4_20240529_150107_model_12 0.82 0.55 0.80 0.25 0.43 0.18
59 GBM_grid_1_AutoML_4_20240529_150107_model_14 0.82 0.54 0.77 0.25 0.42 0.18
60 DRF_1_AutoML_4_20240529_150107 0.81 0.56 0.79 0.26 0.43 0.18
61 XGBoost_grid_1_AutoML_4_20240529_150107_model_5 0.81 0.64 0.79 0.25 0.44 0.20
62 GBM_grid_1_AutoML_4_20240529_150107_model_18 0.81 0.54 0.79 0.25 0.42 0.18
63 GBM_grid_1_AutoML_4_20240529_150107_model_9 0.81 0.54 0.78 0.29 0.43 0.18
64 GBM_grid_1_AutoML_4_20240529_150107_model_1 0.81 0.54 0.78 0.26 0.43 0.18
65 GBM_grid_1_AutoML_4_20240529_150107_model_12 0.75 0.60 0.73 0.35 0.45 0.21
66 GBM_grid_1_AutoML_4_20240529_150107_model_4 0.74 0.60 0.71 0.35 0.46 0.21
[66 rows x 7 columns]
models_regression_predictions_r <- h2o.predict(models_regression_r, beer_test_h2o_r)
head(models_regression_predictions_r)
predict FALSE TRUE
1 FALSE 0.992 0.0076
2 FALSE 0.940 0.0596
3 TRUE 0.315 0.6848
4 FALSE 0.946 0.0538
5 FALSE 0.979 0.0212
6 TRUE 0.041 0.9590
models_regression_performance_r <- h2o.performance(models_regression_r@leader, beer_test_h2o_r)
models_regression_performance_r
H2OBinomialMetrics: deeplearning
MSE: 0.15
RMSE: 0.39
LogLoss: 0.46
Mean Per-Class Error: 0.19
AUC: 0.87
AUCPR: 0.85
Gini: 0.75
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
FALSE TRUE Error Rate
FALSE 183 62 0.253061 =62/245
TRUE 29 186 0.134884 =29/215
Totals 212 248 0.197826 =91/460
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.498087 0.803456 213
2 max f2 0.236614 0.867521 273
3 max f0point5 0.803307 0.807382 133
4 max accuracy 0.546356 0.804348 200
5 max precision 0.989163 1.000000 0
6 max recall 0.039902 1.000000 351
7 max specificity 0.989163 1.000000 0
8 max absolute_mcc 0.498087 0.612631 213
9 max min_per_class_accuracy 0.589536 0.795349 186
10 max mean_per_class_accuracy 0.546356 0.806360 200
11 max tns 0.989163 245.000000 0
12 max fns 0.989163 214.000000 0
13 max fps 0.004281 245.000000 399
14 max tps 0.039902 215.000000 351
15 max tnr 0.989163 1.000000 0
16 max fnr 0.989163 0.995349 0
17 max fpr 0.004281 1.000000 399
18 max tpr 0.039902 1.000000 351
Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Appendix A: Environment, Language & Package Versions, and Coding Style
If you are interested in reproducing this work, here are the versions of R, Python, and Julia that I used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me. Finally, the data visualizations are mostly (if not entirely) implemented using the Grammar of Graphics framework.
cat(
R.version$version.string, "-", R.version$nickname,
"\nOS:", Sys.info()["sysname"], R.version$platform,
"\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)
R version 4.2.3 (2023-03-15) - Shortstop Beagle
OS: Darwin x86_64-apple-darwin17.0
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.5.0", repos="http://cran.us.r-project.org")
devtools::install_version("h2o", version="3.44.0.3", repos="http://cran.us.r-project.org")
library(package=dplyr)
library(package=ggplot2)
library(package=h2o)
import sys
import platform
import os
import cpuinfo
print(
"Python", sys.version,
"\nOS:", platform.system(), platform.platform(),
"\nCPU:", os.cpu_count(), "x", cpuinfo.get_cpu_info()["brand_raw"]
)
Python 3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
OS: Darwin macOS-10.16-x86_64-i386-64bit
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install scipy==1.11.1
!pip install h20==3.46.0.2
import numpy
import pandas
from scipy import stats
import h2o
Appendix B: H2O.ai Initiation
# Start the H2O cluster (locally)
h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 32 minutes 41 seconds
H2O cluster timezone: America/Denver
H2O data parsing timezone: UTC
H2O cluster version: 3.44.0.3
H2O cluster version age: 5 months and 8 days
H2O cluster name: H2O_started_from_R_michael_vdz997
H2O cluster total nodes: 1
H2O cluster total memory: 1.85 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 4.2.3 (2023-03-15)
# Start the H2O cluster (locally)
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321. connected.
Attempting to start a local H2O server...
Java Version: java version "21.0.2" 2024-01-16 LTS; Java(TM) SE Runtime Environment (build 21.0.2+13-LTS-58); Java HotSpot(TM) 64-Bit Server VM (build 21.0.2+13-LTS-58, mixed mode, sharing)
Starting server from /Volumes/Personal/Mami/__Netlify/hello@michaelmallari.com/www.michaelmallari.com/pythonenv/v3.11.4/lib/python3.11/site-packages/h2o/backend/bin/h2o.jar
Ice root: /var/folders/b8/074_xp1n1kdcp6m3ljdf5shh0000gn/T/tmp9r0fuzbf
JVM stdout: /var/folders/b8/074_xp1n1kdcp6m3ljdf5shh0000gn/T/tmp9r0fuzbf/h2o_michael_started_from_python.out
JVM stderr: /var/folders/b8/074_xp1n1kdcp6m3ljdf5shh0000gn/T/tmp9r0fuzbf/h2o_michael_started_from_python.err
Server is running at http://127.0.0.1:54325
Connecting to H2O server at http://127.0.0.1:54325 ... successful.
-------------------------- ------------------------------
H2O_cluster_uptime: 03 secs
H2O_cluster_timezone: America/Denver
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.46.0.2
H2O_cluster_version_age: 15 days
H2O_cluster_name: H2O_from_python_michael_tuyuco
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 1.983 Gb
H2O_cluster_total_cores: 8
H2O_cluster_allowed_cores: 8
H2O_cluster_status: locked, healthy
H2O_connection_url: http://127.0.0.1:54325
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
Python_version: 3.11.4 final
-------------------------- ------------------------------
Appendix C: A Case for H2O.ai for AutoML
Advantages
- H2O.ai provides an easy-to-use interface that automates the end-to-end data science process.
- It has efficient AutoML capabilities, making machine learning more accessible and saving time in model development.
- H2O.ai is scalable, accommodating various data volumes for businesses.
- It provides a “leaderboard” of H2O models which can be easily exported for use in production.
Disadvantages
- One downside of H2O.ai is that it can be difficult to determine who is accountable for the decisions made by the automated Machine Learning models. This could be a concern in regulated industries like Financial Services or Healthcare.
Further Readings
- H2O AutoML: Automatic Machine Learning. (n.d.). H2O 3.46.0.2 documentation. https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html