Data-Informed Thinking + Doing

Categorical Predictions Based on Probability

Inferring consumer demographics with preference for light beer—using logistic regression in R, Python, and Julia.

Suppose the Blue Moon Brewing Company has noticed a trend that suggests a shift in consumer preferences to light beer, which may be associated with specific demographic segments. To validate this, the business has initiated a customer segmentation study, collecting relevant data through customer surveys. The research has been designed to uncover patterns and correlations that can inform new product development and targeted marketing campaigns.

The objective of this data analysis is to infer whether demographic factors such as gender, age, marital status, and income significantly influence the preference for light beer compared to regular beer. This understanding will guide Blue Moon in optimizing its segmentation, target market, and positioning (STP) strategies to cater to the evolving tastes of their customer base.


Data Understanding

To understand the data, I’ve imported a CSV file containing 100 records and 5 columns. The columns represent gender (0 for female, 1 for male), marital status (0 for unmarried, 1 for married), income, age, and beer preference (0 for regular, 1 for light). This preliminary analysis is crucial for recognizing the dataset’s structure and preparing for further data exploration and modeling.

str(beer_r)
'data.frame':	100 obs. of  5 variables:
 $ gender      : int  0 1 1 1 1 1 1 1 0 1 ...
 $ married     : int  0 1 1 1 1 0 0 1 1 0 ...
 $ income      : int  31779 32739 24302 64709 41882 38990 22408 25440 30784 31916 ...
 $ age         : int  46 50 46 70 54 36 40 51 52 43 ...
 $ prefer_light: int  0 0 0 0 0 0 0 0 0 0 ...
beer_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   gender        100 non-null    int64
 1   married       100 non-null    int64
 2   income        100 non-null    int64
 3   age           100 non-null    int64
 4   prefer_light  100 non-null    int64
dtypes: int64(5)
memory usage: 4.0 KB
beer_py.head(n=8)
   gender  married  income  age  prefer_light
0       0        0   31779   46             0
1       1        1   32739   50             0
2       1        1   24302   46             0
3       1        1   64709   70             0
4       1        1   41882   54             0
5       1        0   38990   36             0
6       1        0   22408   40             0
7       1        1   25440   51             0
beer_py.tail(n=8)
    gender  married  income  age  prefer_light
92       0        1   46273   56             1
93       0        0   33088   22             1
94       0        0   41391   54             1
95       0        1   45133   51             1
96       0        1   39942   21             1
97       1        1   47495   35             1
98       1        1   70981   45             1
99       0        1   48259   32             1
beer_jl
100×5 DataFrame
 Row │ gender  married  income  age    prefer_light
     │ Int64   Int64    Int64   Int64  Int64
─────┼──────────────────────────────────────────────
   1 │      0        0   31779     46             0
   2 │      1        1   32739     50             0
   3 │      1        1   24302     46             0
   4 │      1        1   64709     70             0
   5 │      1        1   41882     54             0
   6 │      1        0   38990     36             0
  ⋮  │   ⋮        ⋮       ⋮       ⋮         ⋮
  96 │      0        1   45133     51             1
  97 │      0        1   39942     21             1
  98 │      1        1   47495     35             1
  99 │      1        1   70981     45             1
 100 │      0        1   48259     32             1
                                     89 rows omitted

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is vital in providing a comprehensive view of single variables (univariate), pairs (bivariate), and multiple (multivariate) interactions. It uncovers trends, patterns, and anomalies, guiding further analysis and hypothesis formulation.

Univariate Analysis

Univariate analysis on individual variables summarizes and finds patterns in the data. It’s the foundation for all subsequent analysis, providing insights into distribution, central tendency, and variability. By examining one variable at a time, it helps identify outliers, understand data distribution, and assess data quality. This analysis is essential for setting the stage for more complex bivariate and multivariate analyses.

Numerical Variables

The summary statistics of income (a discrete number) suggest a right-skewed trend, as the mean is slightly higher than the median. Most incomes cluster around the lower to mid-range, with fewer high earners, indicated by the maximum value being significantly higher than the mean and median.

The summary statistics of age (a discrete number) indicate a moderately right-skewed distribution, with a mean slightly higher than the median. The data points are more concentrated in the middle age range, with a gradual decrease towards older ages, as reflected by the maximum age of 87.

# Income
AnalyzeUnivariate(vector=beer_r$income, categorical=FALSE)
[1] "Summary Statistics"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  20945   33073   39646   40064   45776   70981 
# Age
AnalyzeUnivariate(vector=beer_r$age, categorical=FALSE)
[1] "Summary Statistics"
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     21      36      43      44      51      87 
Categorical Variables
# Gender
AnalyzeUnivariate(vector=beer_r$gender, categorical=TRUE)
[1] "Distribution"
vector
 0  1 
43 57 
# Married
AnalyzeUnivariate(vector=beer_r$married, categorical=TRUE)
[1] "Distribution"
vector
 0  1 
45 55 
# Prefer Light
AnalyzeUnivariate(beer_r$prefer_light, categorical=TRUE)
[1] "Distribution"
vector
 0  1 
50 50 

Bivariate Analysis

Bivariate analysis is pivotal for identifying statistical relationships between two variables, especially between predictors and the target variable. It informs on tendencies and feature selection, while enhancing predictive model accuracy by clarifying how variables interact with each other.

# Gender vs. Prefer Light
# Age vs. Prefer Light
# Income vs. Prefer Light
# Age vs. Prefer Light

Multivariate Analysis

Multivariate analysis is crucial for grasping complex relationships within data. It enables the detection of patterns, outliers, and the intrinsic structure of data sets. A correlation matrix, a key multivariate analysis tool, reveals the strength and direction of relationships between variables. Positive values indicate a direct relationship, while negative values suggest an inverse one. Insights from a correlation matrix can guide feature selection, inform hypothesis testing, and enhance model accuracy by identifying multicollinearity or potential predictors for further analysis.

correlation_matrix_r <- round(cor(beer_r), 2)
head(correlation_matrix_r[5:1, 1:5])
             gender married income   age prefer_light
prefer_light  -0.14    0.10   0.51 -0.51         1.00
age            0.22    0.26   0.09  1.00        -0.51
income         0.05    0.40   1.00  0.09         0.51
married       -0.01    1.00   0.40  0.26         0.10
gender         1.00   -0.01   0.05  0.22        -0.14
ggcorrplot::ggcorrplot(correlation_matrix_r, outline.col=palette_michaelmallari_r[1], colors = c(palette_michaelmallari_r[3], palette_michaelmallari_r[1], palette_michaelmallari_r[2]), lab=TRUE)

ggplot2::ggplot(beer_r, aes(x=age, y=income, color=factor(prefer_light))) +
    ggplot2::geom_point() +
    ggplot2::stat_ellipse(type="t") +
    ggplot2::scale_y_continuous(breaks = c(15000, 30000,  45000, 60000, 75000), limits=c(15000, 75000), expand=c(0, 0), position="right") +
    ggplot2::scale_color_manual(values=c(palette_michaelmallari_r[3], palette_michaelmallari_r[2]), labels=c("Regular Beer", "Light Beer")) +
    ggplot2::labs(
        title="Responsible Drinking With Light Beer",
        subtitle="Age and income as proxies for level of responsibilities",
        x="Age",
        y="Income (USD)",
        color=""
    ) +
    theme_michaelmallari_r()


Data Modeling

Data Preparation: Split Data for Training & Testing

Splitting the dataset into training and testing sets is crucial to evaluate the model’s ability to generalize to new data. For a small dataset with 100 observations, an ideal split ratio is often 70:30 or 80:20 for training and testing, respectively. This ensures sufficient data for learning while retaining enough unique instances to assess performance accurately without overfitting.

set.seed(1856)  #FearTheTurtle 🐢
index_train <- createDataPartition(beer_r$prefer_light, p=0.7, list=FALSE)
beer_train_r <- beer_r[index_train, ]
beer_test_r  <- beer_r[-index_train, ]

Model Training

# All features
model_1 <- glm(formula=prefer_light ~ ., family="binomial", data=beer_train_r)
summary(model_1)

Call:
glm(formula = prefer_light ~ ., family = "binomial", data = beer_train_r)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4408  -0.4498   0.0009   0.3410   2.0903  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.0555189  2.1395814    0.03  0.97930    
gender      -0.6796768  0.8336018   -0.82  0.41487    
married      0.1062429  0.9507667    0.11  0.91103    
income       0.0002499  0.0000677    3.69  0.00022 ***
age         -0.2174801  0.0612084   -3.55  0.00038 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 97.041  on 69  degrees of freedom
Residual deviance: 41.988  on 65  degrees of freedom
AIC: 51.99

Number of Fisher Scoring iterations: 6
# Significance features only: income and age
model_2 <- glm(formula=prefer_light ~ income + age, family="binomial", data=beer_train_r)
summary(model_2)

Call:
glm(formula = prefer_light ~ income + age, family = "binomial", 
    data = beer_train_r)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2352  -0.4191  -0.0043   0.3391   2.2329  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.1370452  1.9659827   -0.07  0.94443    
income       0.0002387  0.0000611    3.91 0.000093 ***
age         -0.2097528  0.0556144   -3.77  0.00016 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 97.041  on 69  degrees of freedom
Residual deviance: 42.753  on 67  degrees of freedom
AIC: 48.75

Number of Fisher Scoring iterations: 6
# Marital status, income, and age
model_3 <- glm(formula=prefer_light ~ married + income + age, family="binomial", data=beer_train_r)
summary(model_3)

Call:
glm(formula = prefer_light ~ married + income + age, family = "binomial", 
    data = beer_train_r)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2747  -0.4176  -0.0044   0.3166   2.2517  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.0954511  2.1404852    0.04  0.96443    
married      0.2586641  0.9342008    0.28  0.78187    
income       0.0002360  0.0000619    3.81  0.00014 ***
age         -0.2152624  0.0597939   -3.60  0.00032 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 97.041  on 69  degrees of freedom
Residual deviance: 42.676  on 66  degrees of freedom
AIC: 50.68

Number of Fisher Scoring iterations: 6

Model Evaluation

Accuracy is the preferred metric for evaluating our model’s performance in predicting beer preference because it reflects the overall correctness of predictions. Precision and recall are less relevant as they focus on positive predictions, which isn’t as critical when distinguishing between two equally important classes—regular and light beer. The F1 score, which balances precision and recall, is unnecessary for similar reasons. Specificity, which measures true negatives, is also less pertinent because the cost of false positives is not higher than false negatives in this context. Thus, accuracy provides a clear and balanced evaluation of the model’s predictive power.

Model 1: All Features

GetAccuracyCutoff(model=model_1, df=beer_test_r, target="prefer_light") +
    ggplot2::labs(
        title="Model 1: Maximum Predictive Accuracy at 93.33%",
        alt="Model 1: Maximum Predictive Accuracy at 93.33%",
        subtitle="When 58%+ probability = light beer preference",
        x="Predicted Probability Cut-off for Light Beer Preference",
        y="Model Accuracy",
        fill=NULL,
        caption=NULL
    )

GetEvalMetrics(model=model_1, df=beer_test_r, target="prefer_light", type="response", cutoff=0.58)
Confusion Matrix

         Prediction
Actual    Regular Light
  Regular      14     1
  Light         1    14

               TN    FP
               FN    TP

------------------

Evaluation Metrics

Accuracy = 93.33 %
Precision = 93.33 %
Recall/Sensitivity/True Positive Rate = 93.33 %
F1 Score = 93.33 %
Specificity/True Negative Rate = 93.33 %

Model 2: Significant Features Only (Income and Age)

GetAccuracyCutoff(model=model_2, df=beer_test_r, target="prefer_light") +
    ggplot2::labs(
        title="Model 2: Maximum Predictive Accuracy at 90.00%",
        alt="Model 2: Maximum Predictive Accuracy at 90.00%",
        subtitle="When 57%+ probability = light beer preference",
        x="Predicted Probability Cut-off for Light Beer Preference",
        y="Model Accuracy",
        fill=NULL,
        caption=NULL
    )

GetEvalMetrics(model=model_2, df=beer_test_r, target="prefer_light", type="response", cutoff=0.57)
Confusion Matrix

         Prediction
Actual    Regular Light
  Regular      14     1
  Light         2    13

               TN    FP
               FN    TP

------------------

Evaluation Metrics

Accuracy = 90.00 %
Precision = 92.86 %
Recall/Sensitivity/True Positive Rate = 86.67 %
F1 Score = 89.66 %
Specificity/True Negative Rate = 93.33 %

Model 3: Marital Status, Income, and Age

GetAccuracyCutoff(model=model_3, df=beer_test_r, target="prefer_light") +
    ggplot2::labs(
        title="Model 3: Maximum Predictive Accuracy at 90.00%",
        alt="Model 3: Maximum Predictive Accuracy at 90.00%",
        subtitle="When 50%+ probability = light beer preference",
        x="Predicted Probability Cut-off for Light Beer Preference",
        y="Model Accuracy",
        fill=NULL,
        caption=NULL
    )

GetEvalMetrics(model=model_3, df=beer_test_r, target="prefer_light", type="response", cutoff=0.50)
Confusion Matrix

         Prediction
Actual    Regular Light
  Regular      13     2
  Light         1    14

               TN    FP
               FN    TP

------------------

Evaluation Metrics

Accuracy = 90.00 %
Precision = 87.50 %
Recall/Sensitivity/True Positive Rate = 93.33 %
F1 Score = 90.32 %
Specificity/True Negative Rate = 86.67 %

Model Selection

Although I was primarily interested in evaluating the predictive models based on accuracy, model_1 outperformed model_2 and model_3 (on all selected performance metrics) across all evaluation metrics of interest—indicating more consistent and reliable predictions across all aspects of the binary classification performance.

While model_2 outperformed model_3 in predicting a preference for regular beer, model_3 was better in predicting a preference for light beer.


Business Understanding: Leveling-Up

In conclusion, this analysis has illuminated a clear preference for light beer among women and higher-income earners—the target market attributes for introducing light beer under the Blue Moon brand. By integrating this insight with a cluster analysis (for data-driven customer segmentation), we’ve empowered the business to strategically explore customer segments with these desired attributes. Subsequently, the go-to-market (GTM) team will then be equipped to position a light beer that resonates with the selected target market’s preferences.

With a sound, data-informed STP marketing strategy, the Blue Moon Brewing Company is poised to successfully introduce and market a new light beer to the most receptive audiences—ensuring product-market fit.


Appendix A: Environment, Language & Package Versions, and Coding Style

If you are interested in reproducing this work, here are the versions of R, Python, and Julia that I used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me. Finally, the data visualizations are mostly (if not entirely) implemented using the Grammar of Graphics framework.

cat(
    R.version$version.string, "-", R.version$nickname,
    "\nOS:", Sys.info()["sysname"], R.version$platform,
    "\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)
R version 4.2.3 (2023-03-15) - Shortstop Beagle 
OS: Darwin x86_64-apple-darwin17.0 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.5.0", repos="http://cran.us.r-project.org")
devtools::install_version("ggcorrplot", version="0.1.4.1", repos="http://cran.us.r-project.org")
devtools::install_version("caret", version="6.0.94", repos="http://cran.us.r-project.org")

library(package=dplyr)
library(package=ggplot2)
library(package=ggcorrplot)
library(package=caret)
import sys
import platform
import os
import cpuinfo
print(
    "Python", sys.version,
    "\nOS:", platform.system(), platform.platform(),
    "\nCPU:", os.cpu_count(), "x", cpuinfo.get_cpu_info()["brand_raw"]
)
Python 3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)] 
OS: Darwin macOS-10.16-x86_64-i386-64bit 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install scipy==1.11.1

import numpy
import pandas
from scipy import stats
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home/lib/server
using Pkg
Pkg.add(name="HTTP", version="1.10.2")
Pkg.add(name="CSV", version="0.10.13")
Pkg.add(name="DataFrames", version="1.6.1")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="StatsBase", version="0.34.2")

using HTTP
using CSV
using DataFrames
using CategoricalArrays
using StatsBase

Appendix B: A Case for Logistic Regression

Advantages

  • Simplicity: Logistic regression is a simple model that is easy to understand and interpret. The output can be interpreted as the probability of the positive class.
  • Efficiency: It requires less computational resources as it’s not a complex algorithm. It’s relatively quick and easy to train.
  • Robust: It’s less prone to overfitting and can handle noise in the data.
  • Outputs Probabilities: Unlike other classification algorithms, logistic regression provides actual probabilities, which can be useful in cases where we want to rank predictions.
  • Feature Importance: Logistic regression allows for understanding the influence of several independent variables on a single outcome variable.

Disadvantages

  • Linearity Assumption: Logistic regression assumes a linear relationship between the independent variables and the logit of the dependent variable. If this assumption is violated, the model’s predictive performance can suffer.
  • Outliers: Logistic regression is sensitive to outliers in the data. Outliers can have a large effect on the decision boundary.
  • Multicollinearity: Logistic regression can overfit on highly correlated features. It’s important to check for multicollinearity in the data before applying logistic regression.
  • Limited to Binary Classification: Logistic regression is typically used for binary classification. For multi-class problems, one would have to resort to techniques like “one-vs-all” or “one-vs-one”, which involve training multiple models.
  • Performance: Logistic regression may not perform as well as other more complex models like Random Forests or Gradient Boosting on complex datasets that have many non-linear relationships.

Further Readings

  • Albright, S. C., Winston, W. L., & Zappe, C. (2003). Data Analysis for Managers with Microsoft Excel (2nd ed.). South-Western College Publishing.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1
  • Shmueli, G., Patel, N. R., & Bruce, P. C. (2007). Data Mining for Business Intelligence. Wiley.
Recent Thoughts