Data-Informed Thinking + Doing

Categorical Predictions Based on Past Events

Predicting whether the stock market will move up or down—using Naïve Bayes in R, Python, and Julia.

Getting Started

If you are interested in reproducing this work, here are the versions of R, Python, and Julia used (as well as the respective packages for each). Additionally, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.

cat(R.version$version.string, R.version$nickname)
R version 4.2.3 (2023-03-15) Shortstop Beagle
require(devtools)
devtools::install_version("dplyr", version="1.1.2", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.4.2", repos="http://cran.us.r-project.org")
library(dplyr)
library(ggplot2)
import sys
print(sys.version)
3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
!pip install pandas==2.0.3
!pip install plotnine==0.12.1
import pandas
import plotnine
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server
using Pkg
Pkg.add(name="CSV", version="0.10.11")
Pkg.add(name="DataFrames", version="1.5.0")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="Colors", version="0.12.10")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.3.4")
using Dates
using CSV
using DataFrames
using CategoricalArrays
using Colors
using Cairo
using Gadfly

Importing and Examining Dataset

Upon importing and examining the dataset, we can see that the data frame dimension is 1250 rows and 9 columns.

stock_market_r <- read.csv("../../dataset/stock-market.csv", stringsAsFactors=TRUE)
str(object=stock_market_r)
'data.frame':	1250 obs. of  9 variables:
 $ Year     : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
 $ Lag1     : num  0.381 0.959 1.032 -0.623 0.614 ...
 $ Lag2     : num  -0.192 0.381 0.959 1.032 -0.623 ...
 $ Lag3     : num  -2.624 -0.192 0.381 0.959 1.032 ...
 $ Lag4     : num  -1.055 -2.624 -0.192 0.381 0.959 ...
 $ Lag5     : num  5.01 -1.055 -2.624 -0.192 0.381 ...
 $ Volume   : num  1.19 1.3 1.41 1.28 1.21 ...
 $ Today    : num  0.959 1.032 -0.623 0.614 0.213 ...
 $ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
head(x=stock_market_r, n=7)
  Year  Lag1  Lag2  Lag3  Lag4  Lag5 Volume Today Direction
1 2001  0.38 -0.19 -2.62 -1.05  5.01    1.2  0.96        Up
2 2001  0.96  0.38 -0.19 -2.62 -1.05    1.3  1.03        Up
3 2001  1.03  0.96  0.38 -0.19 -2.62    1.4 -0.62      Down
4 2001 -0.62  1.03  0.96  0.38 -0.19    1.3  0.61        Up
5 2001  0.61 -0.62  1.03  0.96  0.38    1.2  0.21        Up
6 2001  0.21  0.61 -0.62  1.03  0.96    1.3  1.39        Up
7 2001  1.39  0.21  0.61 -0.62  1.03    1.4 -0.40      Down
tail(x=stock_market_r, n=7)
     Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
1244 2005 -0.024 -0.584 -0.285 -0.141  0.419    2.0  0.252        Up
1245 2005  0.252 -0.024 -0.584 -0.285 -0.141    2.1  0.422        Up
1246 2005  0.422  0.252 -0.024 -0.584 -0.285    1.9  0.043        Up
1247 2005  0.043  0.422  0.252 -0.024 -0.584    1.3 -0.955      Down
1248 2005 -0.955  0.043  0.422  0.252 -0.024    1.5  0.130        Up
1249 2005  0.130 -0.955  0.043  0.422  0.252    1.4 -0.298      Down
1250 2005 -0.298  0.130 -0.955  0.043  0.422    1.4 -0.489      Down
stock_market_py = pandas.read_csv("../../dataset/stock-market.csv")
stock_market_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1250 entries, 0 to 1249
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Year       1250 non-null   int64  
 1   Lag1       1250 non-null   float64
 2   Lag2       1250 non-null   float64
 3   Lag3       1250 non-null   float64
 4   Lag4       1250 non-null   float64
 5   Lag5       1250 non-null   float64
 6   Volume     1250 non-null   float64
 7   Today      1250 non-null   float64
 8   Direction  1250 non-null   object 
dtypes: float64(7), int64(1), object(1)
memory usage: 88.0+ KB
stock_market_py.head(n=8)
   Year   Lag1   Lag2   Lag3   Lag4   Lag5  Volume  Today Direction
0  2001  0.381 -0.192 -2.624 -1.055  5.010  1.1913  0.959        Up
1  2001  0.959  0.381 -0.192 -2.624 -1.055  1.2965  1.032        Up
2  2001  1.032  0.959  0.381 -0.192 -2.624  1.4112 -0.623      Down
3  2001 -0.623  1.032  0.959  0.381 -0.192  1.2760  0.614        Up
4  2001  0.614 -0.623  1.032  0.959  0.381  1.2057  0.213        Up
5  2001  0.213  0.614 -0.623  1.032  0.959  1.3491  1.392        Up
6  2001  1.392  0.213  0.614 -0.623  1.032  1.4450 -0.403      Down
7  2001 -0.403  1.392  0.213  0.614 -0.623  1.4078  0.027        Up
stock_market_py.tail(n=8)
      Year   Lag1   Lag2   Lag3   Lag4   Lag5   Volume  Today Direction
1242  2005 -0.584 -0.285 -0.141  0.419  0.555  2.20881 -0.024      Down
1243  2005 -0.024 -0.584 -0.285 -0.141  0.419  1.99669  0.252        Up
1244  2005  0.252 -0.024 -0.584 -0.285 -0.141  2.06517  0.422        Up
1245  2005  0.422  0.252 -0.024 -0.584 -0.285  1.88850  0.043        Up
1246  2005  0.043  0.422  0.252 -0.024 -0.584  1.28581 -0.955      Down
1247  2005 -0.955  0.043  0.422  0.252 -0.024  1.54047  0.130        Up
1248  2005  0.130 -0.955  0.043  0.422  0.252  1.42236 -0.298      Down
1249  2005 -0.298  0.130 -0.955  0.043  0.422  1.38254 -0.489      Down
stock_market_jl = CSV.File("../../dataset/stock-market.csv") |> DataFrames.DataFrame
1250×9 DataFrame
  Row │ Year   Lag1     Lag2     Lag3     Lag4     Lag5     Volume   Today    Direction
      │ Int64  Float64  Float64  Float64  Float64  Float64  Float64  Float64  String7
──────┼─────────────────────────────────────────────────────────────────────────────────
    1 │  2001    0.381   -0.192   -2.624   -1.055    5.01   1.1913     0.959  Up
    2 │  2001    0.959    0.381   -0.192   -2.624   -1.055  1.2965     1.032  Up
    3 │  2001    1.032    0.959    0.381   -0.192   -2.624  1.4112    -0.623  Down
    4 │  2001   -0.623    1.032    0.959    0.381   -0.192  1.276      0.614  Up
    5 │  2001    0.614   -0.623    1.032    0.959    0.381  1.2057     0.213  Up
    6 │  2001    0.213    0.614   -0.623    1.032    0.959  1.3491     1.392  Up
    7 │  2001    1.392    0.213    0.614   -0.623    1.032  1.445     -0.403  Down
    8 │  2001   -0.403    1.392    0.213    0.614   -0.623  1.4078     0.027  Up
    9 │  2001    0.027   -0.403    1.392    0.213    0.614  1.164      1.303  Up
   10 │  2001    1.303    0.027   -0.403    1.392    0.213  1.2326     0.287  Up
   11 │  2001    0.287    1.303    0.027   -0.403    1.392  1.309     -0.498  Down
   12 │  2001   -0.498    0.287    1.303    0.027   -0.403  1.258     -0.189  Down
   13 │  2001   -0.189   -0.498    0.287    1.303    0.027  1.098      0.68   Up
   14 │  2001    0.68    -0.189   -0.498    0.287    1.303  1.0531     0.701  Up
   15 │  2001    0.701    0.68    -0.189   -0.498    0.287  1.1498    -0.562  Down
   16 │  2001   -0.562    0.701    0.68    -0.189   -0.498  1.2953     0.546  Up
   17 │  2001    0.546   -0.562    0.701    0.68    -0.189  1.1188    -1.747  Down
   18 │  2001   -1.747    0.546   -0.562    0.701    0.68   1.0484     0.359  Up
   19 │  2001    0.359   -1.747    0.546   -0.562    0.701  1.013     -0.151  Down
   20 │  2001   -0.151    0.359   -1.747    0.546   -0.562  1.0596    -0.841  Down
   21 │  2001   -0.841   -0.151    0.359   -1.747    0.546  1.1583    -0.623  Down
   22 │  2001   -0.623   -0.841   -0.151    0.359   -1.747  1.1072    -1.334  Down
   23 │  2001   -1.334   -0.623   -0.841   -0.151    0.359  1.0755     1.183  Up
   24 │  2001    1.183   -1.334   -0.623   -0.841   -0.151  1.0391    -0.865  Down
   25 │  2001   -0.865    1.183   -1.334   -0.623   -0.841  1.0752    -0.218  Down
   26 │  2001   -0.218   -0.865    1.183   -1.334   -0.623  1.1503     0.812  Up
   27 │  2001    0.812   -0.218   -0.865    1.183   -1.334  1.1537    -1.891  Down
   28 │  2001   -1.891    0.812   -0.218   -0.865    1.183  1.2572    -1.736  Down
   29 │  2001   -1.736   -1.891    0.812   -0.218   -0.865  1.1122    -1.851  Down
   30 │  2001   -1.851   -1.736   -1.891    0.812   -0.218  1.2085    -0.195  Down
   31 │  2001   -0.195   -1.851   -1.736   -1.891    0.812  1.3659    -0.556  Down
   32 │  2001   -0.556   -0.195   -1.851   -1.736   -1.891  1.2313     1.749  Up
   33 │  2001    1.749   -0.556   -0.195   -1.851   -1.736  1.1308    -0.766  Down
   34 │  2001   -0.766    1.749   -0.556   -0.195   -1.851  1.1141    -1.431  Down
   35 │  2001   -1.431   -0.766    1.749   -0.556   -0.195  1.2253     0.104  Up
   36 │  2001    0.104   -1.431   -0.766    1.749   -0.556  1.2949    -0.568  Down
   37 │  2001   -0.568    0.104   -1.431   -0.766    1.749  1.294      0.586  Up
   38 │  2001    0.586   -0.568    0.104   -1.431   -0.766  0.9292     0.998  Up
   39 │  2001    0.998    0.586   -0.568    0.104   -1.431  1.0918     0.645  Up
   40 │  2001    0.645    0.998    0.586   -0.568    0.104  1.1322     0.226  Up
   41 │  2001    0.226    0.645    0.998    0.586   -0.568  1.1141    -2.476  Down
   42 │  2001   -2.476    0.226    0.645    0.998    0.586  1.0859    -4.318  Down
   43 │  2001   -4.318   -2.476    0.226    0.645    0.998  1.229      1.483  Up
   44 │  2001    1.483   -4.318   -2.476    0.226    0.645  1.3609    -2.584  Down
   45 │  2001   -2.584    1.483   -4.318   -2.476    0.226  1.3974     0.587  Up
   46 │  2001    0.587   -2.584    1.483   -4.318   -2.476  1.2595    -1.962  Down
  ⋮   │   ⋮       ⋮        ⋮        ⋮        ⋮        ⋮        ⋮        ⋮         ⋮
 1206 │  2005   -0.431   -0.237    1.678    0.152   -1.502  2.46775   -1.048  Down
 1207 │  2005   -1.048   -0.431   -0.237    1.678    0.152  2.39537    1.655  Up
 1208 │  2005    1.655   -1.048   -0.431   -0.237    1.678  2.3794     0.718  Up
 1209 │  2005    0.718    1.655   -1.048   -0.431   -0.237  2.56747   -0.352  Down
 1210 │  2005   -0.352    0.718    1.655   -1.048   -0.431  2.45785    0.998  Up
 1211 │  2005    0.998   -0.352    0.718    1.655   -1.048  2.64809    0.426  Up
 1212 │  2005    0.426    0.998   -0.352    0.718    1.655  2.71663    0.016  Up
 1213 │  2005    0.016    0.426    0.998   -0.352    0.718  2.05051    0.219  Up
 1214 │  2005    0.219    0.016    0.426    0.998   -0.352  1.98758   -0.345  Down
 1215 │  2005   -0.345    0.219    0.016    0.426    0.998  1.96505    0.169  Up
 1216 │  2005    0.169   -0.345    0.219    0.016    0.426  2.21446    0.845  Up
 1217 │  2005    0.845    0.169   -0.345    0.219    0.016  2.37846    0.305  Up
 1218 │  2005    0.305    0.845    0.169   -0.345    0.219  1.77314   -0.078  Down
 1219 │  2005   -0.078    0.305    0.845    0.169   -0.345  1.89978   -0.385  Down
 1220 │  2005   -0.385   -0.078    0.305    0.845    0.169  2.35937    0.179  Up
 1221 │  2005    0.179   -0.385   -0.078    0.305    0.845  2.12158    0.941  Up
 1222 │  2005    0.941    0.179   -0.385   -0.078    0.305  2.29804    0.44   Up
 1223 │  2005    0.44     0.941    0.179   -0.385   -0.078  2.45329    0.527  Up
 1224 │  2005    0.527    0.44     0.941    0.179   -0.385  2.11735    0.508  Up
 1225 │  2005    0.508    0.527    0.44     0.941    0.179  2.29142    0.347  Up
 1226 │  2005    0.347    0.508    0.527    0.44     0.941  1.9854     0.209  Up
 1227 │  2005    0.209    0.347    0.508    0.527    0.44   0.72494   -0.851  Down
 1228 │  2005   -0.851    0.209    0.347    0.508    0.527  2.0169     0.002  Up
 1229 │  2005    0.002   -0.851    0.209    0.347    0.508  2.26834   -0.636  Down
 1230 │  2005   -0.636    0.002   -0.851    0.209    0.347  2.37469    1.216  Up
 1231 │  2005    1.216   -0.636    0.002   -0.851    0.209  2.61483    0.032  Up
 1232 │  2005    0.032    1.216   -0.636    0.002   -0.851  2.12558   -0.236  Down
 1233 │  2005   -0.236    0.032    1.216   -0.636    0.002  2.32584    0.128  Up
 1234 │  2005    0.128   -0.236    0.032    1.216   -0.636  2.11074   -0.501  Down
 1235 │  2005   -0.501    0.128   -0.236    0.032    1.216  2.09383   -0.122  Down
 1236 │  2005   -0.122   -0.501    0.128   -0.236    0.032  2.1783     0.281  Up
 1237 │  2005    0.281   -0.122   -0.501    0.128   -0.236  1.89629    0.084  Up
 1238 │  2005    0.084    0.281   -0.122   -0.501    0.128  1.87655    0.555  Up
 1239 │  2005    0.555    0.084    0.281   -0.122   -0.501  2.39002    0.419  Up
 1240 │  2005    0.419    0.555    0.084    0.281   -0.122  2.14552   -0.141  Down
 1241 │  2005   -0.141    0.419    0.555    0.084    0.281  2.18059   -0.285  Down
 1242 │  2005   -0.285   -0.141    0.419    0.555    0.084  2.58419   -0.584  Down
 1243 │  2005   -0.584   -0.285   -0.141    0.419    0.555  2.20881   -0.024  Down
 1244 │  2005   -0.024   -0.584   -0.285   -0.141    0.419  1.99669    0.252  Up
 1245 │  2005    0.252   -0.024   -0.584   -0.285   -0.141  2.06517    0.422  Up
 1246 │  2005    0.422    0.252   -0.024   -0.584   -0.285  1.8885     0.043  Up
 1247 │  2005    0.043    0.422    0.252   -0.024   -0.584  1.28581   -0.955  Down
 1248 │  2005   -0.955    0.043    0.422    0.252   -0.024  1.54047    0.13   Up
 1249 │  2005    0.13    -0.955    0.043    0.422    0.252  1.42236   -0.298  Down
 1250 │  2005   -0.298    0.13    -0.955    0.043    0.422  1.38254   -0.489  Down
                                                                       1159 rows omitted

Wrangling Data


References

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1
  • Shmueli, G., Patel, N. R., & Bruce, P. C. (2007). Data Mining for Business Intelligence. Wiley.
  • Albright, S. C., Winston, W. L., & Zappe, C. (2003). Data Analysis for Managers with Microsoft Excel (2nd ed.). South-Western College Publishing.
Applied Advanced Analytics & AI in Sports