Data-Informed Thinking + Doing
Categorical Predictions Based on Past Events
Predicting whether the stock market will move up or down—using Naïve Bayes in R, Python, and Julia.
Getting Started
If you are interested in reproducing this work, here are the versions of R, Python, and Julia used (as well as the respective packages for each). Additionally, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.
cat(R.version$version.string, R.version$nickname)
R version 4.2.3 (2023-03-15) Shortstop Beagle
require(devtools)
devtools::install_version("dplyr", version="1.1.2", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.4.2", repos="http://cran.us.r-project.org")
library(dplyr)
library(ggplot2)
import sys
print(sys.version)
3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
!pip install pandas==2.0.3
!pip install plotnine==0.12.1
import pandas
import plotnine
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin22.4.0)
CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Environment:
DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server
using Pkg
Pkg.add(name="CSV", version="0.10.11")
Pkg.add(name="DataFrames", version="1.5.0")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="Colors", version="0.12.10")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.3.4")
using Dates
using CSV
using DataFrames
using CategoricalArrays
using Colors
using Cairo
using Gadfly
Importing and Examining Dataset
Upon importing and examining the dataset, we can see that the data frame dimension is 1250
rows and 9
columns.
stock_market_r <- read.csv("../../dataset/stock-market.csv", stringsAsFactors=TRUE)
str(object=stock_market_r)
'data.frame': 1250 obs. of 9 variables:
$ Year : int 2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
$ Lag1 : num 0.381 0.959 1.032 -0.623 0.614 ...
$ Lag2 : num -0.192 0.381 0.959 1.032 -0.623 ...
$ Lag3 : num -2.624 -0.192 0.381 0.959 1.032 ...
$ Lag4 : num -1.055 -2.624 -0.192 0.381 0.959 ...
$ Lag5 : num 5.01 -1.055 -2.624 -0.192 0.381 ...
$ Volume : num 1.19 1.3 1.41 1.28 1.21 ...
$ Today : num 0.959 1.032 -0.623 0.614 0.213 ...
$ Direction: Factor w/ 2 levels "Down","Up": 2 2 1 2 2 2 1 2 2 2 ...
head(x=stock_market_r, n=7)
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
1 2001 0.38 -0.19 -2.62 -1.05 5.01 1.2 0.96 Up
2 2001 0.96 0.38 -0.19 -2.62 -1.05 1.3 1.03 Up
3 2001 1.03 0.96 0.38 -0.19 -2.62 1.4 -0.62 Down
4 2001 -0.62 1.03 0.96 0.38 -0.19 1.3 0.61 Up
5 2001 0.61 -0.62 1.03 0.96 0.38 1.2 0.21 Up
6 2001 0.21 0.61 -0.62 1.03 0.96 1.3 1.39 Up
7 2001 1.39 0.21 0.61 -0.62 1.03 1.4 -0.40 Down
tail(x=stock_market_r, n=7)
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
1244 2005 -0.024 -0.584 -0.285 -0.141 0.419 2.0 0.252 Up
1245 2005 0.252 -0.024 -0.584 -0.285 -0.141 2.1 0.422 Up
1246 2005 0.422 0.252 -0.024 -0.584 -0.285 1.9 0.043 Up
1247 2005 0.043 0.422 0.252 -0.024 -0.584 1.3 -0.955 Down
1248 2005 -0.955 0.043 0.422 0.252 -0.024 1.5 0.130 Up
1249 2005 0.130 -0.955 0.043 0.422 0.252 1.4 -0.298 Down
1250 2005 -0.298 0.130 -0.955 0.043 0.422 1.4 -0.489 Down
stock_market_py = pandas.read_csv("../../dataset/stock-market.csv")
stock_market_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1250 entries, 0 to 1249
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 1250 non-null int64
1 Lag1 1250 non-null float64
2 Lag2 1250 non-null float64
3 Lag3 1250 non-null float64
4 Lag4 1250 non-null float64
5 Lag5 1250 non-null float64
6 Volume 1250 non-null float64
7 Today 1250 non-null float64
8 Direction 1250 non-null object
dtypes: float64(7), int64(1), object(1)
memory usage: 88.0+ KB
stock_market_py.head(n=8)
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
0 2001 0.381 -0.192 -2.624 -1.055 5.010 1.1913 0.959 Up
1 2001 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 Up
2 2001 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 Down
3 2001 -0.623 1.032 0.959 0.381 -0.192 1.2760 0.614 Up
4 2001 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 Up
5 2001 0.213 0.614 -0.623 1.032 0.959 1.3491 1.392 Up
6 2001 1.392 0.213 0.614 -0.623 1.032 1.4450 -0.403 Down
7 2001 -0.403 1.392 0.213 0.614 -0.623 1.4078 0.027 Up
stock_market_py.tail(n=8)
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
1242 2005 -0.584 -0.285 -0.141 0.419 0.555 2.20881 -0.024 Down
1243 2005 -0.024 -0.584 -0.285 -0.141 0.419 1.99669 0.252 Up
1244 2005 0.252 -0.024 -0.584 -0.285 -0.141 2.06517 0.422 Up
1245 2005 0.422 0.252 -0.024 -0.584 -0.285 1.88850 0.043 Up
1246 2005 0.043 0.422 0.252 -0.024 -0.584 1.28581 -0.955 Down
1247 2005 -0.955 0.043 0.422 0.252 -0.024 1.54047 0.130 Up
1248 2005 0.130 -0.955 0.043 0.422 0.252 1.42236 -0.298 Down
1249 2005 -0.298 0.130 -0.955 0.043 0.422 1.38254 -0.489 Down
stock_market_jl = CSV.File("../../dataset/stock-market.csv") |> DataFrames.DataFrame
1250×9 DataFrame
Row │ Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
│ Int64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 String7
──────┼─────────────────────────────────────────────────────────────────────────────────
1 │ 2001 0.381 -0.192 -2.624 -1.055 5.01 1.1913 0.959 Up
2 │ 2001 0.959 0.381 -0.192 -2.624 -1.055 1.2965 1.032 Up
3 │ 2001 1.032 0.959 0.381 -0.192 -2.624 1.4112 -0.623 Down
4 │ 2001 -0.623 1.032 0.959 0.381 -0.192 1.276 0.614 Up
5 │ 2001 0.614 -0.623 1.032 0.959 0.381 1.2057 0.213 Up
6 │ 2001 0.213 0.614 -0.623 1.032 0.959 1.3491 1.392 Up
7 │ 2001 1.392 0.213 0.614 -0.623 1.032 1.445 -0.403 Down
8 │ 2001 -0.403 1.392 0.213 0.614 -0.623 1.4078 0.027 Up
9 │ 2001 0.027 -0.403 1.392 0.213 0.614 1.164 1.303 Up
10 │ 2001 1.303 0.027 -0.403 1.392 0.213 1.2326 0.287 Up
11 │ 2001 0.287 1.303 0.027 -0.403 1.392 1.309 -0.498 Down
12 │ 2001 -0.498 0.287 1.303 0.027 -0.403 1.258 -0.189 Down
13 │ 2001 -0.189 -0.498 0.287 1.303 0.027 1.098 0.68 Up
14 │ 2001 0.68 -0.189 -0.498 0.287 1.303 1.0531 0.701 Up
15 │ 2001 0.701 0.68 -0.189 -0.498 0.287 1.1498 -0.562 Down
16 │ 2001 -0.562 0.701 0.68 -0.189 -0.498 1.2953 0.546 Up
17 │ 2001 0.546 -0.562 0.701 0.68 -0.189 1.1188 -1.747 Down
18 │ 2001 -1.747 0.546 -0.562 0.701 0.68 1.0484 0.359 Up
19 │ 2001 0.359 -1.747 0.546 -0.562 0.701 1.013 -0.151 Down
20 │ 2001 -0.151 0.359 -1.747 0.546 -0.562 1.0596 -0.841 Down
21 │ 2001 -0.841 -0.151 0.359 -1.747 0.546 1.1583 -0.623 Down
22 │ 2001 -0.623 -0.841 -0.151 0.359 -1.747 1.1072 -1.334 Down
23 │ 2001 -1.334 -0.623 -0.841 -0.151 0.359 1.0755 1.183 Up
24 │ 2001 1.183 -1.334 -0.623 -0.841 -0.151 1.0391 -0.865 Down
25 │ 2001 -0.865 1.183 -1.334 -0.623 -0.841 1.0752 -0.218 Down
26 │ 2001 -0.218 -0.865 1.183 -1.334 -0.623 1.1503 0.812 Up
27 │ 2001 0.812 -0.218 -0.865 1.183 -1.334 1.1537 -1.891 Down
28 │ 2001 -1.891 0.812 -0.218 -0.865 1.183 1.2572 -1.736 Down
29 │ 2001 -1.736 -1.891 0.812 -0.218 -0.865 1.1122 -1.851 Down
30 │ 2001 -1.851 -1.736 -1.891 0.812 -0.218 1.2085 -0.195 Down
31 │ 2001 -0.195 -1.851 -1.736 -1.891 0.812 1.3659 -0.556 Down
32 │ 2001 -0.556 -0.195 -1.851 -1.736 -1.891 1.2313 1.749 Up
33 │ 2001 1.749 -0.556 -0.195 -1.851 -1.736 1.1308 -0.766 Down
34 │ 2001 -0.766 1.749 -0.556 -0.195 -1.851 1.1141 -1.431 Down
35 │ 2001 -1.431 -0.766 1.749 -0.556 -0.195 1.2253 0.104 Up
36 │ 2001 0.104 -1.431 -0.766 1.749 -0.556 1.2949 -0.568 Down
37 │ 2001 -0.568 0.104 -1.431 -0.766 1.749 1.294 0.586 Up
38 │ 2001 0.586 -0.568 0.104 -1.431 -0.766 0.9292 0.998 Up
39 │ 2001 0.998 0.586 -0.568 0.104 -1.431 1.0918 0.645 Up
40 │ 2001 0.645 0.998 0.586 -0.568 0.104 1.1322 0.226 Up
41 │ 2001 0.226 0.645 0.998 0.586 -0.568 1.1141 -2.476 Down
42 │ 2001 -2.476 0.226 0.645 0.998 0.586 1.0859 -4.318 Down
43 │ 2001 -4.318 -2.476 0.226 0.645 0.998 1.229 1.483 Up
44 │ 2001 1.483 -4.318 -2.476 0.226 0.645 1.3609 -2.584 Down
45 │ 2001 -2.584 1.483 -4.318 -2.476 0.226 1.3974 0.587 Up
46 │ 2001 0.587 -2.584 1.483 -4.318 -2.476 1.2595 -1.962 Down
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
1206 │ 2005 -0.431 -0.237 1.678 0.152 -1.502 2.46775 -1.048 Down
1207 │ 2005 -1.048 -0.431 -0.237 1.678 0.152 2.39537 1.655 Up
1208 │ 2005 1.655 -1.048 -0.431 -0.237 1.678 2.3794 0.718 Up
1209 │ 2005 0.718 1.655 -1.048 -0.431 -0.237 2.56747 -0.352 Down
1210 │ 2005 -0.352 0.718 1.655 -1.048 -0.431 2.45785 0.998 Up
1211 │ 2005 0.998 -0.352 0.718 1.655 -1.048 2.64809 0.426 Up
1212 │ 2005 0.426 0.998 -0.352 0.718 1.655 2.71663 0.016 Up
1213 │ 2005 0.016 0.426 0.998 -0.352 0.718 2.05051 0.219 Up
1214 │ 2005 0.219 0.016 0.426 0.998 -0.352 1.98758 -0.345 Down
1215 │ 2005 -0.345 0.219 0.016 0.426 0.998 1.96505 0.169 Up
1216 │ 2005 0.169 -0.345 0.219 0.016 0.426 2.21446 0.845 Up
1217 │ 2005 0.845 0.169 -0.345 0.219 0.016 2.37846 0.305 Up
1218 │ 2005 0.305 0.845 0.169 -0.345 0.219 1.77314 -0.078 Down
1219 │ 2005 -0.078 0.305 0.845 0.169 -0.345 1.89978 -0.385 Down
1220 │ 2005 -0.385 -0.078 0.305 0.845 0.169 2.35937 0.179 Up
1221 │ 2005 0.179 -0.385 -0.078 0.305 0.845 2.12158 0.941 Up
1222 │ 2005 0.941 0.179 -0.385 -0.078 0.305 2.29804 0.44 Up
1223 │ 2005 0.44 0.941 0.179 -0.385 -0.078 2.45329 0.527 Up
1224 │ 2005 0.527 0.44 0.941 0.179 -0.385 2.11735 0.508 Up
1225 │ 2005 0.508 0.527 0.44 0.941 0.179 2.29142 0.347 Up
1226 │ 2005 0.347 0.508 0.527 0.44 0.941 1.9854 0.209 Up
1227 │ 2005 0.209 0.347 0.508 0.527 0.44 0.72494 -0.851 Down
1228 │ 2005 -0.851 0.209 0.347 0.508 0.527 2.0169 0.002 Up
1229 │ 2005 0.002 -0.851 0.209 0.347 0.508 2.26834 -0.636 Down
1230 │ 2005 -0.636 0.002 -0.851 0.209 0.347 2.37469 1.216 Up
1231 │ 2005 1.216 -0.636 0.002 -0.851 0.209 2.61483 0.032 Up
1232 │ 2005 0.032 1.216 -0.636 0.002 -0.851 2.12558 -0.236 Down
1233 │ 2005 -0.236 0.032 1.216 -0.636 0.002 2.32584 0.128 Up
1234 │ 2005 0.128 -0.236 0.032 1.216 -0.636 2.11074 -0.501 Down
1235 │ 2005 -0.501 0.128 -0.236 0.032 1.216 2.09383 -0.122 Down
1236 │ 2005 -0.122 -0.501 0.128 -0.236 0.032 2.1783 0.281 Up
1237 │ 2005 0.281 -0.122 -0.501 0.128 -0.236 1.89629 0.084 Up
1238 │ 2005 0.084 0.281 -0.122 -0.501 0.128 1.87655 0.555 Up
1239 │ 2005 0.555 0.084 0.281 -0.122 -0.501 2.39002 0.419 Up
1240 │ 2005 0.419 0.555 0.084 0.281 -0.122 2.14552 -0.141 Down
1241 │ 2005 -0.141 0.419 0.555 0.084 0.281 2.18059 -0.285 Down
1242 │ 2005 -0.285 -0.141 0.419 0.555 0.084 2.58419 -0.584 Down
1243 │ 2005 -0.584 -0.285 -0.141 0.419 0.555 2.20881 -0.024 Down
1244 │ 2005 -0.024 -0.584 -0.285 -0.141 0.419 1.99669 0.252 Up
1245 │ 2005 0.252 -0.024 -0.584 -0.285 -0.141 2.06517 0.422 Up
1246 │ 2005 0.422 0.252 -0.024 -0.584 -0.285 1.8885 0.043 Up
1247 │ 2005 0.043 0.422 0.252 -0.024 -0.584 1.28581 -0.955 Down
1248 │ 2005 -0.955 0.043 0.422 0.252 -0.024 1.54047 0.13 Up
1249 │ 2005 0.13 -0.955 0.043 0.422 0.252 1.42236 -0.298 Down
1250 │ 2005 -0.298 0.13 -0.955 0.043 0.422 1.38254 -0.489 Down
1159 rows omitted
Wrangling Data
References
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1
- Shmueli, G., Patel, N. R., & Bruce, P. C. (2007). Data Mining for Business Intelligence. Wiley.
- Albright, S. C., Winston, W. L., & Zappe, C. (2003). Data Analysis for Managers with Microsoft Excel (2nd ed.). South-Western College Publishing.
Applied Advanced Analytics & AI in Sports