Data-Informed Thinking + Doing

Summary Statistics

Extracting an initial overview from the 2000 U.S. Census Bureau data—using R, Python, and Julia.

Summary statistics provide a concise overview of large datasets, simplifying complex information for initial analysis. Using a 2000 U.S. Census Bureau dataset (as an example), key summary statistics like mean, median, and mode help understand population characteristics, and offer insights into demographic trends. These statistics condense vast data into understandable figures—enabling policymakers to make informed decisions based on the population’s central tendencies and distributions.

Let’s explore implementing these ideas across three popular programming languages for statistical analysis.

Getting Started

If you are interested in reproducing this work, here are the versions of R, Python, and Julia that I used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.

cat(
    R.version$version.string, "-", R.version$nickname,
    "\nOS:", Sys.info()["sysname"], R.version$platform,
    "\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)
R version 4.2.3 (2023-03-15) - Shortstop Beagle 
OS: Darwin x86_64-apple-darwin17.0 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
library(dplyr)
import sys
import platform
import os
import cpuinfo
print(
    "Python", sys.version,
    "\nOS:", platform.system(), platform.platform(),
    "\nCPU:", os.cpu_count(), "x", cpuinfo.get_cpu_info()["brand_raw"]
)
Python 3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)] 
OS: Darwin macOS-10.16-x86_64-i386-64bit 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install scipy==1.11.1
import numpy
import pandas
from scipy import stats
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server
using Pkg
Pkg.add(name="HTTP", version="1.10.2")
Pkg.add(name="CSV", version="0.10.13")
Pkg.add(name="DataFrames", version="1.6.1")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="StatsBase", version="0.34.2")
using HTTP
using CSV
using DataFrames
using CategoricalArrays
using StatsBase

Importing and Examining Dataset

As always, let’s kick-off by importing and examining the dataset.

employment_r <- read.csv("https://huggingface.co/datasets/michaelmallari/us-census/raw/main/2000-employment.csv")
str(object=employment_r)
'data.frame':	3219 obs. of  6 variables:
 $ state              : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
 $ county             : chr  "Autauga" "Baldwin" "Barbour" "Bibb" ...
 $ employed_civilian  : int  19600 62760 10190 7970 22700 3440 8010 47530 15700 10180 ...
 $ unemployed_civilian: int  970 2670 610 530 980 320 950 3010 1090 410 ...
 $ armed_forces       : int  570 210 20 20 40 10 30 170 20 20 ...
 $ not_in_labor_force : int  11100 42330 9000 6350 15180 3770 7230 35850 11380 8340 ...
head(x=employment_r, n=8)
    state  county employed_civilian unemployed_civilian armed_forces not_in_labor_force
1 Alabama Autauga             19600                 970          570              11100
2 Alabama Baldwin             62760                2670          210              42330
3 Alabama Barbour             10190                 610           20               9000
4 Alabama    Bibb              7970                 530           20               6350
5 Alabama  Blount             22700                 980           40              15180
6 Alabama Bullock              3440                 320           10               3770
7 Alabama  Butler              8010                 950           30               7230
8 Alabama Calhoun             47530                3010          170              35850
tail(x=employment_r, n=8)
           state                  county employed_civilian unemployed_civilian armed_forces not_in_labor_force
3212 Puerto Rico Trujillo Alto Municipio             23230                3020           60              28660
3213 Puerto Rico        Utuado Municipio              6170                2660            0              16730
3214 Puerto Rico     Vega Alta Municipio              8610                1950           30              16740
3215 Puerto Rico     Vega Baja Municipio             14150                3720           30              27090
3216 Puerto Rico       Vieques Municipio              1710                 670           10               4270
3217 Puerto Rico      Villalba Municipio              5120                2160            0              11630
3218 Puerto Rico       Yabucoa Municipio              7240                2260            0              19170
3219 Puerto Rico         Yauco Municipio             10180                3410           10              20300
employment_py = pandas.read_csv("https://huggingface.co/datasets/michaelmallari/us-census/raw/main/2000-employment.csv")
employment_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3219 entries, 0 to 3218
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   state                3219 non-null   object
 1   county               3219 non-null   object
 2   employed_civilian    3219 non-null   int64 
 3   unemployed_civilian  3219 non-null   int64 
 4   armed_forces         3219 non-null   int64 
 5   not_in_labor_force   3219 non-null   int64 
dtypes: int64(4), object(2)
memory usage: 151.0+ KB
employment_py.head(n=8)
     state   county  employed_civilian  unemployed_civilian  armed_forces  not_in_labor_force
0  Alabama  Autauga              19600                  970           570               11100
1  Alabama  Baldwin              62760                 2670           210               42330
2  Alabama  Barbour              10190                  610            20                9000
3  Alabama     Bibb               7970                  530            20                6350
4  Alabama   Blount              22700                  980            40               15180
5  Alabama  Bullock               3440                  320            10                3770
6  Alabama   Butler               8010                  950            30                7230
7  Alabama  Calhoun              47530                 3010           170               35850
employment_py.tail(n=8)
            state                   county  employed_civilian  unemployed_civilian  armed_forces  not_in_labor_force
3211  Puerto Rico  Trujillo Alto Municipio              23230                 3020            60               28660
3212  Puerto Rico         Utuado Municipio               6170                 2660             0               16730
3213  Puerto Rico      Vega Alta Municipio               8610                 1950            30               16740
3214  Puerto Rico      Vega Baja Municipio              14150                 3720            30               27090
3215  Puerto Rico        Vieques Municipio               1710                  670            10                4270
3216  Puerto Rico       Villalba Municipio               5120                 2160             0               11630
3217  Puerto Rico        Yabucoa Municipio               7240                 2260             0               19170
3218  Puerto Rico          Yauco Municipio              10180                 3410            10               20300
employment_jl = CSV.File(HTTP.get("https://huggingface.co/datasets/michaelmallari/us-census/raw/main/2000-employment.csv").body) |> DataFrames.DataFrame
3219×6 DataFrame
  Row │ state        county                   employed_civilian  unemployed_civilian  armed_forces  not_in_labor_force
      │ String31     String                   Int64              Int64                Int64         Int64
──────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
    1 │ Alabama      Autauga                              19600                  970           570               11100
    2 │ Alabama      Baldwin                              62760                 2670           210               42330
    3 │ Alabama      Barbour                              10190                  610            20                9000
    4 │ Alabama      Bibb                                  7970                  530            20                6350
    5 │ Alabama      Blount                               22700                  980            40               15180
    6 │ Alabama      Bullock                               3440                  320            10                3770
  ⋮   │      ⋮                  ⋮                     ⋮                   ⋮                ⋮                ⋮
 3215 │ Puerto Rico  Vega Baja Municipio                  14150                 3720            30               27090
 3216 │ Puerto Rico  Vieques Municipio                     1710                  670            10                4270
 3217 │ Puerto Rico  Villalba Municipio                    5120                 2160             0               11630
 3218 │ Puerto Rico  Yabucoa Municipio                     7240                 2260             0               19170
 3219 │ Puerto Rico  Yauco Municipio                      10180                 3410            10               20300
                                                                                                      3208 rows omitted

Handling Missing Data

Handling missing data ensures accurate analyses. Let’s pinpoint any missing data and determine appropriate actions.

Measuring the Center of Numerical Variables

Measuring the center of numerical variables, often through metrics like mean or median, provides a central tendency that summarizes data points. This helps in understanding typical or average values, aiding in analysis, comparison, and making informed decisions based on the core representation of the dataset. Reminder: Each observation (or row) in the dataset is at the county-level—a total of 3219 in the U.S., including Washington, DC and Puerto Rico.

# Create a custom function for mode, since there in none included in Base R
get_mode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

employment_r %>%
    dplyr::summarize(
        "Mean" = mean(employment_r$unemployed_civilian),
        "Median" = median(employment_r$unemployed_civilian),
        "Mode" = get_mode(employment_r$unemployed_civilian)
    )
  Mean Median Mode
1 2280    620  130
numpy.mean(employment_py.unemployed_civilian)
2280.0155327741536
numpy.median(employment_py.unemployed_civilian)
620.0
stats.mode(employment_py.unemployed_civilian)
ModeResult(mode=130, count=42)
Statistics.mean(employment_jl.unemployed_civilian)
2280.0155327741536
Statistics.median(employment_jl.unemployed_civilian)
620.0
StatsBase.mode(employment_jl.unemployed_civilian)
130

In this example, looking at the central tendencies in absolute values across all counties in the U.S. can be misleading. After all, a mean of 100 unemployed civilians in the state of Hawaii tells a different, and a mean of 100 unemployed civilians in California. Let’s look at the same data, but as a percentage of the total population for each state.

employment_state_r <- employment_r %>%
    dplyr::group_by(state) %>%
    summarize(
        total_population = sum(employed_civilian) + sum(unemployed_civilian) + sum(armed_forces) + not_in_labor_force,
        pct_employed_civilian = sum(employed_civilian) / total_population,
        pct_unemployed_civilian = sum(unemployed_civilian) / total_population,
        pct_armed_forces = sum(armed_forces) / total_population,
        pct_not_in_labor_force = sum(not_in_labor_force) / total_population
    )
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
`summarise()` has grouped output by 'state'. You can override using the `.groups` argument.
employment_state_r
# A tibble: 3,219 × 6
# Groups:   state [52]
   state   total_population pct_employed_civilian pct_unemployed_civilian pct_armed_forces pct_not_in_labor_force
   <chr>              <int>                 <dbl>                   <dbl>            <dbl>                  <dbl>
 1 Alabama          2047500                 0.932                  0.0565          0.00594                  0.636
 2 Alabama          2078730                 0.918                  0.0557          0.00585                  0.626
 3 Alabama          2045400                 0.933                  0.0566          0.00595                  0.636
 4 Alabama          2042750                 0.934                  0.0567          0.00596                  0.637
 5 Alabama          2051580                 0.930                  0.0564          0.00593                  0.635
 6 Alabama          2040170                 0.935                  0.0568          0.00597                  0.638
 7 Alabama          2043630                 0.934                  0.0567          0.00596                  0.637
 8 Alabama          2072250                 0.921                  0.0559          0.00587                  0.628
 9 Alabama          2047780                 0.932                  0.0565          0.00594                  0.636
10 Alabama          2044740                 0.933                  0.0566          0.00595                  0.637
# ℹ 3,209 more rows

Measuring the Spread of Numerical Variables

Measuring the spread of numerical variables, typically through metrics like standard deviation or interquartile range, quantifies data variability. This offers insights into the dispersion around the central value, highlighting data consistency or variability, enabling better understanding, comparison, and risk assessment within the dataset.

# employment_r %>% dplyr::summarize(
#     "Mean Absolute Deviation (MAD)" = mad(employment_r),
#     "Variance" = var(employment_r),
#     "Standard Deviation" = sd(employment_r),
#     "Interquartile Range (IQR)" = IQR(employment_r)
# )
# StatsBase.mad(employment_jl.population)
# Statistics.var(employment_jl.population)
# Statistics.std(employment_jl.population)
# StatsBase.iqr(employment_jl.population)

Measuring the Range of Numerical Variables

Measuring the range of numerical variables provides the difference between the highest and lowest values, offering a quick snapshot of data spread. This simple metric highlights the extent of variability within the dataset, aiding in understanding the data’s breadth and potential outliers.

# employment_r %>% dplyr::summarize(
#     min = min(unemployment),
#     max = max(unemployment),
#     quantile = quantile(unemployment)
# )
# Statistics.quantile(employment_jl.population)

Using Built-In, Aggregate Summary Statistic Functions

To simply this task, R, Python, and Julia have built in functions that can be used to quickly summarize a dataset. These functions are summary() in R, describe() in Python, and describe() in Julia.

summary(employment_r)
    state              county          employed_civilian unemployed_civilian  armed_forces   not_in_labor_force
 Length:3219        Length:3219        Min.   :     40   Min.   :     0      Min.   :    0   Min.   :     10   
 Class :character   Class :character   1st Qu.:   4605   1st Qu.:   270      1st Qu.:    0   1st Qu.:   3240   
 Mode  :character   Mode  :character   Median :  10360   Median :   620      Median :   10   Median :   7180   
                                       Mean   :  40259   Mean   :  2280      Mean   :  250   Mean   :  23182   
                                       3rd Qu.:  27065   3rd Qu.:  1540      3rd Qu.:   40   3rd Qu.:  16740   
                                       Max.   :3927780   Max.   :331060      Max.   :46050   Max.   :2689920   
employment_py.describe()
       employed_civilian  unemployed_civilian  armed_forces  not_in_labor_force
count       3.219000e+03          3219.000000   3219.000000        3.219000e+03
mean        4.025878e+04          2280.015533    249.785648        2.318159e+04
std         1.282390e+05          9010.641408   1496.824456        7.822213e+04
min         4.000000e+01             0.000000      0.000000        1.000000e+01
25%         4.605000e+03           270.000000      0.000000        3.240000e+03
50%         1.036000e+04           620.000000     10.000000        7.180000e+03
75%         2.706500e+04          1540.000000     40.000000        1.674000e+04
max         3.927780e+06        331060.000000  46050.000000        2.689920e+06
DataFrames.describe(employment_jl)
6×7 DataFrame
 Row │ variable             mean     min        median   max      nmissing  eltype
     │ Symbol               Union…   Any        Union…   Any      Int64     DataType
─────┼───────────────────────────────────────────────────────────────────────────────
   1 │ state                         Alabama             Wyoming         0  String31
   2 │ county                        Abbeville           Ziebach         0  String
   3 │ employed_civilian    40258.8  40         10360.0  3927780         0  Int64
   4 │ unemployed_civilian  2280.02  0          620.0    331060          0  Int64
   5 │ armed_forces         249.786  0          10.0     46050           0  Int64
   6 │ not_in_labor_force   23181.6  10         7180.0   2689920         0  Int64

Further Readings

  • Hildebrand, D. K., Ott, R. L., & Gray, J. B. (2005). Basic Statistical Ideas for Managers (2nd ed.). Thompson Brooks/Cole.
  • Stine, R. & Foster, D. (2017). Statistics for Business: Decision Making and Analysis (3rd ed.). Pearson.
  • Tamhane, A. C. & Dunlop, D. D. (2000). Statistics and Data Analysis: From Elementary to Intermediate. Prentice Hall.
  • US Census Bureau. (2023, August 4). Decennial census of population and housing by decades. Census.gov. https://www.census.gov/programs-surveys/decennial-census/decade.2000.html
Applied Advanced Analytics & AI in Sports