Summary Statistics
Extracting an initial overview from the 2000 U.S. Census Bureau data—using R, Python, and Julia.
Summary statistics provide a concise overview of large datasets, simplifying complex information for initial analysis. Using a 2000 U.S. Census Bureau dataset (as an example), key summary statistics like mean, median, and mode help understand population characteristics, and offer insights into demographic trends. These statistics condense vast data into understandable figures—enabling policymakers to make informed decisions based on the population’s central tendencies and distributions.
Let’s explore implementing these ideas across three popular programming languages for statistical analysis.
Getting Started
If you are interested in reproducing this work, here are the versions of R, Python, and Julia that I used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.
cat(
R.version$version.string, "-", R.version$nickname,
"\nOS:", Sys.info()["sysname"], R.version$platform,
"\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)
R version 4.2.3 (2023-03-15) - Shortstop Beagle
OS: Darwin x86_64-apple-darwin17.0
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
library(dplyr)
import sys
import platform
import os
import cpuinfo
print(
"Python", sys.version,
"\nOS:", platform.system(), platform.platform(),
"\nCPU:", os.cpu_count(), "x", cpuinfo.get_cpu_info()["brand_raw"]
)
Python 3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
OS: Darwin macOS-10.16-x86_64-i386-64bit
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install scipy==1.11.1
import numpy
import pandas
from scipy import stats
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin22.4.0)
CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Environment:
DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server
using Pkg
Pkg.add(name="HTTP", version="1.10.2")
Pkg.add(name="CSV", version="0.10.13")
Pkg.add(name="DataFrames", version="1.6.1")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="StatsBase", version="0.34.2")
using HTTP
using CSV
using DataFrames
using CategoricalArrays
using StatsBase
Importing and Examining Dataset
As always, let’s kick-off by importing and examining the dataset.
employment_r <- read.csv("https://huggingface.co/datasets/michaelmallari/us-census/raw/main/2000-employment.csv")
str(object=employment_r)
'data.frame': 3219 obs. of 6 variables:
$ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
$ county : chr "Autauga" "Baldwin" "Barbour" "Bibb" ...
$ employed_civilian : int 19600 62760 10190 7970 22700 3440 8010 47530 15700 10180 ...
$ unemployed_civilian: int 970 2670 610 530 980 320 950 3010 1090 410 ...
$ armed_forces : int 570 210 20 20 40 10 30 170 20 20 ...
$ not_in_labor_force : int 11100 42330 9000 6350 15180 3770 7230 35850 11380 8340 ...
head(x=employment_r, n=8)
state county employed_civilian unemployed_civilian armed_forces not_in_labor_force
1 Alabama Autauga 19600 970 570 11100
2 Alabama Baldwin 62760 2670 210 42330
3 Alabama Barbour 10190 610 20 9000
4 Alabama Bibb 7970 530 20 6350
5 Alabama Blount 22700 980 40 15180
6 Alabama Bullock 3440 320 10 3770
7 Alabama Butler 8010 950 30 7230
8 Alabama Calhoun 47530 3010 170 35850
tail(x=employment_r, n=8)
state county employed_civilian unemployed_civilian armed_forces not_in_labor_force
3212 Puerto Rico Trujillo Alto Municipio 23230 3020 60 28660
3213 Puerto Rico Utuado Municipio 6170 2660 0 16730
3214 Puerto Rico Vega Alta Municipio 8610 1950 30 16740
3215 Puerto Rico Vega Baja Municipio 14150 3720 30 27090
3216 Puerto Rico Vieques Municipio 1710 670 10 4270
3217 Puerto Rico Villalba Municipio 5120 2160 0 11630
3218 Puerto Rico Yabucoa Municipio 7240 2260 0 19170
3219 Puerto Rico Yauco Municipio 10180 3410 10 20300
employment_py = pandas.read_csv("https://huggingface.co/datasets/michaelmallari/us-census/raw/main/2000-employment.csv")
employment_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3219 entries, 0 to 3218
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 state 3219 non-null object
1 county 3219 non-null object
2 employed_civilian 3219 non-null int64
3 unemployed_civilian 3219 non-null int64
4 armed_forces 3219 non-null int64
5 not_in_labor_force 3219 non-null int64
dtypes: int64(4), object(2)
memory usage: 151.0+ KB
employment_py.head(n=8)
state county employed_civilian unemployed_civilian armed_forces not_in_labor_force
0 Alabama Autauga 19600 970 570 11100
1 Alabama Baldwin 62760 2670 210 42330
2 Alabama Barbour 10190 610 20 9000
3 Alabama Bibb 7970 530 20 6350
4 Alabama Blount 22700 980 40 15180
5 Alabama Bullock 3440 320 10 3770
6 Alabama Butler 8010 950 30 7230
7 Alabama Calhoun 47530 3010 170 35850
employment_py.tail(n=8)
state county employed_civilian unemployed_civilian armed_forces not_in_labor_force
3211 Puerto Rico Trujillo Alto Municipio 23230 3020 60 28660
3212 Puerto Rico Utuado Municipio 6170 2660 0 16730
3213 Puerto Rico Vega Alta Municipio 8610 1950 30 16740
3214 Puerto Rico Vega Baja Municipio 14150 3720 30 27090
3215 Puerto Rico Vieques Municipio 1710 670 10 4270
3216 Puerto Rico Villalba Municipio 5120 2160 0 11630
3217 Puerto Rico Yabucoa Municipio 7240 2260 0 19170
3218 Puerto Rico Yauco Municipio 10180 3410 10 20300
employment_jl = CSV.File(HTTP.get("https://huggingface.co/datasets/michaelmallari/us-census/raw/main/2000-employment.csv").body) |> DataFrames.DataFrame
3219×6 DataFrame
Row │ state county employed_civilian unemployed_civilian armed_forces not_in_labor_force
│ String31 String Int64 Int64 Int64 Int64
──────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Alabama Autauga 19600 970 570 11100
2 │ Alabama Baldwin 62760 2670 210 42330
3 │ Alabama Barbour 10190 610 20 9000
4 │ Alabama Bibb 7970 530 20 6350
5 │ Alabama Blount 22700 980 40 15180
6 │ Alabama Bullock 3440 320 10 3770
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
3215 │ Puerto Rico Vega Baja Municipio 14150 3720 30 27090
3216 │ Puerto Rico Vieques Municipio 1710 670 10 4270
3217 │ Puerto Rico Villalba Municipio 5120 2160 0 11630
3218 │ Puerto Rico Yabucoa Municipio 7240 2260 0 19170
3219 │ Puerto Rico Yauco Municipio 10180 3410 10 20300
3208 rows omitted
Handling Missing Data
Handling missing data ensures accurate analyses. Let’s pinpoint any missing data and determine appropriate actions.
Measuring the Center of Numerical Variables
Measuring the center of numerical variables, often through metrics like mean or median, provides a central tendency that summarizes data points. This helps in understanding typical or average values, aiding in analysis, comparison, and making informed decisions based on the core representation of the dataset. Reminder: Each observation (or row) in the dataset is at the county-level—a total of 3219
in the U.S., including Washington, DC and Puerto Rico.
# Create a custom function for mode, since there in none included in Base R
get_mode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
employment_r %>%
dplyr::summarize(
"Mean" = mean(employment_r$unemployed_civilian),
"Median" = median(employment_r$unemployed_civilian),
"Mode" = get_mode(employment_r$unemployed_civilian)
)
Mean Median Mode
1 2280 620 130
numpy.mean(employment_py.unemployed_civilian)
2280.0155327741536
numpy.median(employment_py.unemployed_civilian)
620.0
stats.mode(employment_py.unemployed_civilian)
ModeResult(mode=130, count=42)
Statistics.mean(employment_jl.unemployed_civilian)
2280.0155327741536
Statistics.median(employment_jl.unemployed_civilian)
620.0
StatsBase.mode(employment_jl.unemployed_civilian)
130
In this example, looking at the central tendencies in absolute values across all counties in the U.S. can be misleading. After all, a mean of 100 unemployed civilians in the state of Hawaii tells a different, and a mean of 100 unemployed civilians in California. Let’s look at the same data, but as a percentage of the total population for each state.
employment_state_r <- employment_r %>%
dplyr::group_by(state) %>%
summarize(
total_population = sum(employed_civilian) + sum(unemployed_civilian) + sum(armed_forces) + not_in_labor_force,
pct_employed_civilian = sum(employed_civilian) / total_population,
pct_unemployed_civilian = sum(unemployed_civilian) / total_population,
pct_armed_forces = sum(armed_forces) / total_population,
pct_not_in_labor_force = sum(not_in_labor_force) / total_population
)
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()` always returns an ungrouped data frame and adjust accordingly.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
`summarise()` has grouped output by 'state'. You can override using the `.groups` argument.
employment_state_r
# A tibble: 3,219 × 6
# Groups: state [52]
state total_population pct_employed_civilian pct_unemployed_civilian pct_armed_forces pct_not_in_labor_force
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Alabama 2047500 0.932 0.0565 0.00594 0.636
2 Alabama 2078730 0.918 0.0557 0.00585 0.626
3 Alabama 2045400 0.933 0.0566 0.00595 0.636
4 Alabama 2042750 0.934 0.0567 0.00596 0.637
5 Alabama 2051580 0.930 0.0564 0.00593 0.635
6 Alabama 2040170 0.935 0.0568 0.00597 0.638
7 Alabama 2043630 0.934 0.0567 0.00596 0.637
8 Alabama 2072250 0.921 0.0559 0.00587 0.628
9 Alabama 2047780 0.932 0.0565 0.00594 0.636
10 Alabama 2044740 0.933 0.0566 0.00595 0.637
# ℹ 3,209 more rows
Measuring the Spread of Numerical Variables
Measuring the spread of numerical variables, typically through metrics like standard deviation or interquartile range, quantifies data variability. This offers insights into the dispersion around the central value, highlighting data consistency or variability, enabling better understanding, comparison, and risk assessment within the dataset.
# employment_r %>% dplyr::summarize(
# "Mean Absolute Deviation (MAD)" = mad(employment_r),
# "Variance" = var(employment_r),
# "Standard Deviation" = sd(employment_r),
# "Interquartile Range (IQR)" = IQR(employment_r)
# )
# StatsBase.mad(employment_jl.population)
# Statistics.var(employment_jl.population)
# Statistics.std(employment_jl.population)
# StatsBase.iqr(employment_jl.population)
Measuring the Range of Numerical Variables
Measuring the range of numerical variables provides the difference between the highest and lowest values, offering a quick snapshot of data spread. This simple metric highlights the extent of variability within the dataset, aiding in understanding the data’s breadth and potential outliers.
# employment_r %>% dplyr::summarize(
# min = min(unemployment),
# max = max(unemployment),
# quantile = quantile(unemployment)
# )
# Statistics.quantile(employment_jl.population)
Using Built-In, Aggregate Summary Statistic Functions
To simply this task, R, Python, and Julia have built in functions that can be used to quickly summarize a dataset. These functions are summary()
in R, describe()
in Python, and describe()
in Julia.
summary(employment_r)
state county employed_civilian unemployed_civilian armed_forces not_in_labor_force
Length:3219 Length:3219 Min. : 40 Min. : 0 Min. : 0 Min. : 10
Class :character Class :character 1st Qu.: 4605 1st Qu.: 270 1st Qu.: 0 1st Qu.: 3240
Mode :character Mode :character Median : 10360 Median : 620 Median : 10 Median : 7180
Mean : 40259 Mean : 2280 Mean : 250 Mean : 23182
3rd Qu.: 27065 3rd Qu.: 1540 3rd Qu.: 40 3rd Qu.: 16740
Max. :3927780 Max. :331060 Max. :46050 Max. :2689920
employment_py.describe()
employed_civilian unemployed_civilian armed_forces not_in_labor_force
count 3.219000e+03 3219.000000 3219.000000 3.219000e+03
mean 4.025878e+04 2280.015533 249.785648 2.318159e+04
std 1.282390e+05 9010.641408 1496.824456 7.822213e+04
min 4.000000e+01 0.000000 0.000000 1.000000e+01
25% 4.605000e+03 270.000000 0.000000 3.240000e+03
50% 1.036000e+04 620.000000 10.000000 7.180000e+03
75% 2.706500e+04 1540.000000 40.000000 1.674000e+04
max 3.927780e+06 331060.000000 46050.000000 2.689920e+06
DataFrames.describe(employment_jl)
6×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 DataType
─────┼───────────────────────────────────────────────────────────────────────────────
1 │ state Alabama Wyoming 0 String31
2 │ county Abbeville Ziebach 0 String
3 │ employed_civilian 40258.8 40 10360.0 3927780 0 Int64
4 │ unemployed_civilian 2280.02 0 620.0 331060 0 Int64
5 │ armed_forces 249.786 0 10.0 46050 0 Int64
6 │ not_in_labor_force 23181.6 10 7180.0 2689920 0 Int64
Further Readings
- Hildebrand, D. K., Ott, R. L., & Gray, J. B. (2005). Basic Statistical Ideas for Managers (2nd ed.). Thompson Brooks/Cole.
- Stine, R. & Foster, D. (2017). Statistics for Business: Decision Making and Analysis (3rd ed.). Pearson.
- Tamhane, A. C. & Dunlop, D. D. (2000). Statistics and Data Analysis: From Elementary to Intermediate. Prentice Hall.
- US Census Bureau. (2023, August 4). Decennial census of population and housing by decades. Census.gov. https://www.census.gov/programs-surveys/decennial-census/decade.2000.html