Base R can handle .csv and .txt files “out of the box”, but these can
have limitations. Note: With Rmd, comments/text do not
require the “#” comment symbol. Instead we specify where the
code
will go using “code chunks”, as below.
my_data <- read.csv("example_data.csv")
colnames(my_data)
## [1] "SEX" "EDATTAIN" "EMPSTAT" "AGE"
table(my_data$SEX)
##
## 1 2
## 2568 2432
table(my_data$EDATTAIN)
##
## 1 2 3 4 5
## 2488 1973 130 289 120
table(my_data$EMPSTAT)
##
## 1 2 3 4 5
## 1096 1873 121 1900 10
To read in different file types, we will need to download R packages.
R packages are add-ons that expand the functionality of base R. Its helpful to think of R as a series of books. “Base R” is the starter set of “books” you get when you first install R. install.packages() downloads a new “book” and adds it to your collection. This step needs to be done only once, or periodically to update.
Note: quotation marks (““) ARE REQUIRED to specify a package name.
if(!require("openxlsx")) {
install.packages("openxlsx")
}
## Loading required package: openxlsx
if (!(require("haven"))) {
install.packages("haven")
}
## Loading required package: haven
if(!require("ggplot2")) {
install.packages("ggplot2")
}
## Loading required package: ggplot2
if (!(require("gtsummary"))) {
install.packages("gtsummary")
}
## Loading required package: gtsummary
openxlsx provides functions to read and write Microsoft Excel files.
haven provides functions to read from various stats packages, including SPSS, Stata, and SAS. Notably, haven is able to read in values and value labels.
ggplot2 streamlines the creation of plots to visualize your data.
gtsummary streamlines the creation of tables to summarize your data.
In order to load the package, we use library(). This is like taking the book off your shelf and opening it to read/use. This step needs to be done each time you open R/RStudio, so it’s helpful to have a section at the top of the script that loads all packages needed.
Note: quotation marks (““) are NOT required for the library function.
library(haven)
library(openxlsx)
library(ggplot2)
library(gtsummary)
Excel can be convenient for data management, but like .csv files, it does not allow you to store a numeric code and a text label in a single cell. Understanding coded values will require referring to a separate metadata file.
my_data_xlsx <- read.xlsx("example_data.xlsx")
head(my_data_xlsx)
## SEX EDATTAIN EMPSTAT AGE
## 1 1 2 2 29
## 2 1 1 2 17
## 3 1 2 2 32
## 4 1 1 4 69
## 5 2 1 1 8
## 6 2 2 4 27
#### Summarize (Frequency Tables)
table(my_data_xlsx$SEX)
##
## 1 2
## 2568 2432
table(my_data_xlsx$EDATTAIN)
##
## 1 2 3 4 5
## 2488 1973 130 289 120
table(my_data_xlsx$EMPSTAT)
##
## 1 2 3 4 5
## 1096 1873 121 1900 10
The haven package provides functions to read and write the formatted data files used by Stata, SPSS, and SAS. This package makes it easier to get your data into R, and also provides support for LABELED VALUES.
haven labels are complex; converting to the factor format, which is native to R, will preserve the labels and work more smoothly with other R functions.
We use as_factor
from the haven
package to
simplify the labeled values.
my_data <- read_dta("example_data.dta")
head(my_data) # It's still numeric! and confusing
## # A tibble: 6 × 4
## SEX EDATTAIN EMPSTAT AGE
## <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl>
## 1 1 [Male] 2 [Primary completed] 2 [Employed] 29
## 2 1 [Male] 1 [Less than primary completed] 2 [Employed] 17
## 3 1 [Male] 2 [Primary completed] 2 [Employed] 32
## 4 1 [Male] 1 [Less than primary completed] 4 [Inactive] 69
## 5 2 [Female] 1 [Less than primary completed] 1 [NIU (not in universe)] 8
## 6 2 [Female] 2 [Primary completed] 4 [Inactive] 27
## as_factor in the haven package to convert to factors for further analysis
my_data$SEX <- as_factor(my_data$SEX)
my_data$EDATTAIN <- as_factor(my_data$EDATTAIN)
my_data$EMPSTAT <- as_factor(my_data$EMPSTAT)
head(my_data) # Labels present!
## # A tibble: 6 × 4
## SEX EDATTAIN EMPSTAT AGE
## <fct> <fct> <fct> <dbl>
## 1 Male Primary completed Employed 29
## 2 Male Less than primary completed Employed 17
## 3 Male Primary completed Employed 32
## 4 Male Less than primary completed Inactive 69
## 5 Female Less than primary completed NIU (not in universe) 8
## 6 Female Primary completed Inactive 27
table(my_data$SEX)
##
## Male Female
## 2568 2432
table(my_data$EDATTAIN)
##
## Less than primary completed Primary completed
## 2488 1973
## Secondary completed University completed
## 130 289
## Unknown
## 120
table(my_data$EMPSTAT)
##
## NIU (not in universe) Employed Unemployed
## 1096 1873 121
## Inactive Unknown/missing
## 1900 10
summary(my_data$AGE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 11.00 23.00 26.36 39.00 99.00
table(my_data$EDATTAIN, my_data$SEX)
##
## Male Female
## Less than primary completed 1270 1218
## Primary completed 1019 954
## Secondary completed 57 73
## University completed 163 126
## Unknown 59 61
Use the subset()
function specifying a DATA.FRAME
followed by an expression that will return a LOGICAL (TRUE or FALSE)
value.
sub_data <- subset(my_data, EDATTAIN != "Unknown")
table(sub_data$EDATTAIN) ## empty levels present
##
## Less than primary completed Primary completed
## 2488 1973
## Secondary completed University completed
## 130 289
## Unknown
## 0
R factors will keep empty levels, unless we explicitly tell R to drop them.
sub_data <- droplevels(sub_data) ## drop removed labels
# This time we save the resulting table to an object for further analysis
crosstab <- table(sub_data$EDATTAIN, sub_data$SEX)
Calling an R object will print the contents of that object. Small objects may be printed in the console/Rmd report, but larger onces might get truncated:
## A reasonable table
crosstab
##
## Male Female
## Less than primary completed 1270 1218
## Primary completed 1019 954
## Secondary completed 57 73
## University completed 163 126
## Large objects may get truncated/summarized
my_data
## # A tibble: 5,000 × 4
## SEX EDATTAIN EMPSTAT AGE
## <fct> <fct> <fct> <dbl>
## 1 Male Primary completed Employed 29
## 2 Male Less than primary completed Employed 17
## 3 Male Primary completed Employed 32
## 4 Male Less than primary completed Inactive 69
## 5 Female Less than primary completed NIU (not in universe) 8
## 6 Female Primary completed Inactive 27
## 7 Male Less than primary completed NIU (not in universe) 8
## 8 Female Primary completed Employed 20
## 9 Male Less than primary completed Inactive 12
## 10 Female Secondary completed Employed 30
## # … with 4,990 more rows
Options to write R objects out to various file types. Note the similarities and differences in syntax. In most file-writing functions, the first argument is the R object to be written out.
write.csv(crosstab, file = "xtab_emp_by_sex.csv")
write.xlsx(crosstab, file = "xtab_emp_by_sex.xlsx")
write_dta(as.data.frame(crosstab), path = "xtab_emp_by_sex.dta")
## note different syntax (argument name is `path` instead of `file`)
Creating tables in base R can be a bit clunky. If you want a table of proportions or percentages, it may take a few steps.
crosstab ## print frequencies
##
## Male Female
## Less than primary completed 1270 1218
## Primary completed 1019 954
## Secondary completed 57 73
## University completed 163 126
ppn_crosstab <- prop.table(crosstab) ## get proportions
ppn_crosstab
##
## Male Female
## Less than primary completed 0.26024590 0.24959016
## Primary completed 0.20881148 0.19549180
## Secondary completed 0.01168033 0.01495902
## University completed 0.03340164 0.02581967
ppn_crosstab <- round(ppn_crosstab, 2) # round to 2 digits
ppn_crosstab
##
## Male Female
## Less than primary completed 0.26 0.25
## Primary completed 0.21 0.20
## Secondary completed 0.01 0.01
## University completed 0.03 0.03
gtsummary
produces nice all-in-one tables that integrate
nicely into HTML reports created with R Markdown.
tbl_summary(sub_data, by = "SEX")
Characteristic | Male, N = 2,5091 | Female, N = 2,3711 |
---|---|---|
EDATTAIN | ||
Less than primary completed | 1,270 (51%) | 1,218 (51%) |
Primary completed | 1,019 (41%) | 954 (40%) |
Secondary completed | 57 (2.3%) | 73 (3.1%) |
University completed | 163 (6.5%) | 126 (5.3%) |
EMPSTAT | ||
NIU (not in universe) | 585 (23%) | 511 (22%) |
Employed | 1,183 (47%) | 583 (25%) |
Unemployed | 72 (2.9%) | 49 (2.1%) |
Inactive | 667 (27%) | 1,222 (52%) |
Unknown/missing | 2 (<0.1%) | 6 (0.3%) |
AGE | 22 (10, 39) | 23 (11, 39) |
1 n (%); Median (IQR) |
barplot(crosstab)
Base R provides MANY built in functions, each with many options to customize your plots. The flexibility/customization comes at the trade-off of user input. Adding a legend takes two extra steps in Base R.
First, we use layout() to tell R to treat the plot window as a matrix(), consisting of 1 row and 4 columns: 1,1,2,2. The first plot will occupy the first two columns (as a merged column), while the second plot will occupy the last set of columns.
Make an empty plot (of a single point) to use as the base for the legend. “n” indicates “null” for many arguments to the plot function.
## plot matrix
layout(matrix(c(1,1,2,2), ncol = 4))
## barplot as normal
barplot(crosstab, col = rainbow(4), main = "Education by Sex")
## Make an empty plot
plot(
1,
type = "n", ## do not plot the points
bty = "n", ## do not plot a bounding box
xaxt = "n", ## do not plot an X axis
xlab = "", ## label the x axis as "", in other words, as blank
yaxt = "n", ## do not plot a Y axis
ylab ="", ## label the Y axis as "", in other words, as blank
)
legend(
"center",
legend = levels(sub_data$EDATTAIN),
pch = 22,
pt.bg = rainbow(4),
inset = c(-0.2,0),
title = "Ed Attainment"
)
ggplot2
is EXTREMELY popular for quickly generating
polished looking plots. It does use a slightly different syntax than
Base R, but there is ample user-support.
qplot(x = SEX, fill = EDATTAIN, data = sub_data, geom = "bar")