A guide to bar charts for Likert-type psychometric scales built with ggplot2.
As we’ve mentioned throughout this series, one of the most important focus areas of the new PMA COVID-19 survey has to do with perceptions of risk expressed by women during the early months of the pandemic. Because all respondents to the COVID-19 survey are participants in a multi-year panel study examining broad topics in reproductive health, analysts will soon be able to link women’s attitudes and beliefs about COVID-19 during the summer of 2020 to longer-term health and family planning outcomes.
In this post, we’ll examine one of the most common data visualization tools used to explore attitudinal data: the bar chart. In the PMA COVID-19 survey, women are asked to rate their level of concern for several different types of risk associated with the pandemic. The survey uses a four-point scale for such questions, and it includes the following response options:
This type of scale is common in psychometric research, particularly where analysts want to compare attitudes about a wide range of topics. You might notice that the responses follow a bi-polar format, where more neutral responses are organized at the center, and more extreme responses are listed on either side. This type for scale is sometimes called the Likert scale after the pioneering social psychologist, Rensis Likert.
The bar chart is typically used for Likert-type data because:
We’ll discuss some of the many choices you’ll have to make about layout, and we’ll show how to implement them with the tidyverse
package ggplot2.
You’ll find the data featured in this post if you navigate to the new COVID-19 Unit of Analysis in the IPUMS PMA data extract system. Our examples feature data from all four samples (Female Respondents only).
To follow along, make sure to create an extract that includes these variables:
You’ll also need to install the following packages as needed (current versions are recommended):
Load those packages and your data extract into R (be sure to change the file paths to match the location of your own extract):
To start, let’s take a look at just one of the variables that uses the Likert-type scale shown above. In COVIDCONCERN
, women who have not already been infected with COVID-19 are asked to rate their level of concern for becoming infected:
How concerned are you about getting infected yourself?
(Read all options)
[] Very concerned
[] Concerned
[] A little concerned
[] Not concerned
[] I am currently / was infected with COVID-19
[] No response
Let’s break down the responses to this question by COUNTRY
. First, following the explanation in our last post, we’d strongly recommend transforming both variables into factor objects (this will ensure that their value labels are displayed in graphics output). We’ll also edit the COUNTRY
labels for DRC and Nigeria, and we’ll describe the NIU
cases for COVIDCONCERN
as women who Never heard or read about COVID-19
.
covid <- covid %>%
mutate(
across(where(is.labelled), ~as_factor(.x) %>% fct_drop),
COUNTRY = COUNTRY %>%
fct_recode(
`DRC (Kinshasa)` = "Congo, Democratic Republic",
`Nigeria (Lagos & Kano)` = "Nigeria"
),
COVIDCONCERN = COVIDCONCERN %>%
fct_recode(
`Never heard or read about COVID-19` = "NIU (not in universe)"
)
)
Using the gtsummary
package featured in our last post, you might preview the breakdown of COVIDCONCERN
by COUNTRY
in a table as follows:
covid %>% tbl_summary(by = COUNTRY, include = COVIDCONCERN)
Characteristic | Burkina Faso, N = 3,5281 | DRC (Kinshasa), N = 1,3241 | Kenya, N = 5,9861 | Nigeria (Lagos & Kano), N = 1,3461 |
---|---|---|---|---|
Concerned about getting infected | ||||
Not concerned | 238 (6.7%) | 259 (20%) | 236 (3.9%) | 39 (2.9%) |
A little concerned | 365 (10%) | 161 (12%) | 171 (2.9%) | 30 (2.2%) |
Concerned | 752 (21%) | 210 (16%) | 759 (13%) | 170 (13%) |
Very concerned | 2,168 (61%) | 689 (52%) | 4,818 (80%) | 1,100 (82%) |
Currently / previously infected with COVID-19 | 2 (<0.1%) | 1 (<0.1%) | 0 (0%) | 2 (0.1%) |
No response or missing | 0 (0%) | 2 (0.2%) | 2 (<0.1%) | 1 (<0.1%) |
Never heard or read about COVID-19 | 3 (<0.1%) | 2 (0.2%) | 0 (0%) | 4 (0.3%) |
1
n (%)
|
Now we’re ready to begin arranging these summary data into a bar chart with ggplot2
.
As you might know, ggplot2 is part of the tidyverse
family of packages. For regular readers of this blog, this means that you’ll be able to use the same grammar that you’re used to seeing elsewhere, but with one important difference: while you’ll be able to pipe functions to ggplot()
with the familiar %>%
operator, functions within the package use their own pipe-like operator +
.
This +
operator allows the user to assemble multiple layers of visual information onto the same plot. These layers are built from functions that start with the prefix geom_
, because each layer is more-or-less defined by the geometric shapes that convey information about the data.
While these geom_
functions are simple to use and combine, you’ll first need to define some common parameters with the function ggplot()
. This function initializes a kind of “skeleton” plot - or canvas - onto which you’ll layer each geom_
function. Usually, you’ll identify variables here that you’ll want to map onto the x and y-axes, or onto the “fill” styles (e.g. colors and shadings) within your plot’s geometric shapes.
We’ll use the geom_bar()
function after we define some basic parameters for our plot in ggplot()
:
covid %>%
ggplot(aes(x = COUNTRY, fill = COVIDCONCERN)) +
geom_bar()
In the above function, we initialize our plot with ggplot()
and define its basic aesthetic qualities with aes()
: we specify that we’ll plot each COUNTRY
on the x-axis, and - in whatever geometric shapes we draw next - we’ll fill its segments with colors defined by COVIDCONCERN
. Note that the ggplot()
function doesn’t draw anything, itself. Instead, we pipe ggplot
to geom_bar()
, which is responsible for drawing and stacking the bars.
But what about the values that appeared on the y-axis? We didn’t specify anything in our data, but it seems like ggplot()
automatically calculated the number of women in each country who selected each response. While this might be a useful default in some situations, here we’d much rather normalize these bars as a percentage of the total number of responses for each sample. We’ll do this by manipulating the position
argument in geom_bar()
.
The position
argument in geom_bar()
determines how the bars representing each response should be arranged on our plot. This argument can take one of several position adjustment functions, and its default behavior uses position_stack() to “stack” bars representing the frequency of each response. This kind of bar chart is known as a stacked bar chart.
If we want to normalize the size of our bars to the size of each sample, we can use position_fill() to stretch each stack of bars to an equal length. The result allows us to compare the proportion of responses across samples:
covid %>%
ggplot(aes(x = COUNTRY, fill = COVIDCONCERN)) +
geom_bar(position = position_fill())
This arrangement is helpful for comparing more extreme responses, but you may notice that it’s still a bit hard to compare the proportion of moderate responses in the middle of each stack. For this reason, you might consider using position_dodge() to transform our stacked bar chart into a grouped bar chart.
covid %>%
ggplot(aes(x = COUNTRY, fill = COVIDCONCERN)) +
geom_bar(position = position_dodge())
Unfortunately, when we switch position
from position_fill()
to position_dodge()
, we’re no longer able to stretch each bar to a normalized length. Instead, we’ll need to pre-calculate the proportion of each response and pass it to geom_bar()
via the stat
argument.
In each of the above plots, we’ve relied on the default behavior of geom_bar()
to calculate the frequency of each response and - when requested - to stretch each bar to a normalized length. There are many reasons why you might want to pass your own statistics to geom_bar()
, and you can do so with the argument stat = "identity"
.
For example, we might create a table of summary statistics showing the proportion of responses to COVIDCONCERN
by COUNTRY
:
concern_tbl <- covid %>%
as_survey_design() %>%
group_by(COUNTRY, COVIDCONCERN) %>%
summarise(PERCENT = 100 * survey_mean(vartype = NULL))
concern_tbl
# A tibble: 25 x 3
# Groups: COUNTRY [4]
COUNTRY COVIDCONCERN PERCENT
<fct> <fct> <dbl>
1 Burkina Faso Not concerned 6.75
2 Burkina Faso A little concerned 10.3
3 Burkina Faso Concerned 21.3
4 Burkina Faso Very concerned 61.5
5 Burkina Faso Currently / previously infected with COVID-… 0.0567
6 Burkina Faso Never heard or read about COVID-19 0.0850
7 DRC (Kinshasa) Not concerned 19.6
8 DRC (Kinshasa) A little concerned 12.2
9 DRC (Kinshasa) Concerned 15.9
10 DRC (Kinshasa) Very concerned 52.0
# … with 15 more rows
Now, if we pass our summary table concern_tbl
to ggplot()
, we’ll be able to map response percentages in the PERCENT
column to the y-axis. In the geom_bar()
function, we’ll use stat = "identity"
to ensure that our pre-calculated statistics are displayed:
concern_tbl %>%
ggplot(aes(x = COUNTRY, fill = COVIDCONCERN, y = PERCENT)) +
geom_bar(
position = position_dodge(),
stat = "identity"
)
You might also consider pre-calculating statistics if you want to add layers of text or error bars to your plot. As we’ve discussed elsewhere, we love using as_survey_design()
and survey_mean()
from the srvyr package to generate population-level estimates with cluster-robust standard errors. Here, we’ll use CVQWEIGHT
as a weighting variable and EAID
as the identification number for each sample cluster, thus creating a population-level summary table called concern_pop
:
concern_pop <- covid %>%
as_survey_design(weight = CVQWEIGHT, id = EAID) %>%
group_by(COUNTRY, COVIDCONCERN) %>%
summarise(PERCENT = 100 * survey_mean(vartype = "ci"))
concern_pop
# A tibble: 25 x 5
# Groups: COUNTRY [4]
COUNTRY COVIDCONCERN PERCENT PERCENT_low PERCENT_upp
<fct> <fct> <dbl> <dbl> <dbl>
1 Burkina F… Not concerned 6.18 3.70 8.66
2 Burkina F… A little concerned 8.35 5.61 11.1
3 Burkina F… Concerned 17.7 14.3 21.2
4 Burkina F… Very concerned 67.7 62.1 73.3
5 Burkina F… Currently / previously … 0.0251 -0.0103 0.0604
6 Burkina F… Never heard or read abo… 0.0484 -0.0149 0.112
7 DRC (Kins… Not concerned 18.8 14.9 22.7
8 DRC (Kins… A little concerned 9.77 7.71 11.8
9 DRC (Kins… Concerned 17.0 13.0 21.1
10 DRC (Kins… Very concerned 54.0 48.5 59.5
# … with 15 more rows
Note the addition of PERCENT_low
and PERCENT_upp
, representing the lower and upper bounds of a 95% confidence interval for each population-level estimate of PERCENT
. We’ll use these in a new layer created by geom_errorbar()
:
concern_pop %>%
ggplot(aes(x = COUNTRY, fill = COVIDCONCERN, y = PERCENT)) +
geom_bar(
position = position_dodge(),
stat = "identity"
) +
geom_errorbar(
aes(ymin = PERCENT_low, ymax = PERCENT_upp),
width = 0.2,
position = position_dodge(width = 0.9)
)
Likewise, pre-calculating statistics in a table like concern_pop
makes it easy to access statistics by name in geom_text()
. In this example, adding the text label for each value of PERCENT
is redundant with the y-axis (not recommended), but you could also include text from any column in the pre-calculated table:
concern_pop %>%
ggplot(aes(x = COUNTRY, fill = COVIDCONCERN, y = PERCENT)) +
geom_bar(
position = "dodge",
stat = "identity"
) +
geom_text(
aes(label = round(PERCENT, 0)),
position = position_dodge(0.9),
vjust = -0.5
)
So far, we’ve focused all of our attention on passing the correct statistics to geom_bar()
. Unfortunately, this is only half the battle: our plots still aren’t very readable!
You may have noticed, for example, that the legend in each of our plots seems to take up about one third of the usable space. In a blog like ours - where many of you might be reading this post on a mobile phone - this layout is certainly not ideal. Instead, we’ll flip the x and y-axes, and then we’ll position the legend below the plot.
For example, let’s return to the stacked bar chart showing the population-level percentages for each response:
concern_pop %>%
ggplot(aes(x = COUNTRY, fill = COVIDCONCERN, y = PERCENT)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "bottom")
The function coord_flip() pivots our plot into a horizontal orientation, and another new function - theme() - allows us to move our legend. However, the legend now occupies too much horizontal space: one of the responses appears to be cut-off by the right-hand margin of the page.
It is possible to manipulate the layout of your legend with another function, guides(). For example, you might arrange the response codes into two separate columns:
concern_pop %>%
ggplot(aes(x = COUNTRY, fill = COVIDCONCERN, y = PERCENT)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "bottom") +
guides(fill = guide_legend(ncol = 2))
However, in our particular case, it might make more sense to simply drop the three types of non-response completely. We might do so, in part, because the remaining responses are part of an ordinal set. If we restrict the plotted values only to those ordinal responses, we’ll also be able to add an ordinal color scheme to our plot making the relationship between each response much clearer.
An easy way to drop non-response options in our particular case is to filter
only those four responses containing the word “concern” (upper or lower case). Then, when only the four ordinal responses remain, we’ll use scale_fill_brewer() to select an ordinal color scheme (“blues” by default). This time, we’ll use guides()
to reverse the order of the responses in our legend:
You can choose from several color palettes with scale_fill_brewer()
, or you can define your own colors using scale_fill_manual(), where you’ll assign a color to each response via a named character vector.
For example, here we’ll use some of the hex color codes you’ll see in the CSS throughout this blog:
concern_pop %>%
filter(grepl("concern", COVIDCONCERN, ignore.case = TRUE)) %>%
ggplot(aes(x = COUNTRY, fill = COVIDCONCERN, y = PERCENT)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "bottom") +
scale_fill_manual(
values = alpha(
colour = c(
"Very concerned" = "#00263A", # IPUMS Navy
"Concerned" = "#4E6C7D", # IPUMS Dark-Grey
"A little concerned" = "#7A99AC", # IPUMS Blue-Grey
"Not concerned" = "#F1F5F7" # IPUMS Light-Grey
)
)
) +
guides(fill = guide_legend(reverse = TRUE))
There are several ways to add text labels to a plot, but we find it easiest to define every label together in a single function, labs(). If you want to omit a particular label, you can simply set it to NULL
:
concern_pop %>%
filter(grepl("concern", COVIDCONCERN, ignore.case = TRUE)) %>%
ggplot(aes(x = COUNTRY, fill = COVIDCONCERN, y = PERCENT)) +
geom_bar(stat = "identity") +
coord_flip() +
theme(legend.position = "bottom") +
scale_fill_manual(
values = alpha(
colour = c(
"Very concerned" = "#00263A",
"Concerned" = "#4E6C7D",
"A little concerned" = "#7A99AC",
"Not concerned" = "#F1F5F7"
)
)
) +
guides(fill = guide_legend(reverse = TRUE)) +
labs(
title = "CONCERN ABOUT GETTING INFECTED WITH COVID-19",
subtitle = "Estimated percentage for populations of women age 15-49 (summer 2020)",
x = NULL,
y = NULL,
fill = NULL
)
As you can see, the default label fonts for ggplot2
do not match the fonts used on our blog. If this is an important consideration, you can download a .ttf
file for your preferred font from a repository like Google Fonts, and then load that file into R with font_add()
.
font_add(
family = "cabrito",
regular = "fonts/cabritosansnormregular-webfont.ttf"
)
Once you’ve loaded a font into R, you can make it accessible to ggplot2
for the remainder of your R session with the function showtext::showtext_auto().
Now, we can build on our custom theme()
by defining a general font family and size in text
. We can also tweak specific details for the title
and plot.subtitle
:
concern_pop %>%
filter(grepl("concern", COVIDCONCERN, ignore.case = TRUE)) %>%
ggplot(aes(x = COUNTRY, fill = COVIDCONCERN, y = PERCENT)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(
values = alpha(
colour = c(
"Very concerned" = "#00263A",
"Concerned" = "#4E6C7D",
"A little concerned" = "#7A99AC",
"Not concerned" = "#F1F5F7"
)
)
) +
guides(fill = guide_legend(reverse = TRUE)) +
labs(
title = "CONCERN ABOUT GETTING INFECTED WITH COVID-19",
subtitle = "Estimated percentage for populations of women age 15-49 (summer 2020)",
x = NULL,
y = NULL,
fill = NULL
) +
theme(
text = element_text(family = "cabrito", size = 10),
title = element_text(size = 14, color = "#00263A"),
plot.subtitle = element_text(size = 12),
legend.position = "bottom"
)
In the previous section, we decided to drop the three types of non-response for COVIDCONCERN
so that we could adopt an ordinal color scheme (each color corresponds with an ordinal level of concern). This improved the readability of our plot by making the relationship between response options more clear. However, this decision also came with a small cost: because our bars no longer represent 100% of each population, it’s a bit harder to compare the percentage of women represented by the responses on the right side of the plot (“Not concerned”).
In this case, you might consider the divergent stacked bar chart, where “positive” and “negative” levels of concern are plotted in opposite directions from an origin point on our x-axis. You might also consider this if you wanted to directly juxtapose the most extreme responses on our scale: “Very concerned” and “Not concerned”.
Note that three of the responses on our scale reflect some level of concern about getting infected with COVID-19; we’ll plot these responses in the positive direction on our x-axis. The negative response - “Not concerned” - will be plotted in the negative direction if we multiply PERCENT
by -1 for those cases. We’ll also give our negative response a secondary color (“PMA Pink”) and draw a vertical line at the origin to provide extra clarity. Finally, we’ll use a new function breaks()
to fully customize the order of responses in our legend:
concern_pop <- concern_pop %>%
mutate(PERCENT = if_else(
COVIDCONCERN == "Not concerned",
-PERCENT, # Multiply by -1
PERCENT
)) %>%
filter(grepl("concern", COVIDCONCERN, ignore.case = T))
concern_pop
# A tibble: 16 x 5
# Groups: COUNTRY [4]
COUNTRY COVIDCONCERN PERCENT PERCENT_low PERCENT_upp
<fct> <fct> <dbl> <dbl> <dbl>
1 Burkina Faso Not concerned -6.18 3.70 8.66
2 Burkina Faso A little concer… 8.35 5.61 11.1
3 Burkina Faso Concerned 17.7 14.3 21.2
4 Burkina Faso Very concerned 67.7 62.1 73.3
5 DRC (Kinshasa) Not concerned -18.8 14.9 22.7
6 DRC (Kinshasa) A little concer… 9.77 7.71 11.8
7 DRC (Kinshasa) Concerned 17.0 13.0 21.1
8 DRC (Kinshasa) Very concerned 54.0 48.5 59.5
9 Kenya Not concerned -4.84 2.44 7.23
10 Kenya A little concer… 3.45 2.12 4.79
11 Kenya Concerned 13.1 10.4 15.7
12 Kenya Very concerned 78.6 74.6 82.7
13 Nigeria (Lagos & … Not concerned -3.17 1.45 4.89
14 Nigeria (Lagos & … A little concer… 2.04 0.899 3.19
15 Nigeria (Lagos & … Concerned 14.1 6.18 22.0
16 Nigeria (Lagos & … Very concerned 80.3 72.6 87.9
concern_pop %>%
ggplot(aes(x = PERCENT, y = COUNTRY, fill = COVIDCONCERN)) +
geom_bar(stat = "identity") +
# draws a vertical line at 0 on the x axis
geom_vline(xintercept = 0) +
# define fill colors (values) and the arrangement of the legend (breaks)
scale_fill_manual(
values = alpha(
colour = c(
"Very concerned" = "#00263A", # PMA Pink
"Concerned" = "#4E6C7D", # IPUMS Dark-Grey
"A little concerned" = "#7A99AC", # IPUMS Blue-Grey
"Not concerned" = "#98579B" # IPUMS Light-Grey
)
),
breaks = c(
"Not concerned",
"Very concerned",
"Concerned",
"A little concerned"
)
) +
# define labels (labs) and format them as desired (theme)
labs(
title = "CONCERN ABOUT GETTING INFECTED WITH COVID-19",
subtitle = "Estimated percentage for populations of women age 15-49 (summer 2020)",
x = NULL,
y = NULL,
fill = NULL
) +
theme(
text = element_text(family = "cabrito", size = 10),
title = element_text(size = 14, color = "#00263A"),
plot.subtitle = element_text(size = 12),
legend.position = "bottom",
)
While it is still difficult to compare women who are “Concerned” or “A little concerned”, this type of chart makes it easy to compare the most extreme response while also comparing the full set of negative response to the full set of positive responses.
A word of caution: data visualization experts disagree about what to do with middle / neutral responses. While it’s possible to distribute these responses in halves on the outside or in the middle of each bar stack, we much prefer a facet showing both neutral and non-response options to the side.
All of the plots we’ve explored so far have contained a single panel, where both the x and y-axes are uninterrupted for the full width of the display. In some cases, you may want to facet multiple panels together.
For example, consider the variable PREGFEELNOW
, in which women describe how they would feel if they became pregnant “now”. As we’ll see, this variable contains both a large number of middle / neutral responses (“Mixed happy and unhappy”) and a large number of non-responses (e.g. women who were pregnant at the time, or who simply did not respond). We’ll use a facet to show these responses in a separate panel alongside those who provided a positive or negative opinion.
First, we’ll create a summary table and clean up the factor levels for clarity. As we’ve shown above, we’ll make PERCENT
a negative value for negative responses (“Very unhappy” or “Sort of unhappy”):
pg_tbl <- covid %>%
as_survey_design(weight = CVQWEIGHT, id = EAID) %>%
group_by(COUNTRY, PREGFEELNOW) %>%
summarise(PERCENT = 100 * survey_mean(vartype = NULL)) %>%
mutate(
PREGFEELNOW = factor(
PREGFEELNOW,
levels = c(
"Sort of unhappy",
"Very unhappy",
"Sort of happy",
"Very happy",
"No response or missing",
"NIU (not in universe)",
"Mixed happy and unhappy"
)
) %>% fct_recode(`Currently Pregnant` = "NIU (not in universe)"),
PERCENT = if_else(
PREGFEELNOW %in% c("Very unhappy", "Sort of unhappy"),
-PERCENT,
PERCENT
)
)
pg_tbl
# A tibble: 28 x 3
# Groups: COUNTRY [4]
COUNTRY PREGFEELNOW PERCENT
<fct> <fct> <dbl>
1 Burkina Faso Very unhappy -29.1
2 Burkina Faso Sort of unhappy -11.1
3 Burkina Faso Mixed happy and unhappy 5.65
4 Burkina Faso Sort of happy 15.0
5 Burkina Faso Very happy 30.8
6 Burkina Faso No response or missing 0.333
7 Burkina Faso Currently Pregnant 7.99
8 DRC (Kinshasa) Very unhappy -44.6
9 DRC (Kinshasa) Sort of unhappy -11.2
10 DRC (Kinshasa) Mixed happy and unhappy 5.90
# … with 18 more rows
Next, we’ll create a new column that indicates whether we want each response to appear in the second panel in our faceted display. Let’s call this column ASIDE
:
pg_tbl <- pg_tbl %>%
mutate(ASIDE = PREGFEELNOW %in% c(
"Mixed happy and unhappy",
"No response or missing",
"Currently Pregnant"
))
pg_tbl
# A tibble: 28 x 4
# Groups: COUNTRY [4]
COUNTRY PREGFEELNOW PERCENT ASIDE
<fct> <fct> <dbl> <lgl>
1 Burkina Faso Very unhappy -29.1 FALSE
2 Burkina Faso Sort of unhappy -11.1 FALSE
3 Burkina Faso Mixed happy and unhappy 5.65 TRUE
4 Burkina Faso Sort of happy 15.0 FALSE
5 Burkina Faso Very happy 30.8 FALSE
6 Burkina Faso No response or missing 0.333 TRUE
7 Burkina Faso Currently Pregnant 7.99 TRUE
8 DRC (Kinshasa) Very unhappy -44.6 FALSE
9 DRC (Kinshasa) Sort of unhappy -11.2 FALSE
10 DRC (Kinshasa) Mixed happy and unhappy 5.90 TRUE
# … with 18 more rows
Our plot will look similar to the divergent bar chart we made in the previous section, but will now add a new function facet_grid() that divides pg_tbl
into separate panels defined by ASIDE
:
pg_tbl %>%
ggplot(aes(x = COUNTRY, fill = PREGFEELNOW, y = PERCENT)) +
geom_bar(stat = "identity") +
coord_flip() +
# facet_grid() can distribute facets in rows and/or columns
facet_grid(
cols = vars(ASIDE), # here, we choose columns defined by ASIDE
scales = "free", # "free" scales allows for independent facet scales
space = "free" # "free" space allows for independent facet widths
) +
# we define fill colors (values) and the arrangement of the legend (breaks)
scale_fill_manual(
values = alpha(c(
"Very unhappy" = "#98579B",
"Sort of unhappy" = "#e8bce8",
"Mixed happy and unhappy" = "#969696",
"Sort of happy" = "#7A99AC",
"Very happy" = "#00263A",
"Currently Pregnant" = "#cccccc",
"No response or missing" = "#F1F5F7"
)),
breaks = c(
"Sort of unhappy",
"Very unhappy",
"Very happy",
"Sort of happy",
"Mixed happy and unhappy",
"Currently Pregnant",
"No response or missing"
)
) +
guides(fill = guide_legend(nrow = 2, byrow = T)) +
# In theme(), we control labeling for each facet:
theme(
text = element_text(family = "cabrito", size = 10),
title = element_text(size = 14, color = "#00263A"),
plot.subtitle = element_text(size = 12),
legend.position = "bottom",
strip.text = element_blank(), # leaves facet labels blank
strip.background = element_blank() # removes background for facet labels
) +
# All other labels are defined in labs()
labs(
title = "IF YOU GOT PREGNANT NOW, HOW WOULD YOU FEEL?",
subtitle = "Estimated percentage for populations of women age 15-49 (summer 2020)",
x = NULL,
y = NULL,
fill = NULL
)
Another reason you might want to use facets is to align responses to questions that use a common response scale. For example, the variable COMMCOVIDWORRY
uses the same response options shown in COVIDCONCERN
, and it reflects each woman’s level of concern for the spread of COVID-19 in her community. If we align two bar charts for COVIDCONCERN
and COMMCOVIDWORRY
with facet_grid()
, we’ll be able to easily compare women’s concerns for personal and communal health.
First, we’ll use pivot_longer
to organize responses to COVIDCONCERN
and COMMCOVIDWORRY
in separate rows. Then, we’ll pre-calculate our summary statistics in a table called covid_pop
.
covid_pop <- covid %>%
select(COUNTRY, CVQWEIGHT, EAID, COVIDCONCERN, COMMCOVIDWORRY) %>%
pivot_longer(
c(COVIDCONCERN, COMMCOVIDWORRY),
names_to = "QUESTION",
values_to = "RESPONSE"
) %>%
as_survey_design(weight = CVQWEIGHT, id = EAID) %>%
group_by(COUNTRY, QUESTION, RESPONSE) %>%
summarise(PERCENT = survey_mean(vartype = NULL)) %>%
filter(grepl("concern", RESPONSE, ignore.case = T))
covid_pop
# A tibble: 32 x 4
# Groups: COUNTRY, QUESTION [8]
COUNTRY QUESTION RESPONSE PERCENT
<fct> <chr> <fct> <dbl>
1 Burkina Faso COMMCOVIDWORRY Not concerned 0.0422
2 Burkina Faso COMMCOVIDWORRY A little concerned 0.0741
3 Burkina Faso COMMCOVIDWORRY Concerned 0.215
4 Burkina Faso COMMCOVIDWORRY Very concerned 0.668
5 Burkina Faso COVIDCONCERN Not concerned 0.0618
6 Burkina Faso COVIDCONCERN A little concerned 0.0835
7 Burkina Faso COVIDCONCERN Concerned 0.177
8 Burkina Faso COVIDCONCERN Very concerned 0.677
9 DRC (Kinshasa) COMMCOVIDWORRY Not concerned 0.161
10 DRC (Kinshasa) COMMCOVIDWORRY A little concerned 0.0943
# … with 22 more rows
This time, we’ll build separate facets for each QUESTION
. We’ll also arrange facets in the direction perpendicular to the direction of the bars (i.e. in rows).
covid_pop %>%
ggplot(aes(x = COUNTRY, y = PERCENT, fill = RESPONSE)) +
geom_bar(stat = "identity") +
geom_vline(xintercept = 0) +
coord_flip() +
# This time, we'll add labels to each facet with labeller()
facet_grid(
rows = vars(QUESTION),
scales = "free",
space = "free",
labeller = labeller(QUESTION = c(
COMMCOVIDWORRY = "Getting infected",
COVIDCONCERN = "Spread in community"
))
) +
# Define fill colors (values) and legend orientation
scale_fill_manual(
values = alpha(colour = c(
"Very concerned" = "#00263A", # IPUMS Navy
"Concerned" = "#4E6C7D", # IPUMS Dark-Grey
"A little concerned" = "#7A99AC", # IPUMS Blue-Grey
"Not concerned" = "#F1F5F7" # IPUMS Light-Grey
))
) +
guides(fill = guide_legend(reverse = TRUE)) +
# We'll format the labels defined above in strip.text.y
# We also increase the panel.spacing by 1 "line"
theme(
text = element_text(family = "cabrito", size = 10),
title = element_text(size = 14, color = "#00263A"),
plot.subtitle = element_text(size = 12),
legend.position = "bottom",
strip.background = element_blank(),
strip.text.y = element_text(size = 12, angle = 0),
panel.spacing = unit(1, "lines")
) +
# All other labels are defined in labs()
labs(
title = "COVID-19 CONCERNS: PERSONAL VS COMMUNAL",
subtitle = "Estimated percentage for populations of women age 15-49 (summer 2020)",
x = NULL,
y = NULL,
fill = NULL
)
Of course, bar charts are only one of the many ways you might choose to visualize Likert-type data from the PMA COVID-19 survey. We think faceted bar charts are a great way to compare data from several questions that use the same response scale, or to showcase the different types of non-response you’ll find in the top-codes used throughout all of the IPUMS PMA data series.
The customization options afforded by ggplot2
are incredibly powerful, but they can also be overwhelming! We’ll practice using tools from ggplot2
again in our next post, where we’ll be thinking about ways to visualize larger batches of related variables.
If you see mistakes or want to suggest changes, please create an issue on the source repository.