Please learn from my mistakes

Reflections on a data visualisation faux-pas

Yaning Wu
5 min readOct 23, 2021

This article refers to this blog post.

A previous visualisation with a red “x” and “misleading” in capital letters on top of the image.

Last week, I couldn’t hold back my eagerness about the radical ideas that I was exposed to during the first few weeks of term — ideas about what’s wrong with international development and the problematic nature of measurements in social science.

I was so excited about them that I located two datasets, merged them, and plotted the most convenient statistics on the relationship between the accuracy of expert-devised global health security measures for predicting country outcomes during the pandemic and the categorisations of those countries as part of the “West” or the “rest”. I said, in no uncertain terms, that these health security measures were more likely to be optimistic about Western countries (those in Europe and North America) and pessimistic about African countries. In the same vein, I implied that certain biases, introduced by historical imbalances in power, caused these inequalities.

Though I have no doubt that these imbalances exist, and affect the everyday lives of the residents in both country groups, I was wrong about the data. I did what you should never do — I picked the data points that showed the clearest correlation and didn’t present any alternatives.

In more detail …

Here’s the original visualisation, including context:

I used the visual indicator of slope to illustrate the relationship between prediction optimism and actual COVID-19 outcomes in two contexts relevant to international development. I was especially shocked (though maybe not surprised) that no “Western” country had done better than its predicted rank; African countries, of course, had a greater variety of fates but were generally perceived pessimistically according to these data.

So what went wrong?

The numerical measure I used to compare the GHS index and COVID-19 outcomes is rank. This doesn’t take into account both the actual index figures and the actual number of COVID-19 cases, which are arguably more relevant for practical reasons. For example, country A could have ten cases and country B could have a million, or country A could have five hundred and country B could have six hundred, but both of those pairings would mean the same with this comparison. The conclusions and visuals I gathered from rankings, shown above, are therefore inaccurate.

How did I find out?

I’m a member of the international nonprofit Data Visualization Society’s Slack, which is a virtual coworking and collaboration space where tens of thousands of dataviz enthusiasts and practitioners gather for advice, discussion, and friendly competition. I submitted my original visualisation to the space’s “critique” forum, seeking insights about how I could more clearly communicate the point I was trying to make. By then, I was convinced that my point was indisputably correct.

First, designer J.K. Dru gave her time generously to suggest some improvements to the design of the visualisation and the way I labelled my axes. I amended the piece according to her guidance and didn’t think much for a week.

Then, Data Visualization Society Executive Director and public health professional Amanda Makulec noted that rankings were not the best idea, explaining succinctly what I summarised above, as well as raising concerns about data quality by country. I then realised that my representation of this data may have been a misleading one and was inspired to write this separate reflection.

How can I fix this?

There was, at least, a simple solution. Instead of using rankings for both vertical axes, I could use standardised versions of both GHS index scores and cases per 100,000 population, allowing both to be comparable (you can’t place a score of 65.3 and 500 cases per 100,000 on the same axis). Here’s the R code I used:

library(tidyverse)ghs <- read.csv("~/ghs - ghs.csv")
names(ghs) <- c("ghs_rank", "covid_rank", "country", "ghs_score", "region", "pop_cat", "income_cat", "covid_cases_oct8", "pop_2020", "cases_per_pop")
ghs$income_cat <- as.factor(ghs$income_cat)
levels(ghs$income_cat) <- c("High income", "Upper middle income", "Lower middle income", "Low income")
# making scaled vars for actual values
ghs$cases_scaled <- scale(ghs$cases_per_pop)
ghs$ghs_scaled <- scale(ghs$ghs_score)
ghs <- ghs %>% pivot_longer(cases_scaled:ghs_scaled, "value_type")ghs_afr <- subset(ghs, ghs$region == "Africa")ghs_west <- subset(ghs, ghs$region == "North America" | ghs$region == "Europe")
ghs_rest <- subset(ghs, ghs$region != "North America" & ghs$region != "Europe")
ghs_lmics <- subset(ghs, ghs$income_cat == "Low income" | ghs$income_cat == "Lower middle income")
ghs_hics <- subset(ghs, ghs$income_cat == "High income")
ghs_no_small_states <- subset(ghs, ghs$pop_2020 > 150000000)ghs_slopes <- function(df){
ggplot(df, aes(x = value_type, y = value, group = country)) +
geom_line(aes(color = income_cat), size = 0.5) +
scale_color_manual(values = c("#6C779A", "#C1C5D5", "#E9AFAB", "#ca6362")) +
labs(color = "Income category") +
theme_void() +
theme(legend.position = "none")
ggsave("ghs.png", dpi = 1000)
}
ghs_slopes(ghs_west)
ghs_slopes(ghs_afr)

And here are the resulting plots, with Africa to the left and the “West” to the right:

What still holds?

Different patterns certainly emerge from my revised charts. But I can still say that the Global Health Security index is an imperfect predictor of pandemic performance, especially for outlier cases such as the United States. I just can’t argue, using this data, that biases due to power imbalances are present in their assessment.

What’s the takeaway?

My first attempt at visualising this data did not draw a surprise conclusion. It wasn’t as if my charts showed that an overwhelming majority of Democrats supported Trump during election season, or that the climate was taking a turn for the colder. That’s why I put little thought into questioning the relationship I noticed. And although my audience is tiny, neither did most of my readers.

But regardless of my intent and the nature of the data, I should have been more careful. I allowed my preconceived notions to sway me into believing that things were as clear-cut as they seemed to be. So the next time you visualise information, do it more than once.

--

--

Yaning Wu

she/her. Population Health student @ UCL. Perpetual dataviz nerd. Published on Towards Data Science and UX Collective.