Chris (Evans) R SAFAQ: Mapping dates to episodes

Chris Evans

Mapping dates to episodes

R programming R tricks Tidyverse

Illustrates how to use join_by() in tidyverse R to do this.

Author

Affiliation

Chris Evans

PSYCTC.org

Published

Feb. 3, 2025

Citation

Evans, 2025

Started 3.ii.25

Show code

### this is just the code that creates the "copy to clipboard" function in the code blocks
htmltools::tagList(
  xaringanExtra::use_clipboard(
    button_text = "<i class=\"fa fa-clone fa-2x\" style=\"color: #301e64\"></i>",
    success_text = "<i class=\"fa fa-check fa-2x\" style=\"color: #90BE6D\"></i>",
    error_text = "<i class=\"fa fa-times fa-2x\" style=\"color: #F94144\"></i>"
  ),
  rmarkdown::html_dependency_font_awesome()
)

Background and illustrative data

This came out of a large piece of practice oriented research (POR) where some, horribly familiar, issues with the service’s software systems meant the mapping of data to episodes was wrong for some participants so I had correct dates for observations and correct start and end dates for episodes but I had remap all the data!

So (simplifying a bit to make it digestible here) I’m showing 207 rows of data from 13 which looked like this.

Show code

tmpTibDat %>%
  filter(row_number() < 35) %>%
  flextable() %>%
  autofit()

Show code

tmpTibDat %>%
  group_by(ID) %>%
  summarise(n = n()) -> tmpTibDatCounts

That’s the first 34 of the 207 rows. “Fecha” is Spanish for “date” so “Fecha.Estudio” was the date on which data was collected.

The challenge was to map those “Fecha.Estudio” into these dates from this separate dataset of episode dates.

Show code

tmpTibEpisodes %>%
  as_grouped_data(groups = "ID") %>%
  flextable() %>%
  autofit()

Yes, “Fecha.ini” and “Fecha.fin” are dates of the opening of the episode and ending of it respectively and “Where” is location of the work the participant was doing in that episode: “Hosp” = “Inpatient”, “HDD” = “Day hospital”, “Comm” = “Community work”.

So here are the counts of episodes per participant.

Show code

tmpTibEpisodes %>%
  count(nEpisode) %>%
  flextable() %>%
  autofit()

Yes! I have used data from participants in the actual study who had lots of episodes!

Using `join_by()`

The first nice trick is to use `join_by()” from the R package dplyr (part of the “tidyverse”) to create a “within” join. Here’s the code.

byWithin <- join_by(ID,  # says to do the next bit per value of ID in the first dataset
                    ### and this is the within bit:
                    within(Fecha.Estudio, Fecha.Estudio, Fecha.ini, Fecha.fin))

That just creates a within join instruction, what it says is that Fecha.Estudio in the first dataset must be later than or equal to Fecha.ini in the second dataset and Fecha.Estudio is before or equal to Fecha.fin in the second dataset.

You can see that is using Fecha.Estudio twice. That’s because this can be used to see if an interval, rather than a single date, lies within the range in the other datase. So if your data were collected over a period from say Fecha.Estudio1 to Fecha.Estudio2 you could see if that interval lay within a therapy episode. However, our data were all collected on one date so I used Fecha.Estudio twice to create a single date to test.

Show code

byWithin <- join_by(ID,
                    within(Fecha.Estudio, Fecha.Estudio, Fecha.ini, Fecha.fin))

tmpTibDat %>%
  left_join(tmpTibEpisodes, 
            ### now tell the left_join() to use the within join that you defined:
            byWithin) -> tmpTibDat2

tmpTibDat2 %>%
  filter(row_number() < 35) %>%
  as_grouped_data(groups = "ID") %>%
  flextable() %>%
  autofit()

Watch out for duplicated data!

That looks good but we had 207 rows of data before doing that join and now we have 210. What’s happened?

Well it’s a little gotcha, in programming jargon it’s an “corner case”: a problem arising where values of two different variables can catch you out.

Of course the R and the join have done what they should so how have we got three new rows of data?

Show code

tmpTibDat2 %>%
  group_by(rowN) %>%
  mutate(nRowN = n(),
         rowNN = row_number()) %>%
  ungroup() -> tmpTibDat2

tmpTibDat2 %>%
  filter(nRowN > 1) %>%
  as_grouped_data(groups = "rowN") %>%
  flextable()

You can see what happened there: three of the Fecha.Estudio fell on the Fecha.fin ending one episode but in each case the next episode started on the same date (pretty common when episodes are defined by transfers between levels of support). The code did the correct thing and said that the data fell within both episodes.

I fixed that by removing the mappings to the Fecha.fin, i.e. by treating all those data which were collected on a day that was both the end of an episode and the start of the next episode as coming from the second of the episodes. (That’s realistic for our data.)

Show code

tmpTibDat2 %>%
  ### remove rows where a row has been duplicated, removing the first one
  filter(!(nRowN > 1 & rowNN == 1)) -> tmpTibDat2

That’s fixed it and now nrow(tmpTibDat2) = 207.

Avoiding the corner case can be worse …

You could avoid this by redefining your limits.

Show code

tmpTibEpisodes %>%
  ### define Fecha.fin.eve as the day before Fecha.fin (uses R date arithmetic which assumes we are counting days)
  mutate(Fecha.fin.eve = Fecha.fin - 1) -> tmpTibEpisodes2

byWithin <- join_by(ID,
                    within(Fecha.Estudio, Fecha.Estudio, Fecha.ini, Fecha.fin.eve))

tmpTibDat %>%
  left_join(tmpTibEpisodes2, 
            ### now tell the left_join() to use the within join that you defined:
            byWithin) -> tmpTibDat3

That seems fine: nrow(tmpTibDat3) = 207 but it’s not fine as there are data rows where the Fecha.Estudio was on the Fecha.fin that aren’t mapped to episodes.

Show code

tmpTibDat3 %>%
  filter(is.na(Fecha.ini)) %>%
  flextable() %>%
  colformat_date(na_str = "NA") %>%
  colformat_char(na_str = "NA") %>%
  colformat_num(na_str = "NA") %>%
  bg(i = ~ is.na(Fecha.ini), j = 5:9, bg = "red") %>%
  autofit()

That shows what happens to be ten rows of data that couldn’t be mapped by the join_with(): they are included but there are missing values where the mapping should have mapped the rows to episodes.

Now I can plot it!

But the correct mapping makes it easy, using ggplot, to map the data collection to the episodes.

Show code

### create a variable firstFechaIni by which to sort the participants on the y axis
tmpTibDat2 %>%
  group_by(ID) %>%
  mutate(firstFechaIni = first(Fecha.ini)) %>%
  ungroup() -> tmpTibDat2

### create colour mapping for regimes
vecWhereColours <- c("Hosp" = "red",
                     "HDD" = "orange",
                     "Comm" = "green")

ggplot(data = tmpTibDat2,
       aes(x = Fecha.Estudio, 
           ### reorder the ID values so the earliest starts lowest
           y = reorder(ID, firstFechaIni))) +
  ### plot the episodes, linewidth makes the lines bars
  geom_linerange(aes(xmin = Fecha.ini, xmax = Fecha.fin,
                     colour = where),
                 linewidth = 5) +
  ### superimpose the data collection dates
  geom_point(aes(x = Fecha.Estudio),
             size = 1) +
  ylab("ID") +
  ### use the colour mapping
  scale_color_manual("Where",
                       values = vecWhereColours) +
  ### nice easy date axis mapping
  ### though I never remember it!
  scale_x_date(name = "Date",
               breaks = "3 months",
               date_labels = "%b-%y") +
  ### cosmetics for that date mapping on the x axis ...
  theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
  ggtitle("Mapping data collection to episodes",
          subtitle = "Data collection left censored by change of recording system")

Show code

### save to make a png available to distill ()
# ggsave("TimeMap.png")

As the title says, that looks as if a lot of early data is missing, it’s not really, it was just collected before the services changed their data collection software and I didn’t bother to merge the earlier data in for this little explanation of the way of mapping the dates to the episodes.

Learning points

join_by() and a “within join” is wonderful for mapping data to episodes.
but the join is inclusive so watch that you haven’t duplicated some rows of data which map to more than one episode (where there are abutting episodes and observations made on the date of the end of the first episode and the date of the start of the second)
watch out: data that don’t map into the available episode dates aren’t excluded from the created dataset, they just have missing episode data.
so always check you have kept the correct number of observations and that you have episode mappings for all the data.

But using such a join is hugely easier than coding your own and orders of magnitude faster than any code to do that I at least might write!

History

3.ii.25: created.

Visit count

Click to see detail of visits and stats for this site

free web counters

Last updated

Show code

cat(paste(format(Sys.time(), "%d/%m/%Y"), "at", format(Sys.time(), "%H:%M")))

17/02/2025 at 17:15

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Evans (2025, Feb. 3). Chris (Evans) R SAFAQ: Mapping dates to episodes. Retrieved from https://www.psyctc.org/R_blog/posts/2025-02-03-mapping-dates-to-episodes/

BibTeX citation

@misc{evans2025mapping,
  author = {Evans, Chris},
  title = {Chris (Evans) R SAFAQ: Mapping dates to episodes},
  url = {https://www.psyctc.org/R_blog/posts/2025-02-03-mapping-dates-to-episodes/},
  year = {2025}
}

Mapping dates to episodes

Author

Affiliation

Published

Citation

Background and illustrative data

Using join_by()

Watch out for duplicated data!

Avoiding the corner case can be worse …

Now I can plot it!

Learning points

History

Last updated

Footnotes

Reuse

Citation

Using `join_by()`