Illustrates how to use join_by() in tidyverse R to do this.
Started 3.ii.25
### this is just the code that creates the "copy to clipboard" function in the code blocks
htmltools::tagList(
xaringanExtra::use_clipboard(
button_text = "<i class=\"fa fa-clone fa-2x\" style=\"color: #301e64\"></i>",
success_text = "<i class=\"fa fa-check fa-2x\" style=\"color: #90BE6D\"></i>",
error_text = "<i class=\"fa fa-times fa-2x\" style=\"color: #F94144\"></i>"
),
rmarkdown::html_dependency_font_awesome()
)
This came out of a large piece of practice oriented research (POR) where some, horribly familiar, issues with the service’s software systems meant the mapping of data to episodes was wrong for some participants so I had correct dates for observations and correct start and end dates for episodes but I had remap all the data!
So (simplifying a bit to make it digestible here) I’m showing 207 rows of data from 13 which looked like this.
rowN | ID | Fecha.Estudio |
---|---|---|
1 | 1 | 2020-12-17 |
2 | 1 | 2020-12-18 |
3 | 1 | 2021-02-04 |
4 | 1 | 2021-03-24 |
5 | 1 | 2021-07-22 |
6 | 1 | 2021-09-30 |
7 | 1 | 2022-03-23 |
8 | 1 | 2022-08-31 |
9 | 1 | 2022-11-21 |
10 | 1 | 2023-01-19 |
11 | 1 | 2023-02-07 |
12 | 1 | 2023-04-27 |
13 | 1 | 2023-04-27 |
14 | 1 | 2023-07-20 |
15 | 1 | 2023-07-20 |
16 | 1 | 2023-10-20 |
17 | 1 | 2023-10-20 |
18 | 1 | 2023-11-24 |
19 | 1 | 2023-12-13 |
20 | 1 | 2024-03-14 |
21 | 1 | 2024-05-09 |
22 | 1 | 2024-06-14 |
23 | 1 | 2024-11-07 |
24 | 1 | 2024-12-13 |
25 | 2 | 2020-11-30 |
26 | 2 | 2021-04-21 |
27 | 2 | 2021-05-06 |
28 | 2 | 2021-06-22 |
29 | 2 | 2021-08-04 |
30 | 2 | 2021-11-10 |
31 | 2 | 2023-03-31 |
32 | 2 | 2023-06-12 |
33 | 3 | 2021-01-20 |
34 | 3 | 2021-01-21 |
That’s the first 34 of the 207 rows. “Fecha” is Spanish for “date” so “Fecha.Estudio” was the date on which data was collected.
The challenge was to map those “Fecha.Estudio” into these dates from this separate dataset of episode dates.
ID | Fecha.ini | Fecha.fin | where | nEpisode | EpisodN |
---|---|---|---|---|---|
1 | |||||
2020-02-05 | 2021-09-30 | Comm | 5 | 1 | |
2021-10-18 | 2022-03-23 | Comm | 5 | 2 | |
2022-08-01 | 2023-11-30 | HDD | 5 | 3 | |
2023-12-01 | 2024-06-30 | Comm | 5 | 4 | |
2024-10-08 | 2025-01-25 | Comm | 5 | 5 | |
2 | |||||
2020-11-16 | 2021-05-03 | HDD | 5 | 1 | |
2021-05-04 | 2021-10-01 | Hosp | 5 | 2 | |
2021-10-04 | 2022-03-31 | HDD | 5 | 3 | |
2023-03-06 | 2023-06-09 | Hosp | 5 | 4 | |
2023-06-12 | 2023-08-01 | HDD | 5 | 5 | |
3 | |||||
2020-01-13 | 2021-03-31 | Hosp | 5 | 1 | |
2021-04-06 | 2022-06-07 | HDD | 5 | 2 | |
2022-10-11 | 2023-07-31 | HDD | 5 | 3 | |
2023-09-01 | 2024-03-31 | Comm | 5 | 4 | |
2024-04-02 | 2025-01-25 | Comm | 5 | 5 | |
4 | |||||
2019-11-27 | 2021-02-03 | Hosp | 5 | 1 | |
2021-02-08 | 2022-08-30 | Comm | 5 | 2 | |
2022-08-30 | 2022-10-06 | Hosp | 5 | 3 | |
2022-10-06 | 2023-05-12 | Comm | 5 | 4 | |
2023-05-12 | 2023-05-19 | Hosp | 5 | 5 | |
5 | |||||
2020-08-05 | 2021-01-25 | HDD | 6 | 1 | |
2021-01-25 | 2021-07-09 | Hosp | 6 | 2 | |
2021-07-12 | 2022-11-30 | HDD | 6 | 3 | |
2022-12-02 | 2023-01-31 | Comm | 6 | 4 | |
2023-02-01 | 2023-06-30 | Comm | 6 | 5 | |
2023-10-20 | 2025-01-25 | HDD | 6 | 6 | |
6 | |||||
2020-11-23 | 2020-12-18 | Hosp | 5 | 1 | |
2020-12-22 | 2021-01-25 | HDD | 5 | 2 | |
2021-01-26 | 2021-03-19 | Hosp | 5 | 3 | |
2021-03-22 | 2021-04-14 | HDD | 5 | 4 | |
2021-04-15 | 2021-10-01 | HDD | 5 | 5 | |
7 | |||||
2021-01-11 | 2021-03-31 | HDD | 6 | 1 | |
2021-04-01 | 2021-04-27 | Hosp | 6 | 2 | |
2021-04-28 | 2021-07-05 | HDD | 6 | 3 | |
2021-07-06 | 2021-07-23 | Hosp | 6 | 4 | |
2021-07-26 | 2021-08-31 | HDD | 6 | 5 | |
2021-09-01 | 2024-01-31 | Comm | 6 | 6 | |
8 | |||||
2020-09-09 | 2021-05-08 | Hosp | 5 | 1 | |
2021-05-10 | 2021-12-31 | HDD | 5 | 2 | |
2022-03-01 | 2023-07-31 | HDD | 5 | 3 | |
2023-08-01 | 2023-09-13 | HDD | 5 | 4 | |
2024-01-15 | 2025-01-25 | Comm | 5 | 5 | |
9 | |||||
2021-09-03 | 2021-10-15 | Hosp | 9 | 1 | |
2021-10-18 | 2021-12-10 | HDD | 9 | 2 | |
2021-12-13 | 2022-02-16 | Comm | 9 | 3 | |
2022-02-16 | 2022-02-25 | Hosp | 9 | 4 | |
2022-10-03 | 2022-11-29 | HDD | 9 | 5 | |
2022-11-30 | 2022-12-23 | Hosp | 9 | 6 | |
2023-01-03 | 2023-03-22 | HDD | 9 | 7 | |
2023-04-03 | 2024-04-30 | Comm | 9 | 8 | |
2024-07-12 | 2025-01-25 | Comm | 9 | 9 | |
10 | |||||
2020-04-29 | 2020-12-18 | Hosp | 6 | 1 | |
2020-12-21 | 2021-02-17 | HDD | 6 | 2 | |
2021-02-17 | 2021-09-10 | Hosp | 6 | 3 | |
2021-09-15 | 2023-06-20 | HDD | 6 | 4 | |
2023-06-21 | 2023-12-01 | Hosp | 6 | 5 | |
2024-03-12 | 2025-01-25 | HDD | 6 | 6 | |
11 | |||||
2019-09-24 | 2021-04-09 | HDD | 5 | 1 | |
2021-04-12 | 2021-09-17 | Hosp | 5 | 2 | |
2021-09-17 | 2021-11-19 | HDD | 5 | 3 | |
2021-11-22 | 2022-06-01 | HDD | 5 | 4 | |
2022-06-01 | 2022-08-31 | Comm | 5 | 5 | |
12 | |||||
2020-08-03 | 2021-03-26 | Hosp | 6 | 1 | |
2021-03-31 | 2022-05-05 | HDD | 6 | 2 | |
2022-05-06 | 2022-07-15 | Hosp | 6 | 3 | |
2022-07-18 | 2022-10-24 | HDD | 6 | 4 | |
2022-11-01 | 2023-03-31 | Comm | 6 | 5 | |
2024-01-16 | 2024-03-05 | HDD | 6 | 6 | |
13 | |||||
2020-08-11 | 2020-12-03 | HDD | 5 | 1 | |
2020-12-04 | 2021-04-17 | Hosp | 5 | 2 | |
2021-04-19 | 2022-05-31 | HDD | 5 | 3 | |
2022-06-01 | 2024-07-08 | Comm | 5 | 4 | |
2024-07-09 | 2024-10-31 | Hosp | 5 | 5 |
Yes, “Fecha.ini” and “Fecha.fin” are dates of the opening of the episode and ending of it respectively and “Where” is location of the work the participant was doing in that episode: “Hosp” = “Inpatient”, “HDD” = “Day hospital”, “Comm” = “Community work”.
So here are the counts of episodes per participant.
Yes! I have used data from participants in the actual study who had lots of episodes!
join_by()
The first nice trick is to use `join_by()” from the R package dplyr (part of the “tidyverse”) to create a “within” join. Here’s the code.
byWithin <- join_by(ID, # says to do the next bit per value of ID in the first dataset
### and this is the within bit:
within(Fecha.Estudio, Fecha.Estudio, Fecha.ini, Fecha.fin))
That just creates a within join instruction, what it says is that Fecha.Estudio in the first dataset must be later than or equal to Fecha.ini in the second dataset and Fecha.Estudio is before or equal to Fecha.fin in the second dataset.
You can see that is using Fecha.Estudio twice. That’s because this can be used to see if an interval, rather than a single date, lies within the range in the other datase. So if your data were collected over a period from say Fecha.Estudio1 to Fecha.Estudio2 you could see if that interval lay within a therapy episode. However, our data were all collected on one date so I used Fecha.Estudio twice to create a single date to test.
byWithin <- join_by(ID,
within(Fecha.Estudio, Fecha.Estudio, Fecha.ini, Fecha.fin))
tmpTibDat %>%
left_join(tmpTibEpisodes,
### now tell the left_join() to use the within join that you defined:
byWithin) -> tmpTibDat2
tmpTibDat2 %>%
filter(row_number() < 35) %>%
as_grouped_data(groups = "ID") %>%
flextable() %>%
autofit()
ID | rowN | Fecha.Estudio | Fecha.ini | Fecha.fin | where | nEpisode | EpisodN |
---|---|---|---|---|---|---|---|
1 | |||||||
1 | 2020-12-17 | 2020-02-05 | 2021-09-30 | Comm | 5 | 1 | |
2 | 2020-12-18 | 2020-02-05 | 2021-09-30 | Comm | 5 | 1 | |
3 | 2021-02-04 | 2020-02-05 | 2021-09-30 | Comm | 5 | 1 | |
4 | 2021-03-24 | 2020-02-05 | 2021-09-30 | Comm | 5 | 1 | |
5 | 2021-07-22 | 2020-02-05 | 2021-09-30 | Comm | 5 | 1 | |
6 | 2021-09-30 | 2020-02-05 | 2021-09-30 | Comm | 5 | 1 | |
7 | 2022-03-23 | 2021-10-18 | 2022-03-23 | Comm | 5 | 2 | |
8 | 2022-08-31 | 2022-08-01 | 2023-11-30 | HDD | 5 | 3 | |
9 | 2022-11-21 | 2022-08-01 | 2023-11-30 | HDD | 5 | 3 | |
10 | 2023-01-19 | 2022-08-01 | 2023-11-30 | HDD | 5 | 3 | |
11 | 2023-02-07 | 2022-08-01 | 2023-11-30 | HDD | 5 | 3 | |
12 | 2023-04-27 | 2022-08-01 | 2023-11-30 | HDD | 5 | 3 | |
13 | 2023-04-27 | 2022-08-01 | 2023-11-30 | HDD | 5 | 3 | |
14 | 2023-07-20 | 2022-08-01 | 2023-11-30 | HDD | 5 | 3 | |
15 | 2023-07-20 | 2022-08-01 | 2023-11-30 | HDD | 5 | 3 | |
16 | 2023-10-20 | 2022-08-01 | 2023-11-30 | HDD | 5 | 3 | |
17 | 2023-10-20 | 2022-08-01 | 2023-11-30 | HDD | 5 | 3 | |
18 | 2023-11-24 | 2022-08-01 | 2023-11-30 | HDD | 5 | 3 | |
19 | 2023-12-13 | 2023-12-01 | 2024-06-30 | Comm | 5 | 4 | |
20 | 2024-03-14 | 2023-12-01 | 2024-06-30 | Comm | 5 | 4 | |
21 | 2024-05-09 | 2023-12-01 | 2024-06-30 | Comm | 5 | 4 | |
22 | 2024-06-14 | 2023-12-01 | 2024-06-30 | Comm | 5 | 4 | |
23 | 2024-11-07 | 2024-10-08 | 2025-01-25 | Comm | 5 | 5 | |
24 | 2024-12-13 | 2024-10-08 | 2025-01-25 | Comm | 5 | 5 | |
2 | |||||||
25 | 2020-11-30 | 2020-11-16 | 2021-05-03 | HDD | 5 | 1 | |
26 | 2021-04-21 | 2020-11-16 | 2021-05-03 | HDD | 5 | 1 | |
27 | 2021-05-06 | 2021-05-04 | 2021-10-01 | Hosp | 5 | 2 | |
28 | 2021-06-22 | 2021-05-04 | 2021-10-01 | Hosp | 5 | 2 | |
29 | 2021-08-04 | 2021-05-04 | 2021-10-01 | Hosp | 5 | 2 | |
30 | 2021-11-10 | 2021-10-04 | 2022-03-31 | HDD | 5 | 3 | |
31 | 2023-03-31 | 2023-03-06 | 2023-06-09 | Hosp | 5 | 4 | |
32 | 2023-06-12 | 2023-06-12 | 2023-08-01 | HDD | 5 | 5 | |
3 | |||||||
33 | 2021-01-20 | 2020-01-13 | 2021-03-31 | Hosp | 5 | 1 | |
34 | 2021-01-21 | 2020-01-13 | 2021-03-31 | Hosp | 5 | 1 |
That looks good but we had 207 rows of data before doing that join and now we have 210. What’s happened?
Well it’s a little gotcha, in programming jargon it’s an “corner case”: a problem arising where values of two different variables can catch you out.
Of course the R and the join have done what they should so how have we got three new rows of data?
rowN | ID | Fecha.Estudio | Fecha.ini | Fecha.fin | where | nEpisode | EpisodN | nRowN | rowNN |
---|---|---|---|---|---|---|---|---|---|
58 | |||||||||
4 | 2022-08-30 | 2021-02-08 | 2022-08-30 | Comm | 5 | 2 | 2 | 1 | |
4 | 2022-08-30 | 2022-08-30 | 2022-10-06 | Hosp | 5 | 3 | 2 | 2 | |
60 | |||||||||
4 | 2022-10-06 | 2022-08-30 | 2022-10-06 | Hosp | 5 | 3 | 2 | 1 | |
4 | 2022-10-06 | 2022-10-06 | 2023-05-12 | Comm | 5 | 4 | 2 | 2 | |
134 | |||||||||
9 | 2022-02-16 | 2021-12-13 | 2022-02-16 | Comm | 9 | 3 | 2 | 1 | |
9 | 2022-02-16 | 2022-02-16 | 2022-02-25 | Hosp | 9 | 4 | 2 | 2 |
You can see what happened there: three of the Fecha.Estudio fell on the Fecha.fin ending one episode but in each case the next episode started on the same date (pretty common when episodes are defined by transfers between levels of support). The code did the correct thing and said that the data fell within both episodes.
I fixed that by removing the mappings to the Fecha.fin, i.e. by treating all those data which were collected on a day that was both the end of an episode and the start of the next episode as coming from the second of the episodes. (That’s realistic for our data.)
That’s fixed it and now nrow(tmpTibDat2)
= 207.
You could avoid this by redefining your limits.
tmpTibEpisodes %>%
### define Fecha.fin.eve as the day before Fecha.fin (uses R date arithmetic which assumes we are counting days)
mutate(Fecha.fin.eve = Fecha.fin - 1) -> tmpTibEpisodes2
byWithin <- join_by(ID,
within(Fecha.Estudio, Fecha.Estudio, Fecha.ini, Fecha.fin.eve))
tmpTibDat %>%
left_join(tmpTibEpisodes2,
### now tell the left_join() to use the within join that you defined:
byWithin) -> tmpTibDat3
That seems fine: nrow(tmpTibDat3)
= 207 but it’s not fine as there are data rows where the Fecha.Estudio was on the Fecha.fin that aren’t mapped to episodes.
rowN | ID | Fecha.Estudio | Fecha.ini | Fecha.fin | where | nEpisode | EpisodN | Fecha.fin.eve |
---|---|---|---|---|---|---|---|---|
6 | 1 | 2021-09-30 | NA | NA | NA | NA | NA | NA |
7 | 1 | 2022-03-23 | NA | NA | NA | NA | NA | NA |
98 | 7 | 2021-07-05 | NA | NA | NA | NA | NA | NA |
131 | 9 | 2021-10-15 | NA | NA | NA | NA | NA | NA |
150 | 10 | 2020-12-18 | NA | NA | NA | NA | NA | NA |
174 | 11 | 2021-04-09 | NA | NA | NA | NA | NA | NA |
185 | 11 | 2022-08-31 | NA | NA | NA | NA | NA | NA |
194 | 12 | 2022-05-05 | NA | NA | NA | NA | NA | NA |
201 | 12 | 2023-03-31 | NA | NA | NA | NA | NA | NA |
220 | 13 | 2024-10-31 | NA | NA | NA | NA | NA | NA |
That shows what happens to be ten rows of data that couldn’t be mapped by the join_with()
: they are included but there are missing values where the mapping should have mapped the rows to episodes.
But the correct mapping makes it easy, using ggplot, to map the data collection to the episodes.
### create a variable firstFechaIni by which to sort the participants on the y axis
tmpTibDat2 %>%
group_by(ID) %>%
mutate(firstFechaIni = first(Fecha.ini)) %>%
ungroup() -> tmpTibDat2
### create colour mapping for regimes
vecWhereColours <- c("Hosp" = "red",
"HDD" = "orange",
"Comm" = "green")
ggplot(data = tmpTibDat2,
aes(x = Fecha.Estudio,
### reorder the ID values so the earliest starts lowest
y = reorder(ID, firstFechaIni))) +
### plot the episodes, linewidth makes the lines bars
geom_linerange(aes(xmin = Fecha.ini, xmax = Fecha.fin,
colour = where),
linewidth = 5) +
### superimpose the data collection dates
geom_point(aes(x = Fecha.Estudio),
size = 1) +
ylab("ID") +
### use the colour mapping
scale_color_manual("Where",
values = vecWhereColours) +
### nice easy date axis mapping
### though I never remember it!
scale_x_date(name = "Date",
breaks = "3 months",
date_labels = "%b-%y") +
### cosmetics for that date mapping on the x axis ...
theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
ggtitle("Mapping data collection to episodes",
subtitle = "Data collection left censored by change of recording system")
### save to make a png available to distill ()
# ggsave("TimeMap.png")
As the title says, that looks as if a lot of early data is missing, it’s not really, it was just collected before the services changed their data collection software and I didn’t bother to merge the earlier data in for this little explanation of the way of mapping the dates to the episodes.
join_by()
and a “within join” is wonderful for mapping data to episodes.But using such a join is hugely easier than coding your own and orders of magnitude faster than any code to do that I at least might write!
17/02/2025 at 17:15
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Evans (2025, Feb. 3). Chris (Evans) R SAFAQ: Mapping dates to episodes. Retrieved from https://www.psyctc.org/R_blog/posts/2025-02-03-mapping-dates-to-episodes/
BibTeX citation
@misc{evans2025mapping, author = {Evans, Chris}, title = {Chris (Evans) R SAFAQ: Mapping dates to episodes}, url = {https://www.psyctc.org/R_blog/posts/2025-02-03-mapping-dates-to-episodes/}, year = {2025} }