Mapping dates to episodes

R programming R tricks Tidyverse

Illustrates how to use join_by() in tidyverse R to do this.

Chris Evans https://www.psyctc.org/R_blog/ (PSYCTC.org)https://www.psyctc.org/psyctc/
02-03-2025

Started 3.ii.25

Show code
### this is just the code that creates the "copy to clipboard" function in the code blocks
htmltools::tagList(
  xaringanExtra::use_clipboard(
    button_text = "<i class=\"fa fa-clone fa-2x\" style=\"color: #301e64\"></i>",
    success_text = "<i class=\"fa fa-check fa-2x\" style=\"color: #90BE6D\"></i>",
    error_text = "<i class=\"fa fa-times fa-2x\" style=\"color: #F94144\"></i>"
  ),
  rmarkdown::html_dependency_font_awesome()
)

Background and illustrative data

This came out of a large piece of practice oriented research (POR) where some, horribly familiar, issues with the service’s software systems meant the mapping of data to episodes was wrong for some participants so I had correct dates for observations and correct start and end dates for episodes but I had remap all the data!

So (simplifying a bit to make it digestible here) I’m showing 207 rows of data from 13 which looked like this.

Show code
tmpTibDat %>%
  filter(row_number() < 35) %>%
  flextable() %>%
  autofit()

rowN

ID

Fecha.Estudio

1

1

2020-12-17

2

1

2020-12-18

3

1

2021-02-04

4

1

2021-03-24

5

1

2021-07-22

6

1

2021-09-30

7

1

2022-03-23

8

1

2022-08-31

9

1

2022-11-21

10

1

2023-01-19

11

1

2023-02-07

12

1

2023-04-27

13

1

2023-04-27

14

1

2023-07-20

15

1

2023-07-20

16

1

2023-10-20

17

1

2023-10-20

18

1

2023-11-24

19

1

2023-12-13

20

1

2024-03-14

21

1

2024-05-09

22

1

2024-06-14

23

1

2024-11-07

24

1

2024-12-13

25

2

2020-11-30

26

2

2021-04-21

27

2

2021-05-06

28

2

2021-06-22

29

2

2021-08-04

30

2

2021-11-10

31

2

2023-03-31

32

2

2023-06-12

33

3

2021-01-20

34

3

2021-01-21

Show code
tmpTibDat %>%
  group_by(ID) %>%
  summarise(n = n()) -> tmpTibDatCounts

That’s the first 34 of the 207 rows. “Fecha” is Spanish for “date” so “Fecha.Estudio” was the date on which data was collected.

The challenge was to map those “Fecha.Estudio” into these dates from this separate dataset of episode dates.

Show code
tmpTibEpisodes %>%
  as_grouped_data(groups = "ID") %>%
  flextable() %>%
  autofit()

ID

Fecha.ini

Fecha.fin

where

nEpisode

EpisodN

1

2020-02-05

2021-09-30

Comm

5

1

2021-10-18

2022-03-23

Comm

5

2

2022-08-01

2023-11-30

HDD

5

3

2023-12-01

2024-06-30

Comm

5

4

2024-10-08

2025-01-25

Comm

5

5

2

2020-11-16

2021-05-03

HDD

5

1

2021-05-04

2021-10-01

Hosp

5

2

2021-10-04

2022-03-31

HDD

5

3

2023-03-06

2023-06-09

Hosp

5

4

2023-06-12

2023-08-01

HDD

5

5

3

2020-01-13

2021-03-31

Hosp

5

1

2021-04-06

2022-06-07

HDD

5

2

2022-10-11

2023-07-31

HDD

5

3

2023-09-01

2024-03-31

Comm

5

4

2024-04-02

2025-01-25

Comm

5

5

4

2019-11-27

2021-02-03

Hosp

5

1

2021-02-08

2022-08-30

Comm

5

2

2022-08-30

2022-10-06

Hosp

5

3

2022-10-06

2023-05-12

Comm

5

4

2023-05-12

2023-05-19

Hosp

5

5

5

2020-08-05

2021-01-25

HDD

6

1

2021-01-25

2021-07-09

Hosp

6

2

2021-07-12

2022-11-30

HDD

6

3

2022-12-02

2023-01-31

Comm

6

4

2023-02-01

2023-06-30

Comm

6

5

2023-10-20

2025-01-25

HDD

6

6

6

2020-11-23

2020-12-18

Hosp

5

1

2020-12-22

2021-01-25

HDD

5

2

2021-01-26

2021-03-19

Hosp

5

3

2021-03-22

2021-04-14

HDD

5

4

2021-04-15

2021-10-01

HDD

5

5

7

2021-01-11

2021-03-31

HDD

6

1

2021-04-01

2021-04-27

Hosp

6

2

2021-04-28

2021-07-05

HDD

6

3

2021-07-06

2021-07-23

Hosp

6

4

2021-07-26

2021-08-31

HDD

6

5

2021-09-01

2024-01-31

Comm

6

6

8

2020-09-09

2021-05-08

Hosp

5

1

2021-05-10

2021-12-31

HDD

5

2

2022-03-01

2023-07-31

HDD

5

3

2023-08-01

2023-09-13

HDD

5

4

2024-01-15

2025-01-25

Comm

5

5

9

2021-09-03

2021-10-15

Hosp

9

1

2021-10-18

2021-12-10

HDD

9

2

2021-12-13

2022-02-16

Comm

9

3

2022-02-16

2022-02-25

Hosp

9

4

2022-10-03

2022-11-29

HDD

9

5

2022-11-30

2022-12-23

Hosp

9

6

2023-01-03

2023-03-22

HDD

9

7

2023-04-03

2024-04-30

Comm

9

8

2024-07-12

2025-01-25

Comm

9

9

10

2020-04-29

2020-12-18

Hosp

6

1

2020-12-21

2021-02-17

HDD

6

2

2021-02-17

2021-09-10

Hosp

6

3

2021-09-15

2023-06-20

HDD

6

4

2023-06-21

2023-12-01

Hosp

6

5

2024-03-12

2025-01-25

HDD

6

6

11

2019-09-24

2021-04-09

HDD

5

1

2021-04-12

2021-09-17

Hosp

5

2

2021-09-17

2021-11-19

HDD

5

3

2021-11-22

2022-06-01

HDD

5

4

2022-06-01

2022-08-31

Comm

5

5

12

2020-08-03

2021-03-26

Hosp

6

1

2021-03-31

2022-05-05

HDD

6

2

2022-05-06

2022-07-15

Hosp

6

3

2022-07-18

2022-10-24

HDD

6

4

2022-11-01

2023-03-31

Comm

6

5

2024-01-16

2024-03-05

HDD

6

6

13

2020-08-11

2020-12-03

HDD

5

1

2020-12-04

2021-04-17

Hosp

5

2

2021-04-19

2022-05-31

HDD

5

3

2022-06-01

2024-07-08

Comm

5

4

2024-07-09

2024-10-31

Hosp

5

5

Yes, “Fecha.ini” and “Fecha.fin” are dates of the opening of the episode and ending of it respectively and “Where” is location of the work the participant was doing in that episode: “Hosp” = “Inpatient”, “HDD” = “Day hospital”, “Comm” = “Community work”.

So here are the counts of episodes per participant.

Show code
tmpTibEpisodes %>%
  count(nEpisode) %>%
  flextable() %>%
  autofit()

nEpisode

n

5

40

6

24

9

9

Yes! I have used data from participants in the actual study who had lots of episodes!

Using join_by()

The first nice trick is to use `join_by()” from the R package dplyr (part of the “tidyverse”) to create a “within” join. Here’s the code.

byWithin <- join_by(ID,  # says to do the next bit per value of ID in the first dataset
                    ### and this is the within bit:
                    within(Fecha.Estudio, Fecha.Estudio, Fecha.ini, Fecha.fin))

That just creates a within join instruction, what it says is that Fecha.Estudio in the first dataset must be later than or equal to Fecha.ini in the second dataset and Fecha.Estudio is before or equal to Fecha.fin in the second dataset.

You can see that is using Fecha.Estudio twice. That’s because this can be used to see if an interval, rather than a single date, lies within the range in the other datase. So if your data were collected over a period from say Fecha.Estudio1 to Fecha.Estudio2 you could see if that interval lay within a therapy episode. However, our data were all collected on one date so I used Fecha.Estudio twice to create a single date to test.

Show code
byWithin <- join_by(ID,
                    within(Fecha.Estudio, Fecha.Estudio, Fecha.ini, Fecha.fin))

tmpTibDat %>%
  left_join(tmpTibEpisodes, 
            ### now tell the left_join() to use the within join that you defined:
            byWithin) -> tmpTibDat2

tmpTibDat2 %>%
  filter(row_number() < 35) %>%
  as_grouped_data(groups = "ID") %>%
  flextable() %>%
  autofit()

ID

rowN

Fecha.Estudio

Fecha.ini

Fecha.fin

where

nEpisode

EpisodN

1

1

2020-12-17

2020-02-05

2021-09-30

Comm

5

1

2

2020-12-18

2020-02-05

2021-09-30

Comm

5

1

3

2021-02-04

2020-02-05

2021-09-30

Comm

5

1

4

2021-03-24

2020-02-05

2021-09-30

Comm

5

1

5

2021-07-22

2020-02-05

2021-09-30

Comm

5

1

6

2021-09-30

2020-02-05

2021-09-30

Comm

5

1

7

2022-03-23

2021-10-18

2022-03-23

Comm

5

2

8

2022-08-31

2022-08-01

2023-11-30

HDD

5

3

9

2022-11-21

2022-08-01

2023-11-30

HDD

5

3

10

2023-01-19

2022-08-01

2023-11-30

HDD

5

3

11

2023-02-07

2022-08-01

2023-11-30

HDD

5

3

12

2023-04-27

2022-08-01

2023-11-30

HDD

5

3

13

2023-04-27

2022-08-01

2023-11-30

HDD

5

3

14

2023-07-20

2022-08-01

2023-11-30

HDD

5

3

15

2023-07-20

2022-08-01

2023-11-30

HDD

5

3

16

2023-10-20

2022-08-01

2023-11-30

HDD

5

3

17

2023-10-20

2022-08-01

2023-11-30

HDD

5

3

18

2023-11-24

2022-08-01

2023-11-30

HDD

5

3

19

2023-12-13

2023-12-01

2024-06-30

Comm

5

4

20

2024-03-14

2023-12-01

2024-06-30

Comm

5

4

21

2024-05-09

2023-12-01

2024-06-30

Comm

5

4

22

2024-06-14

2023-12-01

2024-06-30

Comm

5

4

23

2024-11-07

2024-10-08

2025-01-25

Comm

5

5

24

2024-12-13

2024-10-08

2025-01-25

Comm

5

5

2

25

2020-11-30

2020-11-16

2021-05-03

HDD

5

1

26

2021-04-21

2020-11-16

2021-05-03

HDD

5

1

27

2021-05-06

2021-05-04

2021-10-01

Hosp

5

2

28

2021-06-22

2021-05-04

2021-10-01

Hosp

5

2

29

2021-08-04

2021-05-04

2021-10-01

Hosp

5

2

30

2021-11-10

2021-10-04

2022-03-31

HDD

5

3

31

2023-03-31

2023-03-06

2023-06-09

Hosp

5

4

32

2023-06-12

2023-06-12

2023-08-01

HDD

5

5

3

33

2021-01-20

2020-01-13

2021-03-31

Hosp

5

1

34

2021-01-21

2020-01-13

2021-03-31

Hosp

5

1

Watch out for duplicated data!

That looks good but we had 207 rows of data before doing that join and now we have 210. What’s happened?

Well it’s a little gotcha, in programming jargon it’s an “corner case”: a problem arising where values of two different variables can catch you out.

Of course the R and the join have done what they should so how have we got three new rows of data?

Show code
tmpTibDat2 %>%
  group_by(rowN) %>%
  mutate(nRowN = n(),
         rowNN = row_number()) %>%
  ungroup() -> tmpTibDat2

tmpTibDat2 %>%
  filter(nRowN > 1) %>%
  as_grouped_data(groups = "rowN") %>%
  flextable()

rowN

ID

Fecha.Estudio

Fecha.ini

Fecha.fin

where

nEpisode

EpisodN

nRowN

rowNN

58

4

2022-08-30

2021-02-08

2022-08-30

Comm

5

2

2

1

4

2022-08-30

2022-08-30

2022-10-06

Hosp

5

3

2

2

60

4

2022-10-06

2022-08-30

2022-10-06

Hosp

5

3

2

1

4

2022-10-06

2022-10-06

2023-05-12

Comm

5

4

2

2

134

9

2022-02-16

2021-12-13

2022-02-16

Comm

9

3

2

1

9

2022-02-16

2022-02-16

2022-02-25

Hosp

9

4

2

2

You can see what happened there: three of the Fecha.Estudio fell on the Fecha.fin ending one episode but in each case the next episode started on the same date (pretty common when episodes are defined by transfers between levels of support). The code did the correct thing and said that the data fell within both episodes.

I fixed that by removing the mappings to the Fecha.fin, i.e. by treating all those data which were collected on a day that was both the end of an episode and the start of the next episode as coming from the second of the episodes. (That’s realistic for our data.)

Show code
tmpTibDat2 %>%
  ### remove rows where a row has been duplicated, removing the first one
  filter(!(nRowN > 1 & rowNN == 1)) -> tmpTibDat2

That’s fixed it and now nrow(tmpTibDat2) = 207.

Avoiding the corner case can be worse …

You could avoid this by redefining your limits.

Show code
tmpTibEpisodes %>%
  ### define Fecha.fin.eve as the day before Fecha.fin (uses R date arithmetic which assumes we are counting days)
  mutate(Fecha.fin.eve = Fecha.fin - 1) -> tmpTibEpisodes2

byWithin <- join_by(ID,
                    within(Fecha.Estudio, Fecha.Estudio, Fecha.ini, Fecha.fin.eve))

tmpTibDat %>%
  left_join(tmpTibEpisodes2, 
            ### now tell the left_join() to use the within join that you defined:
            byWithin) -> tmpTibDat3

That seems fine: nrow(tmpTibDat3) = 207 but it’s not fine as there are data rows where the Fecha.Estudio was on the Fecha.fin that aren’t mapped to episodes.

Show code
tmpTibDat3 %>%
  filter(is.na(Fecha.ini)) %>%
  flextable() %>%
  colformat_date(na_str = "NA") %>%
  colformat_char(na_str = "NA") %>%
  colformat_num(na_str = "NA") %>%
  bg(i = ~ is.na(Fecha.ini), j = 5:9, bg = "red") %>%
  autofit()

rowN

ID

Fecha.Estudio

Fecha.ini

Fecha.fin

where

nEpisode

EpisodN

Fecha.fin.eve

6

1

2021-09-30

NA

NA

NA

NA

NA

NA

7

1

2022-03-23

NA

NA

NA

NA

NA

NA

98

7

2021-07-05

NA

NA

NA

NA

NA

NA

131

9

2021-10-15

NA

NA

NA

NA

NA

NA

150

10

2020-12-18

NA

NA

NA

NA

NA

NA

174

11

2021-04-09

NA

NA

NA

NA

NA

NA

185

11

2022-08-31

NA

NA

NA

NA

NA

NA

194

12

2022-05-05

NA

NA

NA

NA

NA

NA

201

12

2023-03-31

NA

NA

NA

NA

NA

NA

220

13

2024-10-31

NA

NA

NA

NA

NA

NA

That shows what happens to be ten rows of data that couldn’t be mapped by the join_with(): they are included but there are missing values where the mapping should have mapped the rows to episodes.

Now I can plot it!

But the correct mapping makes it easy, using ggplot, to map the data collection to the episodes.

Show code
### create a variable firstFechaIni by which to sort the participants on the y axis
tmpTibDat2 %>%
  group_by(ID) %>%
  mutate(firstFechaIni = first(Fecha.ini)) %>%
  ungroup() -> tmpTibDat2

### create colour mapping for regimes
vecWhereColours <- c("Hosp" = "red",
                     "HDD" = "orange",
                     "Comm" = "green")

ggplot(data = tmpTibDat2,
       aes(x = Fecha.Estudio, 
           ### reorder the ID values so the earliest starts lowest
           y = reorder(ID, firstFechaIni))) +
  ### plot the episodes, linewidth makes the lines bars
  geom_linerange(aes(xmin = Fecha.ini, xmax = Fecha.fin,
                     colour = where),
                 linewidth = 5) +
  ### superimpose the data collection dates
  geom_point(aes(x = Fecha.Estudio),
             size = 1) +
  ylab("ID") +
  ### use the colour mapping
  scale_color_manual("Where",
                       values = vecWhereColours) +
  ### nice easy date axis mapping
  ### though I never remember it!
  scale_x_date(name = "Date",
               breaks = "3 months",
               date_labels = "%b-%y") +
  ### cosmetics for that date mapping on the x axis ...
  theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
  ggtitle("Mapping data collection to episodes",
          subtitle = "Data collection left censored by change of recording system")
Show code
### save to make a png available to distill ()
# ggsave("TimeMap.png")

As the title says, that looks as if a lot of early data is missing, it’s not really, it was just collected before the services changed their data collection software and I didn’t bother to merge the earlier data in for this little explanation of the way of mapping the dates to the episodes.

Learning points

But using such a join is hugely easier than coding your own and orders of magnitude faster than any code to do that I at least might write!

History

Visit count

free web counters

Last updated

Show code
cat(paste(format(Sys.time(), "%d/%m/%Y"), "at", format(Sys.time(), "%H:%M")))
17/02/2025 at 17:15

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Evans (2025, Feb. 3). Chris (Evans) R SAFAQ: Mapping dates to episodes. Retrieved from https://www.psyctc.org/R_blog/posts/2025-02-03-mapping-dates-to-episodes/

BibTeX citation

@misc{evans2025mapping,
  author = {Evans, Chris},
  title = {Chris (Evans) R SAFAQ: Mapping dates to episodes},
  url = {https://www.psyctc.org/R_blog/posts/2025-02-03-mapping-dates-to-episodes/},
  year = {2025}
}