[Started 16.ii.21, I will try to remember to put latest update date here and perhaps to mark any major changes or additions as this is more of an evolving resource than a post. Latest (cosmetic) tweak 10.iv.21]
This is a collection of practices and ways of doing things that I wish I had learned much, much earlier as they do make life easier. One thing I didn’t realise when I started in this game was how much you have to go back to your earlier work: it happens when referees come back wanting things done differently for a paper, when you are recycling some aspects of one paper or project to another, particularly when you are reusing code, and it happens, occasionally, when someone writes to you wanting details about a project that are not in a publication. I had one of those recently and found myself digging out old data from a collaborative project where the data collection started 20 years ago and I have (fully anonymous) data from 30 years ago.
Big principles and tasks
- Talking of “fully anonymous”: have a data management plan and be clear, with the laws of the time and the consent at the time
- Your data management plan should cover what will happen if you die or become incapable of doing your tasks. (Ouch, I’ve got work to do on this one.)
- Make up a data analysis plan (DAP) and ideally register it somewhere before you look at the data
- If you are uncertain about whether the design will do what you wish, if you possibly can, run some simulations but try not to get persecuted by inappropriate ideas of a priori “statistical power calculation”: those ideas are absolutely right for tight experimental designs where a hypothesis testing approach fits what you want to learn but very little of what I do really does fit that paradigm and psychology, and research more generally, is finally growing out of the obsession with very poor research design driven by the idea that all that is useful is p < .05.
- If the project is fairly large, it is probably wise to think of writing a protocol paper to go with your DAP. (I have come round to this idea only recently and remain a bit sceptical about protocol papers as just a way to get another publication: they must serve a purpose beyond what could have been served by a simple project registration and a good introduction to the first or only paper).
- Be realistic about the time that data management, analyses and writing up will take (I still almost never get this right and my estimates still range from .01 to about .9 of the time things turn out to take).
Data handling
- Data protection:
- make sure you can encrypt safely any data that needs that (but don’t encrypt everything “to be safe”: that’s part of an approach to data protection that thinks in blanket prescriptions, not about the real risks)
- don’t assume that pseudonymisation (replacing real ID codes with “hashed” codes) will protect confidentiality: it may be possible to work out the identify of some people from their combination of other variables: hunchbacked Tudor courtiers who died in the UK midlands really might only be Richard III for a silly example. Think about “jigsaw attack” identification (reduce the possible to a small enough number that there is only one piece that fits) and learn about “k-anonymity” (are there small, n = k, numbers of variables that uniquely identify an individual?)
- it’s probably wise to make sure that any storage medium on which any but clearly zero risk data is stored is well encrypted and that working machines handling data are encrypted (the entire disk: unlikely for my work but it’s theoretically possible to extract data from unencrypted swap space)
- if you need to be able to deanonymise/decrypt data: store the lookup tables very safely and quite separately from the data itself (doh!)
- go for “open data” if you can but don’t be naive about deanonymisation: my own position is that “clinical” data should only go into open access if you are genuinely sure no single participant can be identified. If you can’t be sure, my position is that it can’t go into an open repository no matter how good the principles of open data are.
- Data cleaning: you will need data cleaning: most non-researchers, and many researchers, are simply not good at ensuring good, clean, data collection and data entry so you will need to monitor that. If you possibly can, make data checking a continuous process that starts as early as possible after data is collected. You may never be able to correct clear errors if they are only detected later.
- Make your data cleaning as comprehensive as possible:
- go through every variable and check for impossible values
- are some apparently fixed variables not quite so meaning that you need to be able to log changes somehow (e.g. gender)
- if the variable is measured more than once within individuals, are there impossible changes?
- now go through every variable and think whether it creates logical impossibilities for others: as far as I know, males cannot be pregnant is an easy example, some are more subtle
- now, if you possibly can, set up ways to trap impossible or dubious looking data and get the impossibilities changed and the dubious values double checked
- Data handling/coding
- Keep “raw data” quite distinct from “data to be analysed”
- Raw data:
- Have a “data log” of anything noteworthy: e.g. “CV-19 prevented data collection between d1/m1/yyy1 and d2/m2/yyy2” or “service changes forced a change in coding of variable y on dd/mm/yyyy” and keep it with the data.
- Have it in as simple a storage format as is possible and is as congruent with how it was collected/entered, e.g. CSV (Comma Separated Variable) rather than a spreadsheet format.
- Make sure every row of data in any data table is identifiable: that may be simple sequential numbering or a hash code but you must always be able to get back to an individual row of data. I now create a variable with the row number for each row of any row I pull in from raw data to working data.
- If you have ID codes for units of data creation, e.g. individual research participants, make sure there are no missing values for them (!)
- If you have ID codes, ditto, make sure they are unique: no two individuals should have the same ID. If you have a data set which should only contain one row per participant,
table(table(ID))
is a great tool: it should show “1”!
- Document the variables, i.e. create a “data dictionary” and store that with the raw data (but don’t store any ID decrypting data with the raw data!)
- Try to keep the creation of “derived codings”, e.g. recoding, say ethnicity, as something done within the “data to be analysed”, not on the raw data
- In line with that last one: try to ensure that beyond necessary cleaning, the raw data is not manipulated at all.
- Have multiple, probably encrypted, copies of your raw data in multiple places: one fire or one company going bust must not lose your backups. (But make sure the backups are identical: don’t keep creating new ones, particularly of changed data, if you feel uneasy about the security of other backups.)
- Data to be analysed:
- Start off by putting all your renaming, recoding and any creation of new variables before you start any actual analyses. If you have a good DAP, this should be easy and prevent you putting such things within code for actual analyses.
- It is perfect if each of your raw data files produces one data file for analyses but with relational data that may be impossible and you may need to take data from multiple raw data files and pull them into one file for analyses: e.g. with multiple questionnaires over time from each participant but one per participant demographic information file then do the data merging to create just one file for analyses from the two raw data files. Again, with more complex data that may not be possible: in those situations your DAP should ideally have defined the different data files (tables) you will need for your analyses and how they are made up from the raw data.
- Quite often, even with single data files, you may have to reshape data for particular analyses, or select particular variables only to reduce RAM needs (very rare for me), or you are doing analyses only on a subset of participants (very common for me). Don’t create lots of new datasets each time you do this, use well commented code before the actual analyses to create a temporary data set you delete after analysing it. This feels slow at the time but makes returning to even quite simple analyse of even quite simple data easy, having multiple derived datasets can make returning to the project an absolute nightmare.
- When you have got the data you want to analyse, you should probably do all the data cleanliness checks you did during data acquisition just in case something has gone wrong.
- When you have done that data massaging, you can finally start your analyses. I try always to start with my data description analyses. These may be much more comprehensive than will go into a paper. It may be good to have it available to people as a supplementary material to a publication. It can be boring to do!
- Now you can do the fun bit: your final analyses!
Coding
- Do put comments in but clear coding style is just as important. Some of my old R code is heavily commented but terrible coding and a nightmare to return to despite all the comments. If you are starting to comment because you know what you’re doing is messy or unclear, it may be better to stop and rethink!
- I don’t follow the tidyverse naming/vocabulary style guide obsessionally but it’s worth reading. I do:
- try to make functions have verb names: getCSC(), getNNA()
- try to make other object names start with letters that say something about the class of the object:
- listInfo: a list (of information!)
- tibDat: a tibble of data (this is usually my main data file for my analyses)
- dfDat or datDat: if I have a data frame not a tibble (increasingly rare)
- vecCutpoints: vector of cutting points
- valN: single values, typically the total n (yes, I know that single variables are vectors of length 1)
- matLoadings: matrices
- arr1: arrays (I think I’ve only ever used one once)
- I often use temporary objects in which case I prefix with “tmp”:
- tmpTib: any old temporary tibble
- tmpTibMeans: (better) temporary tibble of means, typically to add points for means to a violin plot or something like that
- often I will use tmp objects inserted in text after a block of data analysis, e.g. to put a mean into the text after getting basic descriptives.
- I tidy these away from time to time when I’m starting a new block of analyses with
rm(list = ls(pattern = glob2rx("tmp*"))
but it’s not really necessary as the principle is that I know I can overwrite an object starting “tmp”
- I do like to have spaces around all operators:
y <- m * x + c
(though I confess the spaces around “*” still feel a bit odd to me, all the other ones I am sure help me when re-reading my code or other people’s code)
- I always use “<-” or “->” for assignments, never “=”, reserve “=” for setting values of arguments for functions
- I try now mostly to assign forwards, i.e. using “->” at the end of a constructor though I still use simple assignments like
bootReps <- 500
“backwards”. No logic to that!
- I’m not sure if this is good practice or not but I try now to distinguish between packages I will use repeatedly through analyses, e.g. tidyverse and ggplot2, and I load them at the top of my markdown file. Things I might use less often I try to call using the
package::function()
syntax. For example, I use the binconf()
function from the excellent Hmisc package quite often but I use it by calling Hmisc::binconf()
rather than by loading all of Hmisc. This minimise messes with name collisions.
- Talking of name collisions, do watch out for warnings about a function overwriting another when you do load a package: they can be important.
- If you have an error that’s baffling you, try making calls to packaged functions explicit, e.g.
dplyr::summarise()
instead of summarise()
. Sometimes that reveals a name collision. Sometimes getting help on a function will also show a name collision by offering you two packages with the same function name. I think that trick only works if all packages have documented the function.
Randomising (including bootstrapping and jackknifing)
- If you’re ever doing anything which involves a randomisation (and remember that all bootstrapping and jackknifing does), then use
set.seed()
to set a seed and make sure you get reproducible results.
- You may want just to put a
set.seed()
at the top of any R or Rmarkdown you write. (Mostly just to remind yourself!)
- However, it’s probably good practice to reset the seed before any block of code that does call a random number generator as that will reset the seed. If it runs variable numbers of calls to any randomiser the next block may no longer be reproducible.
Functions
- If I am writing a function, I try always to put a comment explaining it after the function declaration unless the function is utterly trivial
- If I am writing a function I am now trying to think whether it really is a one off (in which case I might name it tmpGetNonsense() to undeline that it’s a throwaway), if it’s not, I really want to get into the habit of adding it to my emerging CPCEfuns package so I can easily access it again later and so I can always find the latest version of it
- Where it’s not a throwaway function, I try to put some input sanity checking in. That can seem silly if the function is going to be used in a pipe where the pipe into it ought to make sure the input is sane. However, if my sanity checking throws an error, e.g.
stop("that input was not OK: must be numeric")
it really can help when debugging.
- Almost never use warnings: if there’s any possibility that what you’re warning about might lead to incorrect results from a function, use
stop()
. Where something really could never be fatal, I think using message()
may be wiser than using warning()
.
- I am starting to put messages in my functions a bit more as they can suppressed in Rmarkdown output and they can help understand things particularly when coming back to code.