3 Preparation

In this section, we show all code that was used to process the data.

3.1 MRI data to BIDS

3.1.1 Overview

3.1.1.1 Data availability

The data is freely available from https://github.com/lnnrtwttkhn/highspeed-bids and https://gin.g-node.org/lnnrtwttkhn/highspeed-bids.

3.1.1.2 License

The dataset is licensed under Creative Commons Attribution-ShareAlike 4.0. Please see https://creativecommons.org/licenses/by-sa/4.0/ for details.

3.1.2 Step 1: Conversion of MRI data to the Brain Imaging Data Structure (BIDS) using HeuDiConv

3.1.2.1 Overview

After MRI data was acquired at the MRI scanner, we converted all data to adhere to the Brain Imaging Data Structure (BIDS) standard. Please see the paper by Gorgoleski et al., 2016, Scientific Data for details. In short, BIDS is a community standard for organizing and describing MRI datasets.

3.1.2.2 Container: `heudiconv` container, version 0.6.0

We used HeuDiConv, version 0.6.0, to convert our MRI DICOM data to the BIDS structure. First, we created a Singularity container with HeuDiConv in our cluster environment at the Max Planck Institute for Human Development, Berlin Germany. The Singularity container is separately available at https://github.com/lnnrtwttkhn/tools and was created using:

singularity pull docker://nipy/heudiconv:0.6.0

For the conversion of DICOM data acquired at the MRI scanner to BIDS-converted NIfTI-files the following scripts were used (these scripts can be found in the in the code/heudiconv/ directory):

3.1.2.3 Mapping raw DICOMS to BIDS: `highspeed-heudiconv-heuristic.py`

highspeed_heudiconv_heuristic.py is a Python script, that creates a mapping between the DICOMS and the NIfTI-converted files in the BIDS structure.

#!/usr/bin/env python


def create_key(template, outtype=('nii.gz',), annotation_classes=None):
    if template is None or not template:
        raise ValueError('Template must be a valid format string')
    return template, outtype, annotation_classes


def infotodict(seqinfo):

    # paths in BIDS format
    anat = create_key('sub-{subject}/{session}/anat/sub-{subject}_{session}_rec-{rec}_T1w')
    rest_pre = create_key('sub-{subject}/{session}/func/sub-{subject}_{session}_task-rest_rec-{rec}_run-pre_bold')
    rest_post = create_key('sub-{subject}/{session}/func/sub-{subject}_{session}_task-rest_rec-{rec}_run-post_bold')
    task = create_key('sub-{subject}/{session}/func/sub-{subject}_{session}_task-highspeed_rec-{rec}_run-{item:02d}_bold')
    fmap_topup = create_key('sub-{subject}/{session}/fmap/sub-{subject}_{session}_rec-{rec}_dir-{dir}_epi')
    info = {anat: [], rest_pre: [], rest_post: [], task: [], fmap_topup: []}

    for s in seqinfo:

        if 'NORM' in s.image_type:
            rec = 'prenorm'
        else:
            rec = 'nonorm'

        if ('t1' in s.series_description):
            info[anat].append({'item': s.series_id, 'rec': rec})
        if ('FM_' in s.series_description) and ('prenorm' in rec):
            info[fmap_topup].append({'item': s.series_id, 'rec': rec, 'dir': 'AP'})
        if ('FMInv_' in s.series_description) and ('prenorm' in rec):
            info[fmap_topup].append({'item': s.series_id, 'rec': rec, 'dir': 'PA'})
        if ('Rest_Pre' in s.series_description) and ('prenorm' in rec):
            info[rest_pre].append({'item': s.series_id, 'rec': rec})
        if ('Rest_Post' in s.series_description) and ('prenorm' in rec):
            info[rest_post].append({'item': s.series_id, 'rec': rec})
        # some participants have one post resting state labelled "Rest":
        if ('Rest' in s.series_description) and ('prenorm' in rec):
            info[rest_post].append({'item': s.series_id, 'rec': rec})
        if ('Run' in s.series_description) and ('prenorm' in rec):
            info[task].append({'item': s.series_id, 'rec': rec})

    return info

3.1.2.4 Changing participant IDs: `highspeed-heudiconv-anonymizer.py`

highspeed-heudiconv-anonymizer.py is an executable Python script, which is used in combination with the --anon-cmd flag of the heudiconv command that turns the original participant IDs (that we used when we ran the study in the lab) into consecutive zero-padded numbers (see this thread on neurostars.org for details)

#!/usr/bin/env python
# ==============================================================================
# SCRIPT INFORMATION:
# ==============================================================================
# SCRIPT: ANONYMIZE PARTICIPANT IDS DURING BIDS-CONVERSION WITH HEUDICONV
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2020
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ==============================================================================
# import relevant packages:
import sys
import os
# define paths depending on the operating systen:
if 'linux' in sys.platform:
    # define the path to the text file containg the subject IDs:
    path_sublist = os.path.join("/code", "highspeed-participant-list.txt")
# retrieve the user input:
ids_orig = open(path_sublist, "r").read().splitlines()
# define the number of subjects:
ids_new = ["%02d" % t for t in range(1, len(ids_orig)+1)]
# create a dictionary mapping original ids to anonymized ids:
subj_map = dict(zip(ids_orig, ids_new))
# replace the original ids with zero-padded numbers:
sid = sys.argv[-1]
if sid in subj_map:
    print(subj_map[sid])

As a side note, the last step is not really necessary since zero-padded numbers are not required by the BIDS standard.

3.1.2.5 Running `heudiconv` on the cluster: `highspeed-heudiconv-cluster.sh`

highspeed-heudiconv-cluster.sh is a bash script that parallelizes the HeuDiConv command for each participant on the high-performance cluster of the Max Planck Institute for Human Development Berlin, Germany.

Note, that for privacy concerns we only saved the BIDS-converted data in the repo after running the defacing (see details below).

#!/usr/bin/bash
# ==============================================================================
# SCRIPT INFORMATION:
# ==============================================================================
# SCRIPT: PARALLELIZE BIDS CONVERSION USING HEUDICONV ON THE MPIB CLUSTER
# PROJECT NAME: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2020
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ==============================================================================
# DEFINE ALL PATHS:
# ==============================================================================
PATH_BASE="${HOME}"
# path to the project root directory
PATH_ROOT="${PATH_BASE}/highspeed"
# define the name of the project:
PROJECT_NAME="highspeed-bids"
# define the path to the project folder:
PATH_PROJECT="${PATH_ROOT}/${PROJECT_NAME}"
# define the path to the input directory:
PATH_INPUT="${PATH_PROJECT}/input/mri"
# define the path to the output directory
PATH_OUTPUT="${PATH_PROJECT}"
# define the path to the singularity container:
PATH_CONTAINER="${PATH_PROJECT}/tools/heudiconv/heudiconv_0.6.0.sif"
# define the path to the code main directory:
PATH_CODE="${PATH_PROJECT}/code/heudiconv"
# path to the heudiconv heuristic file:
HEURISTIC_FILE="highspeed-heudiconv-heuristic.py"
# define path to the python executable file that anonymizes the subject ids:
ANON_FILE="highspeed-heudiconv-anonymizer.py"
# make the anonymizer file executable:
chmod +x "${PATH_CODE}/$ANON_FILE"
# path to the directory where error and out path_logs of cluster jobs are saved:
PATH_LOGS="${PATH_PROJECT}/logs/heudiconv/$(date '+%Y%m%d_%H%M%S')"
# path to the text file with all subject ids:
PATH_SUB_LIST="${PATH_CODE}/highspeed-participant-list.txt"
# ==============================================================================
# CREATE RELEVANT DIRECTORIES:
# ==============================================================================
# create directory for log files:
if [ ! -d ${PATH_LOGS} ]; then
    mkdir -p ${PATH_LOGS}
    echo "created ${PATH_LOGS}"
fi
# ==============================================================================
# DEFINE PARAMETERS:
# ==============================================================================
# maximum number of cpus per process:
N_CPUS=1
# memory demand in *GB*
MEM_GB=6
# memory demand in *MB*
MEM_MB="$((${MEM_GB} * 1000))"
# read subject ids from the list of the text file
SUB_LIST=$(cat ${PATH_SUB_LIST} | tr '\n' ' ')
# ==============================================================================
# RUN HEUDICONV:
# ==============================================================================
# initalize a subject counter:
SUB_COUNT=0
# loop over all subjects:
for SUB in ${SUB_LIST}; do
    # update the subject counter:
    let SUB_COUNT=SUB_COUNT+1
    # get the subject number with zero padding:
    SUB_PAD=$(printf "%02d\n" $SUB_COUNT)
    # loop over all sessions:
    for SES in `seq 1 2`; do
        # get the session number with zero padding:
        SES_PAD=$(printf "%02d\n" $SES)
        # define the dicom template for the heudiconv command:
        DICOM_DIR_TEMPLATE="HIGHSPEED_{subject}_HIGHSPEED_{subject}_${SES}*/*/*/*IMA"
        # check the existence of the input files and continue if data is missing:
        if [ ! -d ${PATH_INPUT}/HIGHSPEED_${SUB}_HIGHSPEED_${SUB}_${SES}_* ]; then
            echo "No data input available for sub-${SUB} ses-${SES_PAD}!"
            continue
        fi
        # start slurm job file:
        echo "#!/bin/bash" > job
        # name of the job:
            echo "#SBATCH --job-name heudiconv_sub-${SUB_PAD}_ses-${SES_PAD}" >> job
        # select partition:
        echo "#SBATCH --partition gpu" >> job
            # set the expected maximum running time for the job:
        echo "#SBATCH --time 12:00:00" >> job
        # determine how much RAM your operation needs:
        echo "#SBATCH --mem ${MEM_GB}GB" >> job
        # request multiple cpus
        echo "#SBATCH --cpus-per-task ${N_CPUS}" >> job
        # write output and error log to log folder:
        echo "#SBATCH --output ${PATH_LOGS}/slurm-heudiconv-%j.out" >> job
        # email notification on abort/end, use 'n' for no notification:
        echo "#SBATCH --mail-type NONE" >> job
        # set working directory
        echo "#SBATCH --workdir ." >> job
        # define the heudiconv command:
        echo "singularity run --contain -B ${PATH_INPUT}:/input:ro \
        -B ${PATH_OUTPUT}:/output:rw -B ${PATH_CODE}:/code:ro \
        ${PATH_CONTAINER} -d /input/${DICOM_DIR_TEMPLATE} -s ${SUB} \
        --ses ${SES_PAD} -o /output -f /code/${HEURISTIC_FILE} \
        --anon-cmd /code/${ANON_FILE} -c dcm2niix -b --overwrite" >> job
        # submit job to cluster queue and remove it to avoid confusion:
        sbatch job
        rm -f job
    done
done

We acquired both pre-normalized and non-normalized MRI data (see e.g., here for more information on pre-normalization). All analyses reported in the paper were based on the pre-normalized data. Only the pre-normalized data set is published because uploading the dataset in two versions (with- and without pre-normalization) would otherwise cause interference when running fMRIPrep (see here).

3.1.2.6 Resources

The following resources helped along the way (thank you, people on the internet!):

3.1.3 Step 2: Removal of facial features using pydeface

3.1.3.1 Overview

Facial features need to be removed from structural images before sharing the data online. See the statement from openfmri.org below regarding the importance of defacing:

To protect the privacy of the individuals who have been scanned we require that all subjects be de-identified before publishing a dataset. For the purposes of fMRI de-facing is the preferred method de-identification of scan data. Skull stripped data will not be accepted for publication.

and a second statement from the openneuro.org FAQs:

Yes. We recommend using pydeface. Defacing is strongly prefered over skullstripping, because the process is more robust and yields lower chance of accidentally removing brain tissue.

3.1.3.2 Container: `pydeface` container, version 2.0.0

Defacing of all structural images was performed using pydeface, version 2.0.0. (with nipype version 1.3.0-rc1)

To ensure robustness of the defacing procedure, we used a Singularity container for pydeface, which we installed as follows:

singularity pull docker://poldracklab/pydeface:37-2e0c2d

All scripts that we used for defacing can be found in the code/defacing directory.

3.1.3.3 Defacing structural images: `highspeed-defacing-cluster.sh`

First, we ran highspeed-defacing-cluster.sh to deface all structural images that can be found in the corresponding BIDS data set. Note, that this script was optimized to run on the high performance cluster of the Max Planck Institute for Human Development, Berlin.

#!/bin/bash
# ==============================================================================
# SCRIPT INFORMATION:
# ==============================================================================
# SCRIPT: DEFACING ANATOMICAL MRI DATA IN A BIDS DATASET
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2020
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ==============================================================================
# DEFINE ALL PATHS:
# ==============================================================================
# define home directory:
PATH_BASE="${HOME}"
# define the name of the current task:
TASK_NAME="pydeface"
# define the name of the project:
PATH_ROOT="${PATH_BASE}/highspeed"
# define the name of the project:
PROJECT_NAME="highspeed-bids"
# define the path to the project folder:
PATH_PROJECT="${PATH_ROOT}/${PROJECT_NAME}"
# define the path to the singularity container:
PATH_CONTAINER="${PATH_PROJECT}/tools/${TASK_NAME}/${TASK_NAME}_37-2e0c2d.sif"
# path to the log directory:
PATH_LOG="${PATH_PROJECT}/logs/${TASK_NAME}"
# path to the data directory (in BIDS format):
PATH_BIDS="${PATH_PROJECT}"
# ==============================================================================
# CREATE RELEVANT DIRECTORIES:
# ==============================================================================
# create directory for log files:
if [ ! -d ${PATH_LOG} ]; then
    mkdir -p ${PATH_LOG}
else
    # remove old log files inside the log container:
    rm -r ${PATH_LOG}/*
fi
# ==============================================================================
# DEFINE PARAMETERS:
# ==============================================================================
# maximum number of cpus per process:
N_CPUS=1
# memory demand in *MB*
MEM_MB=500
# memory demand in *KB*
MEM_KB="$((${MEM_MB} * 1000))"
# ==============================================================================
# RUN PYDEFACE:
# ==============================================================================
for FILE in ${PATH_BIDS}/*/*/anat/*T1w.nii.gz; do
    # to just get filename from a given path:
    FILE_BASENAME="$(basename -- $FILE)"
    # get the parent directory:
    FILE_PARENT="$(dirname "$FILE")"
    # create cluster job:
    echo "#!/bin/bash" > job
    # name of the job
    echo "#SBATCH --job-name pydeface_${FILE_BASENAME}" >> job
    # set the expected maximum running time for the job:
    echo "#SBATCH --time 1:00:00" >> job
    # determine how much RAM your operation needs:
    echo "#SBATCH --mem ${MEM_MB}MB" >> job
    # email notification on abort/end, use 'n' for no notification:
    echo "#SBATCH --mail-type NONE" >> job
    # writelog to log folder
    echo "#SBATCH --output ${PATH_LOG}/slurm-%j.out" >> job
    # request multiple cpus
    echo "#SBATCH --cpus-per-task ${N_CPUS}" >> job
    # define the main command:
    echo "singularity run --contain -B ${FILE_PARENT}:/input:rw ${PATH_CONTAINER} \
    pydeface /input/${FILE_BASENAME} --force" >> job
    # submit job to cluster queue and remove it to avoid confusion:
    sbatch job
    rm -f job
done

3.1.3.4 Replacing defaced with original images: `highspeed-defacing-cleanup.sh`

pydeface creates a new file with the ending T1w_defaced.nii.gz. As fMRIPrep, MRIQC and other tools need to use the defaced instead of the original image, we need to replace the original with the defaced image. This can be done separately after pydeface was run.

In order to replace the original structural images with the defaced once (as recommended), we ran highspeed-defacing-cleanup.sh.

#!/bin/bash
# ==============================================================================
# SCRIPT INFORMATION:
# ==============================================================================
# SCRIPT: REPLACING ORIGINAL STRUCTURAL IMAGES WITH DEFACED STRUCTURAL IMAGES
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2020
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ==============================================================================
# DEFINE ALL PATHS:
# ==============================================================================
# define home directory
PATH_BASE="${HOME}"
# define the name of the current task:
TASK_NAME="pydeface"
# define the name of the project:
PROJECT_NAME="highspeed"
# path to the data directory (in bids format):
PATH_BIDS="${PATH_BASE}/${PROJECT_NAME}/highspeed-bids"
# ==============================================================================
# REMOVE ORIGINAL T1W IMAGES AND REPLACE WITH DEFACED ONES:
# ==============================================================================
for FILE in ${PATH_BIDS}/*/*/anat/*T1w_defaced.nii.gz; do
    # to just get filename from a given path:
    FILE_BASENAME="$(basename -- $FILE)"
    # get the parent path of directories:
    FILE_PARENT="$(dirname "$FILE")"
    # get the file name without the _defaced extension:
    FILE_NEW="${FILE_BASENAME//_defaced}"
    # remove the undefaced T1w file:
    rm -rf ${FILE_PARENT}/${FILE_NEW}
    echo "removed ${FILE_PARENT}/${FILE_NEW}"
    # replace the original T1w image with the defaced version:
    mv ${FILE} ${FILE_PARENT}/${FILE_NEW}
    echo "replaced with ${FILE}"
done

3.1.3.5 Resources

“Pydeface defaces structural data only?!” - Discussion on neurostars.org if any other data than structural acquisitions should be defaced (short answer: no!)
“Is/how much fmriprep (freesurfer et al) is resilient to “defacing”?" - Discussion on neurostars.org if fMRIPrep works well with defaced data (short answer: yes!)

3.1.4 Step 3: Adding BIDS `*events.tsv` files

3.1.4.1 Overview

According to the BIDS specification …

The purpose of this file is to describe timing and other properties of events recorded during the scan.

Thus, we transformed all behavioral data that was acquired during the scanning sessions to run-specific *events.tsv files that are compatible with the BIDS standard.

3.1.4.2 Code and software

We mainly used two MATLAB files that were run in MATLAB version R2017b. The code takes the original behavioral data files acquired during scanning as input and returns run-specific BIDS-compatible *events.tsv files.

%% SCRIPT: CREATE EVENT.TSV FILES FROM THE BEHAVIORAL DATA FOR BIDS
% =========================================================================
% PROJECT: HIGHSPEED
% WRITTEN BY LENNART WITTKUHN 2018 - 2020
% CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
% MAX PLANCK RESEARCH GROUP NEUROCODE
% MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT
% MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
% LENTZEALLEE 94, 14195 BERLIN, GERMANY
% =========================================================================
%% DEFINE PATHS AND IMPORTANT VARIABLES:
% clear the workspace and command window:
clear variables; clc;
% define the data root path
path_root = strsplit(pwd, 'code');
path_root = path_root{1};
% define the input path:
path_input = fullfile(path_root, 'input', 'behavior', 'main');
% define the output path:
path_output = path_root;
% get the contents of the output directory:
path_output_dir = dir(path_output);
% check how many subjects are in the root directory:
num_subs_found = sum(contains({path_output_dir.name},'sub'));
% extended output path used to check for old files:
path_old_files = fullfile(path_output,'*','*','func');
% find all existing events.tsv files in the output directory:
prev_files = dir(fullfile(path_old_files,'*events.tsv'));
% delete all previous events files:
for old_file = 1:length(prev_files)
      delete(fullfile(prev_files(old_file).folder,prev_files(old_file).name))
end
% define the script path:
path_script = fullfile(path_root, 'code');
% read the text file containing a list of subject ids:
sub_list = dlmread(fullfile(path_script, 'heudiconv', 'highspeed-participant-list.txt'));
% turn the array with ids into a strings in a cell array:
sub_list = cellstr(num2str(sub_list));
%check if the number of subjects in the list matches the target directory
if numel(sub_list) ~= num_subs_found
    warning(['Number of subjects in the data dir does not match ' ...
        'number of subjects in the subject text file!']);
    sub_alt_list = cellfun(@num2str,num2cell(1:length(sub_list)),'un',0);
else
    sub_alt_list = sub_list;
    sub_alt_list = cellfun(@num2str,num2cell(1:num_subs_found),'un',0);
end
% determine the number of study sessions:
num_ses = 2;
% determine the number of task runs per study session:
num_run = 4;
% define a cell array containing the stimulus labels in german:
key_set = {'Gesicht','Haus','Katze','Schuh','Stuhl'};
% define a cell array containing the stimulus labels in english:
value_set = {'Face','House','Cat','Shoe','Chair'};
% create a dictionary that translates the stimulus labels:
label_dict = containers.Map(key_set,value_set);
% create a 2d-array of run indices ordered by run (row) and sessions (col):
run_array = reshape(1:num_run * num_ses, num_run, num_ses);
% define the names of the four different task conditions:
task_names = {'oddball','sequence','repetition','repetition'};
%%
for sub = 1:length(sub_alt_list)
%for sub = 1:1
    % initialize the maximum repetition trial index:
    max_rep = 0;
    % get the current subject id:
    sub_id = sub_list{sub};
    % print progress:
    fprintf('Running sub %d of %d\n', sub, length(sub_alt_list))
    % define a template string that takes subject, session and run id:
    template_string = '*sub_%s_session_%d*run_%d*';
    % put in the current subject, session and run id:
    file_string = sprintf(template_string,sub_id,num_ses,num_run);
    % read behavioral data files of all participants:
    path_file = dir(fullfile(path_input,file_string));
    % load the behavioral data into the workspace:
    load(fullfile(path_input,path_file.name));
    for session = 1:num_ses
        % create a subject identifier (in bids format):
        pad_sub = sprintf('sub-%02d',str2double(sub_alt_list{sub}));
        % create a session identififer (in bids format):
        pad_ses = ['ses-0', num2str(session)];
        % combine the two identifiers as the first part of file names:
        sub_file_name = strcat(pad_sub,'_',pad_ses);
        % create the subject output path:
        path_output_sub = (fullfile(path_output,pad_sub,pad_ses,'func'));
        % create the subject directory if it does not exist yet:
        if ~exist(path_output_sub,'dir')
            system(sprintf('mkdir -p %s',path_output_sub));
        end
        for run = 1:num_run
            events = table;
            for cond = 1:4                
                event_all = extract_bids_events(Data, Basics, Sets, pad_sub, run, session, cond);
                events = [events;event_all];
                
            end
            % sort by event onset (i.e., in chronological order):
            events = sortrows(events,{'onset'});
            % make two copies of the repetition trials:
            rep_trials_old = events.trial(contains(events.condition, 'repetition'));
            rep_trials_new = rep_trials_old;
            % get the old trial indices while maintaining their order:
            trial_old = unique(rep_trials_old, 'stable');
            % get the number of repetition trials in the current run:
            n_rep_trials = length(trial_old);
            % create new trial indices depending on the running number of
            % repetition trials:
            trial_new = max_rep+1:max_rep+n_rep_trials;
            % change the old trial indices
            for i = 1:n_rep_trials
                rep_trials_new(rep_trials_old == trial_old(i)) = trial_new(i);
            end
            % update the repetition trials of the events files:
            events.trial(contains(events.condition, 'repetition')) = rep_trials_new;
            % update the counter of the maximum repetition trial index:
            max_rep = max(unique(events.trial(contains(events.condition, 'repetition'))));
            % create template string file for data output (tsv format):
            string_template = '_task-highspeed_rec-prenorm_run-0%d_events';
            % write conditon and run information into the string:
            string_written = sprintf(string_template,run);
            % create the full filenames:
            outfile_name = strcat(sub_file_name,string_written);
            % create paths of the tsv and csv files:
            path_tsv = fullfile(path_output_sub,strcat(outfile_name,'.tsv'));
            path_csv = fullfile(path_output_sub,strcat(outfile_name,'.csv'));
            % write the events table as csv file:
            writetable(events,path_csv,'Delimiter','\t');
            % copy the created file from csv to tsv file:
            copyfile(path_csv,path_tsv)
            % delete the csv file:
            delete(path_csv);
        end
    end
end

function event_all = extract_bids_events(Data, Basics, Sets, pad_sub, run, ses, cond)
%% CONVERT DATASET TO TABLE AND GET THE VARIABLE NAMES:
% convert data table from dataset to table type:
data = dataset2table(Data(cond).data);
% get the variable names of the current data table:
var_names = data.Properties.VariableNames;
% get all events we want to convert:
all_events = var_names(contains(var_names, {'tFlip'}));
%% BASIC TASK STATS
% determine the number of study sessions:
num_ses = 2;
% determine the number of task runs per study session:
num_run = 4;
% get the indices of the current sessions (as booleans):
idx_session = Basics.runInfo.session == ses;
% get the indices of the current run (as booleans):
idx_run = Basics.runInfo.run == run;
% get the timestamp of of the first scanner trigger:
t_trigger = Basics.runInfo.tTrigger(idx_session & idx_run);
% get the data indices of the current session:
idx_data_ses = data.session == ses;
% get the data indices of the current run within session:
idx_data_run = data.run == run;
% combine the indices to get the correct data indices:
index = idx_data_ses & idx_data_run;
% create a 2d-array of run indices ordered by run (row) and sessions (col):
run_array = reshape(1:num_run * num_ses, num_run, num_ses);
% define the names of the four different task conditions:
task_names = {'oddball','sequence','repetition','repetition'};
%% DEFINE DICTIONAIRY FOR THE STIMULUS LABELS:
% define a cell array containing the stimulus labels in german:
keys_stim = {'Gesicht','Haus','Katze','Schuh','Stuhl'};
% define a cell array containing the stimulus labels in english:
value_stim = {'face','house','cat','shoe','chair'};
% create a dictionary that translates the stimulus labels:
dict_stim = containers.Map(keys_stim,value_stim);
%% DEFINE DICTIONAIRY FOR THE EVENTS:
% define a cell array containing the stimulus labels in german:
keys_type = {'tFlipCue','tFlipBlank','tFlipFix','tFlipStim','tFlipITI','tFlipDelay','tFlipResp','tResponse'};
% define a cell array containing the stimulus labels in english:
value_type = {'cue','blank','fixation','stimulus','interval','delay','choice','response'};
% create a dictionary that translates the stimulus labels:
dict_type = containers.Map(keys_type,value_type);
%% LOOP OVER ALL EVENTS AND GATHER THE EVENT INFORMATION
event_all = table;
for i = 1:length(all_events)
    % get the current event:
    event_type = all_events{i};
    % get the number of sequential stimuli of that event:
    num_seq_stim = size(data{:,event_type},2);
    % number of trials of cond in the current run and session:
    num_events = sum(index) * num_seq_stim;
    % initialize empty events struct
    event = struct;
    
    % onsets, in seconds from first trigger:
    event.onset = data{index, event_type} - t_trigger;
    event.onset = reshape(transpose(event.onset),[],1);
    
    % duration, in seconds
    if strcmp(event_type,'tFlipCue')
        event.duration = repmat(Basics.tTargetCue,num_events,1);
    elseif strcmp(event_type, 'tFlipBlank')
        event.duration = repmat(Basics.tPreFixation,num_events,1);
    elseif strcmp(event_type, 'tFlipFix')
        event.duration = repmat(Basics.tFixation,num_events,1);
    elseif strcmp(event_type, 'tFlipStim')
        event.duration = repmat(Sets(cond).set.tStim,num_events,1);
    elseif strcmp(event_type, 'tFlipStim')
        event.duration = repmat(Sets(cond).set.tStim,num_events,1);
    elseif strcmp(event_type, 'tFlipITI')
        event.duration = repelem(data.tITI(index,:),num_seq_stim,1);
    elseif strcmp(event_type, 'tFlipDelay')
        event.duration = (data{index, 'tFlipResp'} - t_trigger) - event.onset;
    elseif strcmp(event_type, 'tFlipResp')
        event.duration = repmat(Basics.tResponseLimit,num_events,1);
    end
    
    % participant id
    event.subject = repmat({pad_sub},num_events,1);
    % add column that contains the session identifier:
    event.session = repmat(ses,num_events,1);
    % run within session:
    event.run_session = repmat(run,num_events,1);
    % run across the entire experiment:
    event.run_study = repmat(run_array(run,ses),num_events,1);
    % add column that contains the trial counter
    if cond == 4
        trial_indices = 41:1:45;
        event.trial = repelem(trial_indices(index)',num_seq_stim,1);
    else
        event.trial = repelem(find(index),num_seq_stim,1);
    end
    % add column that contains the condition:
    event.condition = repmat(task_names(cond),num_events,1);
    % add column that contains the trial type:
    event.trial_type = (repmat({dict_type(event_type)},num_events,1));
    
    % initialize all other event information:
    event.serial_position = nan(num_events,1);
    event.interval_time = nan(num_events,1);
    event.stim_orient = nan(num_events,1);
    event.stim_index = nan(num_events,1);
    event.stim_label = event.trial_type;
    %event.stim_file = strcat('images/',event.stim_label,'.jpg');
    event.target = nan(num_events,1);
    event.nontarget = nan(num_events,1);
    event.key_down = nan(num_events,1);
    event.key_id = repmat({NaN},num_events,1);
    event.key_target = repmat({NaN},num_events,1);
    event.accuracy = nan(num_events,1);
    event.response_time = nan(num_events,1);
    
    if strcmp(event_type, 'tFlipStim')
        % add column that contains the sequential position:
        event.serial_position = repmat(1:num_seq_stim,1,sum(index))';
        % add column that contains the inter-stimulus interval:
        event.interval_time = repelem(data.tITI(index,:),num_seq_stim,1);
        % add column that contains the stimulus orientation:
        event.stim_orient = repelem(data.orient(index,:),num_seq_stim,1);
        % get stimulus labels of the current run:
        event.stim_index = data.stimIndex(index,:);
        event.stim_index = reshape(transpose(event.stim_index),[],1);
        % add column that contains the path to the stimulus folder:
        event.stim_label = transpose(value_stim(event.stim_index));
        %event.stim_file = strcat('images/',event.stim_label,'.jpg');
        % add column that indicates whether stimulus is a target:
        if cond == 1
            event.target = double(event.stim_orient == 180);
            event.nontarget = nan(sum(index) * num_seq_stim,1);
        elseif cond == 2 || cond == 3 || cond == 4
            A = data.stimIndex(index,:);
            V = data.targetPos(index,:);
            W = data.targetPosAlt(index,:);
            event.target = bsxfun(@eq, cumsum(ones(size(A)), 2), V);
            event.target = reshape(transpose(event.target),[],1);
            event.nontarget = bsxfun(@eq, cumsum(ones(size(A)), 2), W);
            event.nontarget = reshape(transpose(event.nontarget),[],1);
        end
    end
    
    % add participant responses:
    if (strcmp(event_type, 'tFlipStim') && strcmp(task_names{cond}, 'oddball')) || ...
        (strcmp(event_type, 'tFlipResp') && ~strcmp(task_names{cond}, 'oddball'))
        % key press
        event.key_down = repelem(data.keyIsDown(index,:),num_seq_stim,1);
        % key identity
        event.key_id = repelem(data.keyIndex(index,:),num_seq_stim,1);
        if ~isempty(event.key_id)
            event.key_id = cellstr(num2str(event.key_id));
            event.key_id(strcmp(strrep(event.key_id,' ',''),'90')) = {'left'};
            event.key_id(strcmp(strrep(event.key_id,' ',''),'71')) = {'right'};
            event.key_id(~strcmp(event.key_id,'left') & ...
                ~strcmp(event.key_id,'right')) = {NaN};
        end
        % key target
        if ismember('keyTarget',data.Properties.VariableNames)
            event.key_target = repelem(data.keyTarget(index,:),num_seq_stim,1);
        else
            event.key_target = repmat({NaN},sum(index) * num_seq_stim,1);
        end
        % accuracy
        event.accuracy = repelem(data.acc(index,:),num_seq_stim,1);
        % response time
        event.response_time = repelem(data.rt(index,:),num_seq_stim,1);

    end
    events = struct2table(event);
    event_all = [event_all;events];
end
% remove all events that have no onset:
event_all(isnan(event_all.onset),:) = [];
end

3.1.5 Step 4: Adding additional information to the BIDS dataset

3.1.5.1 Code and software

First, we created a virtual environment using Python version 3.8.5 to install all dependencies and required packages.

A list of all required packages can be found in requirements.txt (created using pip freeze > requirements.txt) and installed via pip install -r requirements.txt.

mkvirtualenv highspeed-bids -p $(which python3)

$ python --version
Python 3.8.5

3.1.5.2 Updating `dataset_description.json`: `python highspeed-bids-description.py`

highspeed-bids-description.py is a short Python script that loads the dataset_description.json file that is pre-generated by HeuDiConv and populates it with the relevant study informtion.

datalad run -m "create dataset_description.json" --output "dataset_description.json" "python3 code/bids_conversion/highspeed-bids-description.py"

# ======================================================================
# SCRIPT INFORMATION:
# ======================================================================
# SCRIPT: UPDATE OF BIDS DIRECTORY
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2020
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ======================================================================
# IMPORT RELEVANT PACKAGES
# ======================================================================
import json
import os
# ======================================================================
# DEFINE PATHS
# ======================================================================
# path to the project root:
project_name = 'highspeed-bids'
path_root = os.getcwd().split(project_name)[0] + project_name
path_desc = os.path.join(path_root, 'dataset_description.json')
# ======================================================================
# UPDATE DATA-SET DESCRIPTION FILE
# ======================================================================
# open the dataset_description.json file:
with open(path_desc) as json_file:
    json_desc = json.load(json_file)
json_file.close()
# update fields of the json file:
json_desc["Acknowledgements"] = "This work was funded by a research group grant awarded to NWS by the Max Planck Society (M.TN.A.BILD0004). We thank Eran Eldar, Sam Hall-McMaster and Ondrej Zika for helpful comments on a previous version of this manuscript, Gregor Caregnato for help with participant recruitment and data collection, Anika Loewe, Sonali Beckmann and Nadine Taube for assistance with MRI data acquisition, Lion Schulz for help with behavioral data analysis, Michael Krause for support with cluster computing and all participants for their participation. Lennart Wittkuhn is a pre-doctoral fellow of the International Max Planck Research School on Computational Methods in Psychiatry and Ageing Research (IMPRS COMP2PSYCH). The participating institutions are the Max Planck Institute for Human Development, Berlin, Germany, and University College London, London, UK. For more information, see https://www.mps-ucl-centre.mpg.de/en/comp2psych."
json_desc["Authors"] = ["Lennart Wittkuhn", "Nicolas W. Schuck"]
json_desc["Funding"] = ["M.TN.A.BILD0004"]
json_desc["DatasetDOI"] = "https://gin.g-node.org/lnnrtwttkhn/highspeed-bids/"
json_desc["License"] = "Creative Commons Attribution-NonCommercial-ShareAlike 4.0"
json_desc["Name"] = "Faster than thought: Detecting sub-second activation sequences with sequential fMRI pattern analysis"
json_desc["ReferencesAndLinks"] = ["Wittkuhn, L. and Schuck, N. W. (2020). Faster than thought: Detecting sub-second activation sequences with sequential fMRI pattern analysis. bioRxiv. doi: 10.1101/2020.02.15.950667"]
json_desc["HowToAcknowledge"] = "Please cite: Wittkuhn, L. and Schuck, N. W. (2020). Faster than thought: Detecting sub-second activation sequences with sequential fMRI pattern analysis. bioRxiv. doi: 10.1101/2020.02.15.950667"
json_desc["EthicsApprovals"] = ["The research protocol was approved by the ethics commission of the German Psychological Society (DPGs), reference number: NS 012018"]
# save updated data-set_description.json file:
with open(path_desc, 'w') as outfile:
    json.dump(json_desc, outfile, indent=4)
outfile.close()

3.1.5.3 Updating the `.json` files of fieldmap data: `python highspeed-bids-fieldmaps.py`

During later pre-processing of MRI data with fMRIPrep, we want to use our fieldmap data for distortion correction. To ensure that fMRIPrep detects and uses the fieldmap data, we need to add the IntendedFor field to the .json files of the fieldmap data and provide relative paths to the functional task data (see here for more information).

This is done using the code/bids_conversion/highspeed-bids-fieldmaps.py file. To detect the provencance of the run command in our DataLad dataset, we run:

datalad run -m "create IntendedFor fields in fieldmap .json files" \
--input "./*/*/*/*.nii.gz" --output "./*/*/fmap/*.json" \
"python3 code/code/bids_conversion/highspeed_bids_fieldmaps.py"

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# ======================================================================
# SCRIPT INFORMATION:
# ======================================================================
# SCRIPT: UPDATE THE FIELDMAP JSON FILES
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2020
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ======================================================================
# IMPORT RELEVANT PACKAGES
# ======================================================================
import os
import glob
import json
import stat
# ======================================================================
# DEFINE PATHS
# ======================================================================
# to run type python3 bids_fieldmaps_info.py $PATH_BIDS
# where $PATH_BIDS is the path to your BIDS directory
# path to the project root:
project_name = 'highspeed-bids'
path_root = os.getcwd().split(project_name)[0] + project_name
path_fmap = os.path.join(path_root, '*', '*', 'fmap', '*.json')
path_func = os.path.join(path_root, '*', '*', 'func', '*.nii.gz')
# ======================================================================
# UPDATE FIELDMAP JSON FILES
# ======================================================================
# get all fieldmap files in the data-set:
files_fmap = glob.glob(path_fmap)
# loop over all field-map files:
for file_path in files_fmap:
    # open the .json file of the fieldmap acquisition:
    with open(file_path, 'r') as in_file:
        json_info = json.load(in_file)
    in_file.close()
    # get the path to the session folder of a specific participant:
    file_base = os.path.dirname(os.path.dirname(file_path))
    # get the path to all functional acquisitions in that session:
    files_func = glob.glob(os.path.join(file_base, 'func', '*nii.gz'))
    session = os.path.basename(file_base)
    up_dirs = os.path.join(session, 'func')
    intended_for = [os.path.join(up_dirs, os.path.basename(file)) for file in files_func]
    json_info["IntendedFor"] = sorted(intended_for)
    # change file permissions to read:
    permissions = os.stat(file_path).st_mode
    os.chmod(path=file_path, mode=permissions | stat.S_IWUSR)
    # save updated fieldmap json-file:
    with open(file_path, 'w') as out_file:
        json.dump(json_info, out_file, indent=2, sort_keys=True)
    out_file.close()
    # change file permissions back to read-only:
    os.chmod(path=file_path, mode=permissions)

3.1.5.4 Creating the `participants.tsv` and `participants.json` files

According to the BIDS specification, version 1.1.0 …

[…] the purpose of this file is to describe properties of participants such as age, handedness, sex, etc. In case of single session studies this file has one compulsory column participant_id that consists of sub- , followed by a list of optional columns describing participants. Each participant needs to be described by one and only one row.

In our case, this information was saved in the original behavioral data files that were created when participants performed the task during scanning.

To extract this information from the behavioral files into the participants.tsv file we use the code/events/highspeed_bids_participants.m file. Since wrapping the execution of this script into a datalad run command appeared challenging, we just ran the script with paths not self-contained and just saved the changes in files.

First, we had to unlock the participants.tsv file using datalad unlock participants.tsv

The we ran highspeed_bids_participants.m using MATLAB version R2017b and saved the changes using datalad save.

%% HIGHSPEED: GET DATA OF THE HIGHSPEED TASK
clear variables; clc; % clear workspace and command window

path_base = strsplit(pwd,'code');
path_base = path_base{1};
path_input = fullfile(path_base, 'input', 'behavior', 'main');
path_output = path_base;
path_digitspan = fullfile(...
    path_base, 'input', 'behavior', 'digitspan');
allID = dlmread(fullfile(...
    path_base, 'code', 'heudiconv', 'highspeed-participant-list.txt'));
num_subs = length(allID);

% get data
dirData = dir(path_input);
dirData = {dirData.name};
dataFiles = dirData(...
    contains(dirData,'session_1_run_4') & ...
    contains(dirData,cellstr(num2str(allID))));

covariates = table;
covariates.participant_id = cell(num_subs,1);
covariates.age = nan(num_subs,1);
covariates.sex = cell(num_subs,1);
covariates.handedness = cell(num_subs,1);
covariates.digit_span = nan(num_subs,1);
covariates.randomization = nan(num_subs,1);
covariates.session_interval = nan(num_subs,1);

% study intervals, ordered by participant ids:
intervals = {
    1, 13, 4, 4, 17, 8, 14, 6, 7, 10, ...
    7, 6, 18, 4, 8, 5, 23, 3, 1, 12, ...
    9, 8, 24, 21, 17, 21, 14, 4, 4, 9, ...
    7, 7, 11, 7, 14, 2, 1, 5, 3, 3};
% create a dictionary that maps IDs to intervals:
interval_dict = containers.Map(allID,intervals);

filetemplate = 'highspeed_task_mri_sub_%d_session_%d_run_%d.mat';
fprintf('List of missing data:\n')
for sub = 1:num_subs
    % get correct ids:
    id_orig = allID(sub);
    id_new = sprintf('sub-%02d', sub);
    % load task statistics
    session = 1; run = 4;
    filename = sprintf(filetemplate,allID(sub),session,run);
    dataframe = dirData(contains(dirData,filename));
    if ~isempty(dataframe)
        load(fullfile(path_input,filename));
        covariates.participant_id{sub} = id_new;
        covariates.age(sub) = Parameters.subjectInfo.age;
        covariates.sex{sub} = Parameters.subjectInfo.gender;
        covariates.handedness{sub} = 'right';
        covariates.randomization(sub) = Parameters.subjectInfo.cbal;
        covariates.session_interval(sub) = interval_dict(id_orig);
    else
        str = strcat(str,'- all behavioral data\n');
    end
    % digit span
    digitspan_file = sprintf('DigitSpan_%d.mat',allID(sub));
    digitspan_dir = dir(fullfile(path_digitspan));
    if any(contains({digitspan_dir.name},digitspan_file))
        load(fullfile(path_digitspan,digitspan_file))
        covariates.digit_span(sub) = nansum(Data.acc);
    end
end

% WRITE  DATA
writetable(covariates,fullfile(path_output,'participants.csv'), ...
    'Delimiter','\t','WriteRowNames',true, ...
    'QuoteStrings',true,'WriteVariableNames',true)
copyfile(fullfile(path_output,'participants.csv'), ...
    fullfile(path_output,'participants.tsv'));
delete(fullfile(path_output,'participants.csv'));

Finally, we create the participants.json file using:

datalad run -m "create participants.json" \
--output "participants.json" \
"python3 code/bids_conversion/highspeed-bids-participants.py"

# ======================================================================
# SCRIPT INFORMATION:
# ======================================================================
# SCRIPT: UPDATE PARTICIPANTS .JSON FILE
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2019
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ======================================================================
# IMPORT RELEVANT PACKAGES
# ======================================================================
import json
import os
# ======================================================================
# DEFINE PATHS
# ======================================================================
# path to the project root:
project_name = 'highspeed-bids'
path_root = os.getcwd().split(project_name)[0] + project_name
path_desc = os.path.join(path_root, 'participants.json')
# ======================================================================
# UPDATE DATA-SET DESCRIPTION FILE
# ======================================================================
# update fields of the json file:
json_desc = dict()
json_desc["participant_id"] = {
        "Description": "Participant identifier"
        }
json_desc["age"] = {
        "Description": "Age, in years as in the first session",
        "Units": "years"
        }
json_desc["sex"] = {
        "Description": "Sex, self-rated by participant",
        "Levels": {
            "m": "male",
            "f": "female",
            "o": "other"
            }
        }
json_desc["handedness"] = {
        "Description": "Handedness, self-rated by participant; note that participants were required to be right-handed",
        "Levels": {
            "right": "right",
            "left": "left"
            }
        }
json_desc["digit_span"] = {
        "Description": "Total score in Digit-Span Test (Petermann & Wechsler, 2012), assessing working memory capacity",
        "Units": "total scores"
        }
json_desc["randomization"] = {
        "Description": "Pseudo-randomized group assignment for selection of sequences in sequence trials"
        }
json_desc["session_interval"] = {
        "Description:": "Interval in days between the two experimental sessions",
        "Units": "days"
        }
# save updated data-set_description.json file:
with open(path_desc, 'w') as outfile:
    json.dump(json_desc, outfile, indent=4)
outfile.close()

3.2 MRI quality control: Running MRIQC

3.2.1 Overview

MRIQC extracts no-reference IQMs (image quality metrics) from structural (T1w and T2w) and functional MRI (magnetic resonance imaging) data.

Please see the official MRIQC documentation for details and refer to the paper listed in the References section.

3.2.1.1 Data availability

The data is freely available from https://github.com/lnnrtwttkhn/highspeed-mriqc and https://gin.g-node.org/lnnrtwttkhn/highspeed-mriqc.

3.2.1.2 License

The dataset is licensed under Creative Commons Attribution-ShareAlike 4.0. Please see https://creativecommons.org/licenses/by-sa/4.0/ for details.

3.2.2 Container: `mriqc` container, version 0.15.2rc1

MRIQC quality control was performed using mriqc, version 0.15.2rc1. To run MRIQC, we created a Singularity container based on the MRIQC docker image:

singularity pull docker://poldracklab/mriqc:0.15.2rc1

3.2.3 MRIQC subject-level reports: `highspeed-mriqc-subject-level.sh`

First, MRIQC has to run on the individual subject-, (if available, session-), modality- and run- level. This is achieved by running the highspeed-mriqc-subject-level.sh shell-script. The script is parallelizing all subjects and sessions on the cluster.

#!/usr/bin/bash
# ==============================================================================
# SCRIPT INFORMATION:
# ==============================================================================
# SCRIPT: CREATE PARTICIPANT-LEVEL MRIQC REPORTS FOR A BIDS-STRUCTURED DATASET
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2020
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT (MPIB)
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ACKNOWLEDGEMENTS: THANKS TO ALEXANDER SKOWRON AND NIR MONETA @ MPIB FOR HELP
# ==============================================================================
# DEFINE ALL PATHS:
# ==============================================================================
# path to the base directory:
PATH_BASE="${HOME}"
# path to the project root directory
PATH_ROOT="${PATH_BASE}/highspeed"
# define the name of the project:
PROJECT_NAME="highspeed-mriqc"
# define the path to the project folder:
PATH_PROJECT="${PATH_ROOT}/${PROJECT_NAME}"
# define the name of the current task:
TASK_NAME="mriqc"
# define the path to the script main directory:
PATH_CODE="${PATH_PROJECT}/code/${TASK_NAME}"
# cd into the directory of the current task:
cd "${PATH_CODE}"
# define the path to the singularity container:
PATH_CONTAINER="${PATH_PROJECT}/tools/${TASK_NAME}/${TASK_NAME}_0.15.2rc1.sif"
# define the path for the templateflow cache
PATH_TEMPLATEFLOW="${PATH_BASE}/.cache/templateflow"
# path to the data directory (in bids format):
PATH_INPUT="${PATH_PROJECT}/bids"
# path to the output directory:
PATH_OUTPUT=${PATH_PROJECT}/${TASK_NAME}
# path to the working directory:
PATH_WORK=${PATH_PROJECT}/work
# path to the log directory:
PATH_LOG=${PATH_PROJECT}/logs/$(date '+%Y%m%d_%H%M%S')
# path to the text file with all subject ids:
PATH_SUB_LIST="${PATH_CODE}/highspeed-participant-list.txt"
# ==============================================================================
# CREATE RELEVANT DIRECTORIES:
# ==============================================================================
# create working directory:
if [ ! -d ${PATH_WORK} ]
then
    mkdir -p ${PATH_WORK}
fi
# create directory for log files:
if [ ! -d ${PATH_LOG} ]
then
    mkdir -p ${PATH_LOG}
else
    # remove old log files inside the log container:
    rm -r ${PATH_LOG}/*
fi
# ==============================================================================
# DEFINE PARAMETERS:
# ==============================================================================
# maximum number of cpus per process:
N_CPUS=5
# memory demand in *GB*
MEM_GB=10
# read subject ids from the list of the text file:
SUB_LIST=$(cat ${PATH_SUB_LIST} | tr '\n' ' ')
# declare an array with sessions you want to run:
declare -a SESSIONS=("01" "02")
#for SUB in ${SUB_LIST}; do
# ==============================================================================
# RUN MRIQC:
# ==============================================================================
# initilize a subject counter:
SUB_COUNT=0
for SUB in ${SUB_LIST}; do
    # update the subject counter:
    let SUB_COUNT=SUB_COUNT+1
    # create the subject number with zero-padding:
    SUB_PAD=$(printf "%02d\n" ${SUB_COUNT})
    # loop over all sessions:
    for SES in ${SESSIONS[@]}; do
        # create a new job file:
        echo "#!/bin/bash" > job
        # name of the job
        echo "#SBATCH --job-name mriqc_sub-${SUB_PAD}_ses-${SES}" >> job
        # add partition to job
        echo "#SBATCH --partition gpu" >> job
        # set the expected maximum running time for the job:
        echo "#SBATCH --time 24:00:00" >> job
        # determine how much RAM your operation needs:
        echo "#SBATCH --mem ${MEM_GB}GB" >> job
        # email notification on abort/end, use 'n' for no notification:
        echo "#SBATCH --mail-type NONE" >> job
        # write log to log folder:
        echo "#SBATCH --output ${PATH_LOG}/slurm-mriqc-%j.out" >> job
        # request multiple cpus:
        echo "#SBATCH --cpus-per-task ${N_CPUS}" >> job
        # export template flow environment variable:
        echo "export SINGULARITYENV_TEMPLATEFLOW_HOME=/templateflow" >> job
        # define the main command:
        echo "singularity run --contain -B ${PATH_INPUT}:/input:ro \
        -B ${PATH_OUTPUT}:/output:rw -B ${PATH_WORK}:/work:rw \
        -B ${PATH_TEMPLATEFLOW}:/templateflow:rw \
        ${PATH_CONTAINER} /input/ /output/ participant --participant-label ${SUB_PAD} \
        --session-id ${SES} -w /work/ --verbose-reports --write-graph \
        --n_cpus ${N_CPUS} --mem_gb ${MEM_GB} --no-sub" >> job
        # submit job to cluster queue and remove it to avoid confusion:
        sbatch job
        rm -f job
    done
done

3.2.4 MRIQC group-level reports: `highspeed-mriqc-group-level.sh`

Afterwards, one can run the highspeed-mriqc-group-level.sh script to acquire group statistics of the quality metrics.

#!/usr/bin/bash
# ==============================================================================
# SCRIPT INFORMATION:
# ==============================================================================
# SCRIPT: CREATE GROUP-LEVEL MRIQC REPORTS FOR A BIDS-STRUCTURED DATASET
# ONLY RUN AFTER ALL PARTICIPANT-LEVEL REPORTS ARE FINISHED!
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2020
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT (MPIB)
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ACKNOWLEDGEMENT: THANKS TO ALEXANDER SKOWRON AND NIR MONETA @ MPIB FOR HELP
# ==============================================================================
# DEFINE ALL PATHS:
# ==============================================================================
# path to the base directory:
PATH_BASE="${HOME}"
# path to the project root directory
PATH_ROOT="${PATH_BASE}/highspeed"
# define the name of the project:
PROJECT_NAME="highspeed-mriqc"
# define the path to the project folder:
PATH_PROJECT="${PATH_ROOT}/${PROJECT_NAME}"
# define the name of the current task:
TASK_NAME="mriqc"
# define the path to the script main directory:
PATH_CODE="${PATH_PROJECT}/code"
# cd into the directory of the current task:
cd "${PATH_CODE}"
# define the path to the singularity container:
PATH_CONTAINER="${PATH_PROJECT}/tools/${TASK_NAME}/${TASK_NAME}_0.15.2rc1.sif"
# define the path for the templateflow cache
PATH_TEMPLATEFLOW="${PATH_BASE}/.cache/templateflow"
# path to the data directory (in bids format):
PATH_INPUT="${PATH_PROJECT}/bids"
# path to the output directory:
PATH_OUTPUT=${PATH_PROJECT}/${TASK_NAME}
# path to the working directory:
PATH_WORK=${PATH_PROJECT}/work
# path to the log directory:
PATH_LOG=${PATH_PROJECT}/logs
# ==============================================================================
# CREATE RELEVANT DIRECTORIES:
# ==============================================================================
# create working directory:
if [ ! -d ${PATH_WORK} ]
then
    mkdir -p ${PATH_WORK}
fi
# create directory for log files:
if [ ! -d ${PATH_LOG} ]
then
    mkdir -p ${PATH_LOG}
fi
# ==============================================================================
# RUN MRIQC TO CREATE THE GROUP REPORTS:
# ==============================================================================
# create group reports for the functional data:
singularity run -B ${PATH_INPUT}:/input:ro \
-B ${PATH_OUTPUT}:/output:rw -B ${PATH_WORK}:/work:rw \
-B ${PATH_TEMPLATEFLOW}:/templateflow:rw \
${PATH_CONTAINER} /input/ /output/ group --no-sub

3.2.5 References

Esteban, O., Birman, D., Schaer, M., Koyejo, O. O., Poldrack, R. A., & Gorgolewski, K. J. (2017). MRIQC: Advancing the automatic prediction of image quality in MRI from unseen sites. PLoS ONE, 12(9), e0184661. doi: 10.1371/journal.pone.0184661

3.3 MRI pre-processing:Running fMRIPrep

3.3.1 Overview

We used fMRIPrep, version 1.2.2, to pre-process the BIDS-converted MRI data.

According to the fMRIPrep documentation …

fMRIPrep is a functional magnetic resonance imaging (fMRI) data preprocessing pipeline that is designed to provide an easily accessible, state-of-the-art interface that is robust to variations in scan acquisition protocols and that requires minimal user input, while providing easily interpretable and comprehensive error and output reporting. It performs basic processing steps (coregistration, normalization, unwarping, noise component extraction, segmentation, skullstripping etc.) providing outputs that can be easily submitted to a variety of group level analyses, including task-based or resting-state fMRI, graph theory measures, surface or volume-based statistics, etc.

Please see the fMRIPrep documentation for details and refer to the paper listed in the References section.

3.3.1.1 Data availability

The data is freely available from https://github.com/lnnrtwttkhn/highspeed-fmriprep and https://gin.g-node.org/lnnrtwttkhn/highspeed-fmriprep.

3.3.1.2 License

The dataset is licensed under Creative Commons Attribution-ShareAlike 4.0. Please see https://creativecommons.org/licenses/by-sa/4.0/ for details.

3.3.2 Software: `fmriprep` container, version 1.2.2

We first generated a Singularity container based on the fMRIPrep docker image:

singularity pull docker://poldracklab/fmriprep:1.2.2

3.3.3 Run fMRIPrep on HPC

We then ran fMRIPrep on the high-performance cluster (HPC) of the Max Planck Institute for Human Development, Berlin, Germany using the code in highspeed-fmriprep-cluster.sh:

#!/usr/bin/bash
# ==============================================================================
# SCRIPT INFORMATION:
# ==============================================================================
# SCRIPT: RUN FMRIPREP ON THE MPIB CLUSTER (TARDIS)
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2020
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ACKNOWLEDGEMENTS: THANKS HRVOJE STOJIC (UCL) AND ALEX SKOWRON (MPIB) FOR HELP
# ==============================================================================
# DEFINE ALL PATHS:
# ==============================================================================
# path to the base directory:
PATH_BASE="${HOME}"
# path to the project root directory
PATH_ROOT="${PATH_BASE}/highspeed"
# define the name of the project:
PROJECT_NAME="highspeed-fmriprep"
# define the path to the project folder:
PATH_PROJECT="${PATH_ROOT}/${PROJECT_NAME}"
# define the name of the current task:
TASK_NAME="fmriprep"
# path to the current shell script:
PATH_CODE=${PATH_PROJECT}/code
# cd into the directory of the current script:
cd "${PATH_CODE}/${TASK_NAME}"
# path to the fmriprep ressources folder:
PATH_FMRIPREP=${PATH_PROJECT}/tools/${TASK_NAME}
# path to the fmriprep singularity image:
PATH_CONTAINER=${PATH_FMRIPREP}/${TASK_NAME}_1.2.2.sif
# path to the freesurfer license file on tardis:
PATH_FS_LICENSE=${PATH_FMRIPREP}/fs_600_license.txt
# path to the data directory (in bids format):
PATH_BIDS=${PATH_PROJECT}//bids
# path to the output directory:
PATH_OUT=${PATH_PROJECT}
# path to the freesurfer output directory:
PATH_FREESURFER="${PATH_OUT}/freesurfer"
# path to the working directory:
PATH_WORK=${PATH_PROJECT}/work
# path to the log directory:
PATH_LOG=${PATH_PROJECT}/logs/$(date '+%Y%m%d_%H%M%S')
# path to the text file with all subject ids:
PATH_SUB_LIST="${PATH_CODE}/fmriprep/highspeed-participant-list.txt"
# ==============================================================================
# CREATE RELEVANT DIRECTORIES:
# ==============================================================================
# create output directory:
if [ ! -d ${PATH_OUT} ]; then
    mkdir -p ${PATH_OUT}
fi
# create freesurfer output directory
if [ ! -d ${PATH_FREESURFER} ]; then
    mkdir -p ${PATH_FREESURFER}
fi
# create directory for work:
if [ ! -d ${PATH_WORK} ]; then
    mkdir -p ${PATH_WORK}
fi
# create directory for log files:
if [ ! -d ${PATH_LOG} ]; then
    mkdir -p ${PATH_LOG}
fi
# ==============================================================================
# DEFINE PARAMETERS:
# ==============================================================================
# maximum number of cpus per process:
N_CPUS=8
# maximum number of threads per process:
N_THREADS=8
# memory demand in *GB*
MEM_GB=35
# memory demand in *MB*
MEM_MB="$((${MEM_GB} * 1000))"
# user-defined subject list
PARTICIPANTS=$1
# check if user input was supplied:
if [ -z "$PARTICIPANTS" ]; then
    echo "No participant label supplied."
    # read subject ids from the list of the text file   
        SUB_LIST=$(cat ${PATH_SUB_LIST} | tr '\n' ' ')
else
    SUB_LIST=$PARTICIPANTS
fi
# ==============================================================================
# RUN FMRIPREP:
# ==============================================================================
# initalize a subject counter:
SUB_COUNT=0
# loop over all subjects:
for SUB in ${SUB_LIST}; do
    # update the subject counter:
    let SUB_COUNT=SUB_COUNT+1
    # get the subject number with zero padding:
    SUB_PAD=$(printf "%02d\n" $SUB_COUNT)
    # create participant-specific working directory:
    PATH_WORK_SUB="${PATH_WORK}/sub-${SUB_PAD}"
    if [ ! -d ${PATH_WORK_SUB} ]; then
        mkdir -p ${PATH_WORK_SUB}
    fi
    # create a new job file:
    echo "#!/bin/bash" > job
    # name of the job
    echo "#SBATCH --job-name fmriprep_sub-${SUB_PAD}" >> job
    # add partition to job
    echo "#SBATCH --partition gpu" >> job
    # set the expected maximum running time for the job:
    echo "#SBATCH --time 40:00:00" >> job
    # determine how much RAM your operation needs:
    echo "#SBATCH --mem ${MEM_GB}GB" >> job
    # email notification on abort/end, use 'n' for no notification:
    echo "#SBATCH --mail-type NONE" >> job
    # write log to log folder
    echo "#SBATCH --output ${PATH_LOG}/slurm-fmriprep-%j.out" >> job
    # request multiple cpus
    echo "#SBATCH --cpus-per-task ${N_CPUS}" >> job
    # define the fmriprep command:
    echo "singularity run --cleanenv -B ${PATH_BIDS}:/input:ro \
    -B ${PATH_OUT}:/output:rw -B ${PATH_FMRIPREP}:/utilities:ro \
    -B ${PATH_FREESURFER}:/output/freesurfer:rw \
    -B ${PATH_WORK_SUB}:/work:rw ${PATH_CONTAINER} \
    --fs-license-file /utilities/fs_600_license.txt \
    /input/ /output/ participant --participant_label ${SUB_PAD} -w /work/ \
    --mem_mb ${MEM_MB} --nthreads ${N_CPUS} --omp-nthreads $N_THREADS \
    --write-graph --stop-on-first-crash \
    --output-space T1w fsnative template fsaverage \
    --notrack --verbose --resource-monitor" >> job
    # submit job to cluster queue and remove it to avoid confusion:
    sbatch job
    rm -f job
done

3.3.4 References

Esteban, O., Markiewicz, C. J., Blair, R. W., Moodie, C. A., Isik, A. I., Erramuzpe, A., Kent, J. D., Goncalves, M., DuPre, E., Snyder, M., and et al. (2019). fMRIPrep 1.2.2. doi:10.5281/zenodo.852659

Esteban, O., Markiewicz, C. J., Blair, R. W., Moodie, C. A., Isik, A. I., Erramuzpe, A., Kent, J. D., Goncalves, M., DuPre, E., Snyder, M., and et al. (2018). fMRIPrep: A robust preprocessing pipeline for functional MRI. Nature Methods, 16(1):111–116. doi:10.1038/s41592-018-0235-4

Esteban, O., Ciric, R., Finc, K., Blair, R. W., Markiewicz, C. J., Moodie, C. A., Kent, J. D., Goncalves, M., DuPre, E., Gomez, D. E. P., and et al. (2020). Analysis of task-based functional mri data preprocessed with fmriprep. Nature Protocols. doi:10.1038/s41596-020-0327-3

3.4 Feature selection: Anatomical masks

3.4.1 Overview

As described in the paper, we used a feature selection approach that combined binarized anatomical ROIs with functional ROIs based on first-level GLMs.

3.4.1.1 Data availability

The data is freely available from https://github.com/lnnrtwttkhn/highspeed-masks and https://gin.g-node.org/lnnrtwttkhn/highspeed-masks.

3.4.1.2 License

The dataset is licensed under Creative Commons Attribution-ShareAlike 4.0. Please see https://creativecommons.org/licenses/by-sa/4.0/ for details.

3.4.2 Creating binary anatomical masks using `highspeed-masks.py`

We created binarized anatomical masks of occipito-temporal cortex and hippocampus based on the participant-specific Freesurfer parcellation using a Nipype workflow:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# ======================================================================
# SCRIPT INFORMATION:
# ======================================================================
# SCRIPT: CREATE BINARIZED MASKS FROM SEGMENTED FUNCTIONAL IMAGES
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2020
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ======================================================================
# IMPORT RELEVANT PACKAGES
# ======================================================================
# import basic libraries:
import os
import sys
import glob
import warnings
from os.path import join as opj
# import nipype libraries:
from nipype.interfaces.utility import IdentityInterface
from nipype.interfaces.io import SelectFiles, DataSink
from nipype.pipeline.engine import Workflow, Node, MapNode
# import libraries for bids interaction:
from bids.layout import BIDSLayout
# import freesurfer interfaces:
from nipype.interfaces.freesurfer import Binarize
# import fsl interfaces:
from niflow.nipype1.workflows.fmri.fsl import create_susan_smooth
import datalad.api as dl
# ======================================================================
# ENVIRONMENT SETTINGS (DEALING WITH ERRORS AND WARNINGS):
# ======================================================================
# set the fsl output type environment variable:
os.environ['FSLOUTPUTTYPE'] = 'NIFTI_GZ'
# set the freesurfer subject directory:
os.environ['SUBJECTS_DIR'] = '/opt/software/freesurfer/6.0.0/subjects'
# deal with nipype-related warnings:
os.environ['TF_CPP_MIN_LOG_LEVEL'] = "3"
# filter out warnings related to the numpy package:
warnings.filterwarnings("ignore", message="numpy.dtype size changed*")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed*")
# ======================================================================
# DEFINE PBS CLUSTER JOB TEMPLATE (NEEDED WHEN RUNNING ON THE CLUSTER):
# ======================================================================
job_template = """
#PBS -l walltime=5:00:00
#PBS -j oe
#PBS -o $HOME/highspeed/highspeed-masks/logs
#PBS -m n
#PBS -v FSLOUTPUTTYPE=NIFTI_GZ
source /etc/bash_completion.d/virtualenvwrapper
workon highspeed-masks
module load fsl/5.0
module load freesurfer/6.0.0
"""
# ======================================================================
# DEFINE PATHS AND SUBJECTS
# ======================================================================
path_root = None
sub_list = None
# path to the project root:
project_name = 'highspeed-masks'
path_root = os.getenv('PWD').split(project_name)[0] + project_name
# grab the list of subjects from the bids data set:
path_bids = opj(path_root, 'bids')
dl.get(opj('bids', 'participants.json'))
dl.get(glob.glob(opj(path_root, 'bids', 'sub-*', 'ses-*', '*', '*.json')))
dl.get(glob.glob(opj(path_root, 'bids', 'sub-*', 'ses-*', '*.json')))
dl.get(glob.glob(opj(path_root, 'bids', '*.json')))
layout = BIDSLayout(path_bids)
# get all subject ids:
sub_list = sorted(layout.get_subjects())
# create a template to add the "sub-" prefix to the ids
sub_template = ['sub-'] * len(sub_list)
# add the prefix to all ids:
sub_list = ["%s%s" % t for t in zip(sub_template, sub_list)]
# if user defined to run specific subject
sub_list = sub_list[int(sys.argv[1]):int(sys.argv[2])]
# ======================================================================
# DEFINE NODE: INFOSOURCE
# ======================================================================
# define the infosource node that collects the data:
infosource = Node(IdentityInterface(
    fields=['subject_id']), name='infosource')
# let the node iterate (parallelize) over all subjects:
infosource.iterables = [('subject_id', sub_list)]
# ======================================================================
# DEFINE SELECTFILES NODE
# ======================================================================
path_func = opj(path_root, 'fmriprep', 'fmriprep', '*', '*',
        'func', '*space-T1w*preproc_bold.nii.gz')
path_func_parc = opj(path_root, 'fmriprep', 'fmriprep', '*',
        '*', 'func', '*space-T1w*aparcaseg_dseg.nii.gz')
path_wholemask = opj(path_root, 'fmriprep', 'fmriprep', '*',
        '*', 'func', '*space-T1w*brain_mask.nii.gz')
dl.get(glob.glob(path_func))
dl.get(glob.glob(path_func_parc))
dl.get(glob.glob(path_wholemask))
# define all relevant files paths:
templates = dict(
    func=opj(path_root, 'fmriprep', 'fmriprep', '{subject_id}', '*',
             'func', '*space-T1w*preproc_bold.nii.gz'),
    func_parc=opj(path_root, 'fmriprep', 'fmriprep', '{subject_id}',
                  '*', 'func', '*space-T1w*aparcaseg_dseg.nii.gz'),
    wholemask=opj(path_root, 'fmriprep', 'fmriprep', '{subject_id}',
                  '*', 'func', '*space-T1w*brain_mask.nii.gz'),
)
# define the selectfiles node:
selectfiles = Node(SelectFiles(templates, sort_filelist=True),
                   name='selectfiles')
# set expected thread and memory usage for the node:
selectfiles.interface.num_threads = 1
selectfiles.interface.estimated_memory_gb = 0.1
# selectfiles.inputs.subject_id = 'sub-20'
# selectfiles_results = selectfiles.run()
# ======================================================================
# DEFINE CREATE_SUSAN_SMOOTH WORKFLOW NODE
# ======================================================================
# define the susan smoothing node and specify the smoothing fwhm:
susan = create_susan_smooth()
# set the smoothing kernel:
susan.inputs.inputnode.fwhm = 4
# set expected thread and memory usage for the nodes:
susan.get_node('inputnode').interface.num_threads = 1
susan.get_node('inputnode').interface.estimated_memory_gb = 0.1
susan.get_node('median').interface.num_threads = 1
susan.get_node('median').interface.estimated_memory_gb = 3
susan.get_node('mask').interface.num_threads = 1
susan.get_node('mask').interface.estimated_memory_gb = 3
susan.get_node('meanfunc2').interface.num_threads = 1
susan.get_node('meanfunc2').interface.estimated_memory_gb = 3
susan.get_node('merge').interface.num_threads = 1
susan.get_node('merge').interface.estimated_memory_gb = 3
susan.get_node('multi_inputs').interface.num_threads = 1
susan.get_node('multi_inputs').interface.estimated_memory_gb = 3
susan.get_node('smooth').interface.num_threads = 1
susan.get_node('smooth').interface.estimated_memory_gb = 3
susan.get_node('outputnode').interface.num_threads = 1
susan.get_node('outputnode').interface.estimated_memory_gb = 0.1
# ======================================================================
# DEFINE BINARIZE NODE
# ======================================================================
mask_visual_labels = [
    1005, 2005,  # cuneus
    1011, 2011,  # lateral occipital
    1021, 2021,  # pericalcarine
    1029, 2029,  # superioparietal
    1013, 2013,  # lingual
    1008, 2008,  # inferioparietal
    1007, 2007,  # fusiform
    1009, 2009,  # inferiotemporal
    1016, 2016,  # parahippocampal
    1015, 2015,  # middle temporal
]
mask_hippocampus_labels = [
    17, 53,  # left and right hippocampus
]
mask_mtl_labels = [
    17, 53,  # left and right hippocampus
    1016, 2016,  # parahippocampal
    1006, 2006,  # ctx-entorhinal
]
# function: use freesurfer mri_binarize to threshold an input volume
mask_visual = MapNode(
        interface=Binarize(), name='mask_visual', iterfield=['in_file'])
# input: match instead of threshold
mask_visual.inputs.match = mask_visual_labels
# optimize the efficiency of the node:
mask_visual.plugin_args = {'qsub_args': '-l nodes=1:ppn=1', 'overwrite': True}
mask_visual.plugin_args = {'qsub_args': '-l mem=100MB', 'overwrite': True}
# function: use freesurfer mri_binarize to threshold an input volume
mask_hippocampus = MapNode(
        interface=Binarize(), name='mask_hippocampus', iterfield=['in_file'])
# input: match instead of threshold
mask_hippocampus.inputs.match = mask_hippocampus_labels
# optimize the efficiency of the node:
mask_hippocampus.plugin_args = {
    'qsub_args': '-l nodes=1:ppn=1', 'overwrite': True}
mask_hippocampus.plugin_args = {
    'qsub_args': '-l mem=100MB', 'overwrite': True}
# function: use freesurfer mri_binarize to threshold an input volume
mask_mtl = MapNode(
        interface=Binarize(), name='mask_mtl', iterfield=['in_file'])
# input: match instead of threshold
mask_mtl.inputs.match = mask_mtl_labels
# optimize the efficiency of the node:
mask_mtl.plugin_args = {'qsub_args': '-l nodes=1:ppn=1', 'overwrite': True}
mask_mtl.plugin_args = {'qsub_args': '-l mem=100MB', 'overwrite': True}
# ======================================================================
# CREATE DATASINK NODE
# ======================================================================
# create a node of the function:
datasink = Node(DataSink(), name='datasink')
# assign the path to the base directory:
datasink.inputs.base_directory = opj(path_root, 'masks')
# create a list of substitutions to adjust the filepaths of datasink:
substitutions = [('_subject_id_', '')]
# assign the substitutions to the datasink command:
datasink.inputs.substitutions = substitutions
# determine whether to store output in parameterized form:
datasink.inputs.parameterization = True
# set expected thread and memory usage for the node:
datasink.interface.num_threads = 1
datasink.interface.estimated_memory_gb = 0.2
# ======================================================================
# DEFINE WORKFLOW PIPELINE:
# ======================================================================
# initiation of the 1st-level analysis workflow:
wf = Workflow(name='masks')
# stop execution of the workflow if an error is encountered:
wf.config = {'execution': {'stop_on_first_crash': True}}
# define the base directory of the workflow:
wf.base_dir = opj(path_root, 'work')
# connect infosource to selectfiles node:
wf.connect(infosource, 'subject_id', selectfiles, 'subject_id')
# connect functional files to smoothing workflow:
wf.connect(selectfiles, 'func', susan, 'inputnode.in_files')
wf.connect(selectfiles, 'wholemask', susan, 'inputnode.mask_file')
wf.connect(susan, 'outputnode.smoothed_files', datasink, 'smooth')
# connect segmented functional files to visual mask node
wf.connect(selectfiles, 'func_parc', mask_visual, 'in_file')
wf.connect(mask_visual, 'binary_file', datasink, 'mask_visual.@binary')
wf.connect(mask_visual, 'count_file', datasink, 'mask_visual.@count')
# connect segmented functional files to hippocampus node
wf.connect(selectfiles, 'func_parc', mask_hippocampus, 'in_file')
wf.connect(mask_hippocampus, 'binary_file', datasink, 'mask_hippocampus.@binary')
wf.connect(mask_hippocampus, 'count_file', datasink, 'mask_hippocampus.@count')
# connect segmented functional files to mtl node
wf.connect(selectfiles, 'func_parc', mask_mtl, 'in_file')
wf.connect(mask_mtl, 'binary_file', datasink, 'mask_mtl.@binary')
wf.connect(mask_mtl, 'count_file', datasink, 'mask_mtl.@count')
# ======================================================================
# WRITE GRAPH AND EXECUTE THE WORKFLOW
# ======================================================================
# write the graph:
wf.write_graph(graph2use='colored', simple_form=True)
# set the maximum resources the workflow can utilize:
# args_dict = {'status_callback' : log_nodes_cb}
# execute the workflow depending on the operating system:
if 'darwin' in sys.platform:
    # will execute the workflow using all available cpus:
    wf.run(plugin='MultiProc')
elif 'linux' in sys.platform:
    wf.run(plugin='PBS', plugin_args=dict(template=job_template))

3.4.3 Plotting masked data using `highspeed-masks-plot.py`

We generated some plots of the data using the following code:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# ======================================================================
# SCRIPT INFORMATION:
# ======================================================================
# SCRIPT: PLOT MASKS FOR ILLUSTRATION
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2020
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ======================================================================
# IMPORT RELEVANT PACKAGES
# ======================================================================
from nilearn import plotting
from nilearn import image
import os
from os.path import join as opj
import glob
import re
import matplotlib.pyplot as plt
import statistics

sub = 'sub-01'
path_root = os.getcwd()
path_anat = opj(path_root, 'data', 'bids', sub, '*', 'anat', '*.nii.gz')
anat = sorted(glob.glob(path_anat))[0]
path_patterns = opj(path_root, 'data', 'decoding', sub, 'data', '*.nii.gz')
path_patterns = sorted(glob.glob(path_patterns))
path_patterns = [x for x in path_patterns if 'hpc' not in x]
path_union = opj(path_root, 'data', 'decoding', sub, 'masks', '*union.nii.gz')
path_union = glob.glob(path_union)

# path_fmriprep = opj(path_root, 'derivatives', 'fmriprep')
# path_masks = opj(path_root, 'derivatives', 'decoding', sub)
# path_anat = opj(path_fmriprep, sub, 'anat', sub + '_desc-preproc_T1w.nii.gz')
# path_visual = opj(path_masks, 'masks', '*', '*.nii.gz')
# vis_mask = glob.glob(path_visual)[0]
# vis_mask_smooth = image.smooth_img(vis_mask, 4)

plotting.plot_roi(path_union[0], bg_img=anat,
                  cut_coords=[30, 10, 15],
                  title="Region-of-interest", black_bg=True,
                  display_mode='ortho', cmap='red_transparent',
                  draw_cross=False)

# calculate average patterns across all trials:
mean_patterns = [image.mean_img(i) for i in path_patterns]
# check the shape of the mean patterns (should be 3D):
[print(image.get_data(i).shape) for i in mean_patterns]
# extract labels of patterns
labels = [re.search('union_(.+?).nii.gz', i).group(1) for i in path_patterns]


# function used to plot individual patterns:
def plot_patterns(pattern, name):
    display = plotting.plot_stat_map(
            pattern,
            bg_img=anat,
            #cut_coords=[30, 29, -6],
            title=name,
            black_bg=True,
            colorbar=True,
            display_mode='ortho',
            draw_cross=False
            )
    path_save = opj(path_root, 'figures', 'pattern_{}.pdf').format(name)
    display.savefig(filename=path_save)
    display.close()
    return(display)


# plot individual patterns and save coordinates:
coords = []
for pattern, name in zip(mean_patterns, labels):
    display = plot_patterns(pattern, name)
    coords.append(display.cut_coords)
# mean_coords = [sum(x)/len(x) for x in zip(*coords)]
mean_coords = [statistics.median(x) for x in zip(*coords)]

# create subplot with all patterns using mean coordinates:
fig, axes = plt.subplots(nrows=len(path_patterns), ncols=1, figsize=(14, 20))
for pattern, name, ax in zip(mean_patterns, labels,  axes):
    display = plotting.plot_stat_map(
            pattern, bg_img=anat, cut_coords=mean_coords, title=name,
            black_bg=True, colorbar=True, display_mode='ortho',
            draw_cross=False, axes=ax, symmetric_cbar=True, vmax=1)
fig.subplots_adjust(wspace=0, hspace=0)
fig.savefig(opj(path_root, 'figures', 'pattern_all.pdf'), bbox_inches='tight')

3.4.4 Software: Required packages

The requirements.txt file lists the required packages which can be installed e.g., using pip install -r requirements.txt

annexremote==1.4.3
appdirs==1.4.4
bids-validator==1.3.12
boto==2.49.0
certifi==2019.11.28
cffi==1.14.3
chardet==3.0.4
Click==7.0
cryptography==3.2.1
datalad==0.13.5
decorator==4.4.1
Deprecated==1.2.10
docopt==0.6.2
etelemetry==0.1.2
fasteners==0.15
filelock==3.0.12
future==0.18.2
humanize==2.6.0
idna==2.8
importlib-metadata==2.0.0
iso8601==0.1.13
isodate==0.6.0
jeepney==0.4.3
joblib==0.14.1
jsmin==2.2.2
keyring==20.0.1
keyrings.alt==3.2.0
lxml==4.4.2
monotonic==1.5
msgpack==1.0.0
networkx==2.4
neurdflib==5.0.1
nibabel==3.0.0
niflow-nipype1-workflows==0.0.4
nilearn==0.6.0
nipype==1.4.0
num2words==0.5.10
numpy==1.18.0
packaging==20.0
pandas==0.25.3
patool==1.12
patsy==0.5.1
pkg-resources==0.0.0
prov==1.5.3
pybids==0.10.0
pycparser==2.20
pydot==1.4.1
pydotplus==2.0.2
PyGithub==1.53
PyJWT==1.7.1
pyparsing==2.4.6
python-dateutil==2.8.1
pytz==2019.3
rdflib==4.2.2
requests==2.22.0
scikit-learn==0.22.1
scipy==1.4.1
SecretStorage==3.2.0
simplejson==3.17.0
six==1.13.0
sklearn==0.0
SQLAlchemy==1.3.12
tqdm==4.51.0
traits==5.2.0
urllib3==1.25.7
Whoosh==2.7.4
wrapt==1.12.1
zipp==1.2.0

3.5 Feature selection: First-level GLMs

3.5.1 Overview

As described above, we used a feature selection approach that combined binarized anatomical ROIs with functional ROIs based on first-level GLMs. Below we show which code was used to run the first-level GLMs based on Nipype:

3.5.1.1 Data availability

The data is freely available from https://github.com/lnnrtwttkhn/highspeed-glm and https://gin.g-node.org/lnnrtwttkhn/highspeed-glm.

3.5.1.2 License

The dataset is licensed under Creative Commons Attribution-ShareAlike 4.0. Please see https://creativecommons.org/licenses/by-sa/4.0/ for details.

3.5.2 Main GLM workflow (`highspeed-glm-main.py`)

The code below shows the main Nipype workflow to run first-level GLMs used for feature selection:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# ======================================================================
# SCRIPT INFORMATION:
# ======================================================================
# SCRIPT: FIRST LEVEL GLM
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2020
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ACKNOWLEDGEMENTS: THANKS TO HRVOJE STOJIC (UCL) FOR HELP
# ======================================================================
# IMPORT RELEVANT PACKAGES
# ======================================================================
# import basic libraries:
import os
import glob
import sys
import warnings
from os.path import join as opj
# import nipype libraries:
from nipype.interfaces.utility import Function, IdentityInterface
from nipype.interfaces.io import SelectFiles, DataSink
from nipype.pipeline.engine import Workflow, Node, MapNode
from nipype.utils.profiler import log_nodes_cb
from nipype import config, logging
# import spm and matlab interfaces:
from nipype.algorithms.modelgen import SpecifySPMModel
from nipype.interfaces.spm.model import (
    Level1Design, EstimateModel, EstimateContrast, ThresholdStatistics,
    Threshold)
from nipype.interfaces.matlab import MatlabCommand
from nipype.interfaces import spm
# import fsl interfaces:
from nipype.workflows.fmri.fsl import create_susan_smooth
from nipype.interfaces.fsl.utils import ExtractROI
# import libraries for bids interaction:
from bids.layout import BIDSLayout
# import freesurfer interfaces:
# import custom functions:
from highspeed_glm_functions import (
    get_subject_info, plot_stat_maps, leave_one_out)
import datalad.api as dl
# ======================================================================
# ENVIRONMENT SETTINGS (DEALING WITH ERRORS AND WARNINGS):
# ======================================================================
# set the fsl output type environment variable:
os.environ['FSLOUTPUTTYPE'] = 'NIFTI_GZ'
# deal with nipype-related warnings:
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
# inhibit CTF lock
os.environ['MCR_INHIBIT_CTF_LOCK'] = '1'
# filter out warnings related to the numpy package:
warnings.filterwarnings("ignore", message="numpy.dtype size changed*")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed*")
# ======================================================================
# SET PATHS AND SUBJECTS
# ======================================================================
# define paths depending on the operating system (OS) platform:
project = 'highspeed'
# initialize empty paths:
path_root = None
sub_list = None
# path to the project root:
project_name = 'highspeed-glm'
path_root = os.getenv('PWD').split(project_name)[0] + project_name
if 'darwin' in sys.platform:
    path_spm = '/Users/Shared/spm12'
    path_matlab = '/Applications/MATLAB_R2017a.app/bin/matlab -nodesktop -nosplash'
    # set paths for spm:
    spm.SPMCommand.set_mlab_paths(paths=path_spm, matlab_cmd=path_matlab)
    MatlabCommand.set_default_paths(path_spm)
    MatlabCommand.set_default_matlab_cmd(path_matlab)
    sub_list = ['sub-01']
elif 'linux' in sys.platform:
    # path_matlab = '/home/mpib/wittkuhn/spm12.simg eval \$SPMMCRCMD'
    # path_matlab = opj('/home', 'beegfs', 'wittkuhn', 'tools', 'spm', 'spm12.simg eval \$SPMMCRCMD')
    singularity_cmd = 'singularity run -B /home/mpib/wittkuhn -B /mnt/beegfs/home/wittkuhn /home/mpib/wittkuhn/highspeed/highspeed-glm/tools/spm/spm12.simg'
    singularity_spm = 'eval \$SPMMCRCMD'
    path_matlab = ' '.join([singularity_cmd, singularity_spm])
    spm.SPMCommand.set_mlab_paths(matlab_cmd=path_matlab, use_mcr=True)
# grab the list of subjects from the bids data set:
layout = BIDSLayout(opj(path_root, 'bids'))
# get all subject ids:
sub_list = sorted(layout.get_subjects())
# create a template to add the "sub-" prefix to the ids
sub_template = ['sub-'] * len(sub_list)
# add the prefix to all ids:
sub_list = ["%s%s" % t for t in zip(sub_template, sub_list)]
# if user defined to run specific subject
sub_list = sub_list[int(sys.argv[1]):int(sys.argv[2])]
# print the SPM version:
print('using SPM version %s' % spm.SPMCommand().version)
# ======================================================================
# DEFINE PBS CLUSTER JOB TEMPLATE (NEEDED WHEN RUNNING ON THE CLUSTER):
# ======================================================================
job_template = """
#PBS -l walltime=10:00:00
#PBS -j oe
#PBS -o /home/mpib/wittkuhn/highspeed/highspeed-glm/logs/glm
#PBS -m n
#PBS -v FSLOUTPUTTYPE=NIFTI_GZ
source /etc/bash_completion.d/virtualenvwrapper
workon highspeed-glm
module load fsl/5.0
module load matlab/R2017b
module load freesurfer/6.0.0
"""
# ======================================================================
# SETTING UP LOGGING
# ======================================================================
#path_log = opj(path_root, 'logs', 'l1analyis')
# enable debug mode for logging and configuration:
#config.enable_debug_mode()
# enable logging to file and provide path to the logging file:
#config.update_config({'logging': {'log_directory': path_log,
#                                  'log_to_file': True},
#                      'execution': {'stop_on_first_crash': False,
#                                    'keep_unnecessary_outputs': 'false'},
#                      'monitoring': {'enabled': True}
#                      })
# update the global logging settings:
# logging.update_logging(config)
# callback_log_path = opj(path_log,'ressources.log')
# logger = logging.getLogger('callback')
# logger.setLevel(logging.DEBUG)
# handler = logging.FileHandler(callback_log_path)
# logger.addHandler(handler)
# ======================================================================
# SETTING UP LOGGING
# ======================================================================
#path_log = opj(path_root, 'logs', 'l1analyis')
## create directory to save the log files if it does not exist yet:
#if not os.path.exists(path_log):
#    os.makedirs(path_log)
## configure logging:
#logging.basicConfig(
#        filename=opj(path_log,'log_l1analysis.log'),
#        level=logging.DEBUG,
#        filemode = "a+",
#        format='%(asctime)s - %(levelname)s - %(message)s',
#        datefmt='%d/%m/%Y %H:%M:%S')
#logging.getLogger().addHandler(logging.StreamHandler())
# ======================================================================
# LOG INFORMATION ABOUT THE EXECUTION
# ======================================================================
# print the loaded SPM version:
#analysis_name = "Level 1 GLM analysis"
#logging.info("--------------------------------------------------------")
#logging.info("Analysis: " + analysis_name)
#logging.info("SPM version: " + (spm.SPMCommand().version))
#logging.info("List of subjects:")
#logging.info(sub_list)
# ======================================================================
# DEFINE SETTINGS
# ======================================================================
# time of repetition, in seconds:
time_repetition = 1.25
# total number of runs:
num_runs = 8
# smoothing kernel, in mm:
fwhm = 4
# number of dummy variables to remove from each run:
num_dummy = 0
# ======================================================================
# DEFINE NODE: INFOSOURCE
# ======================================================================
# define the infosource node that collects the data:
infosource = Node(IdentityInterface(
    fields=['subject_id']), name='infosource')
# let the node iterate (paralellize) over all subjects:
infosource.iterables = [('subject_id', sub_list)]
# ======================================================================
# DEFINE SELECTFILES NODE
# ======================================================================
path_confounds = glob.glob(opj(
        path_root, 'fmriprep', 'fmriprep', '*', '*',
        'func', '*highspeed*confounds_regressors.tsv'))
path_events = glob.glob(opj(
        path_root, 'bids', '*', '*', 'func', '*events.tsv'))
path_func = glob.glob(opj(
        path_root, 'fmriprep', 'fmriprep', '*', '*',
        'func', '*highspeed*space-T1w*preproc_bold.nii.gz'))
path_anat = glob.glob(opj(
        path_root, 'fmriprep', 'fmriprep', '*',
        'anat', '*_desc-preproc_T1w.nii.gz'))
path_wholemask = glob.glob(opj(
        path_root, 'fmriprep', 'fmriprep', '*', '*',
        'func', '*highspeed*space-T1w*brain_mask.nii.gz'))
dl.get(path_confounds)
dl.get(path_events)
dl.get(path_func)
dl.get(path_anat)
dl.get(path_wholemask)
# define all relevant files paths:
templates = dict(
    confounds=opj(path_root, 'fmriprep', 'fmriprep', '{subject_id}',
                  '*', 'func', '*highspeed*confounds_regressors.tsv'),
    events=opj(path_root, 'bids', '{subject_id}', '*', 'func',
               '*events.tsv'),
    func=opj(path_root, 'fmriprep', 'fmriprep', '{subject_id}', '*',
             'func', '*highspeed*space-T1w*preproc_bold.nii.gz'),
    anat=opj(path_root, 'fmriprep', 'fmriprep', '{subject_id}',
             'anat', '{subject_id}_desc-preproc_T1w.nii.gz'),
    wholemask=opj(path_root, 'fmriprep', 'fmriprep', '{subject_id}',
                  '*', 'func', '*highspeed*space-T1w*brain_mask.nii.gz'),
)
# define the selectfiles node:
selectfiles = Node(SelectFiles(templates, sort_filelist=True),
                   name='selectfiles')
# set expected thread and memory usage for the node:
selectfiles.interface.num_threads = 1
selectfiles.interface.mem_gb = 0.1
# selectfiles.inputs.subject_id = 'sub-20'
# selectfiles_results = selectfiles.run()
# ======================================================================
# DEFINE CREATE_SUSAN_SMOOTH WORKFLOW NODE
# ======================================================================
# define the susan smoothing node and specify the smoothing fwhm:
susan = create_susan_smooth()
# set the smoothing kernel:
susan.inputs.inputnode.fwhm = fwhm
# set expected thread and memory usage for the nodes:
susan.get_node('inputnode').interface.num_threads = 1
susan.get_node('inputnode').interface.mem_gb = 0.1
susan.get_node('median').interface.num_threads = 1
susan.get_node('median').interface.mem_gb = 3
susan.get_node('mask').interface.num_threads = 1
susan.get_node('mask').interface.mem_gb = 3
susan.get_node('meanfunc2').interface.num_threads = 1
susan.get_node('meanfunc2').interface.mem_gb = 3
susan.get_node('merge').interface.num_threads = 1
susan.get_node('merge').interface.mem_gb = 3
susan.get_node('multi_inputs').interface.num_threads = 1
susan.get_node('multi_inputs').interface.mem_gb = 3
susan.get_node('smooth').interface.num_threads = 1
susan.get_node('smooth').interface.mem_gb = 3
susan.get_node('outputnode').interface.num_threads = 1
susan.get_node('outputnode').interface.mem_gb = 0.1
# ======================================================================
# DEFINE NODE: FUNCTION TO GET THE SUBJECT-SPECIFIC INFORMATION
# ======================================================================
subject_info = MapNode(Function(
    input_names=['events', 'confounds'],
    output_names=['subject_info', 'event_names'],
    function=get_subject_info),
    name='subject_info', iterfield=['events', 'confounds'])
# set expected thread and memory usage for the node:
subject_info.interface.num_threads = 1
subject_info.interface.mem_gb = 0.1
# subject_info.inputs.events = selectfiles_results.outputs.events
# subject_info.inputs.confounds = selectfiles_results.outputs.confounds
# subject_info_results = subject_info.run()
# subject_info_results.outputs.subject_info
# ======================================================================
# DEFINE NODE: REMOVE DUMMY VARIABLES (USING FSL ROI)
# ======================================================================
# function: extract region of interest (ROI) from an image
trim = MapNode(ExtractROI(), name='trim', iterfield=['in_file'])
# define index of the first selected volume (i.e., minimum index):
trim.inputs.t_min = num_dummy
# define the number of volumes selected starting at the minimum index:
trim.inputs.t_size = -1
# define the fsl output type:
trim.inputs.output_type = 'NIFTI'
# set expected thread and memory usage for the node:
trim.interface.num_threads = 1
trim.interface.mem_gb = 3
# ======================================================================
# DEFINE NODE: LEAVE-ONE-RUN-OUT SELECTION OF DATA
# ======================================================================
leave_one_run_out = Node(Function(
    input_names=['subject_info', 'event_names', 'data_func', 'run'],
    output_names=['subject_info', 'data_func', 'contrasts'],
    function=leave_one_out),
    name='leave_one_run_out')
# define the number of rows as an iterable:
leave_one_run_out.iterables = ('run', range(num_runs))
# ======================================================================
# DEFINE NODE: SPECIFY SPM MODEL (GENERATE SPM-SPECIFIC MODEL)
# ======================================================================
# function: makes a model specification compatible with spm designers
# adds SPM specific options to SpecifyModel
l1model = Node(SpecifySPMModel(), name="l1model")
# input: concatenate runs to a single session (boolean, default: False):
l1model.inputs.concatenate_runs = False
# input: units of event onsets and durations (secs or scans):
l1model.inputs.input_units = 'secs'
# input: units of design event onsets and durations (secs or scans):
l1model.inputs.output_units = 'secs'
# input: time of repetition (a float):
l1model.inputs.time_repetition = time_repetition
# high-pass filter cutoff in secs (a float, default = 128 secs):
l1model.inputs.high_pass_filter_cutoff = 128
# ======================================================================
# DEFINE NODE: LEVEL 1 DESIGN (GENERATE AN SPM DESIGN MATRIX)
# ======================================================================
# function: generate an SPM design matrix
l1design = Node(Level1Design(), name="l1design")
# input: (a dictionary with keys which are 'hrf' or 'fourier' or
# 'fourier_han' or 'gamma' or 'fir' and with values which are any value)
l1design.inputs.bases = {'hrf': {'derivs': [0, 0]}}
# input: units for specification of onsets ('secs' or 'scans'):
l1design.inputs.timing_units = 'secs'
# input: interscan interval / repetition time in secs (a float):
l1design.inputs.interscan_interval = time_repetition
# input: Model serial correlations AR(1), FAST or none:
l1design.inputs.model_serial_correlations = 'AR(1)'
# input: number of time-bins per scan in secs (an integer):
l1design.inputs.microtime_resolution = 16
# input: the onset/time-bin in seconds for alignment (a float):
l1design.inputs.microtime_onset = 1
# set expected thread and memory usage for the node:
l1design.interface.num_threads = 1
l1design.interface.mem_gb = 2
# ======================================================================
# DEFINE NODE: ESTIMATE MODEL (ESTIMATE THE PARAMETERS OF THE MODEL)
# ======================================================================
# function: use spm_spm to estimate the parameters of a model
l1estimate = Node(EstimateModel(), name="l1estimate")
# input: (a dictionary with keys which are 'Classical' or 'Bayesian2'
# or 'Bayesian' and with values which are any value)
l1estimate.inputs.estimation_method = {'Classical': 1}
# set expected thread and memory usage for the node:
l1estimate.interface.num_threads = 1
l1estimate.interface.mem_gb = 2
# ======================================================================
# DEFINE NODE: ESTIMATE CONTRASTS (ESTIMATES THE CONTRASTS)
# ======================================================================
# function: use spm_contrasts to estimate contrasts of interest
l1contrasts = Node(EstimateContrast(), name="l1contrasts")
# input: list of contrasts with each contrast being a list of the form:
# [('name', 'stat', [condition list], [weight list], [session list])]:
# l1contrasts.inputs.contrasts = l1contrasts_list
# node input: overwrite previous results:
l1contrasts.overwrite = True
# set expected thread and memory usage for the node:
l1contrasts.interface.num_threads = 1
l1contrasts.interface.mem_gb = 1.5
# ======================================================================
# DEFINE NODE: FUNCTION TO PLOT CONTRASTS
# ======================================================================
plot_contrasts = MapNode(Function(
    input_names=['anat', 'stat_map', 'thresh'],
    output_names=['out_path'],
    function=plot_stat_maps),
    name='plot_contrasts', iterfield=['thresh'])
# input: plot data with set of different thresholds:
plot_contrasts.inputs.thresh = [None, 1, 2, 3]
# set expected thread and memory usage for the node:
plot_contrasts.interface.num_threads = 1
plot_contrasts.interface.mem_gb = 0.2
# ======================================================================
# DEFINE NODE: THRESHOLD
# ======================================================================
# function: Topological FDR thresholding based on cluster extent/size.
# Smoothness is estimated from GLM residuals but is assumed to be the
# same for all of the voxels.
thresh = Node(Threshold(), name="thresh")
# input: whether to use FWE (Bonferroni) correction for initial threshold
# (a boolean, nipype default value: True):
thresh.inputs.use_fwe_correction = False
# input: whether to use FDR over cluster extent probabilities (boolean)
thresh.inputs.use_topo_fdr = True
 # input: value for initial thresholding (defining clusters):
thresh.inputs.height_threshold = 0.05
# input: is the cluster forming threshold a stat value or p-value?
# ('p-value' or 'stat', nipype default value: p-value):
thresh.inputs.height_threshold_type = 'p-value'
# input: which contrast in the SPM.mat to use (an integer):
thresh.inputs.contrast_index = 1
# input: p threshold on FDR corrected cluster size probabilities (float):
thresh.inputs.extent_fdr_p_threshold = 0.05
# input: minimum cluster size in voxels (an integer, default = 0):
thresh.inputs.extent_threshold = 0
# set expected thread and memory usage for the node:
thresh.interface.num_threads = 1
thresh.interface.mem_gb = 0.2
# ======================================================================
# DEFINE NODE: THRESHOLD STATISTICS
# ======================================================================
# function: Given height and cluster size threshold calculate
# theoretical probabilities concerning false positives
thresh_stat = Node(ThresholdStatistics(), name="thresh_stat")
# input: which contrast in the SPM.mat to use (an integer):
thresh_stat.inputs.contrast_index = 1
# ======================================================================
# CREATE DATASINK NODE (OUTPUT STREAM):
# ======================================================================
# create a node of the function:
l1datasink = Node(DataSink(), name='datasink')
# assign the path to the base directory:
l1datasink.inputs.base_directory = opj(path_root, 'l1pipeline')
# create a list of substitutions to adjust the file paths of datasink:
substitutions = [('_subject_id_', '')]
# assign the substitutions to the datasink command:
l1datasink.inputs.substitutions = substitutions
# determine whether to store output in parameterized form:
l1datasink.inputs.parameterization = True
# set expected thread and memory usage for the node:
l1datasink.interface.num_threads = 1
l1datasink.interface.mem_gb = 0.2
# ======================================================================
# DEFINE THE LEVEL 1 ANALYSIS SUB-WORKFLOW AND CONNECT THE NODES:
# ======================================================================
# initiation of the 1st-level analysis workflow:
l1analysis = Workflow(name='l1analysis')
# connect the 1st-level analysis components
l1analysis.connect(l1model, 'session_info', l1design, 'session_info')
l1analysis.connect(l1design, 'spm_mat_file', l1estimate, 'spm_mat_file')
l1analysis.connect(l1estimate, 'spm_mat_file', l1contrasts, 'spm_mat_file')
l1analysis.connect(l1estimate, 'beta_images', l1contrasts, 'beta_images')
l1analysis.connect(l1estimate, 'residual_image', l1contrasts, 'residual_image')
# ======================================================================
# DEFINE META-WORKFLOW PIPELINE:
# ======================================================================
# initiation of the 1st-level analysis workflow:
l1pipeline = Workflow(name='l1pipeline')
# stop execution of the workflow if an error is encountered:
l1pipeline.config = {'execution': {'stop_on_first_crash': True,
                                   'hash_method': 'timestamp'}}
# define the base directory of the workflow:
l1pipeline.base_dir = opj(path_root, 'work')
# ======================================================================
# ENABLE LOGGING:
# ======================================================================
# enable logging to file:
#config.enable_debug_mode()
#config.update_config({'logging': {'log_directory': os.getcwd(),
#                                  'log_to_file': True}})
#logging.update_logging(config)
# ======================================================================
# CONNECT WORKFLOW NODES:
# ======================================================================
# connect infosource to selectfiles node:
l1pipeline.connect(infosource, 'subject_id', selectfiles, 'subject_id')
# generate subject specific events and regressors to subject_info:
l1pipeline.connect(selectfiles, 'events', subject_info, 'events')
l1pipeline.connect(selectfiles, 'confounds', subject_info, 'confounds')
# connect functional files to smoothing workflow:
l1pipeline.connect(selectfiles, 'func', susan, 'inputnode.in_files')
l1pipeline.connect(selectfiles, 'wholemask', susan, 'inputnode.mask_file')
l1pipeline.connect(susan, 'outputnode.smoothed_files', l1datasink, 'smooth')
# connect smoothed functional data to the trimming node:
l1pipeline.connect(susan, 'outputnode.smoothed_files', trim, 'in_file')
# ======================================================================
# INPUT AND OUTPUT STREAM FOR THE LEVEL 1 SPM ANALYSIS SUB-WORKFLOW:
# ======================================================================
# connect regressors to the subsetting node::
l1pipeline.connect(subject_info, 'subject_info', leave_one_run_out, 'subject_info')
# connect event_names to the subsetting node:
l1pipeline.connect(subject_info, 'event_names', leave_one_run_out, 'event_names')
# connect smoothed and trimmed data to subsetting node:
l1pipeline.connect(trim, 'roi_file', leave_one_run_out, 'data_func')
# connect regressors to the level 1 model specification node:
l1pipeline.connect(leave_one_run_out, 'subject_info', l1analysis, 'l1model.subject_info')
# connect smoothed and trimmed data to the level 1 model specification:
l1pipeline.connect(leave_one_run_out, 'data_func', l1analysis, 'l1model.functional_runs')
# connect l1 contrast specification to contrast estimation
l1pipeline.connect(leave_one_run_out, 'contrasts', l1analysis, 'l1contrasts.contrasts')
# connect the anatomical image to the plotting node:
l1pipeline.connect(selectfiles, 'anat', plot_contrasts, 'anat')
# connect spm t-images to the plotting node:
l1pipeline.connect(l1analysis, 'l1contrasts.spmT_images', plot_contrasts, 'stat_map')
# connect the t-images and spm mat file to the threshold node:
l1pipeline.connect(l1analysis, 'l1contrasts.spmT_images', thresh, 'stat_image')
l1pipeline.connect(l1analysis, 'l1contrasts.spm_mat_file', thresh, 'spm_mat_file')
# connect all output results of the level 1 analysis to the datasink:
l1pipeline.connect(l1analysis, 'l1estimate.beta_images', l1datasink, 'estimates.@beta_images')
l1pipeline.connect(l1analysis, 'l1estimate.residual_image', l1datasink, 'estimates.@residual_image')
l1pipeline.connect(l1analysis, 'l1contrasts.spm_mat_file', l1datasink, 'contrasts.@spm_mat')
l1pipeline.connect(l1analysis, 'l1contrasts.spmT_images', l1datasink, 'contrasts.@spmT')
l1pipeline.connect(l1analysis, 'l1contrasts.con_images', l1datasink, 'contrasts.@con')
l1pipeline.connect(plot_contrasts, 'out_path', l1datasink, 'contrasts.@out_path')
l1pipeline.connect(thresh, 'thresholded_map', l1datasink, 'thresh.@threshhold_map')
l1pipeline.connect(thresh, 'pre_topo_fdr_map', l1datasink, 'thresh.@pre_topo_fdr_map')
# ======================================================================
# WRITE GRAPH AND EXECUTE THE WORKFLOW
# ======================================================================
# write the graph:
l1pipeline.write_graph(graph2use='colored', simple_form=True)
# set the maximum resources the workflow can utilize:
# args_dict = {'status_callback' : log_nodes_cb}
# execute the workflow depending on the operating system:
if 'darwin' in sys.platform:
    # will execute the workflow using all available cpus:
    l1pipeline.run(plugin='MultiProc')
elif 'linux' in sys.platform:
    l1pipeline.run(plugin='PBS', plugin_args=dict(template=job_template))

3.5.3 Extra GLM functions (`highspeed-glm-functions.py`)

The main GLM Nipype workflow shown above uses the following custom functions:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# ======================================================================
# DEFINE FUNCTION: FUNCTION TO GET THE SUBJECT-SPECIFIC INFORMATION
# ======================================================================


def get_subject_info(events, confounds):
    """
    FUNCTION TO GET THE SUBJECT-SPECIFIC INFORMATION
    :param events: list with paths to events files
    :param confounds: list with paths to confounds files
    :return: Bunch object with event onsets, durations and regressors
    """

    # import libraries (needed to be done in the function):
    import pandas as pd
    from nipype.interfaces.base import Bunch

    # event types we consider:
    event_spec = {
        'correct_rejection': {'target': 0, 'key_down': 0},
        'hit': {'target': 1, 'key_down': 1},
        'false_alarm': {'target': 0, 'key_down': 1},
        'miss': {'target': 1, 'key_down': 0},
    }

    #event_names = ['correct_rejection']

    # read the events and confounds files of the current run:
    #events = selectfiles_results.outputs.events[0]
    #confounds = selectfiles_results.outputs.confounds[0]
    run_events = pd.read_csv(events, sep="\t")
    run_confounds = pd.read_csv(confounds, sep="\t")

    # define confounds to include as regressors:
    confounds = ['trans', 'rot', 'a_comp_cor', 'framewise_displacement']

    # search for confounds of interest in the confounds data frame:
    regressor_names = [col for col in run_confounds.columns if
                       any([conf in col for conf in confounds])]

    def replace_nan(regressor_values):
        # calculate the mean value of the regressor:
        mean_value = regressor_values.mean(skipna=True)
        # replace all values containing nan with the mean value:
        regressor_values[regressor_values.isnull()] = mean_value
        # return list of the regressor values:
        return list(regressor_values)

    # create a nested list with regressor values
    regressors = [replace_nan(run_confounds[conf]) for conf in regressor_names]

    onsets = []
    durations = []
    event_names = []

    for event in event_spec:

        onset_list = list(
            run_events['onset']
            [(run_events['condition'] == 'oddball') &
             (run_events['target'] == event_spec[event]['target']) &
             (run_events['key_down'] == event_spec[event]['key_down'])])

        duration_list = list(
            run_events['duration']
            [(run_events['condition'] == 'oddball') &
             (run_events['target'] == event_spec[event]['target']) &
             (run_events['key_down'] == event_spec[event]['key_down'])])

        if (onset_list != []) & (duration_list != []):
            event_names.append(event)
            onsets.append(onset_list)
            durations.append(duration_list)

    # create a bunch for each run:
    subject_info = Bunch(
        conditions=event_names, onsets=onsets, durations=durations,
        regressor_names=regressor_names, regressors=regressors)

    return subject_info, sorted(event_names)

# ======================================================================
# DEFINE FUNCTION: FUNCTION TO PLOT THE CONTRASTS AGAINST ANATOMICAL
# ======================================================================


def plot_stat_maps(anat, stat_map, thresh):
    """

    :param anat:
    :param stat_map:
    :param thresh:
    :return:
    """
    # import libraries (needed to be done in the function):
    from nilearn.plotting import plot_stat_map
    from os import path as op

    out_path = op.join(op.dirname(stat_map), 'contrast_thresh_%s.png' % thresh)
    plot_stat_map(
            stat_map, title=('Threshold: %s' % thresh),
            bg_img=anat, threshold=thresh, dim=-1, display_mode='ortho',
            output_file=out_path)

    return out_path


def leave_one_out(subject_info, event_names, data_func, run):
    """
    Select subsets of lists in leave-one-out fashion:
    :param subject_info: list of subject info bunch objects
    :param data_func: list of functional runs
    :param run: current run
    :return: return list of subject info and data excluding the current run
    """

    # create new list with event_names of all runs except current run:
    event_names = [info for i, info in enumerate(event_names) if i != run]
    num_events = [len(i) for i in event_names]
    max_events = event_names[num_events.index(max(num_events))]

    # create list of contrasts:
    stim = 'correct_rejection'
    contrast1 = (stim, 'T', max_events, [1 if stim in s else 0 for s in max_events])
    contrasts = [contrast1]

    # create new list with subject info of all runs except current run:
    subject_info = [info for i, info in enumerate(subject_info) if i != run]
    # create new list with functional data of all runs except current run:
    data_func = [info for i, info in enumerate(data_func) if i != run]

    # return the new lists
    return subject_info, data_func, contrasts

3.5.4 Software: Required packages

The requirements.txt file lists the required packages which can be installed e.g., using pip install -r requirements.txt

alabaster==0.7.10
annexremote==1.4.3
apipkg==1.4
appdirs==1.4.4
asn1crypto==0.24.0
atomicwrites==1.1.5
attrs==18.1.0
Babel==2.6.0
bcrypt==3.1.4
boto==2.49.0
certifi==2018.4.16
cffi==1.11.5
chardet==3.0.4
citeproc-py==0.4.0
click==6.7
codecov==2.0.15
configparser==3.5.0
coverage==4.5.1
cryptography==2.2.2
cycler==0.10.0
datalad==0.13.5
decorator==4.3.0
Deprecated==1.2.10
dipy==0.14.0
docutils==0.14
duecredit==0.6.3
execnet==1.5.0
fasteners==0.15
funcsigs==1.0.2
future==0.16.0
futures==3.1.1
grabbit==0.2.5
h5py==2.8.0
humanize==2.6.0
idna==2.6
imagesize==1.0.0
importlib-metadata==2.0.0
iso8601==0.1.13
isodate==0.6.0
jeepney==0.4.3
Jinja2==2.10
jsmin==2.2.2
keyring==20.0.1
keyrings.alt==3.2.0
kiwisolver==1.0.1
lxml==4.2.1
MarkupSafe==1.0
matplotlib==2.2.2
mock==2.0.0
monotonic==1.5
more-itertools==4.2.0
mpmath==1.0.0
msgpack==1.0.0
networkx==2.1
neurdflib==5.0.0.post1
nibabel==2.2.1
nilearn==0.4.1
nipy==0.4.2
nipype==1.1.5
nitime==0.7
num2words==0.5.6
numpy==1.14.4
numpydoc==0.8.0
packaging==17.1
pandas==0.23.4
paramiko==2.4.1
patool==1.12
patsy==0.5.0
pbr==4.0.4
pkg-resources==0.0.0
pluggy==0.6.0
prov==1.5.2
psutil==5.4.6
py==1.5.3
pyasn1==0.4.3
pybids==0.6.5
pycparser==2.18
pydeface==2.0.0
pydot==1.2.4
pydotplus==2.0.2
PyGithub==1.53
Pygments==2.2.0
PyJWT==1.7.1
PyNaCl==1.2.1
pyparsing==2.2.0
pytest==3.6.1
pytest-cov==2.5.1
pytest-env==0.6.2
pytest-forked==0.2
pytest-xdist==1.22.2
python-dateutil==2.7.3
pytz==2018.4
rdflib==4.2.2
requests==2.18.4
scikit-learn==0.20.3
scipy==1.1.0
SecretStorage==3.2.0
simplejson==3.15.0
six==1.11.0
sklearn==0.0
snowballstemmer==1.2.1
Sphinx==1.7.5
sphinxcontrib-websupport==1.1.0
sympy==1.1.1
tqdm==4.52.0
traits==4.6.0
urllib3==1.22
Whoosh==2.7.4
wrapt==1.12.1
xvfbwrapper==0.2.9
yapf==0.22.0
zipp==1.2.0

3.6 Multivariate decoding

3.6.1 Overview

3.6.1.1 Data availability

The data is freely available from https://github.com/lnnrtwttkhn/highspeed-decoding and https://gin.g-node.org/lnnrtwttkhn/highspeed-decoding.

3.6.1.2 License

The dataset is licensed under Creative Commons Attribution-ShareAlike 4.0. Please see https://creativecommons.org/licenses/by-sa/4.0/ for details.

3.6.2 Main decoding pipeline

Here, we show the Python code used to run the decoding pipeline:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# ======================================================================
# SCRIPT INFORMATION:
# ======================================================================
# SCRIPT: DECODING ANALYSIS
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2019
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
'''
========================================================================
IMPORT RELEVANT PACKAGES:
========================================================================
'''
import glob
import os
import yaml
import logging
import time
from os.path import join as opj
import sys
import copy
from pprint import pformat
import numpy as np
from nilearn.input_data import NiftiMasker
from nilearn import plotting, image, masking
import pandas as pd
from matplotlib import pyplot as plt
import warnings
from nilearn.signal import clean
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.preprocessing import StandardScaler
from collections import Counter
import datalad.api as dl
'''
========================================================================
DEAL WITH ERRORS, WARNING ETC.
========================================================================
'''
warnings.filterwarnings('ignore', message='numpy.dtype size changed*')
warnings.filterwarnings('ignore', message='numpy.ufunc size changed*')
mpl_logger = logging.getLogger('matplotlib')
mpl_logger.setLevel(logging.WARNING)
'''
========================================================================
SETUP NECESSARY PATHS ETC:
========================================================================
'''
# get start time of the script:
start = time.time()
# name of the current project:
project = 'highspeed'
# initialize empty variables:
sub = None
path_tardis = None
path_server = None
path_local = None
# define paths depending on the operating system (OS) platform:
if 'darwin' in sys.platform:
    # define the path to the cluster:
    path_tardis = opj(os.environ['HOME'], 'Volumes', 'tardis_beegfs', project)
    # define the path to the server:
    path_server = opj('/Volumes', 'MPRG-Neurocode', 'Data', project)
    # define the path to the local computer:
    path_local = opj(os.environ['HOME'], project, '*', '*')
    # define the subject id:
    sub = 'sub-14'
elif 'linux' in sys.platform:
    # path to the project root:
    project_name = 'highspeed-decoding'
    path_root = os.getenv('PWD').split(project_name)[0] + project_name
    # define the path to the cluster:
    path_tardis = path_root
    # define the path to the server:
    path_server = path_tardis
    # define the path to the local computer:
    path_local = opj(path_tardis, 'code', 'decoding')
    # define the subject id:
    sub = 'sub-%s' % sys.argv[1]
'''
========================================================================
LOAD PROJECT PARAMETERS:
========================================================================
'''
path_params = glob.glob(opj(path_local, '*parameters.yaml'))[0]
with open(path_params, 'rb') as f:
    params = yaml.load(f, Loader=yaml.FullLoader)
f.close()
'''
========================================================================
CREATE PATHS TO OUTPUT DIRECTORIES:
========================================================================
'''
path_decoding = opj(path_tardis, 'decoding')
path_out = opj(path_decoding, sub)
path_out_figs = opj(path_out, 'plots')
path_out_data = opj(path_out, 'data')
path_out_logs = opj(path_out, 'logs')
path_out_masks = opj(path_out, 'masks')
'''
========================================================================
CREATE OUTPUT DIRECTORIES IF THE DO NOT EXIST YET:
========================================================================
'''
for path in [path_out_figs, path_out_data, path_out_logs, path_out_masks]:
    if not os.path.exists(path):
        os.makedirs(path)
'''
========================================================================
SETUP LOGGING:
========================================================================
'''
# remove all handlers associated with the root logger object:
for handler in logging.root.handlers[:]:
    logging.root.removeHandler(handler)
# get current data and time as a string:
timestr = time.strftime("%Y-%m-%d-%H-%M-%S")
# create path for the logging file
log = opj(path_out_logs, '%s-%s.log' % (timestr, sub))
# start logging:
logging.basicConfig(
    filename=log, level=logging.DEBUG, format='%(asctime)s %(message)s',
    datefmt='%d/%m/%Y %H:%M:%S')
'''
========================================================================
DEFINE DECODING SPECIFIC PARAMETERS:
========================================================================
'''
# define the mask to be used:
mask = 'visual'  # visual or whole
# applied time-shift to account for the BOLD delay, in seconds:
bold_delay = 4  # 4, 5 or 6 secs
# define the degree of smoothing of the functional data
smooth = 4
'''
========================================================================
ADD BASIC SCRIPT INFORMATION TO THE LOGGER:
========================================================================
'''
logging.info('Running decoding script')
logging.info('operating system: %s' % sys.platform)
logging.info('project name: %s' % project)
logging.info('participant: %s' % sub)
logging.info('mask: %s' % mask)
logging.info('bold delay: %d secs' % bold_delay)
logging.info('smoothing kernel: %d mm' % smooth)
'''
========================================================================
DEFINE RELEVANT VARIABLES:
========================================================================
'''
# time of repetition (TR), in seconds:
t_tr = params['mri']['tr']
# number of volumes (TRs) for each functional task run:
n_tr_run = 530
# acquisition time window of one sequence trial, in seconds:
t_win = 16
# number of measurements that are considered per sequence time window:
n_tr_win = round(t_win / t_tr)
# number of oddball trials in the experiment:
n_tr_odd = 600
# number of sequence trials in the experiment:
n_tr_seq = 75
# number of repetition trials in the experiment:
n_tr_rep = 45
# number of scanner triggers before the experiment starts:
n_tr_wait = params['mri']['num_trigger']
# number of functional task runs in total:
n_run = params['mri']['num_runs']
# number of experimental sessions in total:
n_ses = params['mri']['num_sessions']
# number of functional task runs per session:
n_run_ses = int(n_run / n_ses)
'''
========================================================================
LOAD BEHAVIORAL DATA (THE EVENTS.TSV FILES OF THE SUBJECT):
========================================================================
Load the events files for the subject. The 'oddball' data serves as the
training set of the classifier. The 'sequence' data constitutes the
test set.
'''
# paths to all events files of the current subject:
path_events = opj(path_tardis, 'bids', sub, 'ses-*', 'func', '*tsv')
dl.get(glob.glob(path_events))
path_events = sorted(glob.glob(path_events), key=lambda f: os.path.basename(f))
logging.info('found %d event files' % len(path_events))
logging.info('paths to events files (sorted):\n%s' % pformat(path_events))
# import events file and save data in dataframe:
df_events = pd.concat((pd.read_csv(f, sep='\t') for f in path_events),
                      ignore_index=True)
'''
========================================================================
CREATE PATHS TO THE MRI DATA:
========================================================================
'''
# define path to input directories:
path_fmriprep = opj(path_tardis, 'fmriprep', 'fmriprep', sub)
path_level1 = opj(path_tardis, 'glm', 'l1pipeline')
path_masks = opj(path_tardis, 'masks', 'masks')

logging.info('path to fmriprep files: %s' % path_fmriprep)
logging.info('path to level 1 files: %s' % path_level1)
logging.info('path to mask files: %s' % path_masks)
paths = {
    'fmriprep': opj(path_tardis, 'fmriprep', 'fmriprep', sub),
    'level1': opj(path_tardis, 'glm', 'l1pipeline'),
    'masks': opj(path_tardis, 'masks', 'masks')
}

# load the visual mask task files:
path_mask_vis_task = opj(path_masks, 'mask_visual', sub, '*', '*task-highspeed*.nii.gz')
path_mask_vis_task = sorted(glob.glob(path_mask_vis_task), key=lambda f: os.path.basename(f))
logging.info('found %d visual mask task files' % len(path_mask_vis_task))
logging.info('paths to visual mask task files:\n%s' % pformat(path_mask_vis_task))
dl.get(path_mask_vis_task)

# load the hippocampus mask task files:
path_mask_hpc_task = opj(path_masks, 'mask_hippocampus', sub, '*', '*task-highspeed*.nii.gz')
path_mask_hpc_task = sorted(glob.glob(path_mask_hpc_task), key=lambda f: os.path.basename(f))
logging.info('found %d hpc mask files' % len(path_mask_hpc_task))
logging.info('paths to hpc mask task files:\n%s' % pformat(path_mask_hpc_task))
dl.get(path_mask_hpc_task)

# load the whole brain mask files:
path_mask_whole_task = opj(path_fmriprep, '*', 'func', '*task-highspeed*T1w*brain_mask.nii.gz')
path_mask_whole_task = sorted(glob.glob(path_mask_whole_task), key=lambda f: os.path.basename(f))
logging.info('found %d whole-brain masks' % len(path_mask_whole_task))
logging.info('paths to whole-brain mask files:\n%s' % pformat(path_mask_whole_task))
dl.get(path_mask_whole_task)

# load the functional mri task files:
path_func_task = opj(path_level1, 'smooth', sub, '*', '*task-highspeed*nii.gz')
path_func_task = sorted(glob.glob(path_func_task), key=lambda f: os.path.basename(f))
logging.info('found %d functional mri task files' % len(path_func_task))
logging.info('paths to functional mri task files:\n%s' % pformat(path_func_task))
dl.get(path_func_task)

# define path to the functional resting state runs:
path_rest = opj(path_tardis, 'masks', 'masks', 'smooth', sub, '*', '*task-rest*nii.gz')
path_rest = sorted(glob.glob(path_rest), key=lambda f: os.path.basename(f))
logging.info('found %d functional mri rest files' % len(path_rest))
logging.info('paths to functional mri rest files:\n%s' % pformat(path_rest))
dl.get(path_rest)

# load the anatomical mri file:
path_anat = opj(path_fmriprep, 'anat', '%s_desc-preproc_T1w.nii.gz' % sub)
path_anat = sorted(glob.glob(path_anat), key=lambda f: os.path.basename(f))
logging.info('found %d anatomical mri file' % len(path_anat))
logging.info('paths to anatoimical mri files:\n%s' % pformat(path_anat))
dl.get(path_anat)

# load the confounds files:
path_confs_task = opj(path_fmriprep, '*', 'func', '*task-highspeed*confounds_regressors.tsv')
path_confs_task = sorted(glob.glob(path_confs_task), key=lambda f: os.path.basename(f))
logging.info('found %d confounds files' % len(path_confs_task))
logging.info('found %d confounds files' % len(path_confs_task))
logging.info('paths to confounds files:\n%s' % pformat(path_confs_task))
dl.get(path_confs_task)

# load the spm.mat files:
path_spm_mat = opj(path_level1, 'contrasts', sub, '*', 'SPM.mat')
path_spm_mat = sorted(glob.glob(path_spm_mat), key=lambda f: os.path.dirname(f))
logging.info('found %d spm.mat files' % len(path_spm_mat))
logging.info('paths to spm.mat files:\n%s' % pformat(path_spm_mat))
dl.get(path_spm_mat)

# load the t-maps of the first-level glm:
path_tmap = opj(path_level1, 'contrasts', sub, '*', 'spmT*.nii')
path_tmap = sorted(glob.glob(path_tmap), key=lambda f: os.path.dirname(f))
logging.info('found %d t-maps' % len(path_tmap))
logging.info('paths to t-maps files:\n%s' % pformat(path_tmap))
dl.get(path_tmap)
'''
========================================================================
LOAD THE MRI DATA:
========================================================================
'''
anat = image.load_img(path_anat[0])
logging.info('successfully loaded %s' % path_anat[0])
# load visual mask:
mask_vis = image.load_img(path_mask_vis_task[0]).get_data().astype(int)
logging.info('successfully loaded one visual mask file!')
# load tmap data:
tmaps = [image.load_img(i).get_data().astype(float) for i in copy.deepcopy(path_tmap)]
logging.info('successfully loaded the tmaps!')
# load hippocampus mask:
mask_hpc = [image.load_img(i).get_data().astype(int) for i in copy.deepcopy(path_mask_hpc_task)]
logging.info('successfully loaded one visual mask file!')
'''
FEATURE SELECTION
'''
# plot raw-tmaps on an anatomical background:
logging.info('plotting raw tmaps with anatomical as background:')
for i, path in enumerate(path_tmap):
    logging.info('plotting raw tmap %s (%d of %d)' % (path, i+1, len(path_tmap)))
    path_save = opj(path_out_figs, '%s_run-%02d_tmap_raw.png' % (sub, i+1))
    plotting.plot_roi(path, anat, title=os.path.basename(path_save),
                      output_file=path_save, colorbar=True)

'''
========================================================================
FEATURE SELECTION: MASKS THE T-MAPS WITH THE ANATOMICAL MASKS
We create a combination of the t-maps and the anatomical mask.
To this end, we multiply the anatomical mask with each t-map.
As the anatomical conists of binary values (zeros and ones) a
multiplication results in t-map masked by the anatomical ROI.
========================================================================
'''
#v= tmaps_masked[0]
#fig = plt.figure()
#ax = fig.add_subplot(111, projection='3d')
#ax.scatter(v[:,0],v[:,1],v[:,2], zdir='z', c= 'red')
#plt.show()

# check if any value in the supposedly binary mask is bigger than 1:
if np.any(mask_vis > 1):
    logging.info('WARNING: detected values > 1 in the anatomical ROI!')
    sys.exit("Values > 1 in the anatomical ROI!")
# get combination of anatomical mask and t-map
tmaps_masked = [np.multiply(mask_vis, i) for i in copy.deepcopy(tmaps)]
# masked tmap into image like object:
tmaps_masked_img = [image.new_img_like(i, j) for (i, j) in zip(path_tmap, copy.deepcopy(tmaps_masked))]

for i, path in enumerate(tmaps_masked_img):
    path_save = opj(path_out_masks, '%s_run-%02d_tmap_masked.nii.gz' % (sub, i + 1))
    path.to_filename(path_save)

# plot masked t-maps
logging.info('plotting masked tmaps with anatomical as background:')
for i, path in enumerate(tmaps_masked_img):
    logging.info('plotting masked tmap %d of %d' % (i+1, len(tmaps_masked_img)))
    path_save = opj(path_out_figs, '%s_run-%02d_tmap_masked.png' % (sub, i+1))
    plotting.plot_roi(path, anat, title=os.path.basename(path_save),
                      output_file=path_save, colorbar=True)
'''
========================================================================
FEATURE SELECTION: THRESHOLD THE MASKED T-MAPS
We threshold the masked t-maps, selecting only voxels above AND
below this threshold. We then extract these data and count how
many voxels were above and / or below the threshold in total.
========================================================================
'''
# set the threshold:
threshold = params['mri']['thresh']
logging.info('thresholding t-maps with a threshold of %s' % str(threshold))
# threshold the masked tmap image:
tmaps_masked_thresh_img = [image.threshold_img(i, threshold) for i in copy.deepcopy(tmaps_masked_img)]

logging.info('plotting thresholded tmaps with anatomical as background:')
for i, path in enumerate(tmaps_masked_thresh_img):
    path_save = opj(path_out_figs, '%s_run-%02d_tmap_masked_thresh.png' % (sub, i+1))
    logging.info('plotting masked tmap %s (%d of %d)'
                 % (path_save, i + 1, len(tmaps_masked_thresh_img)))
    plotting.plot_roi(path, anat, title=os.path.basename(path_save),
                      output_file=path_save, colorbar=True)

# extract data from the thresholded images
tmaps_masked_thresh = [image.load_img(i).get_data().astype(float) for i in tmaps_masked_thresh_img]

# calculate the number of tmap voxels:
num_tmap_voxel = [np.size(i) for i in copy.deepcopy(tmaps_masked_thresh)]
logging.info('number of feature selected voxels: %s' % pformat(num_tmap_voxel))

num_above_voxel = [np.count_nonzero(i) for i in copy.deepcopy(tmaps_masked_thresh)]
logging.info('number of voxels above threshold: %s' % pformat(num_above_voxel))

num_below_voxel = [np.count_nonzero(i == 0) for i in copy.deepcopy(tmaps_masked_thresh)]
logging.info('number of voxels below threshold: %s' % pformat(num_below_voxel))

# plot the distribution of t-values:
for i, run_mask in enumerate(tmaps_masked_thresh):
    masked_tmap_flat = run_mask.flatten()
    masked_tmap_flat = masked_tmap_flat[~np.isnan(masked_tmap_flat)]
    masked_tmap_flat = masked_tmap_flat[~np.isnan(masked_tmap_flat) & ~(masked_tmap_flat == 0)]
    path_save = opj(path_out_figs, '%s_run-%02d_tvalue_distribution.png' % (sub, i+1))
    logging.info('plotting thresholded t-value distribution %s (%d of %d)'
                 % (path_save, i+1, len(tmaps_masked_thresh)))
    fig = plt.figure()
    plt.hist(masked_tmap_flat, bins='auto')
    plt.xlabel('t-values')
    plt.ylabel('number')
    plt.title('t-value distribution (%s, run-%02d)' % (sub, i+1))
    plt.savefig(path_save)

# create a dataframe with the number of voxels
df_thresh = pd.DataFrame({
    'id': [sub] * n_run,
    'run': np.arange(1,n_run+1),
    'n_total': num_tmap_voxel,
    'n_above': num_above_voxel,
    'n_below': num_below_voxel
})
file_name = opj(path_out_data, '%s_thresholding.csv' % (sub))
df_thresh.to_csv(file_name, sep=',', index=False)

'''
========================================================================
FEATURE SELECTION: BINARIZE THE THRESHOLDED MASKED T-MAPS
We set all voxels above and below the threshold to 1 and all voxels
that were not selected to 0.
========================================================================
'''
# replace all NaNs with 0:
tmaps_masked_thresh_bin = [np.where(np.isnan(i), 0, i) for i in copy.deepcopy(tmaps_masked_thresh)]
# replace all other values with 1:
tmaps_masked_thresh_bin = [np.where(i > 0, 1, i) for i in copy.deepcopy(tmaps_masked_thresh_bin)]
# turn the 3D-array into booleans:
tmaps_masked_thresh_bin = [i.astype(bool) for i in copy.deepcopy(tmaps_masked_thresh_bin)]
# create image like object:
masks_final = [image.new_img_like(path_func_task[0], i.astype(np.int)) for i in copy.deepcopy(tmaps_masked_thresh_bin)]

logging.info('plotting final masks with anatomical as background:')
for i, path in enumerate(masks_final):
    filename = '%s_run-%02d_tmap_masked_thresh.nii.gz'\
               % (sub, i + 1)
    path_save = opj(path_out_masks, filename)
    logging.info('saving final mask %s (%d of %d)'
                 % (path_save, i+1, len(masks_final)))
    path.to_filename(path_save)
    path_save = opj(path_out_figs, '%s_run-%02d_visual_final_mask.png'
                    % (sub, i + 1))
    logging.info('plotting final mask %s (%d of %d)'
                 % (path_save, i + 1, len(masks_final)))
    plotting.plot_roi(path, anat, title=os.path.basename(path_save),
                      output_file=path_save, colorbar=True)
'''
========================================================================
LOAD SMOOTHED FMRI DATA FOR ALL FUNCTIONAL TASK RUNS:
========================================================================
'''
# load smoothed functional mri data for all eight task runs:
logging.info('loading %d functional task runs ...' % len(path_func_task))
data_task = [image.load_img(i) for i in path_func_task]
logging.info('loading successful!')
'''
========================================================================
DEFINE THE CLASSIFIERS
========================================================================
'''
class_labels = ['cat', 'chair', 'face', 'house', 'shoe']
# create a dictionary with all values as independent instances:
# see here: https://bit.ly/2J1DvZm
clf_set = {key: LogisticRegression(
    C=1., penalty='l2', multi_class='ovr', solver='lbfgs',
    max_iter=4000, class_weight='balanced', random_state=42) for key in class_labels}
classifiers = {
    'log_reg': LogisticRegression(
        C=1., penalty='l2', multi_class='multinomial', solver='lbfgs',
        max_iter=4000, class_weight='balanced', random_state=42)}
clf_set.update(classifiers)
'''
========================================================================
1. SPLIT THE EVENTS DATAFRAME FOR EACH TASK CONDITION
2. RESET THE INDICES OF THE DATAFRAMES
3. SORT THE ROWS OF ALL DATAFRAMES IN CHRONOLOGICAL ORDER
4. PRINT THE NUMBER OF TRIALS OF EACH TASK CONDITION
========================================================================
'''


class TaskData:
    """

    """
    def __init__(
            self, events, condition, name, trial_type, bold_delay=0,
            interval=1, t_tr=1.25, num_vol_run=530):
        import pandas as pd
        # define name of the task data subset:
        self.name = name
        # define the task condition the task data subset if from:
        self.condition = condition
        # define the delay (in seconds) by which onsets are moved:
        self.bold_delay = bold_delay
        # define the number of TRs from event onset that should be selected:
        self.interval = interval
        # define the repetition time (TR) of the mri data acquisition:
        self.t_tr = t_tr
        # define the number of volumes per task run:
        self.num_vol_run = num_vol_run
        # select events: upright stimulus, correct answer trials only:
        if trial_type == 'stimulus':
            self.events = events.loc[
                          (events['condition'] == condition) &
                          (events['trial_type'] == trial_type) &
                          (events['stim_orient'] == 0) &
                          (events['serial_position'] == 1) &
                          (events['accuracy'] != 0),
                          :]
        elif trial_type == 'cue':
            self.events = events.loc[
                          (events['condition'] == condition) &
                          (events['trial_type'] == trial_type),
                          :]
        # reset the indices of the data frame:
        self.events.reset_index()
        # sort all values by session and run:
        self.events.sort_values(by=['session', 'run_session'])
        # call further function upon initialization:
        self.define_trs()
        self.get_stats()

    def define_trs(self):
        # import relevant functions:
        import numpy as np
        # select all events onsets:
        self.event_onsets = self.events['onset']
        # add the selected delay to account for the peak of the hrf
        self.bold_peaks = self.events['onset'] + self.bold_delay
        # divide the expected time-point of bold peaks by the repetition time:
        self.peak_trs = self.bold_peaks / self.t_tr
        # add the number of run volumes to the tr indices:
        run_volumes = (self.events['run_study']-1) * self.num_vol_run
        # add the number of volumes of each run:
        trs = round(self.peak_trs + run_volumes)
        # copy the relevant trs as often as specified by the interval:
        a = np.transpose(np.tile(trs, (self.interval, 1)))
        # create same-sized matrix with trs to be added:
        b = np.full((len(trs), self.interval), np.arange(self.interval))
        # assign the final list of trs:
        self.trs = np.array(np.add(a, b).flatten(), dtype=int)
        # save the TRs of the stimulus presentations
        self.stim_trs = round(self.event_onsets / self.t_tr + run_volumes)

    def get_stats(self):
        import numpy as np
        self.num_trials = len(self.events)
        self.runs = np.repeat(np.array(self.events['run_study'], dtype=int), self.interval)
        self.trials = np.repeat(np.array(self.events['trial'], dtype=int), self.interval)
        self.sess = np.repeat(np.array(self.events['session'], dtype=int), self.interval)
        self.stim = np.repeat(np.array(self.events['stim_label'], dtype=object), self.interval)
        self.itis = np.repeat(np.array(self.events['interval_time'], dtype=float), self.interval)
        self.stim_trs = np.repeat(np.array(self.stim_trs, dtype=int), self.interval)

    def zscore(self, signals, run_list, t_tr=1.25):
        from nilearn.signal import clean
        import numpy as np
        # get boolean indices for all run indices in the run list:
        run_indices = np.isin(self.runs, list(run_list))
        # standardize data all runs in the run list:
        self.data_zscored = clean(
            signals=signals[self.trs[run_indices]],
            sessions=self.runs[run_indices],
            t_r=t_tr,
            detrend=False,
            standardize=True)

    def predict(self, clf, run_list):
        # import packages:
        import pandas as pd
        # get classifier class predictions:
        pred_class = clf.predict(self.data_zscored)
        # get classifier probabilistic predictions:
        pred_proba = clf.predict_proba(self.data_zscored)
        # get the classes of the classifier:
        classes_names = clf.classes_
        # get boolean indices for all run indices in the run list:
        run_indices = np.isin(self.runs, list(run_list))
        # create a dataframe with the probabilities of each class:
        df = pd.DataFrame(pred_proba, columns=classes_names)
        # get the number of predictions made:
        num_pred = len(df)
        # get the number of trials in the test set:
        num_trials = int(num_pred / self.interval)
        # add the predicted class label to the dataframe:
        df['pred_label'] = pred_class
        # add the true stimulus label to the dataframe:
        df['stim'] = self.stim[run_indices]
        # add the volume number (TR) to the dataframe:
        df['tr'] = self.trs[run_indices]
        # add the sequential TR to the dataframe:
        df['seq_tr'] = np.tile(np.arange(1, self.interval + 1), num_trials)
        # add the counter of trs on which the stimulus was presented
        df['stim_tr'] = self.stim_trs[run_indices]
        # add the trial number to the dataframe:
        df['trial'] = self.trials[run_indices]
        # add the run number to the dataframe:
        df['run_study'] = self.runs[run_indices]
        # add the session number to the dataframe:
        df['session'] = self.sess[run_indices]
        # add the inter trial interval to the dataframe:
        df['tITI'] = self.itis[run_indices]
        # add the participant id to the dataframe:
        df['id'] = np.repeat(self.events['subject'].unique(), num_pred)
        # add the name of the classifier to the dataframe:
        df['test_set'] = np.repeat(self.name, num_pred)
        # return the dataframe:
        return df


train_odd_peak = TaskData(
        events=df_events, condition='oddball', trial_type='stimulus',
        bold_delay=4, interval=1, name='train-odd_peak')
test_odd_peak = TaskData(
        events=df_events, condition='oddball', trial_type='stimulus',
        bold_delay=4, interval=1, name='test-odd_peak')
test_odd_long = TaskData(
        events=df_events, condition='oddball', trial_type='stimulus',
        bold_delay=0, interval=7, name='test-odd_long')
test_seq_long = TaskData(
        events=df_events, condition='sequence', trial_type='stimulus',
        bold_delay=0, interval=13, name='test-seq_long')
test_rep_long = TaskData(
        events=df_events, condition='repetition', trial_type='stimulus',
        bold_delay=0, interval=13, name='test-rep_long')
test_seq_cue = TaskData(
        events=df_events, condition='sequence', trial_type='cue',
        bold_delay=0, interval=5, name='test-seq_cue')
test_rep_cue = TaskData(
        events=df_events, condition='repetition', trial_type='cue',
        bold_delay=0, interval=5, name='test-rep_cue')
test_sets = [
        test_odd_peak, test_odd_long, test_seq_long, test_rep_long,
        test_seq_cue, test_rep_cue
        ]
'''
========================================================================
SEPARATE RUN-WISE STANDARDIZATION (Z-SCORING) OF ALL TASK CONDITIONS
Standardize features by removing the mean and scaling to unit variance.
Centering and scaling happen independently on each feature by computing
the relevant statistics on the samples in the training set. Here,
we standardize the features of all trials run-wise but separated for
each task condition (oddball, sequence, and repetition condition).
========================================================================
'''


def detrend(data, t_tr=1.25):
    from nilearn.signal import clean
    data_detrend = clean(
        signals=data, t_r=t_tr, detrend=True, standardize=False)
    return data_detrend


def show_weights(array):
    # https://stackoverflow.com/a/50154388
    import numpy as np
    import seaborn as sns
    n_samples = array.shape[0]
    classes, bins = np.unique(array, return_counts=True)
    n_classes = len(classes)
    weights = n_samples / (n_classes * bins)
    sns.barplot(classes, weights)
    plt.xlabel('class label')
    plt.ylabel('weight')
    plt.show()


def melt_df(df, melt_columns):
    # save the column names of the dataframe in a list:
    column_names = df.columns.tolist()
    # remove the stimulus classes from the column names;
    id_vars = [x for x in column_names if x not in melt_columns]
    # melt the dataframe creating one column with value_name and var_name:
    df_melt = pd.melt(
            df, value_name='probability', var_name='class', id_vars=id_vars)
    # return the melted dataframe:
    return df_melt


data_list = []
runs = list(range(1, n_run+1))
#mask_label = 'cv'

logging.info('starting leave-one-run-out cross-validation')
for mask_label in ['cv', 'cv_hpc']:
    logging.info('testing in mask %s' % (mask_label))
    for run in runs:
        logging.info('testing on run %d of %d ...' % (run, len(runs)))
        # define the run indices for the training and test set:
        train_runs = [x for x in runs if x != run]
        test_runs = [x for x in runs if x == run]
        # get the feature selection mask of the current run:
        if mask_label == 'cv':
            mask_run = masks_final[runs.index(run)]
        elif mask_label == 'cv_hpc':
            mask_run = path_mask_hpc_task[runs.index(run)]
        # extract smoothed fMRI data from the mask for the cross-validation fold:
        masked_data = [masking.apply_mask(i, mask_run) for i in data_task]
        # detrend the masked fMRI data separately for each run:
        data_detrend = [detrend(i) for i in masked_data]
        # combine the detrended data of all runs:
        data_detrend = np.vstack(data_detrend)
        # loop through all classifiers in the classifier set:
        for clf_name, clf in clf_set.items():
            # print classifier:
            logging.info('classifier: %s' % clf_name)
            # fit the classifier to the training data:
            train_odd_peak.zscore(signals=data_detrend, run_list=train_runs)
            # get the example labels:
            train_stim = copy.deepcopy(train_odd_peak.stim[train_odd_peak.runs != run])
            # replace labels for single-label classifiers:
            if clf_name in class_labels:
                # replace all other labels with other
                train_stim = ['other' if x != clf_name else x for x in train_stim]
                # turn into a numpy array
                train_stim = np.array(train_stim, dtype=object)
            # check weights:
            #show_weights(array=train_stim)
            # train the classifier
            clf.fit(train_odd_peak.data_zscored, train_stim)
            # classifier prediction: predict on test data and save the data:
            for test_set in test_sets:
                logging.info('testing on test set %s' % test_set.name)
                test_set.zscore(signals=data_detrend, run_list=test_runs)
                if test_set.data_zscored.size < 0:
                    continue
                # create dataframe containing classifier predictions:
                df_pred = test_set.predict(clf=clf, run_list=test_runs)
                # add the current classifier as a new column:
                df_pred['classifier'] = np.repeat(clf_name, len(df_pred))
                # add a label that indicates the mask / training regime:
                df_pred['mask'] = np.repeat(mask_label, len(df_pred))
                # melt the data frame:
                df_pred_melt = melt_df(df=df_pred, melt_columns=train_stim)
                # append dataframe to list of dataframe results:
                data_list.append(df_pred_melt)
'''
========================================================================
DECODING ON RESTING STATE DATA:
========================================================================
'''
logging.info('Loading fMRI data of %d resting state runs ...' % len(path_rest))
data_rest = [image.load_img(i) for i in path_rest]
logging.info('loading successful!')

# combine all masks from the feature selection by intersection:
def multimultiply(arrays):
    import copy
    # start with the first array:
    array_union = copy.deepcopy(arrays[0].astype(np.int))
    # loop through all arrays
    for i in range(len(arrays)):
        # multiply every array with all previous array
        array_union = np.multiply(array_union, copy.deepcopy(arrays[i].astype(np.int)))
    # return the union of all arrays:
    return(array_union)


for mask_label in ['union', 'union_hpc']:
    # calculate the union of all masks by multiplication:
    if mask_label == 'union':
        masks_union = multimultiply(tmaps_masked_thresh_bin).astype(int).astype(bool)
    elif mask_label == 'union_hpc':
        masks_union = multimultiply(mask_hpc).astype(int).astype(bool)
    # masked tmap into image like object:
    masks_union_nii = image.new_img_like(path_func_task[0], masks_union)
    path_save = opj(path_out_masks, '{}_task-rest_mask-{}.nii.gz'.format(sub, mask_label))
    masks_union_nii.to_filename(path_save)
    # mask all resting state runs with the averaged feature selection masks:
    data_rest_masked = [masking.apply_mask(i, masks_union_nii) for i in data_rest]
    # detrend and standardize each resting state run separately:
    data_rest_final = [clean(i, detrend=True, standardize=True) for i in data_rest_masked]
    # mask all functional task runs separately:
    data_task_masked = [masking.apply_mask(i, masks_union_nii) for i in data_task]
    # detrend each task run separately:
    data_task_masked_detrend = [clean(i, detrend=True, standardize=False) for i in data_task_masked]
    # combine the detrended data of all runs:
    data_task_masked_detrend = np.vstack(data_task_masked_detrend)
    # select only oddball data and standardize:
    train_odd_peak.zscore(signals=data_task_masked_detrend, run_list=runs)
    # write session and run labels:
    ses_labels = [i.split(sub + "_")[1].split("_task")[0] for i in path_rest]
    run_labels = [i.split("prenorm_")[1].split("_space")[0] for i in path_rest]
    file_names = ['_'.join([a, b]) for (a, b) in zip(ses_labels, run_labels)]
    rest_interval = 1
    # save the voxel patterns:
    num_voxels = len(train_odd_peak.data_zscored[0])
    voxel_labels = ['v' + str(x) for x in range(num_voxels)]
    df_patterns = pd.DataFrame(train_odd_peak.data_zscored, columns=voxel_labels)
    # add the stimulus labels to the dataframe:
    df_patterns['label'] = copy.deepcopy(train_odd_peak.stim)
    # add the participant id to the dataframe:
    df_patterns['id'] = np.repeat(df_events['subject'].unique(), len(train_odd_peak.stim))
    # add the mask label:
    df_patterns['mask'] = np.repeat(mask_label, len(train_odd_peak.stim))
    # split pattern dataframe by label:
    df_pattern_list = [df_patterns[df_patterns['label'] == i] for i in df_patterns['label'].unique()]
    # create file path to save the dataframe:
    file_paths = [opj(path_out_data, '{}_voxel_patterns_{}_{}'.format(sub, mask_label, i)) for i in df_patterns['label'].unique()]
    # save the final dataframe to a .csv-file:
    [i.to_csv(j + '.csv', sep=',', index=False) for (i, j) in zip(df_pattern_list, file_paths)]
    # save only the voxel patterns as niimg-like objects
    [masking.unmask(X=i.loc[:, voxel_labels].to_numpy(), mask_img=masks_union_nii).to_filename(j + '.nii.gz') for (i, j) in zip(df_pattern_list, file_paths)]
    #[image.new_img_like(path_func_task[0], i.loc[:, voxel_labels].to_numpy()).to_filename(j + '.nii.gz') for (i, j) in zip(df_pattern_list, file_paths)]
    # decoding on resting state data:
    for clf_name, clf in clf_set.items():
        # print classifier name:
        logging.info('classifier: %s' % clf_name)
        # get the example labels for all slow trials:
        train_stim = copy.deepcopy(train_odd_peak.stim)
        # replace labels for single-label classifiers:
        if clf_name in class_labels:
            # replace all other labels with other
            train_stim = ['other' if x != clf_name else x for x in train_stim]
            # turn into a numpy array
            train_stim = np.array(train_stim, dtype=object)
        # train the classifier
        clf.fit(train_odd_peak.data_zscored, train_stim)
        # classifier prediction: predict on test data and save the data:
        pred_rest_class = [clf.predict(i) for i in data_rest_final]
        pred_rest_proba = [clf.predict_proba(i) for i in data_rest_final]
        # get the class names of the classifier:
        classes_names = clf.classes_
        # save classifier predictions on resting state scans
        for t, name in enumerate(pred_rest_proba):
            # create a dataframe with the probabilities of each class:
            df_rest_pred = pd.DataFrame(
                    pred_rest_proba[t], columns=classes_names)
            # get the number of predictions made:
            num_pred = len(df_rest_pred)
            # get the number of trials in the test set:
            num_trials = int(num_pred / rest_interval)
            # add the predicted class label to the dataframe:
            df_rest_pred['pred_label'] = pred_rest_class[t]
            # add the true stimulus label to the dataframe:
            df_rest_pred['stim'] = np.full(num_pred, np.nan)
            # add the volume number (TR) to the dataframe:
            df_rest_pred['tr'] = np.arange(1, num_pred + 1)
            # add the sequential TR to the dataframe:
            df_rest_pred['seq_tr'] = np.arange(1, num_pred + 1)
            # add the trial number to the dataframe:
            df_rest_pred['trial'] = np.tile(
                    np.arange(1, rest_interval + 1), num_trials)
            # add the run number to the dataframe:
            df_rest_pred['run_study'] = np.repeat(run_labels[t], num_pred)
            # add the session number to the dataframe:
            df_rest_pred['session'] = np.repeat(ses_labels[t], len(df_rest_pred))
            # add the inter trial interval to the dataframe:
            df_rest_pred['tITI'] = np.tile('rest', num_pred)
            # add the participant id to the dataframe:
            df_rest_pred['id'] = np.repeat(df_events['subject'].unique(), num_pred)
            # add the name of the classifier to the dataframe:
            df_rest_pred['test_set'] = np.repeat('rest', num_pred)
            # add the name of the classifier to the dataframe;
            df_rest_pred['classifier'] = np.repeat(clf_name, num_pred)
            # add a label that indicates the mask / training regime:
            df_rest_pred['mask'] = np.repeat(mask_label, len(df_rest_pred))
            # melt the data frame:
            df_pred_melt = melt_df(df=df_rest_pred, melt_columns=train_stim)
            # append dataframe to list of dataframe results:
            data_list.append(df_pred_melt)
        # run classifier trained on all runs across all test sets:
        for test_set in test_sets:
            logging.info('testing on test set %s' % test_set.name)
            test_set.zscore(signals=data_task_masked_detrend, run_list=runs)
            if test_set.data_zscored.size < 0:
                continue
            # create dataframe containing classifier predictions:
            df_pred = test_set.predict(clf=clf, run_list=runs)
            # add the current classifier as a new column:
            df_pred['classifier'] = np.repeat(clf_name, len(df_pred))
            # add a label that indicates the mask / training regime:
            df_pred['mask'] = np.repeat(mask_label, len(df_pred))
            # melt the data frame:
            df_pred_melt = melt_df(df=df_pred, melt_columns=train_stim)
            # append dataframe to list of dataframe results:
            data_list.append(df_pred_melt)

# concatenate all decoding dataframes in one final dataframe:
df_all = pd.concat(data_list, sort=False)
# create file path to save the dataframe:
file_path = opj(path_out_data, '{}_decoding.csv'.format(sub))
# save the final dataframe to a .csv-file:
df_all.to_csv(file_path, sep=',', index=False)

'''
========================================================================
STOP LOGGING:
========================================================================
'''
end = time.time()
total_time = (end - start)/60
logging.info('total running time: %0.2f minutes' % total_time)
logging.shutdown()

3.6.3 Parallelization on HPC

The decoding pipeline was parallelized on the local high performance cluster of the Max Planck Institute for Human Development, Berlin, Germany, using the following code:

#!/usr/bin/bash
# ==============================================================================
# SCRIPT INFORMATION:
# ==============================================================================
# SCRIPT: PARALELLIZE THE DECODING SCRIPT ON THE MPIB CLUSTER (TARDIS)
# PROJECT: HIGHSPEED
# WRITTEN BY LENNART WITTKUHN, 2018 - 2020
# CONTACT: WITTKUHN AT MPIB HYPHEN BERLIN DOT MPG DOT DE
# MAX PLANCK RESEARCH GROUP NEUROCODE
# MAX PLANCK INSTITUTE FOR HUMAN DEVELOPMENT (MPIB)
# MAX PLANCK UCL CENTRE FOR COMPUTATIONAL PSYCHIATRY AND AGEING RESEARCH
# LENTZEALLEE 94, 14195 BERLIN, GERMANY
# ==============================================================================
# DEFINE ALL PATHS:
# ==============================================================================
# define the name of the current task:
TASK_NAME="highspeed-decoding"
# define the name of the project:
PROJECT_NAME="highspeed"
# path to the home directory:
PATH_HOME=/home/mpib/wittkuhn
# path to the current code folder:
PATH_CODE=${PATH_HOME}/${PROJECT_NAME}/${TASK_NAME}/code/decoding
# cd into the directory of the current script:
cd ${PATH_CODE}
# path to the current script:
PATH_SCRIPT=${PATH_CODE}/highspeed_decoding.py
# path to the output directory:
PATH_OUT=${PATH_HOME}/${PROJECT_NAME}/${TASK_NAME}/decoding
# path to the working directory:
PATH_WORK=${PATH_HOME}/${PROJECT_NAME}/${TASK_NAME}/work
# path to the log directory:
PATH_LOG=${PATH_HOME}/${PROJECT_NAME}/${TASK_NAME}/logs/$(date '+%Y%m%d_%H%M%S')
# ==============================================================================
# CREATE RELEVANT DIRECTORIES:
# ==============================================================================
# create output directory:
if [ ! -d ${PATH_OUT} ]; then
    mkdir -p ${PATH_OUT}
fi
# create working directory:
if [ ! -d ${PATH_WORK} ]; then
    mkdir -p ${PATH_WORK}
fi
# create directory for log files:
if [ ! -d ${PATH_LOG} ]; then
    mkdir -p ${PATH_LOG}
fi
# ==============================================================================
# DEFINE PARAMETERS:
# ==============================================================================
# maximum number of cpus per process:
N_CPUS=1
# memory demand in *GB*
MEM=20GB
# user-defined subject list
PARTICIPANTS=$1
# ==============================================================================
# RUN THE DECODING:
# ==============================================================================
# initialize a counter for the subjects:
SUB_COUNT=0
for i in {1..40}; do
    # turn the subject id into a zero-padded number:
    SUB=$(printf "%02d\n" ${i})
    # start slurm job file:
    echo "#!/bin/bash" > job
    # name of the job
    echo "#SBATCH --job-name highspeed-decoding-sub-${SUB}" >> job
    # set the expected maximum running time for the job:
    echo "#SBATCH --time 12:00:00" >> job
    # determine how much RAM your operation needs:
    echo "#SBATCH --mem ${MEM}" >> job
    # request multiple cpus:
    echo "#SBATCH --cpus-per-task ${N_CPUS}" >> job
    # write (output) log to log folder:
    echo "#SBATCH --output ${PATH_LOG}/highspeed-decoding-sub-${SUB}-%j.out" >> job
    # email notification on abort/end, use 'n' for no notification:
    echo "#SBATCH --mail-type NONE" >> job
    # start in the current directory:
    echo "#SBATCH --workdir ." >> job
    # activate virtual environment on cluster:
    echo "source /etc/bash_completion.d/virtualenvwrapper" >> job
    echo "workon highspeed-decoding" >> job
    # define the main command:
    echo "python3 ${PATH_SCRIPT} ${SUB}" >> job
    # submit job to cluster queue and remove it to avoid confusion:
    sbatch job
    rm -f job
done

3.6.4 Software: Required packages

The requirements.txt file lists the required packages which can be installed e.g., using pip install -r requirements.txt

annexremote==1.4.4
appdirs==1.4.4
attrs==20.3.0
boto==2.49.0
certifi==2020.11.8
cffi==1.14.3
chardet==3.0.4
Counter==1.0.0
cryptography==3.2.1
cycler==0.10.0
datalad==0.13.5
Deprecated==1.2.10
distro==1.5.0
fasteners==0.15
future==0.18.2
humanize==3.1.0
idna==2.10
iniconfig==1.1.1
iso8601==0.1.13
jeepney==0.6.0
joblib==0.17.0
jsmin==2.2.2
keyring==21.5.0
keyrings.alt==4.0.1
kiwisolver==1.3.1
matplotlib==3.3.3
monotonic==1.5
msgpack==1.0.0
nibabel==3.2.0
nilearn==0.7.0
numpy==1.19.4
packaging==20.4
pandas==1.1.4
patool==1.12
Pillow==8.0.1
pkg-resources==0.0.0
pluggy==0.13.1
py==1.9.0
pycparser==2.20
PyGithub==1.53
PyJWT==1.7.1
pyparsing==2.4.7
pytest==6.1.2
python-dateutil==2.8.1
pytz==2020.4
PyYAML==5.3.1
requests==2.25.0
scikit-learn==0.23.2
scipy==1.5.4
SecretStorage==3.2.0
simplejson==3.17.2
six==1.15.0
sklearn==0.0
threadpoolctl==2.1.0
toml==0.10.2
tqdm==4.53.0
urllib3==1.26.2
Whoosh==2.7.4
wrapt==1.12.1

3 Preparation

3.1 MRI data to BIDS

3.1.1 Overview

3.1.1.1 Data availability

3.1.1.2 License

3.1.2 Step 1: Conversion of MRI data to the Brain Imaging Data Structure (BIDS) using HeuDiConv

3.1.2.1 Overview

3.1.2.2 Container: heudiconv container, version 0.6.0

3.1.2.3 Mapping raw DICOMS to BIDS: highspeed-heudiconv-heuristic.py

3.1.2.4 Changing participant IDs: highspeed-heudiconv-anonymizer.py

3.1.2.5 Running heudiconv on the cluster: highspeed-heudiconv-cluster.sh

3.1.2.6 Resources

3.1.3 Step 2: Removal of facial features using pydeface

3.1.3.1 Overview

3.1.3.2 Container: pydeface container, version 2.0.0

3.1.3.3 Defacing structural images: highspeed-defacing-cluster.sh

3.1.3.4 Replacing defaced with original images: highspeed-defacing-cleanup.sh

3.1.3.5 Resources

3.1.4 Step 3: Adding BIDS *events.tsv files

3.1.4.1 Overview

3.1.4.2 Code and software

3.1.5 Step 4: Adding additional information to the BIDS dataset

3.1.5.1 Code and software

3.1.5.2 Updating dataset_description.json: python highspeed-bids-description.py

3.1.5.3 Updating the .json files of fieldmap data: python highspeed-bids-fieldmaps.py

3.1.5.4 Creating the participants.tsv and participants.json files

3.2 MRI quality control: Running MRIQC

3.2.1 Overview

3.2.1.1 Data availability

3.2.1.2 License

3.2.2 Container: mriqc container, version 0.15.2rc1

3.2.3 MRIQC subject-level reports: highspeed-mriqc-subject-level.sh

3.2.4 MRIQC group-level reports: highspeed-mriqc-group-level.sh

3.2.5 References

3.3 MRI pre-processing:Running fMRIPrep

3.3.1 Overview

3.3.1.1 Data availability

3.3.1.2 License

3.3.2 Software: fmriprep container, version 1.2.2

3.3.3 Run fMRIPrep on HPC

3.3.4 References

3.4 Feature selection: Anatomical masks

3.4.1 Overview

3.4.1.1 Data availability

3.4.1.2 License

3.4.2 Creating binary anatomical masks using highspeed-masks.py

3.4.3 Plotting masked data using highspeed-masks-plot.py

3.4.4 Software: Required packages

3.5 Feature selection: First-level GLMs

3.5.1 Overview

3.5.1.1 Data availability

3.5.1.2 License

3.5.2 Main GLM workflow (highspeed-glm-main.py)

3.5.3 Extra GLM functions (highspeed-glm-functions.py)

3.5.4 Software: Required packages

3.6 Multivariate decoding

3.6.1 Overview

3.6.1.1 Data availability

3.6.1.2 License

3.6.2 Main decoding pipeline

3.6.3 Parallelization on HPC

3.6.4 Software: Required packages

3.1.2.2 Container: `heudiconv` container, version 0.6.0

3.1.2.3 Mapping raw DICOMS to BIDS: `highspeed-heudiconv-heuristic.py`

3.1.2.4 Changing participant IDs: `highspeed-heudiconv-anonymizer.py`

3.1.2.5 Running `heudiconv` on the cluster: `highspeed-heudiconv-cluster.sh`

3.1.3.2 Container: `pydeface` container, version 2.0.0

3.1.3.3 Defacing structural images: `highspeed-defacing-cluster.sh`

3.1.3.4 Replacing defaced with original images: `highspeed-defacing-cleanup.sh`

3.1.4 Step 3: Adding BIDS `*events.tsv` files

3.1.5.2 Updating `dataset_description.json`: `python highspeed-bids-description.py`

3.1.5.3 Updating the `.json` files of fieldmap data: `python highspeed-bids-fieldmaps.py`

3.1.5.4 Creating the `participants.tsv` and `participants.json` files

3.2.2 Container: `mriqc` container, version 0.15.2rc1

3.2.3 MRIQC subject-level reports: `highspeed-mriqc-subject-level.sh`

3.2.4 MRIQC group-level reports: `highspeed-mriqc-group-level.sh`

3.3.2 Software: `fmriprep` container, version 1.2.2

3.4.2 Creating binary anatomical masks using `highspeed-masks.py`

3.4.3 Plotting masked data using `highspeed-masks-plot.py`

3.5.2 Main GLM workflow (`highspeed-glm-main.py`)

3.5.3 Extra GLM functions (`highspeed-glm-functions.py`)