Not directly, no. The problem is that the link between the two images can be defined in multiple ways, many of which have no unit (e.g. comparing T2W vs T1W images in MRI), whereas others do (e.g. time series in dynamic contrast enhanced images, b-values in DWI imaging).
However, features can be extracted from multiple input images using the same mask (be sure to enable correctMask
when not resampling), in which case each feature has a value for each input image, which can be combined into a single predictive model. This is currently how I use multimodal input in radiomics research.