Video DiTs have advanced video generation, yet they still struggle to model
multi-instance or subject-object interactions. This raises a key question: How
do these models internally represent interactions? To answer this, we curate
MATRIX-11K, a video dataset with interaction-aware captions and multi-instance
mask tracks. Using this dataset, we conduct a systematic analysis that formalizes
two perspectives of video DiTs: semantic grounding, via video-to-text attention,
which evaluates whether noun and verb tokens capture instances and their re-
lations; and semantic propagation, via video-to-video attention, which assesses
whether instance bindings persist across frames. We find both effects concentrate
in a small subset of interaction-dominant layers. Motivated by this, we introduce
MATRIX, a simple and effective regularization that aligns attention in specific
layers of video DiTs with multi-instance mask tracks from the MATRIX-11K
dataset, enhancing both grounding and propagation. We further propose Inter-
GenEval, an evaluation protocol for interaction-aware video generation. In ex-
periments, MATRIX improves both interaction fidelity and semantic alignment
while reducing drift and hallucination. Extensive ablations validate our design
choices. Codes and weights will be released.