Decision Flow Tracing and Word Impact Analysis in Hybrid Transformer-Conditioned Diffusion Models for Text-to-Image Generation
- Title
- Decision Flow Tracing and Word Impact Analysis in Hybrid Transformer-Conditioned Diffusion Models for Text-to-Image Generation
- Creator
- Thomas, Aldrin P.; George, Shiju; Raj, N. Anand; Ajaz, S. Mohemmed; Shaju, Midhun; Nasim, V. Akil
- Description
- Text-to-image diffusion models have become a cornerstone of modern generative AI, offering high-quality synthesis yet remaining constrained by their black-box nature, which limits controllability and interpretability. In this work, we propose a hybrid transformer-conditioned diffusion model that integrates UNet-based denoising with multi-head cross-attention transformer blocks at critical latent stages of the diffusion process. The architecture is trained on a curated set of 50,000 samples from DiffusionDB with a 200-step latent diffusion schedule. Text prompts are encoded using a 16-token BERT encoder and mapped into a 256-dimensional latent feature space. Cross-attention layers with eight heads are interlaced within the UNet bottleneck and decoder, enabling token-to-region correspondence and fine-grained semantic propagation. To ensure interpretability, we design an explainability framework that combines hierarchical token-level attention heat maps, temporal attention rollouts, and perceptual ablation studies based on learned image patch similarity. Analysis reveals that object tokens remain spatially and temporally consistent, while attribute tokens demonstrate sharper temporal volatility. JensenShannon divergence quantifies this redistribution of attention across diffusion steps. Experimental evaluation against a standard UNet diffusion baseline demonstrates clear improvements: Frhet Inception Distance decreases by 19.6, CLIP alignment score increases by 5.4, and Inception Score improves by 18.6. Moreover, attention coherence improves by 22%, underscoring the gains in explainability. The proposed framework establishes a pathway toward accountable, high-fidelity, and interpretable text-to-image synthesis. Beyond performance, it supports critical tasks such as bias evaluation, fairness auditing, and quality assurance, offering a robust foundation for the next generation of explainable generative AI systems. The Author(s), under exclusive license to Springer Nature Switzerland AG 2026.
- Source
- Lecture Notes in Networks and Systems;Volume;1927 LNNS;pp.163-174
- Date
- 01-01-2026
- Publisher
- Springer Science and Business Media Deutschland GmbH
- Subject
- Cross-Attention; DiffusionDB; Hybrid Transformer Diffusion; Interpretable Generative Modeling; Prompt Engineering; Semantic Propagation
- Coverage
- Thomas A.P., AI and Data Science Engineering, Christ University, Karnataka, Bangalore, India; George S., AI and Data Science Engineering, Christ University, Karnataka, Bangalore, India; Raj N.A., AI and Data Science Engineering, Christ University, Karnataka, Bangalore, India; Ajaz S.M., AI and Data Science Engineering, Christ University, Karnataka, Bangalore, India; Shaju M., AI and Data Science Engineering, Christ University, Karnataka, Bangalore, India; Nasim V.A., AI and Data Science Engineering, Christ University, Karnataka, Bangalore, India
- Rights
- Restricted Access; Hardcopy may be available in the library
- Relation
- ISSN: 23673370; ISBN: 978-303222913-7;
- Format
- online
- Language
- English
- Type
- Conference paper
Collection
Citation
Thomas, Aldrin P.; George, Shiju; Raj, N. Anand; Ajaz, S. Mohemmed; Shaju, Midhun; Nasim, V. Akil, “Decision Flow Tracing and Word Impact Analysis in Hybrid Transformer-Conditioned Diffusion Models for Text-to-Image Generation,” CHRIST (Deemed To Be University) Institutional Repository, accessed June 18, 2026, https://archives.christuniversity.in/items/show/25416.
