How can Multimodal Large Language Models transform Science Education?

Multimodal Large Language Models (MLLMs) are advanced AI systems capable of processing multimodal data including text, sound, and visual inputs. Since science education is inherently multimodal, MLLMs like GPT-4 Vision are promising for shifting between modes and making complex information more accessible and personalized. But they also raise concerns, like cognitive overload or over-reliance on automation. In response, Bewersdorff et al. (2025) propose a theoretical framework for integrating MLLMs into science education, grounded in the Cognitive Theory of Multimedia Learning.

In their framework, Bewersdorff et al. (2025) aim to integrate MLLMs into science education by aligning their capabilities with the Cognitive Theory of Multimedia Learning (CTML). Positioned between verbal (textual) and non-verbal (visual) channels, MLLMs act as adaptive tools that help learners build coherent mental models, with two core functions: First, transforming content across modalities, for example, by converting text into images or diagrams and second, adding an additional mode and enriching existing content such as textual explanations to visuals. These functions are designed to reduce cognitive load and support learners in selecting, organizing, and integrating information, the key processes identified in CTML.
MLLMs align closely with personalized and adaptive learning principles by enabling real-time adjustments to individual learners’ needs. Their ability to flexibly transform and supplement learning materials across modalities supports tailored instruction, accommodating differences in prior knowledge, learning preferences, and cognitive load. For example, MLLMs can dynamically generate content such as diagrams to illustrate chemical reactions or explain complex biological structures, such as the anatomy of an insect’s eye, using images and customized text. When grounded in multimedia learning principles, MLLMs have the potential to create dynamic, learner-centered environments that adapt content, format, and complexity to optimize understanding

Challenges and Future Directions

While promising, MLLMs raise concerns about bias, data protection, and over-reliance on automation. The authors emphasize the continued importance of educators in guiding learners. Future research should empirically test the framework and explore how MLLMs reshape educator roles and learning dynamics across subjects.

Why this matters in EdTech

 This paper shifts the focus of educational technology from ‘AI as a tool’ to ‘AI as a multimodal learning partner’, offering a theoretical framework that can be applied in practice while remaining mindful of ethical boundaries.

References

Bewersdorff, A., Hartmann, C., Hornberger, M., Seßler, K., Bannert, M., Kasneci, E., Kasneci, G., Zhai, X., & Nerdel, C. (2025). Taking the next step with generative artificial intelligence: The transformative role of multimodal large language models in science education. Learning and Individual Differences, 118, 102601. https://doi.org/10.1016/j.lindif.2024.102601