The next generation of Large Language Models (LLM) can take input in any modality and generate content in any modality. According to NExT, a collaboration between the National University of Singapore and Tsinghua University focused on data analytics and artificial intelligence (AI).
In a blog post uploaded by the initiative, researchers of the project claim that this new model of LLM will be able to “perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio”.
“By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities.”, the blog states.
The currently popular generative AI tools can produce results in response to one type of input, such as text, audio, video, etc.., at a time. However, the researchers at NExT are looking to make an AI tool capable of understanding universal modalities, which will enhance human-computer interactions and make them more natural.
NExT-GPT is not fully operational yet. But there are demos of its capabilities up on GitHub.