Just over two months ago, OpenAI released ChatGPT to the public, leading to widespread discussions on how this AI-powered chatbot could impact different areas such as business and education. Now, Amazon has entered the market with a generative AI system that reportedly outperforms OpenAI’s GPT-3.5, increasing competition in the field of AI-powered chatbots.
Shortly thereafter, Google and Baidu launched their own chatbots to showcase their “generative AI” capabilities, which can produce conversational text, graphics, and other forms of content.
In the ScienceQA benchmark, Amazon’s latest language models outperform GPT-3.5 by 16 percentage points, achieving an accuracy rate of 75.17%, which surpasses the performance of many human participants.
The ScienceQA benchmark is comprised of over 21,000 multiple-choice questions, which cover a wide range of science topics and utilize multiple modalities to provide annotated answers.
How does it work?
Multimodal-CoT is a technique that breaks down multi-step problems into smaller, intermediate reasoning processes that ultimately lead to a final answer, even when dealing with inputs from different sources such as language and vision.
The most common method for implementing Multimodal-CoT involves merging information from various modalities into a single format before utilizing LLMs for CoT. However, this approach has several issues, such as the loss of information when transferring data between different formats.
Alternatively, fine-tuning small language models can perform CoT reasoning in multimodality by combining different aspects of language and vision.
However, this method can result in hallucinatory reasoning patterns that can significantly impact the answer inference, which is a major challenge.
The study conducted by Amazon researchers reveals that incorporating visual features leads to the creation of better rationales, ultimately leading to more accurate answer inference.