Apple researchers have driven recent progress through focusing on multimodal large language models (LLMs), particularly through their MM1 models. These models possess exceptional capabilities in processing and understanding a combination of visual and textual information.

Training on a diverse dataset that includes image-caption pairs, interleaved image-text data, and text-only data enables these models to perform remarkably well in tasks like image captioning, where the model generates descriptive text for a given image. 

They also excel in visual question answering, a task where the model provides answers to questions based on visual content, and natural language inference, where the model determines the relationship between sentences in text.

Visual components are very important for AI performance

Apple’s image encoder, which is a core component responsible for converting images into a format that the model can understand, is key in this context. The resolution of input images and the image token count, which refers to the number of distinct pieces of visual information the model processes, are keys to improving the model’s performance. Apple’s findings indicate that optimizing these components can lead to tangible improvements in their AI capabilities.

The MM1 model with 30 billion parameters has excellent in-context learning abilities, able to engage in multi-step reasoning over multiple input images – a feature demonstrated through few-shot “chain-of-thought” prompting. 

This capability suggests that these large multimodal models can address complex, open-ended problems, offering a grounded understanding and generation of language in scenarios that require a nuanced interpretation of both visual and textual data. Apple’s advancements in this area point toward a future where AI can tackle more sophisticated and intricate tasks, paving the way for innovative applications in the coming years.

Apple’s financial commitment and development initiatives

Apple is strongly committed to AI, earmarking $1 billion each year for AI development. Apple is channeling these funds into several key projects, including the creation of an AI framework named “Ajax” and a chatbot known as “Apple GPT.”

The “Ajax” framework is to act as a foundational layer, likely designed to facilitate the development and deployment of AI-powered applications and services across Apple’s ecosystem. 

On the other hand, “Apple GPT,” a chatbot initiative, is Apple’s push into natural language processing and conversational AI. Integrating this technology into services like Siri and Apple Music is likely to greatly improve user experience. For instance, Siri could offer more nuanced and context-aware responses, while Apple Music might provide personalized playlist creation based on the user’s interactions and preferences.

Apple’s current position in the AI industry

Apple’s decision to ramp up its investment in AI hints at a shift in its approach to technology innovation. Known for adopting technologies after they have been validated by early adopters, Apple is now proactive in its new push for AI development. 

The tech giant recognizes the power of AI and its potential to influence its range of products and services. In bolstering its AI capabilities, Apple aims to gain a competitive edge in an industry where AI is becoming a central element of innovation and product differentiation.

Worldwide developers conference and AI features

Expectations are running high for the upcoming Apple’s Worldwide Developers Conference (WWDC), with industry experts and enthusiasts eagerly anticipating the unveiling of new AI-powered features and developer tools. The focus is on improving existing products and on introducing entirely new capabilities that could redefine user experiences and developer engagements with Apple’s ecosystem.

At this event, Apple is likely to present advancements in AI that could include more sophisticated machine learning models, improved frameworks for developers to build AI-driven applications, and perhaps even a closer look at how the company intends to integrate its multimodal AI research into consumer products.

Tim Boesen

March 25, 2024

3 Min