FOD#32: Mixture of Experts — What is it?

A history of MoE and concise coverage of the remarkably rich week in ML research and innovations

Ksenia Se
7 min readDec 12, 2023

It should be illegal to ship that many updates and releases so close to the holidays, but here we are, two weeks before Christmas, with our hands full of news and research papers (thank you, Conference on Neural Information Processing Systems (NeurIPs), very much!). Let’s dive in; it was truly a fascinating week.

But first, a reminder: we are piecing together expert views on the trajectory of ML&AI for 2024. Send at ks@turingpost.com your thoughts on what you believe 2024 will bring!

Many many thanks to those who already shared their views.

Now, to the news. Everybody will be talking about Mixture of Experts these days, thanks to Mistral AI’s almost punkish release of their new model on torrent, which they announced simply like this:

The concept of MoE, though, has been around for a while. To be exact: it was first mentioned in 1988 at Connectionist Summer. The idea, introduced by Robert Jacobs and Geoffrey Hinton, involves using several specialized networks, called ‘expert’ networks, each handling different tasks, along with a controlling network that chooses the right expert for each task. This approach was suggested because using one network for all tasks often leads to problems and slow learning. By dividing tasks among experts, learning becomes faster and more efficient. This idea is the basis of the Mixture of Experts model, where different networks learn different things more effectively, emphasizing specialized learning over a one-size-fits-all strategy in neural networks. The first paper, ‘Adaptive Mixtures of Local Experts about MoE’, was published in 1991.

Despite its initial promise, MoE’s complexity and computational demands led to it being overshadowed by more straightforward algorithms during the early days of AI’s resurgence. However, with the advent of more powerful computing resources and vast datasets, MoE has experienced a renaissance, proving integral to advancements in neural network architectures.

The MoE Framework

The essence of MoE lies in its unique structure. Unlike traditional neural networks that rely on a singular, monolithic approach to problem-solving, MoE employs a range of specialized sub-models. Each ‘expert’ is adept at handling specific types of data or tasks. A gating network then intelligently directs input data to the most appropriate expert(s). This division of labor not only enhances model accuracy but also scales efficiently, as experts can be trained in parallel.

Google Research was especially dedicated to researching the topic:

Sparse Mixture-of-Experts model Mixtral 8×7B by Mistral

This week, MoE is on the rise due to Mistral’s release of their open-source Sparse Mixture-of-Experts model, Mixtral 8x7B. This model outperforms Llama 2 70B and GPT-3.5 in benchmarks, boasting an inference speed that’s six times faster. Licensed under Apache 2.0, Mixtral strikes an efficient balance between cost and performance. It handles five languages, excels in code generation, and can be fine-tuned for instruction-following tasks.

The community is buzzing with excitement. While Mistral is raising another $415 million, hitting a $2 billion valuation and momentarily joining the AI Unicorn Family. (Welcome! We’ll be covering you shortly.)

Additional read: to nerd out more on Mixtral and MoE, please refer to Hugging Face’s blog, Interconnects newsletter, and Mistral’s own release post.

Twitter Library

News from The Usual Suspects ©

Elon Musk’s “Grok” is available for Premium+

  • Musk’s foray into AI with “Grok” is a classic Musk move — disruptive and headline-grabbing. Its integration with Twitter/X for real-time data sourcing is a game changer (many like how well it summarizes the daily news). However, its positioning as an uncensored, ‘anti-woke’ alternative raises questions about content moderation and the handling of misinformation.

Google: A Mosaic of Success and Shortcomings

  • With the overwhelming demand for GPUs, Google works on alternatives. And pretty successfully. Google Cloud just announced TPU v5p and AI Hypercomputer, enhancing AI workloads with powerful, scalable accelerators and integrated systems. TPU v5p offers improved training speeds for large models, while AI Hypercomputer combines optimized hardware, open software, and ML frameworks for efficient AI management.
  • Google’s NotebookLM is also a solid addition. It introduces many new features to enhance the process of combining ideas from various sources, including a noteboard space for pinning quotes and notes, and dynamic suggested actions for reading and note-taking. NotebookLM also offers formats for different writing projects, and ensures personal data remains private. This AI-native application continues to evolve with user feedback.
  • However, Google’s journey with Gemini has not been entirely rosy. The advancements, particularly in leveraging TPUs and its multimodal capabilities, are impressive. However, the controversy surrounding the demo and the delayed release of Gemini Ultra highlight the challenges even tech giants face in the rapidly evolving AI landscape. As you can see from MoE’s papers: Google is (was) ahead of research on many fronts. Now, the competition is no longer just about technological prowess or innovative research; it’s a demonstration of leadership.

HoneyBee from Intel Labs and Mila

  • This new LLM for materials science is notable for being the first billion-parameter scale open-source LLM in this field, delivering top-notch performance on the MatSci-NLP benchmark. This collaboration aims to advance AI tools for materials discovery, tackling challenges like climate change and sustainable semiconductor production. Available on Hugging Face.

CoreWeave’s Funding

  • GPU-rich CoreWeave, timely washing its hands of crypto, saw amazing results in dedicating itself to AI: its valuation hit $7 billion after a minority investment of $642 million led by Fidelity Management and Research Co. The significant investment in CoreWeave underscores the growing interest in AI infrastructure and cloud computing. Their focus on GPUs and AI cloud services is a testament to the increasing demand for high-powered computing in AI development.

Meta AI’s Codec Avatars

  • Meta’s advancement in creating relightable Gaussian Codec Avatars is a fascinating development in the realm of virtual reality and 3D modeling. The level of detail and real-time performance capabilities they’re achieving could have far-reaching implications for VR and AR experiences.

E.U.’s AI Act

  • This is a significant development, marking a bold step by the EU in regulating AI technologies. The focus on high-risk and commercial AI applications, coupled with stringent regulations, is a clear indicator of the EU’s commitment to ethical AI practices. The implications for open-source AI projects are particularly intriguing, as they could reshape the landscape of AI development beyond commercial entities.
  • Additional Read: A Framework for U.S. AI Governance from MIT

Other news, categorized for your convenience

An exceptionally rich week! As always, we offer you only the freshest and the most relevant research papers of the week. Truly, the best curated selection:

Language Models and Code Generation

  1. Magicoder: Introduces LLMs for code with Magicoder and MagicoderS, offering superior performance in various coding benchmarks →paper
  2. Chain of Code (CoC): Combines code-writing with code execution emulation for enhanced reasoning in LMs →paper
  3. CYBERSECEVAL: Evaluates the cybersecurity aspects of LLMs used as coding assistants →paper

Video and Image Synthesis

  1. DeepCache: Accelerates diffusion models in image synthesis by caching and reusing features across stages →paper
  2. Alchemist: Edits material attributes in real images using generative text-to-image models →paper
  3. Kandinsky 3.0: A large-scale text-to-image generation model, developed in Russia, represents a significant advancement in image generation quality and realism →paper
  4. Alpha-CLIP: Focuses on specific regions in images, improving the region-based recognition capabilities of the original CLIP model →paper

Advances in Learning and Training Methods

  1. URIAL by Allen AI: A tuning-free method for aligning LLMs through in-context learning →paper
  2. Nash Learning from Human Feedback: Utilizes human preference data for fine-tuning LLMs →paper
  3. GIVT: A new approach for generative modeling using real-valued vector sequences →paper
  4. Analyzing and Improving the Training Dynamics of Diffusion Models: Proposes modifications to stabilize ADM diffusion model training →paper
  5. SPARQ ATTENTION: Reduces memory bandwidth requirements in LLMs during inference →paper
  6. Efficient Monotonic Multihead Attention: Improves simultaneous translation performance with stable and unbiased monotonic alignment estimation →paper

Multimodal and General AI

  1. OneLLM: A MLLM that aligns multiple modalities to language using a unified framework →paper
  2. Concordia by Google DeepMind: Integrates LLMs into Generative Agent-Based Models for advanced simulations →paper
  3. Multimodal Data and Resource Efficient Device-directed Speech Detection: Explores natural interaction with virtual assistants using a multimodal approach →paper

Reinforcement Learning and Reranking

  1. Pearl by Meta: A versatile framework for real-world intelligent systems using Reinforcement Learning →paper
  2. RankZephyr: An LLM for zero-shot listwise reranking, demonstrating the effectiveness and robustness →paper

Pathfinding and Reasoning

  1. PATHFINDER: A tree-search-based method for generating reasoning paths in language models →paper

Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍

*We thank SuperAnnotate for their insights and ongoing support of Turing Post.

Another week with fascinating innovations! We call this overview “Froth on the Daydream” — or simply, FOD. It’s a reference to the surrealistic and experimental novel by Boris Vian — after all, AI is experimental and feels quite surrealistic, and a lot of writing on this topic is just a froth on the daydream.

--

--

Ksenia Se
Ksenia Se

Written by Ksenia Se

I build Turing Post, equipping you with in-depth knowledge and analysis to make smarter decisions about AI & ML -> https://www.turingpost.com/subscribe

No responses yet