TITLE / TITRE
On Mixture of Experts in Large-Scale Statistical Machine Learning Applications
´¡µþ³§°Õ¸é´¡°ä°Õ/¸éɳ§±«²ÑÉÌý
Mixtures of experts (MoEs), a class of statistical machine learning models that combine multiple models, known as experts, to form more complex and accurate models, have been combined into deep learning architectures to improve the ability of these architectures and AI models to capture the heterogeneity of the data and to scale up these architectures without increasing the computational cost. In mixtures of experts, each expert specializes in a different aspect of the data, which is then combined with a gating function to produce the final output. Therefore, parameter and expert estimates play a crucial role by enabling statisticians and data scientists to articulate and make sense of the diverse patterns present in the data. However, the statistical behaviors of parameters and experts in a mixture of experts have remained unsolved, which is due to the complex interaction between gating function and expert parameters.
In the first part of the talk, we investigate the performance of the least squares estimators (LSE) under a deterministic MoEs model where the data are sampled according to a regression model, a setting that has remained largely unexplored. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. We demonstrate that the rates for estimating strongly identifiable experts, namely the widely used feed-forward networks with activation functions sigmoid(·) and tanh(·), are substantially faster than those of polynomial experts, which we show to exhibit a surprising slow estimation rate.
In the second part of the talk, we show that the insights from theories shed light into understanding and improving important practical applications in machine learning and artificial intelligence (AI), in- cluding effectively scaling up massive AI models with several billion parameters, efficiently finetuning large-scale AI models for downstream tasks, and enhancing the performance of Transformer model, state-of-the-art deep learning architecture, with a novel self-attention mechanism.
|