Publications | Lu Yin

  <div class="row">
    <div class="col-sm-2 abbr"><abbr class="badge">Arxiv</abbr></div>

    <!-- Entry bib key -->
    <div id="yin2023OWL" class="col-sm-8">
    
      <!-- Title -->
      <div class="title">Outlier Weighed Layerwise Sparsity OWL: A Missing Secret Sauce for Pruning LLMs to High Sparsity</div>
      <!-- Author -->
      <div class="author"><b>Yin, Lu</b>;&nbsp;You, Wu;&nbsp;Zhenyu, Zhang;&nbsp;Cheng-Yu, Hsieh;&nbsp;Yaqing, Wang;&nbsp;Yiling, Jia;&nbsp;Mykola, Pechenizkiy;&nbsp;Yi, Liang;&nbsp;Zhangyang, Wang;&nbsp;and Shiwei, Liu
      </div>

      <!-- Journal/Book title and date -->
      <div class="periodical">
        <em>In Arxiv</em> 2023
      </div>
    
      <!-- Links/Buttons -->
      <div class="links">
        <a class="abstract btn btn-sm z-depth-0" role="button">Abs</a>
        <a href="https://arxiv.org/pdf/2310.05175.pdf" class="btn btn-sm z-depth-0" role="button">HTML</a>
      </div>

      <!-- Hidden abstract block -->
      <div class="abstract hidden">
        <p>Large Language Models (LLMs), renowned for their remarkable performance, present a challenge due to their colossal model size when it comes to practical deployment. In response to this challenge, efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters can be pruned in one-shot without hurting performance. Building upon insights gained from pre-LLM models, prevailing LLM pruning strategies have consistently adhered to the practice of uniformly pruning all layers at equivalent sparsity. However, this observation stands in contrast to the prevailing trends observed in the field of vision models, where non-uniform layerwise sparsity typically yields substantially improved results. To elucidate the underlying reasons for this disparity, we conduct a comprehensive analysis of the distribution of token features within LLMs. In doing so, we discover a strong correlation with the emergence of outliers, defined as features exhibiting significantly greater magnitudes compared to their counterparts in feature dimensions. Inspired by this finding, we introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios specifically designed for LLM pruning, termed as Outlier Weighed Layerwise sparsity (OWL). The sparsity ratio of OWL is directly proportional to the outlier ratio observed within each layer, facilitating a more effective alignment between layerwise weight sparsity and outlier ratios. Our empirical evaluation, conducted across the LLaMA-V1 family and OPT, spanning various benchmarks, demonstrates the distinct advantages offered by OWL over previous methods. For instance, our approach exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively. </p>
      </div>
    </div>
  </div>

</li>

AAAI

Lottery Pools: Winning More by Interpolating Tickets without Increasing Training or Inference Cost

Yin, Lu; Liu, Shiwei; Fang, Meng; Huang, Tianjin; Menkovski, Vlado; and Pechenizkiy,

In Thirty-Seventh AAAI Conference on Artificial Intelligence 2023

Abs HTML PDF

Lottery tickets (LTs) is able to discover accurate and sparse subnetworks that could be trained in isolation to match the performance of dense networks. Ensemble, in parallel, is one of the oldest time-proven tricks in machine learning to improve performance by combining the output of multiple independent models. However, the benefits of ensemble in the context of LTs will be diluted since ensemble does not directly lead to stronger sparse subnetworks, but leverages their predictions for a better decision. In this work, we first observe that directly averaging the weights of the adjacent learned subnetworks significantly boosts the performance of LTs. Encouraged by this observation, we further propose an alternative way to perform an ’ensemble’ over the subnetworks identified by iterative magnitude pruning via a simple interpolating strategy. We call our method Lottery Pools. In contrast to the naive ensemble which brings no performance gains to each single subnetwork, Lottery Pools yields much stronger sparse subnetworks than the original LTs without requiring any extra training or inference cost. Across various modern architectures on CIFAR-10/100 and ImageNet, we show that our method achieves significant performance gains in both, in-distribution and out-of-distribution scenarios. Impressively, evaluated with VGG-16 and ResNet-18, the produced sparse subnetworks outperform the original LTs by up to 1.88% on CIFAR-100 and 2.36% on CIFAR-100-C; the resulting dense network surpasses the pre-trained dense-model up to 2.22% on CIFAR-100 and 2.38% on CIFAR-100-C.

</ol>

2022

LoG (Best Paper)

You Can Have Better Graph Neural Networks by Not Training Weights at All: Finding Untrained GNNs Tickets

Huang, Tianjin; Chen, Tianlong; Fang, Meng; Menkovski, Vlado; Zhao, Jiaxu; Yin, Lu; Yulong Pei, Decebal Constantin Mocanu; Wang, Zhangyang; Pechenizkiy, Mykola; and Liu, Shiwei

In Learning on Graphs Conference 2022

Abs HTML PDF

Recent works have impressively demonstrated that there exists a subnetwork in randomly initialized convolutional neural networks (CNNs) that can match the performance of the fully trained dense networks at initialization, without any optimization of the weights of the network (i.e., untrained networks). However, the presence of such untrained subnetworks in graph neural networks (GNNs) still remains mysterious. In this paper we carry out the first-of-its-kind exploration of discovering matching untrained GNNs. With sparsity as the core tool, we can find untrained sparse subnetworks at the initialization, that can match the performance of fully trained dense GNNs. Besides this already encouraging finding of comparable performance, we show that the found untrained subnetworks can substantially mitigate the GNN over-smoothing problem, hence becoming a powerful tool to enable deeper GNNs without bells and whistles. We also observe that such sparse untrained subnetworks have appealing performance in out-of-distribution detection and robustness of input perturbations. We evaluate our method across widely-used GNN architectures on various popular datasets including the Open Graph Benchmark (OGB).
UAI

Superposing Many Tickets into One: A Performance Booster for Sparse Neural Network Training

Yin, Lu; Menkovski, Vlado; Fang, Meng; Huang, Tianjin; Pei, Yulong; Pechenizkiy, Mykola; Mocanu, Decebal Constantin; and Liu, Shiwei

In The 38th Conference on Uncertainty in Artificial Intelligence 2022

Abs HTML PDF

Recent works on sparse neural network training (sparse training) have shown that a compelling trade-off between performance and efficiency can be achieved by training intrinsically sparse neural networks from scratch. Existing sparse training methods usually strive to find the best sparse subnetwork possible in one single run, without involving any expensive dense or pre-training steps. For instance, dynamic sparse training (DST), as one of the most prominent directions, is capable of reaching a competitive performance of dense training by iteratively evolving the sparse topology during the course of training. In this paper, we argue that it is better to allocate the limited resources to create multiple low-loss sparse subnetworks and superpose them into a stronger one, instead of allocating all resources entirely to find an individual subnetwork. To achieve this, two desiderata are required: (1) efficiently producing many low-loss subnetworks, the so-called cheap tickets, within one training process limited to the standard training time used in dense training; (2) effectively superposing these cheap tickets into one stronger subnetwork without going over the constrained parameter budget. To corroborate our conjecture, we present a novel sparse training approach, termed Sup-tickets, which can satisfy the above two desiderata concurrently in a single sparse-to-sparse training process. Across various modern architectures on CIFAR-10/100 and ImageNet, we show that Sup-tickets integrates seamlessly with the existing sparse training methods and demonstrates consistent performance improvement.

2021

ICML

Do we actually need dense over-parameterization? in-time over-parameterization in sparse training

Liu, Shiwei; Yin, Lu; Mocanu, Decebal Constantin; and Pechenizkiy, Mykola

In International Conference on Machine Learning 2021

Abs HTML PDF

In this paper, we introduce a new perspective on training deep neural networks capable of state-of-the-art performance without the need for the expensive over-parameterization by proposing the concept of In-Time Over-Parameterization (ITOP) in sparse training. By starting from a random sparse network and continuously exploring sparse connectivities during training, we can perform an Over-Parameterization over the course of training, closing the gap in the expressibility between sparse training and dense training. We further use ITOP to understand the underlying mechanism of Dynamic Sparse Training (DST) and discover that the benefits of DST come from its ability to consider across time all possible parameters when searching for the optimal sparse connectivity. As long as sufficient parameters have been reliably explored, DST can outperform the dense neural network by a large margin. We present a series of experiments to support our conjecture and achieve the state-of-the-art sparse training performance with ResNet-50 on ImageNet. More impressively, ITOP achieves dominant performance over the overparameterization-based sparse methods at extreme sparsities. When trained with ResNet-34 on CIFAR-100, ITOP can match the performance of the dense model at an extreme sparsity 98%.
NeurIPS

Sparse training via boosting pruning plasticity with neuroregeneration

Liu, Shiwei; Chen, Tianlong; Chen, Xiaohan; Atashgahi, Zahra; Yin, Lu; Kou, Huanyu; Shen, Li; Pechenizkiy, Mykola; Wang, Zhangyang; and Mocanu, Decebal Constantin

Advances in Neural Information Processing Systems 2021

Abs HTML PDF

Works on lottery ticket hypothesis (LTH) and single-shot network pruning (SNIP) have raised a lot of attention currently on post-training pruning (iterative magnitude pruning) and before-training pruning (pruning at initialization). The former method suffers from an extremely large computation cost and the latter category of methods usually struggles with insufficient performance. In comparison, during-training pruning, a class of pruning methods that simultaneously enjoys the training/inference efficiency and the comparable performance, temporarily, has been less explored. To better understand during-training pruning, we quantitatively study the effect of pruning throughout training from the perspective of pruning plasticity (the ability of the pruned networks to recover the original performance). Pruning plasticity can help explain several other empirical observations about neural network pruning in literature. We further find that pruning plasticity can be substantially improved by injecting a brain-inspired mechanism called neuroregeneration, i.e., to regenerate the same number of connections as pruned. We design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration, GraNet, advancing state of the art. Perhaps most impressively, GraNet for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods by a large margin with ResNet-50 on ImageNet without extending the training time.

</div>

Data Efficiency and Knowledge Elicitation

2021

AAAI (Workshop)

Semantic-Based Few-Shot Learning by Interactive Psychometric Testing

Yin, Lu; Menkovski, Vlado; Pei, Yulong; and Pechenizkiy, Mykola

AAAI 2022 Workshop on Interactive Machine Learning (IML@AAAI22) 2021

Abs HTML PDF

Few-shot classification tasks aim to classify images in query sets based on only a few labeled examples in support sets. Most studies usually assume that each image in a task has a single and unique class association. Under these assumptions, these algorithms may not be able to identify the proper class assignment when there is no exact matching between support and query classes. For example, given a few images of lions, bikes, and apples to classify a tiger. However, in a more general setting, we could consider the higher-level concept, the large carnivores, to match the tiger to the lion for semantic classification. Existing studies rarely considered this situation due to the incompatibility of label-based supervision with complex conception relationships. In this work, we advance the few-shot learning towards this more challenging scenario, the semantic-based few-shot learning, and propose a method to address the paradigm by capturing the inner semantic relationships using interactive psychometric learning. The experiment results on the CIFAR-100 dataset show the superiority of our method for the semantic-based few-shot learning compared to the baseline.
ACML (Long Oral)

Hierarchical Semantic Segmentation using Psychometric Learning

Yin, Lu; Menkovski, Vlado; Liu, Shiwei; and Pechenizkiy, Mykola

Proceedings of Machine Learning Research 2021

Abs HTML PDF

Assigning meaning to parts of image data is the goal of semantic image segmentation. Machine learning methods, specifically supervised learning is commonly used in a variety of tasks formulated as semantic segmentation. One of the major challenges in the supervised learning approaches is expressing and collecting the rich knowledge that experts have with respect to the meaning present in the image data. Towards this, typically a fixed set of labels is specified and experts are tasked with annotating the pixels, patches or segments in the images with the given labels. In general, however, the set of classes does not fully capture the rich semantic information present in the images. For example, in medical imaging such as histology images, the different parts of cells could be grouped and sub-grouped based on the expertise of the pathologist. To achieve such a precise semantic representation of the concepts in the image, we need access to the full depth of knowledge of the annotator. In this work, we develop a novel approach to collect segmentation annotations from experts based on psychometric testing. Our method consists of the psychometric testing procedure, active query selection, query enhancement, and a deep metric learning model to achieve a patch-level image embedding that allows for semantic segmentation of images. We show the merits of our method with evaluation on the synthetically generated image, aerial image and histology image.
IJCAI (DC)

Beyond labels: knowledge elicitation using deep metric learning and psychometric testing

Yin, Lu

In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence 2021

HTML

2020

ECML

Knowledge Elicitation Using Deep Metric Learning and Psychometric Testing

Yin, Lu; Menkovski, Vlado; and Pechenizkiy, Mykola

In Joint European Conference on Machine Learning and Knowledge Discovery in Databases 2020

Abs HTML PDF

Knowledge present in a domain is well expressed as relationships between corresponding concepts. For example, in zoology, animal species form complex hierarchies; in genomics, the different (parts of) molecules are organized in groups and subgroups based on their functions; plants, molecules, and astronomical objects all form complex taxonomies. Nevertheless, when applying supervised machine learning (ML) in such domains, we commonly reduce the complex and rich knowledge to a fixed set of labels, and induce a model shows good generalization performance with respect to these labels. The main reason for such a reductionist approach is the difficulty in eliciting the domain knowledge from the experts. Developing a label structure with sufficient fidelity and providing comprehensive multi-label annotation can be exceedingly labor-intensive in many real-world applications. In this paper, we provide a method for efficient hierarchical knowledge elicitation (HKE) from experts working with high-dimensional data such as images or videos. Our method is based on psychometric testing and active deep metric learning. The developed models embed the high-dimensional data in a metric space where distances are semantically meaningful, and the data can be organized in a hierarchical structure. We provide empirical evidence with a series of experiments on a synthetically generated dataset of simple shapes, and Cifar 10 and Fashion-MNIST benchmarks that our method is indeed successful in uncovering hierarchical structures.

–>