Skip to content

zachysun/DailyArXiv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Daily Papers

The project automatically fetches the latest papers from arXiv based on keywords.

The subheadings in the README file represent the search keywords.

Only the most recent articles for each keyword are retained, up to a maximum of 100 papers.

You can click the 'Watch' button to receive daily email notifications.

Last update: 2024-12-26

Thanks to zezhishao/DailyArXiv Thanks to Cool Paper

Index

Time Series

Back to Index

Title Date Cool Paper Abstract Comment
SIGMA: Selective Gated Mamba for Sequential Recommendation 2024-12-24 Go
Show

In various domains, Sequential Recommender Systems (SRS) have become essential due to their superior capability to discern intricate user preferences. Typically, SRS utilize transformer-based architectures to forecast the subsequent item within a sequence. Nevertheless, the quadratic computational complexity inherent in these models often leads to inefficiencies, hindering the achievement of real-time recommendations. Mamba, a recent advancement, has exhibited exceptional performance in time series prediction, significantly enhancing both efficiency and accuracy. However, integrating Mamba directly into SRS poses several challenges. Its inherently unidirectional nature may constrain the model's capacity to capture the full context of user-item interactions, while its instability in state estimation can compromise its ability to detect short-term patterns within interaction sequences. To overcome these issues, we introduce a new framework named Selective Gated Mamba (SIGMA) for Sequential Recommendation. This framework leverages a Partially Flipped Mamba (PF-Mamba) to construct a bidirectional architecture specifically tailored to improve contextual modeling. Additionally, an input-sensitive Dense Selective Gate (DS Gate) is employed to optimize directional weights and enhance the processing of sequential information in PF-Mamba. For short sequence modeling, we have also developed a Feature Extract GRU (FE-GRU) to efficiently capture short-term dependencies. Empirical results indicate that SIGMA outperforms current models on five real-world datasets. Our implementation code is available at https://github.com/ziwliu-cityu/SIMGA to ease reproducibility.

Subsampling, aligning, and averaging to find circular coordinates in recurrent time series 2024-12-24 Go
Show

We introduce a new algorithm for finding robust circular coordinates on data that is expected to exhibit recurrence, such as that which appears in neuronal recordings of C. elegans. Techniques exist to create circular coordinates on a simplicial complex from a dimension 1 cohomology class, and these can be applied to the Rips complex of a dataset when it has a prominent class in its dimension 1 cohomology. However, it is known this approach is extremely sensitive to uneven sampling density. Our algorithm comes with a new method to correct for uneven sampling density, adapting our prior work on averaging coordinates in manifold learning. We use rejection sampling to correct for inhomogeneous sampling and then apply Procrustes matching to align and average the subsamples. In addition to providing a more robust coordinate than other approaches, this subsampling and averaging approach has better efficiency. We validate our technique on both synthetic data sets and neuronal activity recordings. Our results reveal a topological model of neuronal trajectories for C. elegans that is constructed from loops in which different regions of the brain state space can be mapped to specific and interpretable macroscopic behaviors in the worm.

From CNN to CNN + RNN: Adapting Visualization Techniques for Time-Series Anomaly Detection 2024-12-24 Go
Show

Deep neural networks are highly effective in solving complex problems but are often viewed as "black boxes," limiting their adoption in contexts where transparency and explainability are essential. This lack of visibility raises ethical and legal concerns, particularly in critical areas like security, where automated decisions can have significant consequences. The General Data Protection Regulation (GDPR) underscores the importance of justifying these decisions. In this work, we explore visualization techniques to improve the understanding of anomaly detection models based on convolutional recurrent neural networks (CNN + RNN) with a TimeDistributed layer. Our model combines VGG19 for convolutional feature extraction and a GRU layer for sequential analysis of real-time video data. While suitable for temporal data, this structure complicates gradient propagation, as sequence elements are processed independently, dissociating temporal information. We adapt visualization techniques such as saliency maps and Grad-CAM to address these challenges. This article highlights the difficulties in visually interpreting video-based models and demonstrates how techniques for static images can be adapted to recurrent architectures, offering a transitional solution in the absence of dedicated methods.

Time-Series Foundation Model for Value-at-Risk Forecasting 2024-12-24 Go
Show

This study is the first to explore the performance of a time-series foundation model for Value-at-Risk (VaR) forecasting. Foundation models, pre-trained on vast and varied datasets, can be used in a zero-shot setting with relatively minimal data or further improved through finetuning. We compare the performance of Google's model, called TimesFM, against conventional parametric and non-parametric models, including GARCH, Generalized Autoregressive Score (GAS), and empirical quantile estimates, using daily returns from the S&P 100 index and its constituents over 19 years. Our backtesting results indicate that in terms of the actual-over-expected ratio, the fine-tuned TimesFM model consistently outperforms traditional methods. Regarding the quantile score loss function, it achieves performance comparable to the best econometric approach, the GAS model. Overall, the foundation model is either the best or among the top performers in forecasting VaR across the 0.01, 0.025, 0.05, and 0.1 VaR levels. Fine-tuning significantly improves accuracy, indicating that zero-shot use is not optimal for VaR forecasting.

SCKF-LSTM Based Trajectory Tracking for Electricity-Gas Integrated Energy System 2024-12-24 Go
Show

This paper introduces a novel approach for tracking the dynamic trajectories of integrated natural gas and power systems, leveraging a Kalman filter-based structure. To predict the states of the system, the Holt's exponential smoothing techniques and nonlinear dynamic equations of gas pipelines are applied to establish the power and gas system equations, respectively. The square-root cubature Kalman filter algorithm is utilized to address the numerical challenges posed by the strongly nonlinear system equations. The boundary conditions in the gas system include the flow balances at sink nodes, and the mass flow rates of loads have to be predicted at each computation step. For the prediction of load mass flows, the long short-term memory network is employed, known for its effectiveness in time series prediction. Consequently, a combined method based on the square-root cubature Kalman filter and the long short-term memory network is proposed for tracking integrated gas and power systems. To evaluate the tracking performances of the proposed method, the IEEE-39 bus power system and GasLib-40 node gas system are used to form the testing system. Simulation results demonstrate high precision in tracking the dynamic states of power and gas systems. Two indexes are introduced for a numerical analysis of the tracking results, indicating that the accuracy of this method surpasses that of traditional measurements.

Accep...

Accepted by IEEE Transactions on Industrial Informatics

Hierarchical Classification Auxiliary Network for Time Series Forecasting 2024-12-24 Go
Show

Deep learning has significantly advanced time series forecasting through its powerful capacity to capture sequence relationships. However, training these models with the Mean Square Error (MSE) loss often results in over-smooth predictions, making it challenging to handle the complexity and learn high-entropy features from time series data with high variability and unpredictability. In this work, we introduce a novel approach by tokenizing time series values to train forecasting models via cross-entropy loss, while considering the continuous nature of time series data. Specifically, we propose a Hierarchical Classification Auxiliary Network, HCAN, a general model-agnostic component that can be integrated with any forecasting model. HCAN is based on a Hierarchy-Aware Attention module that integrates multi-granularity high-entropy features at different hierarchy levels. At each level, we assign a class label for timesteps to train an Uncertainty-Aware Classifier. This classifier mitigates the over-confidence in softmax loss via evidence theory. We also implement a Hierarchical Consistency Loss to maintain prediction consistency across hierarchy levels. Extensive experiments integrating HCAN with state-of-the-art forecasting models demonstrate substantial improvements over baselines on several real-world datasets.

Zero-Shot Conditioning of Score-Based Diffusion Models by Neuro-Symbolic Constraints 2024-12-24 Go
Show

Score-based diffusion models have emerged as effective approaches for both conditional and unconditional generation. Still conditional generation is based on either a specific training of a conditional model or classifier guidance, which requires training a noise-dependent classifier, even when a classifier for uncorrupted data is given. We propose a method that, given a pre-trained unconditional score-based generative model, samples from the conditional distribution under arbitrary logical constraints, without requiring additional training. Differently from other zero-shot techniques, that rather aim at generating valid conditional samples, our method is designed for approximating the true conditional distribution. Firstly, we show how to manipulate the learned score in order to sample from an un-normalized distribution conditional on a user-defined constraint. Then, we define a flexible and numerically stable neuro-symbolic framework for encoding soft logical constraints. Combining these two ingredients we obtain a general, but approximate, conditional sampling algorithm. We further developed effective heuristics aimed at improving the approximation. Finally, we show the effectiveness of our approach in approximating conditional distributions for various types of constraints and data: tabular data, images and time series.

Developing Cryptocurrency Trading Strategy Based on Autoencoder-CNN-GANs Algorithms 2024-12-24 Go
Show

This paper leverages machine learning algorithms to forecast and analyze financial time series. The process begins with a denoising autoencoder to filter out random noise fluctuations from the main contract price data. Then, one-dimensional convolution reduces the dimensionality of the filtered data and extracts key information. The filtered and dimensionality-reduced price data is fed into a GANs network, and its output serve as input of a fully connected network. Through cross-validation, a model is trained to capture features that precede large price fluctuations. The model predicts the likelihood and direction of significant price changes in real-time price sequences, placing trades at moments of high prediction accuracy. Empirical results demonstrate that using autoencoders and convolution to filter and denoise financial data, combined with GANs, achieves a certain level of predictive performance, validating the capabilities of machine learning algorithms to discover underlying patterns in financial sequences. Keywords - CNN;GANs; Cryptocurrency; Prediction.

The p...

The paper was accepted by 2024 4th International Conference on Artificial Intelligence, Robotics, and Communication(ICAIRC 2024)

Explainable AI for Multivariate Time Series Pattern Exploration: Latent Space Visual Analytics with Temporal Fusion Transformer and Variational Autoencoders in Power Grid Event Diagnosis 2024-12-24 Go
Show

Detecting and analyzing complex patterns in multivariate time-series data is crucial for decision-making in urban and environmental system operations. However, challenges arise from the high dimensionality, intricate complexity, and interconnected nature of complex patterns, which hinder the understanding of their underlying physical processes. Existing AI methods often face limitations in interpretability, computational efficiency, and scalability, reducing their applicability in real-world scenarios. This paper proposes a novel visual analytics framework that integrates two generative AI models, Temporal Fusion Transformer (TFT) and Variational Autoencoders (VAEs), to reduce complex patterns into lower-dimensional latent spaces and visualize them in 2D using dimensionality reduction techniques such as PCA, t-SNE, and UMAP with DBSCAN. These visualizations, presented through coordinated and interactive views and tailored glyphs, enable intuitive exploration of complex multivariate temporal patterns, identifying patterns' similarities and uncover their potential correlations for a better interpretability of the AI outputs. The framework is demonstrated through a case study on power grid signal data, where it identifies multi-label grid event signatures, including faults and anomalies with diverse root causes. Additionally, novel metrics and visualizations are introduced to validate the models and evaluate the performance, efficiency, and consistency of latent maps generated by TFT and VAE under different configurations. These analyses provide actionable insights for model parameter tuning and reliability improvements. Comparative results highlight that TFT achieves shorter run times and superior scalability to diverse time-series data shapes compared to VAE. This work advances fault diagnosis in multivariate time series, fostering explainable AI to support critical system operations.

Neural Conformal Control for Time Series Forecasting 2024-12-24 Go
Show

We introduce a neural network conformal prediction method for time series that enhances adaptivity in non-stationary environments. Our approach acts as a neural controller designed to achieve desired target coverage, leveraging auxiliary multi-view data with neural network encoders in an end-to-end manner to further enhance adaptivity. Additionally, our model is designed to enhance the consistency of prediction intervals in different quantiles by integrating monotonicity constraints and leverages data from related tasks to boost few-shot learning performance. Using real-world datasets from epidemics, electric demand, weather, and others, we empirically demonstrate significant improvements in coverage and probabilistic accuracy, and find that our method is the only one that combines good calibration with consistency in prediction intervals.

EF-LLM: Energy Forecasting LLM with AI-assisted Automation, Enhanced Sparse Prediction, Hallucination Detection 2024-12-24 Go
Show

Accurate prediction helps to achieve supply-demand balance in energy systems, supporting decision-making and scheduling. Traditional models, lacking AI-assisted automation, rely on experts, incur high costs, and struggle with sparse data prediction. To address these challenges, we propose the Energy Forecasting Large Language Model (EF-LLM), which integrates domain knowledge and temporal data for time-series forecasting, supporting both pre-forecast operations and post-forecast decision-support. EF-LLM's human-AI interaction capabilities lower the entry barrier in forecasting tasks, reducing the need for extra expert involvement. To achieve this, we propose a continual learning approach with updatable LoRA and a multi-channel architecture for aligning heterogeneous multimodal data, enabling EF-LLM to continually learn heterogeneous multimodal knowledge. In addition, EF-LLM enables accurate predictions under sparse data conditions through its ability to process multimodal data. We propose Fusion Parameter-Efficient Fine-Tuning (F-PEFT) method to effectively leverage both time-series data and text for this purpose. EF-LLM is also the first energy-specific LLM to detect hallucinations and quantify their occurrence rate, achieved via multi-task learning, semantic similarity analysis, and ANOVA. We have achieved success in energy prediction scenarios for load, photovoltaic, and wind power forecast.

TNNGen: Automated Design of Neuromorphic Sensory Processing Units for Time-Series Clustering 2024-12-23 Go
Show

Temporal Neural Networks (TNNs), a special class of spiking neural networks, draw inspiration from the neocortex in utilizing spike-timings for information processing. Recent works proposed a microarchitecture framework and custom macro suite for designing highly energy-efficient application-specific TNNs. These recent works rely on manual hardware design, a labor-intensive and time-consuming process. Further, there is no open-source functional simulation framework for TNNs. This paper introduces TNNGen, a pioneering effort towards the automated design of TNNs from PyTorch software models to post-layout netlists. TNNGen comprises a novel PyTorch functional simulator (for TNN modeling and application exploration) coupled with a Python-based hardware generator (for PyTorch-to-RTL and RTL-to-Layout conversions). Seven representative TNN designs for time-series signal clustering across diverse sensory modalities are simulated and their post-layout hardware complexity and design runtimes are assessed to demonstrate the effectiveness of TNNGen. We also highlight TNNGen's ability to accurately forecast silicon metrics without running hardware process flow.

Publi...

Published in IEEE Transactions on Circuits and Systems II: Express Briefs, May 2024

The FIX Benchmark: Extracting Features Interpretable to eXperts 2024-12-23 Go
Show

Feature-based methods are commonly used to explain model predictions, but these methods often implicitly assume that interpretable features are readily available. However, this is often not the case for high-dimensional data, and it can be hard even for domain experts to mathematically specify which features are important. Can we instead automatically extract collections or groups of features that are aligned with expert knowledge? To address this gap, we present FIX (Features Interpretable to eXperts), a benchmark for measuring how well a collection of features aligns with expert knowledge. In collaboration with domain experts, we propose FIXScore, a unified expert alignment measure applicable to diverse real-world settings across cosmology, psychology, and medicine domains in vision, language, and time series data modalities. With FIXScore, we find that popular feature-based explanation methods have poor alignment with expert-specified knowledge, highlighting the need for new methods that can better identify features interpretable to experts.

VITRO: Vocabulary Inversion for Time-series Representation Optimization 2024-12-23 Go
Show

Although LLMs have demonstrated remarkable capabilities in processing and generating textual data, their pre-trained vocabularies are ill-suited for capturing the nuanced temporal dynamics and patterns inherent in time series. The discrete, symbolic nature of natural language tokens, which these vocabularies are designed to represent, does not align well with the continuous, numerical nature of time series data. To address this fundamental limitation, we propose VITRO. Our method adapts textual inversion optimization from the vision-language domain in order to learn a new time series per-dataset vocabulary that bridges the gap between the discrete, semantic nature of natural language and the continuous, numerical nature of time series data. We show that learnable time series-specific pseudo-word embeddings represent time series data better than existing general language model vocabularies, with VITRO-enhanced methods achieving state-of-the-art performance in long-term forecasting across most datasets.

Accep...

Accepted to ICASSP 2025

On the Optimization of Singular Spectrum Analyses: A Pragmatic Approach 2024-12-23 Go
Show

Singular Spectrum Analysis (SSA) occupies a prominent place in the real signal analysis toolkit alongside Fourier and Wavelet analysis. In addition to the two aforementioned analyses, SSA allows the separation of patterns directly from the data space into the data space, with data that need not be strictly stationary, continuous, or even normally sampled. In most cases, SSA relies on a combination of Hankel or Toeplitz matrices and Singular Value Decomposition (SVD). Like Fourier and Wavelet analysis, SSA has its limitations. The main bottleneck of the method can be summarized in three points. The first is the diagonalization of the Hankel/Toeplitz matrix, which can become a major problem from a memory and/or computational point of view if the time series to be analyzed is very long or heavily sampled. The second point concerns the size of the analysis window, typically denoted as 'L', which will affect the detection of patterns in the time series as well as the dimensions of the Hankel/Toeplitz matrix. Finally, the third point concerns pattern reconstruction: how to easily identify in the eigenvector/eigenvalue space which patterns should be grouped. We propose to address each of these issues by describing a hopefully effective approach that we have been developing for over 10 years and that has yielded good results in our research work.

28 pages, 11 figures
EasyTime: Time Series Forecasting Made Easy 2024-12-23 Go
Show

Time series forecasting has important applications across diverse domains. EasyTime, the system we demonstrate, facilitates easy use of time-series forecasting methods by researchers and practitioners alike. First, EasyTime enables one-click evaluation, enabling researchers to evaluate new forecasting methods using the suite of diverse time series datasets collected in the preexisting time series forecasting benchmark (TFB). This is achieved by leveraging TFB's flexible and consistent evaluation pipeline. Second, when practitioners must perform forecasting on a new dataset, a nontrivial first step is often to find an appropriate forecasting method. EasyTime provides an Automated Ensemble module that combines the promising forecasting methods to yield superior forecasting accuracy compared to individual methods. Third, EasyTime offers a natural language Q&A module leveraging large language models. Given a question like "Which method is best for long term forecasting on time series with strong seasonality?", EasyTime converts the question into SQL queries on the database of results obtained by TFB and then returns an answer in natural language and charts. By demonstrating EasyTime, we intend to show how it is possible to simplify the use of time series forecasting and to offer better support for the development of new generations of time series forecasting methods.

Accepted by ICDE2025
Quantum Time-Series Learning with Evolutionary Algorithms 2024-12-23 Go
Show

Variational quantum circuits have arisen as an important method in quantum computing. A crucial step of it is parameter optimization, which is typically tackled through gradient-descent techniques. We advantageously explore instead the use of evolutionary algorithms for such optimization, specifically for time-series forecasting. We perform a comparison, for diverse instances of real-world data, between gradient-descent parameter optimization and covariant-matrix adaptation evolutionary strategy. We observe that gradient descent becomes permanently trapped in local minima that have been avoided by evolutionary algorithms in all tested datasets, reaching up to a six-fold decrease in prediction error. Finally, the combined use of evolutionary and gradient-based techniques is explored, aiming at retaining advantages of both. The results are particularly applicable in scenarios sensitive to gains in accuracy.

7 pages, 2 figures
Are Self-Attentions Effective for Time Series Forecasting? 2024-12-23 Go
Show

Time series forecasting is crucial for applications across multiple domains and various scenarios. Although Transformer models have dramatically advanced the landscape of forecasting, their effectiveness remains debated. Recent findings have indicated that simpler linear models might outperform complex Transformer-based approaches, highlighting the potential for more streamlined architectures. In this paper, we shift the focus from evaluating the overall Transformer architecture to specifically examining the effectiveness of self-attention for time series forecasting. To this end, we introduce a new architecture, Cross-Attention-only Time Series transformer (CATS), that rethinks the traditional Transformer framework by eliminating self-attention and leveraging cross-attention mechanisms instead. By establishing future horizon-dependent parameters as queries and enhanced parameter sharing, our model not only improves long-term forecasting accuracy but also reduces the number of parameters and memory usage. Extensive experiment across various datasets demonstrates that our model achieves superior performance with the lowest mean squared error and uses fewer parameters compared to existing models. The implementation of our model is available at: https://github.com/dongbeank/CATS.

Accep...

Accepted at NeurIPS 2024

Improving the Noise Estimation of Latent Neural Stochastic Differential Equations 2024-12-23 Go
Show

Latent neural stochastic differential equations (SDEs) have recently emerged as a promising approach for learning generative models from stochastic time series data. However, they systematically underestimate the noise level inherent in such data, limiting their ability to capture stochastic dynamics accurately. We investigate this underestimation in detail and propose a straightforward solution: by including an explicit additional noise regularization in the loss function, we are able to learn a model that accurately captures the diffusion component of the data. We demonstrate our results on a conceptual model system that highlights the improved latent neural SDE's capability to model stochastic bistable dynamics.

xPatch: Dual-Stream Time Series Forecasting with Exponential Seasonal-Trend Decomposition 2024-12-23 Go
Show

In recent years, the application of transformer-based models in time-series forecasting has received significant attention. While often demonstrating promising results, the transformer architecture encounters challenges in fully exploiting the temporal relations within time series data due to its attention mechanism. In this work, we design eXponential Patch (xPatch for short), a novel dual-stream architecture that utilizes exponential decomposition. Inspired by the classical exponential smoothing approaches, xPatch introduces the innovative seasonal-trend exponential decomposition module. Additionally, we propose a dual-flow architecture that consists of an MLP-based linear stream and a CNN-based non-linear stream. This model investigates the benefits of employing patching and channel-independence techniques within a non-transformer model. Finally, we develop a robust arctangent loss function and a sigmoid learning rate adjustment scheme, which prevent overfitting and boost forecasting performance. The code is available at the following repository: https://github.com/stitsyuk/xPatch.

Collaborative Optimization in Financial Data Mining Through Deep Learning and ResNeXt 2024-12-23 Go
Show

This study proposes a multi-task learning framework based on ResNeXt, aiming to solve the problem of feature extraction and task collaborative optimization in financial data mining. Financial data usually has the complex characteristics of high dimensionality, nonlinearity, and time series, and is accompanied by potential correlations between multiple tasks, making it difficult for traditional methods to meet the needs of data mining. This study introduces the ResNeXt model into the multi-task learning framework and makes full use of its group convolution mechanism to achieve efficient extraction of local patterns and global features of financial data. At the same time, through the design of task sharing layers and dedicated layers, it is established between multiple related tasks. Deep collaborative optimization relationships. Through flexible multi-task loss weight design, the model can effectively balance the learning needs of different tasks and improve overall performance. Experiments are conducted on a real S&P 500 financial data set, verifying the significant advantages of the proposed framework in classification and regression tasks. The results indicate that, when compared to other conventional deep learning models, the proposed method delivers superior performance in terms of accuracy, F1 score, root mean square error, and other metrics, highlighting its outstanding effectiveness and robustness in handling complex financial data. This research provides an efficient and adaptable solution for financial data mining, and at the same time opens up a new research direction for the combination of multi-task learning and deep learning, which has important theoretical significance and practical application value.

MSHyper: Multi-Scale Hypergraph Transformer for Long-Range Time Series Forecasting 2024-12-23 Go
Show

Demystifying interactions between temporal patterns of different scales is fundamental to precise long-range time series forecasting. However, previous works lack the ability to model high-order interactions. To promote more comprehensive pattern interaction modeling for long-range time series forecasting, we propose a Multi-Scale Hypergraph Transformer (MSHyper) framework. Specifically, a multi-scale hypergraph is introduced to provide foundations for modeling high-order pattern interactions. Then by treating hyperedges as nodes, we also build a hyperedge graph to enhance hypergraph modeling. In addition, a tri-stage message passing mechanism is introduced to aggregate pattern information and learn the interaction strength between temporal patterns of different scales. Extensive experiments on five real-world datasets demonstrate that MSHyper achieves state-of-the-art (SOTA) performance across various settings.

12 pages, 8 figures
On the Feasibility of Vision-Language Models for Time-Series Classification 2024-12-23 Go
Show

We build upon time-series classification by leveraging the capabilities of Vision Language Models (VLMs). We find that VLMs produce competitive results after two or less epochs of fine-tuning. We develop a novel approach that incorporates graphical data representations as images in conjunction with numerical data. This approach is rooted in the hypothesis that graphical representations can provide additional contextual information that numerical data alone may not capture. Additionally, providing a graphical representation can circumvent issues such as limited context length faced by LLMs. To further advance this work, we implemented a scalable end-to-end pipeline for training on different scenarios, allowing us to isolate the most effective strategies for transferring learning capabilities from LLMs to Time Series Classification (TSC) tasks. Our approach works with univariate and multivariate time-series data. In addition, we conduct extensive and practical experiments to show how this approach works for time-series classification and generative labels.

Enabling Time-series Foundation Model for Building Energy Forecasting via Contrastive Curriculum Learning 2024-12-23 Go
Show

Advances in time-series forecasting are driving a shift from conventional machine learning models to foundation models (FMs) that are trained with generalized knowledge. However, existing FMs still perform poorly in the energy fields, such as building energy forecasting (BEF). This paper studies the adaptation of FM to BEF tasks. We demonstrate the shortcomings of fine-tuning FM straightforwardly from both the perspectives of FM and the data. To overcome these limitations, we propose a new \textit{contrastive curriculum learning}-based training method. Our method optimizes the ordering of training data in the context of TSFM adaptation. Experiments show that our method can improve the zero/few-shot performance by 14.6% compared to the existing FMs. Our code and new TSFM will be available at .

Architecture-Aware Learning Curve Extrapolation via Graph Ordinary Differential Equation 2024-12-23 Go
Show

Learning curve extrapolation predicts neural network performance from early training epochs and has been applied to accelerate AutoML, facilitating hyperparameter tuning and neural architecture search. However, existing methods typically model the evolution of learning curves in isolation, neglecting the impact of neural network (NN) architectures, which influence the loss landscape and learning trajectories. In this work, we explore whether incorporating neural network architecture improves learning curve modeling and how to effectively integrate this architectural information. Motivated by the dynamical system view of optimization, we propose a novel architecture-aware neural differential equation model to forecast learning curves continuously. We empirically demonstrate its ability to capture the general trend of fluctuating learning curves while quantifying uncertainty through variational parameters. Our model outperforms current state-of-the-art learning curve extrapolation methods and pure time-series modeling approaches for both MLP and CNN-based learning curves. Additionally, we explore the applicability of our method in Neural Architecture Search scenarios, such as training configuration ranking.

Accepted to AAAI'25
DUET: Dual Clustering Enhanced Multivariate Time Series Forecasting 2024-12-23 Go
Show

Multivariate time series forecasting is crucial for various applications, such as financial investment, energy management, weather forecasting, and traffic optimization. However, accurate forecasting is challenging due to two main factors. First, real-world time series often show heterogeneous temporal patterns caused by distribution shifts over time. Second, correlations among channels are complex and intertwined, making it hard to model the interactions among channels precisely and flexibly. In this study, we address these challenges by proposing a general framework called DUET, which introduces dual clustering on the temporal and channel dimensions to enhance multivariate time series forecasting. First, we design a Temporal Clustering Module (TCM) that clusters time series into fine-grained distributions to handle heterogeneous temporal patterns. For different distribution clusters, we design various pattern extractors to capture their intrinsic temporal patterns, thus modeling the heterogeneity. Second, we introduce a novel Channel-Soft-Clustering strategy and design a Channel Clustering Module (CCM), which captures the relationships among channels in the frequency domain through metric learning and applies sparsification to mitigate the adverse effects of noisy channels. Finally, DUET combines TCM and CCM to incorporate both the temporal and channel dimensions. Extensive experiments on 25 real-world datasets from 10 application domains, demonstrate the state-of-the-art performance of DUET.

Accepted by KDD 2025
WPMixer: Efficient Multi-Resolution Mixing for Long-Term Time Series Forecasting 2024-12-22 Go
Show

Time series forecasting is crucial for various applications, such as weather forecasting, power load forecasting, and financial analysis. In recent studies, MLP-mixer models for time series forecasting have been shown as a promising alternative to transformer-based models. However, the performance of these models is still yet to reach its potential. In this paper, we propose Wavelet Patch Mixer (WPMixer), a novel MLP-based model, for long-term time series forecasting, which leverages the benefits of patching, multi-resolution wavelet decomposition, and mixing. Our model is based on three key components: (i) multi-resolution wavelet decomposition, (ii) patching and embedding, and (iii) MLP mixing. Multi-resolution wavelet decomposition efficiently extracts information in both the frequency and time domains. Patching allows the model to capture an extended history with a look-back window and enhances capturing local information while MLP mixing incorporates global information. Our model significantly outperforms state-of-the-art MLP-based and transformer-based models for long-term time series forecasting in a computationally efficient way, demonstrating its efficacy and potential for practical applications.

12 pa...

12 pages, 3 Figures, AAAI-2025

Separating Drone Point Clouds From Complex Backgrounds by Cluster Filter -- Technical Report for CVPR 2024 UG2 Challenge 2024-12-22 Go
Show

The increasing deployment of small drones as tools of conflict and disruption has amplified their threat, highlighting the urgent need for effective anti-drone measures. However, the compact size of most drones presents a significant challenge, as traditional supervised point cloud or image-based object detection methods often fail to identify such small objects effectively. This paper proposes a simple UAV detection method using an unsupervised pipeline. It uses spatial-temporal sequence processing to fuse multiple lidar datasets effectively, tracking and determining the position of UAVs, so as to detect and track UAVs in challenging environments. Our method performs front and rear background segmentation of point clouds through a global-local sequence clusterer and parses point cloud data from both the spatial-temporal density and spatial-temporal voxels of the point cloud. Furthermore, a scoring mechanism for point cloud moving targets is proposed, using time series detection to improve accuracy and efficiency. We used the MMAUD dataset, and our method achieved 4th place in the CVPR 2024 UG2+ Challenge, confirming the effectiveness of our method in practical applications.

7 pages, 4 figures
Enhancing Supply Chain Transparency in Emerging Economies Using Online Contents and LLMs 2024-12-22 Go
Show

In the current global economy, supply chain transparency plays a pivotal role in ensuring this security by enabling companies to monitor supplier performance and fostering accountability and responsibility. Despite the advancements in supply chain relationship datasets like Bloomberg and FactSet, supply chain transparency remains a significant challenge in emerging economies due to issues such as information asymmetry and institutional gaps in regulation. This study proposes a novel approach to enhance supply chain transparency in emerging economies by leveraging online content and large language models (LLMs). We develop a Supply Chain Knowledge Graph Mining System that integrates advanced LLMs with web crawler technology to automatically collect and analyze supply chain information. The system's effectiveness is validated through a case study focusing on the semiconductor supply chain, a domain that has recently gained significant attention due to supply chain risks. Our results demonstrate that the proposed system provides greater applicability for emerging economies, such as mainland China, complementing the data gaps in existing datasets. However, challenges including the accurate estimation of monetary and material flows, the handling of time series data, synonyms disambiguation, and mitigating biases from online contents still remains. Future research should focus on addressing these issues to further enhance the system's capabilities and broaden its application to other emerging economies and industries.

6 pages
Preventing Non-intrusive Load Monitoring Privacy Invasion: A Precise Adversarial Attack Scheme for Networked Smart Meters 2024-12-22 Go
Show

Smart grid, through networked smart meters employing the non-intrusive load monitoring (NILM) technique, can considerably discern the usage patterns of residential appliances. However, this technique also incurs privacy leakage. To address this issue, we propose an innovative scheme based on adversarial attack in this paper. The scheme effectively prevents NILM models from violating appliance-level privacy, while also ensuring accurate billing calculation for users. To achieve this objective, we overcome two primary challenges. First, as NILM models fall under the category of time-series regression models, direct application of traditional adversarial attacks designed for classification tasks is not feasible. To tackle this issue, we formulate a novel adversarial attack problem tailored specifically for NILM and providing a theoretical foundation for utilizing the Jacobian of the NILM model to generate imperceptible perturbations. Leveraging the Jacobian, our scheme can produce perturbations, which effectively misleads the signal prediction of NILM models to safeguard users' appliance-level privacy. The second challenge pertains to fundamental utility requirements, where existing adversarial attack schemes struggle to achieve accurate billing calculation for users. To handle this problem, we introduce an additional constraint, mandating that the sum of added perturbations within a billing period must be precisely zero. Experimental validation on real-world power datasets REDD and UK-DALE demonstrates the efficacy of our proposed solutions, which can significantly amplify the discrepancy between the output of the targeted NILM model and the actual power signal of appliances, and enable accurate billing at the same time. Additionally, our solutions exhibit transferability, making the generated perturbation signal from one target model applicable to other diverse NILM models.

$\clubsuit$ CLOVER $\clubsuit$: Probabilistic Forecasting with Coherent Learning Objective Reparameterization 2024-12-21 Go
Show

Obtaining accurate probabilistic forecasts is an operational challenge in many applications, such as energy management, climate forecasting, supply chain planning, and resource allocation. Many of these applications present a natural hierarchical structure over the forecasted quantities; and forecasting systems that adhere to this hierarchical structure are said to be coherent. Furthermore, operational planning benefits from the accuracy at all levels of the aggregation hierarchy. However, building accurate and coherent forecasting systems is challenging: classic multivariate time series tools and neural network methods are still being adapted for this purpose. In this paper, we augment an MQForecaster neural network architecture with a modified multivariate Gaussian factor model that achieves coherence by construction. The factor model samples can be differentiated with respect to the model parameters, allowing optimization on arbitrary differentiable learning objectives that align with the forecasting system's goals, including quantile loss and the scaled Continuous Ranked Probability Score (CRPS). We call our method the Coherent Learning Objective Reparametrization Neural Network (CLOVER). In comparison to state-of-the-art coherent forecasting methods, CLOVER achieves significant improvements in scaled CRPS forecast accuracy, with average gains of 15%, as measured on six publicly-available datasets.

10 pa...

10 pages of main text. Updated method and results

TimeRAG: BOOSTING LLM Time Series Forecasting via Retrieval-Augmented Generation 2024-12-21 Go
Show

Although the rise of large language models (LLMs) has introduced new opportunities for time series forecasting, existing LLM-based solutions require excessive training and exhibit limited transferability. In view of these challenges, we propose TimeRAG, a framework that incorporates Retrieval-Augmented Generation (RAG) into time series forecasting LLMs, which constructs a time series knowledge base from historical sequences, retrieves reference sequences from the knowledge base that exhibit similar patterns to the query sequence measured by Dynamic Time Warping (DTW), and combines these reference sequences and the prediction query as a textual prompt to the time series forecasting LLM. Experiments on datasets from various domains show that the integration of RAG improved the prediction accuracy of the original model by 2.97% on average.

Enhancing Startup Success Predictions in Venture Capital: A GraphRAG Augmented Multivariate Time Series Method 2024-12-21 Go
Show

In the Venture Capital (VC) industry, predicting the success of startups is challenging due to limited financial data and the need for subjective revenue forecasts. Previous methods based on time series analysis often fall short as they fail to incorporate crucial inter-company relationships such as competition and collaboration. To fill the gap, this paper aims to introduce a novel approach using GraphRAG augmented time series model. With GraphRAG, time series predictive methods are enhanced by integrating these vital relationships into the analysis framework, allowing for a more dynamic understanding of the startup ecosystem in venture capital. Our experimental results demonstrate that our model significantly outperforms previous models in startup success predictions.

VisTR: Visualizations as Representations for Time-series Table Reasoning 2024-12-21 Go
Show

Time-series table reasoning interprets temporal patterns and relationships in data to answer user queries. Despite recent advancements leveraging large language models (LLMs), existing methods often struggle with pattern recognition, context drift in long time-series data, and the lack of visual-based reasoning capabilities. To address these challenges, we propose VisTR, a framework that places visualizations at the core of the reasoning process. Specifically, VisTR leverages visualizations as representations to bridge raw time-series data and human cognitive processes. By transforming tables into fixed-size visualization references, it captures key trends, anomalies, and temporal relationships, facilitating intuitive and interpretable reasoning. These visualizations are aligned with user input, i.e., charts, text, and sketches, through a fine-tuned multimodal LLM, ensuring robust cross-modal alignment. To handle large-scale data, VisTR integrates pruning and indexing mechanisms for scalable and efficient retrieval. Finally, an interactive visualization interface supports seamless multimodal exploration, enabling users to interact with data through both textual and visual modalities. Quantitative evaluations demonstrate the effectiveness of VisTR in aligning multimodal inputs and improving reasoning accuracy. Case studies further illustrate its applicability to various time-series reasoning and exploration tasks.

14 pages, 10 figures
Breaking the Context Bottleneck on Long Time Series Forecasting 2024-12-21 Go
Show

Long-term time-series forecasting is essential for planning and decision-making in economics, energy, and transportation, where long foresight is required. To obtain such long foresight, models must be both efficient and effective in processing long sequence. Recent advancements have enhanced the efficiency of these models; however, the challenge of effectively leveraging longer sequences persists. This is primarily due to the tendency of these models to overfit when presented with extended inputs, necessitating the use of shorter input lengths to maintain tolerable error margins. In this work, we investigate the multiscale modeling method and propose the Logsparse Decomposable Multiscaling (LDM) framework for the efficient and effective processing of long sequences. We demonstrate that by decoupling patterns at different scales in time series, we can enhance predictability by reducing non-stationarity, improve efficiency through a compact long input representation, and simplify the architecture by providing clear task assignments. Experimental results demonstrate that LDM not only outperforms all baselines in long-term forecasting benchmarks, but also reducing both training time and memory costs.

Time ...

Time series forecasting algorithm based on multi-scale analysis

VSFormer: Value and Shape-Aware Transformer with Prior-Enhanced Self-Attention for Multivariate Time Series Classification 2024-12-21 Go
Show

Multivariate time series classification is a crucial task in data mining, attracting growing research interest due to its broad applications. While many existing methods focus on discovering discriminative patterns in time series, real-world data does not always present such patterns, and sometimes raw numerical values can also serve as discriminative features. Additionally, the recent success of Transformer models has inspired many studies. However, when applying to time series classification, the self-attention mechanisms in Transformer models could introduce classification-irrelevant features, thereby compromising accuracy. To address these challenges, we propose a novel method, VSFormer, that incorporates both discriminative patterns (shape) and numerical information (value). In addition, we extract class-specific prior information derived from supervised information to enrich the positional encoding and provide classification-oriented self-attention learning, thereby enhancing its effectiveness. Extensive experiments on all 30 UEA archived datasets demonstrate the superior performance of our method compared to SOTA models. Through ablation studies, we demonstrate the effectiveness of the improved encoding layer and the proposed self-attention mechanism. Finally, We provide a case study on a real-world time series dataset without discriminative patterns to interpret our model.

MCGAN: Enhancing GAN Training with Regression-Based Generator Loss 2024-12-21 Go
Show

Generative adversarial networks (GANs) have emerged as a powerful tool for generating high-fidelity data. However, the main bottleneck of existing approaches is the lack of supervision on the generator training, which often results in undamped oscillation and unsatisfactory performance. To address this issue, we propose an algorithm called Monte Carlo GAN (MCGAN). This approach, utilizing an innovative generative loss function, termly the regression loss, reformulates the generator training as a regression task and enables the generator training by minimizing the mean squared error between the discriminator's output of real data and the expected discriminator of fake data. We demonstrate the desirable analytic properties of the regression loss, including discriminability and optimality, and show that our method requires a weaker condition on the discriminator for effective generator training. These properties justify the strength of this approach to improve the training stability while retaining the optimality of GAN by leveraging strong supervision of the regression loss. Extensive experiments on diverse datasets, including image data (CIFAR-10/100, FFHQ256, ImageNet, and LSUN Bedroom), time series data (VAR and stock data) and video data, are conducted to demonstrate the flexibility and effectiveness of our proposed MCGAN. Numerical results show that the proposed MCGAN is versatile in enhancing a variety of backbone GAN models and achieves consistent and significant improvement in terms of quality, accuracy, training stability, and learned latent space.

Iterative Encoding-Decoding VAEs Anomaly Detection in NOAA's DART Time Series: A Machine Learning Approach for Enhancing Data Integrity for NASA's GRACE-FO Verification and Validation 2024-12-20 Go
Show

NOAA's Deep-ocean Assessment and Reporting of Tsunamis (DART) data are critical for NASA-JPL's tsunami detection, real-time operations, and oceanographic research. However, these time-series data often contain spikes, steps, and drifts that degrade data quality and obscure essential oceanographic features. To address these anomalies, the work introduces an Iterative Encoding-Decoding Variational Autoencoders (Iterative Encoding-Decoding VAEs) model to improve the quality of DART time series. Unlike traditional filtering and thresholding methods that risk distorting inherent signal characteristics, Iterative Encoding-Decoding VAEs progressively remove anomalies while preserving the data's latent structure. A hybrid thresholding approach further retains genuine oceanographic features near boundaries. Applied to complex DART datasets, this approach yields reconstructions that better maintain key oceanic properties compared to classical statistical techniques, offering improved robustness against spike removal and subtle step changes. The resulting high-quality data supports critical verification and validation efforts for the GRACE-FO mission at NASA-JPL, where accurate surface measurements are essential to modeling Earth's gravitational field and global water dynamics. Ultimately, this data processing method enhances tsunami detection and underpins future climate modeling with improved interpretability and reliability.

Preprint
Representation Learning of Daily Movement Data Using Text Encoders 2024-12-20 Go
Show

Time-series representation learning is a key area of research for remote healthcare monitoring applications. In this work, we focus on a dataset of recordings of in-home activity from people living with Dementia. We design a representation learning method based on converting activity to text strings that can be encoded using a language model fine-tuned to transform data from the same participants within a $30$-day window to similar embeddings in the vector space. This allows for clustering and vector searching over participants and days, and the identification of activity deviations to aid with personalised delivery of care.

Accep...

Accepted at ICLR 2024 Workshop on Learning from Time Series For Health: https://openreview.net/forum?id=mmxNNwxvWG

Learning ECG Signal Features Without Backpropagation Using Linear Laws 2024-12-20 Go
Show

This paper introduces LLT-ECG, a novel method for electrocardiogram (ECG) signal classification that leverages concepts from theoretical physics to automatically generate features from time series data. Unlike traditional deep learning approaches, LLT-ECG operates in a forward manner, eliminating the need for backpropagation and hyperparameter tuning. By identifying linear laws that capture shared patterns within specific classes, the proposed method constructs a compact and verifiable representation, enhancing the effectiveness of downstream classifiers. We demonstrate LLT-ECG's state-of-the-art performance on real-world ECG datasets from PhysioNet, underscoring its potential for medical applications where speed and verifiability are crucial.

35 pa...

35 pages, 3 figures, 3 tables

CNN-LSTM Hybrid Deep Learning Model for Remaining Useful Life Estimation 2024-12-20 Go
Show

Remaining Useful Life (RUL) of a component or a system is defined as the length from the current time to the end of the useful life. Accurate RUL estimation plays a crucial role in Predictive Maintenance applications. Traditional regression methods, both linear and non-linear, have struggled to achieve high accuracy in this domain. While Convolutional Neural Networks (CNNs) have shown improved accuracy, they often overlook the sequential nature of the data, relying instead on features derived from sliding windows. Since RUL prediction inherently involves multivariate time series analysis, robust sequence learning is essential. In this work, we propose a hybrid approach combining Convolutional Neural Networks with Long Short-Term Memory (LSTM) networks for RUL estimation. Although CNN-based LSTM models have been applied to sequence prediction tasks in financial forecasting, this is the first attempt to adopt this approach for RUL estimation in prognostics. In this approach, CNN is first employed to efficiently extract features from the data, followed by LSTM, which uses these extracted features to predict RUL. This method effectively leverages sensor sequence information, uncovering hidden patterns within the data, even under multiple operating conditions and fault scenarios. Our results demonstrate that the hybrid CNN-LSTM model achieves the highest accuracy, offering a superior score compared to the other methods.

conference paper
Using matrix-product states for time-series machine learning 2024-12-20 Go
Show

Matrix-product states (MPS) have proven to be a versatile ansatz for modeling quantum many-body physics. For many applications, and particularly in one-dimension, they capture relevant quantum correlations in many-body wavefunctions while remaining tractable to store and manipulate on a classical computer. This has motivated researchers to also apply the MPS ansatz to machine learning (ML) problems where capturing complex correlations in datasets is also a key requirement. Here, we develop and apply an MPS-based algorithm, MPSTime, for learning a joint probability distribution underlying an observed time-series dataset, and show how it can be used to tackle important time-series ML problems, including classification and imputation. MPSTime can efficiently learn complicated time-series probability distributions directly from data, requires only moderate maximum MPS bond dimension $\chi_{\rm max}$, with values for our applications ranging between $\chi_{\rm max} = 20-150$, and can be trained for both classification and imputation tasks under a single logarithmic loss function. Using synthetic and publicly available real-world datasets, spanning applications in medicine, energy, and astronomy, we demonstrate performance competitive with state-of-the-art ML approaches, but with the key advantage of encoding the full joint probability distribution learned from the data. By sampling from the joint probability distribution and calculating its conditional entanglement entropy, we show how its underlying structure can be uncovered and interpreted. This manuscript is supplemented with the release of a publicly available code package MPSTime that implements our approach. The efficiency of the MPS-based ansatz for learning complex correlation structures from time-series data is likely to underpin interpretable advances to challenging time-series ML problems across science, industry, and medicine.

27 pages, 13 figures
Learned Compression of Nonlinear Time Series With Random Access 2024-12-20 Go
Show

Time series play a crucial role in many fields, including finance, healthcare, industry, and environmental monitoring. The storage and retrieval of time series can be challenging due to their unstoppable growth. In fact, these applications often sacrifice precious historical data to make room for new data. General-purpose compressors can mitigate this problem with their good compression ratios, but they lack efficient random access on compressed data, thus preventing real-time analyses. Ad-hoc streaming solutions, instead, typically optimise only for compression and decompression speed, while giving up compression effectiveness and random access functionality. Furthermore, all these methods lack awareness of certain special regularities of time series, whose trends over time can often be described by some linear and nonlinear functions. To address these issues, we introduce NeaTS, a randomly-accessible compression scheme that approximates the time series with a sequence of nonlinear functions of different kinds and shapes, carefully selected and placed by a partitioning algorithm to minimise the space. The approximation residuals are bounded, which allows storing them in little space and thus recovering the original data losslessly, or simply discarding them to obtain a lossy time series representation with maximum error guarantees. Our experiments show that NeaTS improves the compression ratio of the state-of-the-art lossy compressors that use linear or nonlinear functions (or both) by up to 14%. Compared to lossless compressors, NeaTS emerges as the only approach to date providing, simultaneously, compression ratios close to or better than the best existing compressors, a much faster decompression speed, and orders of magnitude more efficient random access, thus enabling the storage and real-time analysis of massive and ever-growing amounts of (historical) time series data.

Accep...

Accepted for publication in Proceedings of the 41st IEEE International Conference on Data Engineering (ICDE 2025)

Identifying Macro Conditional Independencies and Macro Total Effects in Summary Causal Graphs with Latent Confounding 2024-12-20 Go
Show

Understanding causal relations in dynamic systems is essential in epidemiology. While causal inference methods have been extensively studied, they often rely on fully specified causal graphs, which may not always be available in complex dynamic systems. Partially specified causal graphs, and in particular summary causal graphs (SCGs), provide a simplified representation of causal relations between time series when working spacio-temporal data, omitting temporal information and focusing on causal structures between clusters of of temporal variables. Unlike fully specified causal graphs, SCGs can contain cycles, which complicate their analysis and interpretation. In addition, their cluster-based nature introduces new challenges concerning the types of queries of interest: macro queries, which involve relationships between clusters represented as vertices in the graph, and micro queries, which pertain to relationships between variables that are not directly visible through the vertices of the graph. In this paper, we first clearly distinguish between macro conditional independencies and micro conditional independencies and between macro total effects and micro total effects. Then, we demonstrate the soundness and completeness of the d-separation to identify macro conditional independencies in SCGs. Furthermore, we establish that the do-calculus is sound and complete for identifying macro total effects in SCGs. Finally, we give a graphical characterization for the non-identifiability of macro total effects in SCGs.

Accep...

Accepted CI4TS Workshop at UAI2024. Accepted at AAAI25

PoisonCatcher: Revealing and Identifying LDP Poisoning Attacks in IIoT 2024-12-20 Go
Show

Local Differential Privacy (LDP) is widely adopted in the Industrial Internet of Things (IIoT) for its lightweight, decentralized, and scalable nature. However, its perturbation-based privacy mechanism makes it difficult to distinguish between uncontaminated and tainted data, encouraging adversaries to launch poisoning attacks. While LDP provides some resilience against minor poisoning, it lacks robustness in IIoT with dynamic networks and substantial real-time data flows. Effective countermeasures for such attacks are still underdeveloped. This work narrows the critical gap by revealing and identifying LDP poisoning attacks in IIoT. We begin by deepening the understanding of such attacks, revealing novel threats that arise from the interplay between LDP indistinguishability and IIoT complexity. This exploration uncovers a novel rule-poisoning attack, and presents a general attack formulation by unifying it with input-poisoning and output-poisoning. Furthermore, two key attack impacts, i.e., Statistical Query Result (SQR) accuracy degradation and inter-dataset correlations disruption, along with two characteristics: attack patterns unstable and poisoned data stealth are revealed. From this, we propose PoisonCatcher, a four-stage solution that detects LDP poisoning attacks and identifies specific contaminated data points. It utilizes temporal similarity, attribute correlation, and time-series stability analysis to detect datasets exhibiting SQR accuracy degradation, inter-dataset disruptions, and unstable patterns. Enhanced feature engineering is used to extract subtle poisoning signatures, enabling machine learning models to identify specific contamination. Experimental evaluations show the effectiveness, achieving state-of-the-art performance with average precision and recall rates of 86.17% and 97.5%, respectively, across six representative attack scenarios.

12 pa...

12 pages,5 figures, 3 tables

System Safety Monitoring of Learned Components Using Temporal Metric Forecasting 2024-12-20 Go
Show

In learning-enabled autonomous systems, safety monitoring of learned components is crucial to ensure their outputs do not lead to system safety violations, given the operational context of the system. However, developing a safety monitor for practical deployment in real-world applications is challenging. This is due to limited access to internal workings and training data of the learned component. Furthermore, safety monitors should predict safety violations with low latency, while consuming a reasonable amount of computation. To address the challenges, we propose a safety monitoring method based on probabilistic time series forecasting. Given the learned component outputs and an operational context, we empirically investigate different Deep Learning (DL)-based probabilistic forecasting to predict the objective measure capturing the satisfaction or violation of a safety requirement (safety metric). We empirically evaluate safety metric and violation prediction accuracy, and inference latency and resource usage of four state-of-the-art models, with varying horizons, using autonomous aviation and autonomous driving case studies. Our results suggest that probabilistic forecasting of safety metrics, given learned component outputs and scenarios, is effective for safety monitoring. Furthermore, for both case studies, Temporal Fusion Transformer (TFT) was the most accurate model for predicting imminent safety violations, with acceptable latency and resource consumption.

Accep...

Accepted for publication by ACM Transactions on Software Engineering and Methodology (TOSEM)

Zero Shot Time Series Forecasting Using Kolmogorov Arnold Networks 2024-12-19 Go
Show

Accurate energy price forecasting is crucial for participants in day-ahead energy markets, as it significantly influences their decision-making processes. While machine learning-based approaches have shown promise in enhancing these forecasts, they often remain confined to the specific markets on which they are trained, thereby limiting their adaptability to new or unseen markets. In this paper, we introduce a cross-domain adaptation model designed to forecast energy prices by learning market-invariant representations across different markets during the training phase. We propose a doubly residual N-BEATS network with Kolmogorov Arnold networks at its core for time series forecasting. These networks, grounded in the Kolmogorov-Arnold representation theorem, offer a powerful way to approximate multivariate continuous functions. The cross domain adaptation model was generated with an adversarial framework. The model's effectiveness was tested in predicting day-ahead electricity prices in a zero shot fashion. In comparison with baseline models, our proposed framework shows promising results. By leveraging the Kolmogorov-Arnold networks, our model can potentially enhance its ability to capture complex patterns in energy price data, thus improving forecast accuracy across diverse market conditions. This addition not only enriches the model's representational capacity but also contributes to a more robust and flexible forecasting tool adaptable to various energy markets.

Publi...

Published In: 2024 NeurIPS Workshop on Time Series in the Age of Large Models

Granger Causality Detection with Kolmogorov-Arnold Networks 2024-12-19 Go
Show

Discovering causal relationships in time series data is central in many scientific areas, ranging from economics to climate science. Granger causality is a powerful tool for causality detection. However, its original formulation is limited by its linear form and only recently nonlinear machine-learning generalizations have been introduced. This study contributes to the definition of neural Granger causality models by investigating the application of Kolmogorov-Arnold networks (KANs) in Granger causality detection and comparing their capabilities against multilayer perceptrons (MLP). In this work, we develop a framework called Granger Causality KAN (GC-KAN) along with a tailored training approach designed specifically for Granger causality detection. We test this framework on both Vector Autoregressive (VAR) models and chaotic Lorenz-96 systems, analysing the ability of KANs to sparsify input features by identifying Granger causal relationships, providing a concise yet accurate model for Granger causality detection. Our findings show the potential of KANs to outperform MLPs in discerning interpretable Granger causal relationships, particularly for the ability of identifying sparse Granger causality patterns in high-dimensional settings, and more generally, the potential of AI in causality discovery for the dynamical laws in physical systems.

8 pag...

8 pages, 2 figures, 2 tables

Noisereduce: Domain General Noise Reduction for Time Series Signals 2024-12-19 Go
Show

Extracting signals from noisy backgrounds is a fundamental problem in signal processing across a variety of domains. In this paper, we introduce Noisereduce, an algorithm for minimizing noise across a variety of domains, including speech, bioacoustics, neurophysiology, and seismology. Noisereduce uses spectral gating to estimate a frequency-domain mask that effectively separates signals from noise. It is fast, lightweight, requires no training data, and handles both stationary and non-stationary noise, making it both a versatile tool and a convenient baseline for comparison with domain-specific applications. We provide a detailed overview of Noisereduce and evaluate its performance on a variety of time-domain signals.

Pytho...

Python library: https://github.com/timsainb/noisereduce or pip install noisereduce

DroughtSet: Understanding Drought Through Spatial-Temporal Learning 2024-12-19 Go
Show

Drought is one of the most destructive and expensive natural disasters, severely impacting natural resources and risks by depleting water resources and diminishing agricultural yields. Under climate change, accurately predicting drought is critical for mitigating drought-induced risks. However, the intricate interplay among the physical and biological drivers that regulate droughts limits the predictability and understanding of drought, particularly at a subseasonal to seasonal (S2S) time scale. While deep learning has been demonstrated with potential in addressing climate forecasting challenges, its application to drought prediction has received relatively less attention. In this work, we propose a new dataset, DroughtSet, which integrates relevant predictive features and three drought indices from multiple remote sensing and reanalysis datasets across the contiguous United States (CONUS). DroughtSet specifically provides the machine learning community with a new real-world dataset to benchmark drought prediction models and more generally, time-series forecasting methods. Furthermore, we propose a spatial-temporal model SPDrought to predict and interpret S2S droughts. Our model learns from the spatial and temporal information of physical and biological features to predict three types of droughts simultaneously. Multiple strategies are employed to quantify the importance of physical and biological features for drought prediction. Our results provide insights for researchers to better understand the predictability and sensitivity of drought to biological and physical conditions. We aim to contribute to the climate field by proposing a new tool to predict and understand the occurrence of droughts and provide the AI community with a new benchmark to study deep learning applications in climate science.

Accepted by AAAI25
Enhancing Masked Time-Series Modeling via Dropping Patches 2024-12-19 Go
Show

This paper explores how to enhance existing masked time-series modeling by randomly dropping sub-sequence level patches of time series. On this basis, a simple yet effective method named DropPatch is proposed, which has two remarkable advantages: 1) It improves the pre-training efficiency by a square-level advantage; 2) It provides additional advantages for modeling in scenarios such as in-domain, cross-domain, few-shot learning and cold start. This paper conducts comprehensive experiments to verify the effectiveness of the method and analyze its internal mechanism. Empirically, DropPatch strengthens the attention mechanism, reduces information redundancy and serves as an efficient means of data augmentation. Theoretically, it is proved that DropPatch slows down the rate at which the Transformer representations collapse into the rank-1 linear subspace by randomly dropping patches, thus optimizing the quality of the learned representations

PCA-Featured Transformer for Jamming Detection in 5G UAV Networks 2024-12-19 Go
Show

Jamming attacks pose a threat to Unmanned Aerial Vehicle (UAV) wireless communication systems, potentially disrupting essential services and compromising network reliability. Current detection approaches struggle with sophisticated artificial intelligence (AI) jamming techniques that adapt their patterns while existing machine learning solutions often require extensive feature engineering and fail to capture complex temporal dependencies in attack signatures. Furthermore, 5G networks using either Time Division Duplex (TDD) or Frequency Division Duplex (FDD) methods can face service degradation from intentional interference sources. To address these challenges, we present a novel transformer-based deep learning framework for jamming detection with Principal Component Analysis (PCA) added features. Our architecture leverages the transformer's self-attention mechanism to capture complex temporal dependencies and spatial correlations in wireless signal characteristics, enabling more robust jamming detection techniques. The U-shaped model incorporates a modified transformer encoder that processes signal features including received signal strength indicator (RSSI) and signal-to-noise ratio (SINR) measurements, alongside a specialized positional encoding scheme that accounts for the periodic nature of wireless signals. In addition, we propose a batch size scheduler and implement chunking techniques to optimize training convergence for time series data. These advancements contribute to achieving up to a ten times improvement in training speed within the advanced U-shaped encoder-decoder model introduced. Simulation results demonstrate that our approach achieves a detection accuracy of 90.33 % in Line-of-Sight (LoS) and 84.35 % in non-Line-of-Sight (NLoS) and outperforms machine learning methods and existing deep learning solutions such as the XGBoost (XGB) classifier in approximately 4%.

Clustering of timed sequences -- Application to the analysis of care pathways 2024-12-19 Go
Show

Improving the future of healthcare starts by better understanding the current actual practices in hospital settings. This motivates the objective of discovering typical care pathways from patient data. Revealing typical care pathways can be achieved through clustering. The difficulty in clustering care pathways, represented by sequences of timestamped events, lies in defining a semantically appropriate metric and clustering algorithms. In this article, we adapt two methods developed for time series to the clustering of timed sequences: the drop-DTW metric and the DBA approach for the construction of averaged time sequences. These methods are then applied in clustering algorithms to propose original and sound clustering algorithms for timed sequences. This approach is experimented with and evaluated on synthetic and real-world data.

Unveiling social vibrancy in urban spaces with app usage 2024-12-19 Go
Show

Urban vibrancy is an important measure of the energetic nature of a city that is related to why and how people use urban spaces, and it is inherently connected with our social behaviour. Increasingly, people use a wide range of mobile phone apps in their daily lives to connect socially, search for information, make decisions, and arrange travel, amongst many other reasons. However, the relationship between online app usage and urban vibrancy remains unclear, particularly regarding how sociospatial behaviours interact with urban features. Here, we use app-usage data as a digital signature to investigate this question. To do this, we use a high-resolution data source of mobile service-level traffic volumes across eighteen cities in France. We investigate the social component of cities using socially relevant urban features constructed from OpenStreetMap 'Points of Interest'. We developed a methodology for identifying and classifying multidimensional app usage time series based on similarity. We used these in predictive models to interpret the results for each city and across France. Across cities, there were spatial behavioural archetypes, characterised by multidimensional properties. We found patterns between the week and the weekend, and across cities, and the country. These archetypes correspond to changes in socially relevant urban features that impact urban vibrancy. Our results add further evidence for the importance of using computational approaches to understand urban environments, the use of sociological concepts in computational science, and urban vibrancy in cities.

40 pa...

40 pages, 10 figures. Submitted to Computers, Environment and Urban Systems

Hybridization of Persistent Homology with Neural Networks for Time-Series Prediction: A Case Study in Wave Height 2024-12-19 Go
Show

Time-series prediction is an active area of research across various fields, often challenged by the fluctuating influence of short-term and long-term factors. In this study, we introduce a feature engineering method that enhances the predictive performance of neural network models. Specifically, we leverage computational topology techniques to derive valuable topological features from input data, boosting the predictive accuracy of our models. Our focus is on predicting wave heights, utilizing models based on topological features within feedforward neural networks (FNNs), recurrent neural networks (RNNs), long short-term memory networks (LSTM), and RNNs with gated recurrent units (GRU). For time-ahead predictions, the enhancements in $R^2$ score were significant for FNNs, RNNs, LSTM, and GRU models. Additionally, these models also showed significant reductions in maximum errors and mean squared errors.

the p...

the paper contain errors

A Comprehensive Forecasting Framework based on Multi-Stage Hierarchical Forecasting Reconciliation and Adjustment 2024-12-19 Go
Show

Ads demand forecasting for Walmart's ad products plays a critical role in enabling effective resource planning, allocation, and management of ads performance. In this paper, we introduce a comprehensive demand forecasting system that tackles hierarchical time series forecasting in business settings. Though traditional hierarchical reconciliation methods ensure forecasting coherence, they often trade off accuracy for coherence especially at lower levels and fail to capture the seasonality unique to each time-series in the hierarchy. Thus, we propose a novel framework "Multi-Stage Hierarchical Forecasting Reconciliation and Adjustment (Multi-Stage HiFoReAd)" to address the challenges of preserving seasonality, ensuring coherence, and improving accuracy. Our system first utilizes diverse models, ensembled through Bayesian Optimization (BO), achieving base forecasts. The generated base forecasts are then passed into the Multi-Stage HiFoReAd framework. The initial stage refines the hierarchy using Top-Down forecasts and "harmonic alignment." The second stage aligns the higher levels' forecasts using MinTrace algorithm, following which the last two levels undergo "harmonic alignment" and "stratified scaling", to eventually achieve accurate and coherent forecasts across the whole hierarchy. Our experiments on Walmart's internal Ads-demand dataset and 3 other public datasets, each with 4 hierarchical levels, demonstrate that the average Absolute Percentage Error from the cross-validation sets improve from 3% to 40% across levels against BO-ensemble of models (LGBM, MSTL+ETS, Prophet) as well as from 1.2% to 92.9% against State-Of-The-Art models. In addition, the forecasts at all hierarchical levels are proved to be coherent. The proposed framework has been deployed and leveraged by Walmart's ads, sales and operations teams to track future demands, make informed decisions and plan resources.

Publi...

Published in 2024 IEEE International Conference on Big Data (BigData)

Learning Deep Dissipative Dynamics 2024-12-19 Go
Show

This study challenges strictly guaranteeing ``dissipativity'' of a dynamical system represented by neural networks learned from given time-series data. Dissipativity is a crucial indicator for dynamical systems that generalizes stability and input-output stability, known to be valid across various systems including robotics, biological systems, and molecular dynamics. By analytically proving the general solution to the nonlinear Kalman-Yakubovich-Popov (KYP) lemma, which is the necessary and sufficient condition for dissipativity, we propose a differentiable projection that transforms any dynamics represented by neural networks into dissipative ones and a learning method for the transformed dynamics. Utilizing the generality of dissipativity, our method strictly guarantee stability, input-output stability, and energy conservation of trained dynamical systems. Finally, we demonstrate the robustness of our method against out-of-domain input through applications to robotic arms and fluid dynamics. Code is https://github.com/kojima-r/DeepDissipativeModel

AAAI 2025
DualDynamics: Synergizing Implicit and Explicit Methods for Robust Irregular Time Series Analysis 2024-12-19 Go
Show

Real-world time series analysis faces significant challenges when dealing with irregular and incomplete data. While Neural Differential Equation (NDE) based methods have shown promise, they struggle with limited expressiveness, scalability issues, and stability concerns. Conversely, Neural Flows offer stability but falter with irregular data. We introduce 'DualDynamics', a novel framework that synergistically combines NDE-based method and Neural Flow-based method. This approach enhances expressive power while balancing computational demands, addressing critical limitations of existing techniques. We demonstrate DualDynamics' effectiveness across diverse tasks: classification of robustness to dataset shift, irregularly-sampled series analysis, interpolation of missing data, and forecasting with partial observations. Our results show consistent outperformance over state-of-the-art methods, indicating DualDynamics' potential to advance irregular time series analysis significantly.

Publi...

Published at the 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025)

Leveraging Time Series Categorization and Temporal Fusion Transformers to Improve Cryptocurrency Price Forecasting 2024-12-19 Go
Show

Organizing and managing cryptocurrency portfolios and decision-making on transactions is crucial in this market. Optimal selection of assets is one of the main challenges that requires accurate prediction of the price of cryptocurrencies. In this work, we categorize the financial time series into several similar subseries to increase prediction accuracy by learning each subseries category with similar behavior. For each category of the subseries, we create a deep learning model based on the attention mechanism to predict the next step of each subseries. Due to the limited amount of cryptocurrency data for training models, if the number of categories increases, the amount of training data for each model will decrease, and some complex models will not be trained well due to the large number of parameters. To overcome this challenge, we propose to combine the time series data of other cryptocurrencies to increase the amount of data for each category, hence increasing the accuracy of the models corresponding to each category.

Cherry-Picking in Time Series Forecasting: How to Select Datasets to Make Your Model Shine 2024-12-19 Go
Show

The importance of time series forecasting drives continuous research and the development of new approaches to tackle this problem. Typically, these methods are introduced through empirical studies that frequently claim superior accuracy for the proposed approaches. Nevertheless, concerns are rising about the reliability and generalizability of these results due to limitations in experimental setups. This paper addresses a critical limitation: the number and representativeness of the datasets used. We investigate the impact of dataset selection bias, particularly the practice of cherry-picking datasets, on the performance evaluation of forecasting methods. Through empirical analysis with a diverse set of benchmark datasets, our findings reveal that cherry-picking datasets can significantly distort the perceived performance of methods, often exaggerating their effectiveness. Furthermore, our results demonstrate that by selectively choosing just four datasets - what most studies report - 46% of methods could be deemed best in class, and 77% could rank within the top three. Additionally, recent deep learning-based approaches show high sensitivity to dataset selection, whereas classical methods exhibit greater robustness. Finally, our results indicate that, when empirically validating forecasting algorithms on a subset of the benchmarks, increasing the number of datasets tested from 3 to 6 reduces the risk of incorrectly identifying an algorithm as the best one by approximately 40%. Our study highlights the critical need for comprehensive evaluation frameworks that more accurately reflect real-world scenarios. Adopting such frameworks will ensure the development of robust and reliable forecasting methods.

Proce...

Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI-25), February 25-March 4, 2025, Philadelphia, Pennsylvania, USA

DTW+S: Shape-based Comparison of Time-series with Ordered Local Trend 2024-12-18 Go
Show

Measuring distance or similarity between time-series data is a fundamental aspect of many applications including classification, clustering, and ensembling/alignment. Existing measures may fail to capture similarities among local trends (shapes) and may even produce misleading results. Our goal is to develop a measure that looks for similar trends occurring around similar times and is easily interpretable for researchers in applied domains. This is particularly useful for applications where time-series have a sequence of meaningful local trends that are ordered, such as in epidemics (a surge to an increase to a peak to a decrease). We propose a novel measure, DTW+S, which creates an interpretable "closeness-preserving" matrix representation of the time-series, where each column represents local trends, and then it applies Dynamic Time Warping to compute distances between these matrices. We present a theoretical analysis that supports the choice of this representation. We demonstrate the utility of DTW+S in several tasks. For the clustering of epidemic curves, we show that DTW+S is the only measure able to produce good clustering compared to the baselines. For ensemble building, we propose a combination of DTW+S and barycenter averaging that results in the best preservation of characteristics of the underlying trajectories. We also demonstrate that our approach results in better classification compared to Dynamic Time Warping for a class of datasets, particularly when local trends rather than scale play a decisive role.

Longe...

Longer version of the paper "Aligning Time-series by Local Trends: Applications in Public Health" accepted at The 39th Annual AAAI Conference on Artificial Intelligence (AAAI 2025)

Spatio-Temporal Forecasting of PM2.5 via Spatial-Diffusion guided Encoder-Decoder Architecture 2024-12-18 Go
Show

In many problem settings that require spatio-temporal forecasting, the values in the time-series not only exhibit spatio-temporal correlations but are also influenced by spatial diffusion across locations. One such example is forecasting the concentration of fine particulate matter (PM2.5) in the atmosphere which is influenced by many complex factors, the most important ones being diffusion due to meteorological factors as well as transport across vast distances over a period of time. We present a novel Spatio-Temporal Graph Neural Network architecture, that specifically captures these dependencies to forecast the PM2.5 concentration. Our model is based on an encoder-decoder architecture where the encoder and decoder parts leverage gated recurrent units (GRU) augmented with a graph neural network (TransformerConv) to account for spatial diffusion. Our model can also be seen as a generalization of various existing models for time-series or spatio-temporal forecasting. We demonstrate the model's effectiveness on two real-world PM2.5 datasets: (1) data collected by us using a recently deployed network of low-cost PM$_{2.5}$ sensors from 511 locations spanning the entirety of the Indian state of Bihar over a period of one year, and (2) another publicly available dataset that covers severely polluted regions from China for a period of 4 years. Our experimental results show our model's impressive ability to account for both spatial as well as temporal dependencies precisely.

9 pag...

9 pages, 4 figures, International Conference on Data Science and Management of Data (CODS-COMAD), IIT Jodhpur, 2024

TimeCMA: Towards LLM-Empowered Multivariate Time Series Forecasting via Cross-Modality Alignment 2024-12-18 Go
Show

Multivariate time series forecasting (MTSF) aims to learn temporal dynamics among variables to forecast future time series. Existing statistical and deep learning-based methods suffer from limited learnable parameters and small-scale training data. Recently, large language models (LLMs) combining time series with textual prompts have achieved promising performance in MTSF. However, we discovered that current LLM-based solutions fall short in learning disentangled embeddings. We introduce TimeCMA, an intuitive yet effective framework for MTSF via cross-modality alignment. Specifically, we present a dual-modality encoding with two branches: the time series encoding branch extracts disentangled yet weak time series embeddings, and the LLM-empowered encoding branch wraps the same time series with text as prompts to obtain entangled yet robust prompt embeddings. As a result, such a cross-modality alignment retrieves both disentangled and robust time series embeddings, ``the best of two worlds'', from the prompt embeddings based on time series and prompt modality similarities. As another key design, to reduce the computational costs from time series with their length textual prompts, we design an effective prompt to encourage the most essential temporal information to be encapsulated in the last token: only the last token is passed to downstream prediction. We further store the last token embeddings to accelerate inference speed. Extensive experiments on eight real datasets demonstrate that TimeCMA outperforms state-of-the-arts.

Accep...

Accepted by AAAI 2025 (Main Technical Track)

QuLTSF: Long-Term Time Series Forecasting with Quantum Machine Learning 2024-12-18 Go
Show

Long-term time series forecasting (LTSF) involves predicting a large number of future values of a time series based on the past values and is an essential task in a wide range of domains including weather forecasting, stock market analysis, disease outbreak prediction. Over the decades LTSF algorithms have transitioned from statistical models to deep learning models like transformer models. Despite the complex architecture of transformer based LTSF models `Are Transformers Effective for Time Series Forecasting? (Zeng et al., 2023)' showed that simple linear models can outperform the state-of-the-art transformer based LTSF models. Recently, quantum machine learning (QML) is evolving as a domain to enhance the capabilities of classical machine learning models. In this paper we initiate the application of QML to LTSF problems by proposing QuLTSF, a simple hybrid QML model for multivariate LTSF. Through extensive experiments on a widely used weather dataset we show the advantages of QuLTSF over the state-of-the-art classical linear models, in terms of reduced mean squared error and mean absolute error.

submi...

submitted for conference publication

Context Matters: Leveraging Contextual Features for Time Series Forecasting 2024-12-18 Go
Show

Time series forecasts are often influenced by exogenous contextual features in addition to their corresponding history. For example, in financial settings, it is hard to accurately predict a stock price without considering public sentiments and policy decisions in the form of news articles, tweets, etc. Though this is common knowledge, the current state-of-the-art (SOTA) forecasting models fail to incorporate such contextual information, owing to its heterogeneity and multimodal nature. To address this, we introduce ContextFormer, a novel plug-and-play method to surgically integrate multimodal contextual information into existing pre-trained forecasting models. ContextFormer effectively distills forecast-specific information from rich multimodal contexts, including categorical, continuous, time-varying, and even textual information, to significantly enhance the performance of existing base forecasters. ContextFormer outperforms SOTA forecasting models by up to 30% on a range of real-world datasets spanning energy, traffic, environmental, and financial domains.

AnchorInv: Few-Shot Class-Incremental Learning of Physiological Signals via Representation Space Guided Inversion 2024-12-18 Go
Show

Deep learning models have demonstrated exceptional performance in a variety of real-world applications. These successes are often attributed to strong base models that can generalize to novel tasks with limited supporting data while keeping prior knowledge intact. However, these impressive results are based on the availability of a large amount of high-quality data, which is often lacking in specialized biomedical applications. In such fields, models are usually developed with limited data that arrive incrementally with novel categories. This requires the model to adapt to new information while preserving existing knowledge. Few-Shot Class-Incremental Learning (FSCIL) methods offer a promising approach to addressing these challenges, but they also depend on strong base models that face the same aforementioned limitations. To overcome these constraints, we propose AnchorInv following the straightforward and efficient buffer-replay strategy. Instead of selecting and storing raw data, AnchorInv generates synthetic samples guided by anchor points in the feature space. This approach protects privacy and regularizes the model for adaptation. When evaluated on three public physiological time series datasets, AnchorInv exhibits efficient knowledge forgetting prevention and improved adaptation to novel classes, surpassing state-of-the-art baselines.

AAAI-...

AAAI-25 Extended Version

PreMixer: MLP-Based Pre-training Enhanced MLP-Mixers for Large-scale Traffic Forecasting 2024-12-18 Go
Show

In urban computing, precise and swift forecasting of multivariate time series data from traffic networks is crucial. This data incorporates additional spatial contexts such as sensor placements and road network layouts, and exhibits complex temporal patterns that amplify challenges for predictive learning in traffic management, smart mobility demand, and urban planning. Consequently, there is an increasing need to forecast traffic flow across broader geographic regions and for higher temporal coverage. However, current research encounters limitations because of the inherent inefficiency of model and their unsuitability for large-scale traffic network applications due to model complexity. This paper proposes a novel framework, named PreMixer, designed to bridge this gap. It features a predictive model and a pre-training mechanism, both based on the principles of Multi-Layer Perceptrons (MLP). The PreMixer comprehensively consider temporal dependencies of traffic patterns in different time windows and processes the spatial dynamics as well. Additionally, we integrate spatio-temporal positional encoding to manage spatiotemporal heterogeneity without relying on predefined graphs. Furthermore, our innovative pre-training model uses a simple patch-wise MLP to conduct masked time series modeling, learning from long-term historical data segmented into patches to generate enriched contextual representations. This approach enhances the downstream forecasting model without incurring significant time consumption or computational resource demands owing to improved learning efficiency and data handling flexibility. Our framework achieves comparable state-of-the-art performance while maintaining high computational efficiency, as verified by extensive experiments on large-scale traffic datasets.

Timer-XL: Long-Context Transformers for Unified Time Series Forecasting 2024-12-18 Go
Show

We present Timer-XL, a generative Transformer for unified time series forecasting. To uniformly predict 1D and 2D time series, we generalize next token prediction, predominantly adopted for causal generation of 1D sequences, to multivariate next token prediction. The proposed paradigm uniformly formulates various forecasting scenarios as a long-context generation problem. We opt for the generative Transformer, which can capture global-range and causal dependencies while providing contextual flexibility, to implement unified forecasting on univariate series characterized by non-stationarity, multivariate time series with complicated dynamics and correlations, and covariate-informed contexts that include both endogenous and exogenous time series. Technically, we propose a universal TimeAttention to facilitate generative Transformers on multiple time series, which can effectively capture fine-grained intra- and inter-series dependencies of flattened time series tokens (patches), and is further enhanced by deftly designed position embeddings for the temporal and variable dimensions. Timer-XL achieves state-of-the-art performance across challenging forecasting benchmarks through a unified approach. Based on large-scale pre-training, Timer-XL also demonstrates notable zero-shot performance, making it a promising architecture for large time series models.

Geolocation Representation from Large Language Models are Generic Enhancers for Spatio-Temporal Learning 2024-12-18 Go
Show

In the geospatial domain, universal representation models are significantly less prevalent than their extensive use in natural language processing and computer vision. This discrepancy arises primarily from the high costs associated with the input of existing representation models, which often require street views and mobility data. To address this, we develop a novel, training-free method that leverages large language models (LLMs) and auxiliary map data from OpenStreetMap to derive geolocation representations (LLMGeovec). LLMGeovec can represent the geographic semantics of city, country, and global scales, which acts as a generic enhancer for spatio-temporal learning. Specifically, by direct feature concatenation, we introduce a simple yet effective paradigm for enhancing multiple spatio-temporal tasks including geographic prediction (GP), long-term time series forecasting (LTSF), and graph-based spatio-temporal forecasting (GSTF). LLMGeovec can seamlessly integrate into a wide spectrum of spatio-temporal learning models, providing immediate enhancements. Experimental results demonstrate that LLMGeovec achieves global coverage and significantly boosts the performance of leading GP, LTSF, and GSTF models. Our codes are available at \url{https://github.com/Umaruchain/LLMGeovec}.

Accep...

Accepted at AAAI25 main track

RelCon: Relative Contrastive Learning for a Motion Foundation Model for Wearable Data 2024-12-17 Go
Show

We present RelCon, a novel self-supervised Relative Contrastive learning approach that uses a learnable distance measure in combination with a softened contrastive loss for training an motion foundation model from wearable sensors. The learnable distance measure captures motif similarity and domain-specific semantic information such as rotation invariance. The learned distance provides a measurement of semantic similarity between a pair of accelerometer time-series segments, which is used to measure the distance between an anchor and various other sampled candidate segments. The self-supervised model is trained on 1 billion segments from 87,376 participants from a large wearables dataset. The model achieves strong performance across multiple downstream tasks, encompassing both classification and regression. To our knowledge, we are the first to show the generalizability of a self-supervised learning model with motion data from wearables across distinct evaluation tasks.

Synthetic Time Series Data Generation for Healthcare Applications: A PCG Case Study 2024-12-17 Go
Show

The generation of high-quality medical time series data is essential for advancing healthcare diagnostics and safeguarding patient privacy. Specifically, synthesizing realistic phonocardiogram (PCG) signals offers significant potential as a cost-effective and efficient tool for cardiac disease pre-screening. Despite its potential, the synthesis of PCG signals for this specific application received limited attention in research. In this study, we employ and compare three state-of-the-art generative models from different categories - WaveNet, DoppelGANger, and DiffWave - to generate high-quality PCG data. We use data from the George B. Moody PhysioNet Challenge 2022. Our methods are evaluated using various metrics widely used in the previous literature in the domain of time series data generation, such as mean absolute error and maximum mean discrepancy. Our results demonstrate that the generated PCG data closely resembles the original datasets, indicating the effectiveness of our generative models in producing realistic synthetic PCG data. In our future work, we plan to incorporate this method into a data augmentation pipeline to synthesize abnormal PCG signals with heart murmurs, in order to address the current scarcity of abnormal data. We hope to improve the robustness and accuracy of diagnostic tools in cardiology, enhancing their effectiveness in detecting heart murmurs.

TKAN: Temporal Kolmogorov-Arnold Networks 2024-12-17 Go
Show

Recurrent Neural Networks (RNNs) have revolutionized many areas of machine learning, particularly in natural language and data sequence processing. Long Short-Term Memory (LSTM) has demonstrated its ability to capture long-term dependencies in sequential data. Inspired by the Kolmogorov-Arnold Networks (KANs) a promising alternatives to Multi-Layer Perceptrons (MLPs), we proposed a new neural networks architecture inspired by KAN and the LSTM, the Temporal Kolomogorov-Arnold Networks (TKANs). TKANs combined the strenght of both networks, it is composed of Recurring Kolmogorov-Arnold Networks (RKANs) Layers embedding memory management. This innovation enables us to perform multi-step time series forecasting with enhanced accuracy and efficiency. By addressing the limitations of traditional models in handling complex sequential patterns, the TKAN architecture offers significant potential for advancements in fields requiring more than one step ahead forecasting.

Dual Interpretation of Machine Learning Forecasts 2024-12-17 Go
Show

Machine learning predictions are typically interpreted as the sum of contributions of predictors. Yet, each out-of-sample prediction can also be expressed as a linear combination of in-sample values of the predicted variable, with weights corresponding to pairwise proximity scores between current and past economic events. While this dual route leads nowhere in some contexts (e.g., large cross-sectional datasets), it provides sparser interpretations in settings with many regressors and little training data-like macroeconomic forecasting. In this case, the sequence of contributions can be visualized as a time series, allowing analysts to explain predictions as quantifiable combinations of historical analogies. Moreover, the weights can be viewed as those of a data portfolio, inspiring new diagnostic measures such as forecast concentration, short position, and turnover. We show how weights can be retrieved seamlessly for (kernel) ridge regression, random forest, boosted trees, and neural networks. Then, we apply these tools to analyze post-pandemic forecasts of inflation, GDP growth, and recession probabilities. In all cases, the approach opens the black box from a new angle and demonstrates how machine learning models leverage history partly repeating itself.

The Temporal Vadalog System: Temporal Datalog-based Reasoning 2024-12-17 Go
Show

In the wake of the recent resurgence of the Datalog language of databases, together with its extensions for ontological reasoning settings, this work aims to bridge the gap between the theoretical studies of DatalogMTL (Datalog extended with metric temporal logic) and the development of production-ready reasoning systems. In particular, we lay out the functional and architectural desiderata of a modern reasoner and propose our system, Temporal Vadalog. Leveraging the vast amount of experience from the database community, we go beyond the typical chase-based implementations of reasoners, and propose a set of novel techniques and a system that adopts a modern data pipeline architecture. We discuss crucial architectural choices, such as how to guarantee termination when infinitely many time intervals are possibly generated, how to merge intervals, and how to sustain a limited memory footprint. We discuss advanced features of the system, such as the support for time series, and present an extensive experimental evaluation. This paper is a substantially extended version of "The Temporal Vadalog System" as presented at RuleML+RR '22. Under consideration in Theory and Practice of Logic Programming (TPLP).

LLM

Back to Index

Title Date Cool Paper Abstract Comment
A Paragraph is All It Takes: Rich Robot Behaviors from Interacting, Trusted LLMs 2024-12-24 Go
Show

Large Language Models (LLMs) are compact representations of all public knowledge of our physical environment and animal and human behaviors. The application of LLMs to robotics may offer a path to highly capable robots that perform well across most human tasks with limited or even zero tuning. Aside from increasingly sophisticated reasoning and task planning, networks of (suitably designed) LLMs offer ease of upgrading capabilities and allow humans to directly observe the robot's thinking. Here we explore the advantages, limitations, and particularities of using LLMs to control physical robots. The basic system consists of four LLMs communicating via a human language data bus implemented via web sockets and ROS2 message passing. Surprisingly, rich robot behaviors and good performance across different tasks could be achieved despite the robot's data fusion cycle running at only 1Hz and the central data bus running at the extremely limited rates of the human brain, of around 40 bits/s. The use of natural language for inter-LLM communication allowed the robot's reasoning and decision making to be directly observed by humans and made it trivial to bias the system's behavior with sets of rules written in plain English. These rules were immutably written into Ethereum, a global, public, and censorship resistant Turing-complete computer. We suggest that by using natural language as the data bus among interacting AIs, and immutable public ledgers to store behavior constraints, it is possible to build robots that combine unexpectedly rich performance, upgradability, and durable alignment with humans.

10 pages, 1 figure
How Well Do LLMs Generate Code for Different Application Domains? Benchmark and Evaluation 2024-12-24 Go
Show

Recently, an increasing number of AI-driven programming assistants powered by code LLMs have been integrated into various real-world software development environments, significantly boosting developer productivity. However, existing code generation benchmarks primarily focus on general-purpose scenarios, leaving the code generation performance of LLMs for specific application domains largely unknown. In this paper, we introduce a new benchmark, MultiCodeBench, to fill this gap. MultiCodeBench comprises 2,400 programming tasks, covering 12 popular software development domains and 15 programming languages. Specifically, we perform in-depth research to identify these 12 application domains. Given that each domain may involve multiple technical frameworks, and that different frameworks present distinct challenges in the coding process, we categorize the commonly used frameworks and platforms within each domain. We then sample programming problems from GitHub repositories related to these subdomains. To ensure the quality of the tasks and mitigate data leakage issues, we invite annotators to rewrite the docstrings for each task in MultiCodeBench. Additionally, we build a static analysis-based dependency parsing tool to extract the dependencies in the ground truth for each task, enabling deeper performance analysis. Through extensive experiments on MultiCodeBench with eleven representative mainstream LLMs, we reveal the code generation performance of the LLMs across different application domains, providing practical insights for developers in downstream fields when selecting LLMs. Furthermore, we analyze the reasons behind the models' failures in completing software application development tasks, offering guidance for model developers to enhance domain-specific code generation capabilities.

Can LLMs Obfuscate Code? A Systematic Analysis of Large Language Models into Assembly Code Obfuscation 2024-12-24 Go
Show

Malware authors often employ code obfuscations to make their malware harder to detect. Existing tools for generating obfuscated code often require access to the original source code (e.g., C++ or Java), and adding new obfuscations is a non-trivial, labor-intensive process. In this study, we ask the following question: Can Large Language Models (LLMs) potentially generate a new obfuscated assembly code? If so, this poses a risk to anti-virus engines and potentially increases the flexibility of attackers to create new obfuscation patterns. We answer this in the affirmative by developing the MetamorphASM benchmark comprising MetamorphASM Dataset (MAD) along with three code obfuscation techniques: dead code, register substitution, and control flow change. The MetamorphASM systematically evaluates the ability of LLMs to generate and analyze obfuscated code using MAD, which contains 328,200 obfuscated assembly code samples. We release this dataset and analyze the success rate of various LLMs (e.g., GPT-3.5/4, GPT-4o-mini, Starcoder, CodeGemma, CodeLlama, CodeT5, and LLaMA 3.1) in generating obfuscated assembly code. The evaluation was performed using established information-theoretic metrics and manual human review to ensure correctness and provide the foundation for researchers to study and develop remediations to this risk. The source code can be found at the following GitHub link: https://github.com/mohammadi-ali/MetamorphASM.

To ap...

To appear in AAAI 2025, Main Track

Zero-resource Speech Translation and Recognition with LLMs 2024-12-24 Go
Show

Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.

ICASS...

ICASSP 2025, 5 pages, 2 figures, 2 tables

Token-Budget-Aware LLM Reasoning 2024-12-24 Go
Show

Reasoning is critical for large language models (LLMs) to excel in a wide range of tasks. While methods like Chain-of-Thought (CoT) reasoning enhance LLM performance by decomposing problems into intermediate steps, they also incur significant overhead in token usage, leading to increased costs. We find that the reasoning process of current LLMs is unnecessarily lengthy and it can be compressed by including a reasonable token budget in the prompt, but the choice of token budget plays a crucial role in the actual compression effectiveness. We then propose a token-budget-aware LLM reasoning framework, which dynamically estimates token budgets for different problems based on reasoning complexity and uses the estimated token budgets to guide the reasoning process. Experiments show that our method effectively reduces token costs in CoT reasoning with only a slight performance reduction, offering a practical solution to balance efficiency and accuracy in LLM reasoning. Code: https://github.com/GeniusHTX/TALE.

Think or Remember? Detecting and Directing LLMs Towards Memorization or Generalization 2024-12-24 Go
Show

In this paper, we explore the foundational mechanisms of memorization and generalization in Large Language Models (LLMs), inspired by the functional specialization observed in the human brain. Our investigation serves as a case study leveraging specially designed datasets and experimental-scale LLMs to lay the groundwork for understanding these behaviors. Specifically, we aim to first enable LLMs to exhibit both memorization and generalization by training with the designed dataset, then (a) examine whether LLMs exhibit neuron-level spatial differentiation for memorization and generalization, (b) predict these behaviors using model internal representations, and (c) steer the behaviors through inference-time interventions. Our findings reveal that neuron-wise differentiation of memorization and generalization is observable in LLMs, and targeted interventions can successfully direct their behavior.

Security Attacks on LLM-based Code Completion Tools 2024-12-24 Go
Show

The rapid development of large language models (LLMs) has significantly advanced code completion capabilities, giving rise to a new generation of LLM-based Code Completion Tools (LCCTs). Unlike general-purpose LLMs, these tools possess unique workflows, integrating multiple information sources as input and prioritizing code suggestions over natural language interaction, which introduces distinct security challenges. Additionally, LCCTs often rely on proprietary code datasets for training, raising concerns about the potential exposure of sensitive data. This paper exploits these distinct characteristics of LCCTs to develop targeted attack methodologies on two critical security risks: jailbreaking and training data extraction attacks. Our experimental results expose significant vulnerabilities within LCCTs, including a 99.4% success rate in jailbreaking attacks on GitHub Copilot and a 46.3% success rate on Amazon Q. Furthermore, We successfully extracted sensitive user data from GitHub Copilot, including 54 real email addresses and 314 physical addresses associated with GitHub usernames. Our study also demonstrates that these code-based attack methods are effective against general-purpose LLMs, such as the GPT series, highlighting a broader security misalignment in the handling of code by modern LLMs. These findings underscore critical security challenges associated with LCCTs and suggest essential directions for strengthening their security frameworks. The example code and attack samples from our research are provided at https://github.com/Sensente/Security-Attacks-on-LCCTs.

Paper...

Paper accepted at AAAI 2025

Explainable Multi-Modal Data Exploration in Natural Language via LLM Agent 2024-12-24 Go
Show

International enterprises, organizations, or hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos. While there has been recent progress in the separate fields of multi-modal data exploration as well as in database systems that automatically translate natural language questions to database query languages, the research challenge of querying database systems combined with other unstructured modalities such as images in natural language is widely unexplored. In this paper, we propose XMODE - a system that enables explainable, multi-modal data exploration in natural language. Our approach is based on the following research contributions: (1) Our system is inspired by a real-world use case that enables users to explore multi-modal information systems. (2) XMODE leverages a LLM-based agentic AI framework to decompose a natural language question into subtasks such as text-to-SQL generation and image analysis. (3) Experimental results on multi-modal datasets over relational data and images demonstrate that our system outperforms state-of-the-art multi-modal exploration systems, excelling not only in accuracy but also in various performance metrics such as query latency, API costs, planning efficiency, and explanation quality, thanks to the more effective utilization of the reasoning capabilities of LLMs.

Multilingual Mathematical Reasoning: Advancing Open-Source LLMs in Hindi and English 2024-12-24 Go
Show

Large Language Models (LLMs) excel in linguistic tasks but struggle with mathematical reasoning, particularly in non English languages like Hindi. This research aims to enhance the mathematical reasoning skills of smaller, resource efficient open-source LLMs in both Hindi and English. We evaluate models like OpenHathi 7B, LLaMA-2 7B, WizardMath 7B, Mistral 7B, LLeMMa 7B, MAmmoTH 7B, Gemini Pro, and GPT-4 using zero-shot, few-shot chain-of-thought (CoT) methods, and supervised fine-tuning. Our approach incorporates curriculum learning, progressively training models on increasingly difficult problems, a novel Decomposition Strategy to simplify complex arithmetic operations, and a Structured Solution Design that divides solutions into phases. Our experiments result in notable performance enhancements. WizardMath 7B exceeds Gemini's accuracy on English datasets by +6% and matches Gemini's performance on Hindi datasets. Adopting a bilingual approach that combines English and Hindi samples achieves results comparable to individual language models, demonstrating the capability to learn mathematical reasoning in both languages. This research highlights the potential for improving mathematical reasoning in open-source LLMs.

Accep...

Accepted at AAAI 2025

A Statistical Framework for Ranking LLM-Based Chatbots 2024-12-24 Go
Show

Large language models (LLMs) have transformed natural language processing, with frameworks like Chatbot Arena providing pioneering platforms for evaluating these models. By facilitating millions of pairwise comparisons based on human judgments, Chatbot Arena has become a cornerstone in LLM evaluation, offering rich datasets for ranking models in open-ended conversational tasks. Building upon this foundation, we propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis. First, we introduce a factored tie model that enhances the ability to handle ties -- an integral aspect of human-judged comparisons -- significantly improving the model's fit to observed data. Second, we extend the framework to model covariance between competitors, enabling deeper insights into performance relationships and facilitating intuitive groupings into performance tiers. Third, we resolve optimization challenges arising from parameter non-uniqueness by introducing novel constraints, ensuring stable and interpretable parameter estimation. Through rigorous evaluation and extensive experimentation, our framework demonstrates substantial improvements over existing methods in modeling pairwise comparison data. To support reproducibility and practical adoption, we release leaderbot, an open-source Python package implementing our models and analyses.

ChaI-TeA: A Benchmark for Evaluating Autocompletion of Interactions with LLM-based Chatbots 2024-12-24 Go
Show

The rise of LLMs has deflected a growing portion of human-computer interactions towards LLM-based chatbots. The remarkable abilities of these models allow users to interact using long, diverse natural language text covering a wide range of topics and styles. Phrasing these messages is a time and effort consuming task, calling for an autocomplete solution to assist users. We introduce the task of chatbot interaction autocomplete. We present ChaI-TeA: CHat InTEraction Autocomplete; An autcomplete evaluation framework for LLM-based chatbot interactions. The framework includes a formal definition of the task, coupled with suitable datasets and metrics. We use the framework to evaluate After formally defining the task along with suitable datasets and metrics, we test 9 models on the defined auto completion task, finding that while current off-the-shelf models perform fairly, there is still much room for improvement, mainly in ranking of the generated suggestions. We provide insights for practitioners working on this task and open new research directions for researchers in the field. We release our framework to serve as a foundation for future research.

Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight 2024-12-24 Go
Show

Video anomaly detection (VAD) has witnessed significant advancements through the integration of large language models (LLMs) and vision-language models (VLMs), addressing critical challenges such as interpretability, temporal reasoning, and generalization in dynamic, open-world scenarios. This paper presents an in-depth review of cutting-edge LLM-/VLM-based methods in 2024, focusing on four key aspects: (i) enhancing interpretability through semantic insights and textual explanations, making visual anomalies more understandable; (ii) capturing intricate temporal relationships to detect and localize dynamic anomalies across video frames; (iii) enabling few-shot and zero-shot detection to minimize reliance on large, annotated datasets; and (iv) addressing open-world and class-agnostic anomalies by using semantic understanding and motion features for spatiotemporal coherence. We highlight their potential to redefine the landscape of VAD. Additionally, we explore the synergy between visual and textual modalities offered by LLMs and VLMs, highlighting their combined strengths and proposing future directions to fully exploit the potential in enhancing video anomaly detection.

Research report
Pirates of the RAG: Adaptively Attacking LLMs to Leak Knowledge Bases 2024-12-24 Go
Show

The growing ubiquity of Retrieval-Augmented Generation (RAG) systems in several real-world services triggers severe concerns about their security. A RAG system improves the generative capabilities of a Large Language Models (LLM) by a retrieval mechanism which operates on a private knowledge base, whose unintended exposure could lead to severe consequences, including breaches of private and sensitive information. This paper presents a black-box attack to force a RAG system to leak its private knowledge base which, differently from existing approaches, is adaptive and automatic. A relevance-based mechanism and an attacker-side open-source LLM favor the generation of effective queries to leak most of the (hidden) knowledge base. Extensive experimentation proves the quality of the proposed algorithm in different RAG pipelines and domains, comparing to very recent related approaches, which turn out to be either not fully black-box, not adaptive, or not based on open-source models. The findings from our study remark the urgent need for more robust privacy safeguards in the design and deployment of RAG systems.

LLMs Enable Context-Aware Augmented Reality in Surgical Navigation 2024-12-24 Go
Show

Wearable Augmented Reality (AR) technologies are gaining recognition for their potential to transform surgical navigation systems. As these technologies evolve, selecting the right interaction method to control the system becomes crucial. Our work introduces a voice-controlled user interface (VCUI) for surgical AR assistance systems (ARAS), designed for pancreatic surgery, that integrates Large Language Models (LLMs). Employing a mixed-method research approach, we assessed the usability of our LLM-based design in both simulated surgical tasks and during pancreatic surgeries, comparing its performance against conventional VCUI for surgical ARAS using speech commands. Our findings demonstrated the usability of our proposed LLM-based VCUI, yielding a significantly lower task completion time and cognitive workload compared to speech commands. Additionally, qualitative insights from interviews with surgeons aligned with the quantitative data, revealing a strong preference for the LLM-based VCUI. Surgeons emphasized its intuitiveness and highlighted the potential of LLM-based VCUI in expediting decision-making in surgical environments.

32 pages, 9 figures
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation 2024-12-24 Go
Show

Sequential recommendation (SR) systems have evolved significantly over the past decade, transitioning from traditional collaborative filtering to deep learning approaches and, more recently, to large language models (LLMs). While the adoption of LLMs has driven substantial advancements, these models inherently lack collaborative filtering information, relying primarily on textual content data neglecting other modalities and thus failing to achieve optimal recommendation performance. To address this limitation, we propose Molar, a Multimodal large language sequential recommendation framework that integrates multiple content modalities with ID information to capture collaborative signals effectively. Molar employs an MLLM to generate unified item representations from both textual and non-textual data, facilitating comprehensive multimodal modeling and enriching item embeddings. Additionally, it incorporates collaborative filtering signals through a post-alignment mechanism, which aligns user representations from content-based and ID-based models, ensuring precise personalization and robust performance. By seamlessly combining multimodal content with collaborative filtering insights, Molar captures both user interests and contextual semantics, leading to superior recommendation accuracy. Extensive experiments validate that Molar significantly outperforms traditional and LLM-based baselines, highlighting its strength in utilizing multimodal data and collaborative signals for sequential recommendation tasks. The source code is available at https://anonymous.4open.science/r/Molar-8B06/.

INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent 2024-12-24 Go
Show

Recent advancements have underscored the potential of large language model (LLM)-based agents in financial decision-making. Despite this progress, the field currently encounters two main challenges: (1) the lack of a comprehensive LLM agent framework adaptable to a variety of financial tasks, and (2) the absence of standardized benchmarks and consistent datasets for assessing agent performance. To tackle these issues, we introduce \textsc{InvestorBench}, the first benchmark specifically designed for evaluating LLM-based agents in diverse financial decision-making contexts. InvestorBench enhances the versatility of LLM-enabled agents by providing a comprehensive suite of tasks applicable to different financial products, including single equities like stocks, cryptocurrencies and exchange-traded funds (ETFs). Additionally, we assess the reasoning and decision-making capabilities of our agent framework using thirteen different LLMs as backbone models, across various market environments and tasks. Furthermore, we have curated a diverse collection of open-source, multi-modal datasets and developed a comprehensive suite of environments for financial decision-making. This establishes a highly accessible platform for evaluating financial agents' performance across various scenarios.

Enhancing LLM-based Hatred and Toxicity Detection with Meta-Toxic Knowledge Graph 2024-12-24 Go
Show

The rapid growth of social media platforms has raised significant concerns regarding online content toxicity. When Large Language Models (LLMs) are used for toxicity detection, two key challenges emerge: 1) the absence of domain-specific toxic knowledge leads to false negatives; 2) the excessive sensitivity of LLMs to toxic speech results in false positives, limiting freedom of speech. To address these issues, we propose a novel method called MetaTox, leveraging graph search on a meta-toxic knowledge graph to enhance hatred and toxicity detection. First, we construct a comprehensive meta-toxic knowledge graph by utilizing LLMs to extract toxic information through a three-step pipeline, with toxic benchmark datasets serving as corpora. Second, we query the graph via retrieval and ranking processes to supplement accurate, relevant toxic knowledge. Extensive experiments and in-depth case studies across multiple datasets demonstrate that our MetaTox significantly decreases the false positive rate while boosting overall toxicity detection performance. Our code will be available soon.

8 pages of content
Stepwise Reasoning Error Disruption Attack of LLMs 2024-12-24 Go
Show

Large language models (LLMs) have made remarkable strides in complex reasoning tasks, but their safety and robustness in reasoning processes remain underexplored. Existing attacks on LLM reasoning are constrained by specific settings or lack of imperceptibility, limiting their feasibility and generalizability. To address these challenges, we propose the Stepwise rEasoning Error Disruption (SEED) attack, which subtly injects errors into prior reasoning steps to mislead the model into producing incorrect subsequent reasoning and final answers. Unlike previous methods, SEED is compatible with zero-shot and few-shot settings, maintains the natural reasoning flow, and ensures covert execution without modifying the instruction. Extensive experiments on four datasets across four different models demonstrate SEED's effectiveness, revealing the vulnerabilities of LLMs to disruptions in reasoning processes. These findings underscore the need for greater attention to the robustness of LLM reasoning to ensure safety in practical applications.

EF-LLM: Energy Forecasting LLM with AI-assisted Automation, Enhanced Sparse Prediction, Hallucination Detection 2024-12-24 Go
Show

Accurate prediction helps to achieve supply-demand balance in energy systems, supporting decision-making and scheduling. Traditional models, lacking AI-assisted automation, rely on experts, incur high costs, and struggle with sparse data prediction. To address these challenges, we propose the Energy Forecasting Large Language Model (EF-LLM), which integrates domain knowledge and temporal data for time-series forecasting, supporting both pre-forecast operations and post-forecast decision-support. EF-LLM's human-AI interaction capabilities lower the entry barrier in forecasting tasks, reducing the need for extra expert involvement. To achieve this, we propose a continual learning approach with updatable LoRA and a multi-channel architecture for aligning heterogeneous multimodal data, enabling EF-LLM to continually learn heterogeneous multimodal knowledge. In addition, EF-LLM enables accurate predictions under sparse data conditions through its ability to process multimodal data. We propose Fusion Parameter-Efficient Fine-Tuning (F-PEFT) method to effectively leverage both time-series data and text for this purpose. EF-LLM is also the first energy-specific LLM to detect hallucinations and quantify their occurrence rate, achieved via multi-task learning, semantic similarity analysis, and ANOVA. We have achieved success in energy prediction scenarios for load, photovoltaic, and wind power forecast.

Sim911: Towards Effective and Equitable 9-1-1 Dispatcher Training with an LLM-Enabled Simulation 2024-12-24 Go
Show

Emergency response services are vital for enhancing public safety by safeguarding the environment, property, and human lives. As frontline members of these services, 9-1-1 dispatchers have a direct impact on response times and the overall effectiveness of emergency operations. However, traditional dispatcher training methods, which rely on role-playing by experienced personnel, are labor-intensive, time-consuming, and often neglect the specific needs of underserved communities. To address these challenges, we introduce Sim911, the first training simulation for 9-1-1 dispatchers powered by Large Language Models (LLMs). Sim911 enhances training through three key technical innovations: (1) knowledge construction, which utilizes archived 9-1-1 call data to generate simulations that closely mirror real-world scenarios; (2) context-aware controlled generation, which employs dynamic prompts and vector bases to ensure that LLM behavior aligns with training objectives; and (3) validation with looped correction, which filters out low-quality responses and refines the system performance.

Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels 2024-12-24 Go
Show

Meeting growing demands for low latency and cost efficiency in production-grade large language model (LLM) serving systems requires integrating advanced optimization techniques. However, dynamic and unpredictable input-output lengths of LLM, compounded by these optimizations, exacerbate the issues of workload variability, making it difficult to maintain high efficiency on AI accelerators, especially DSAs with tile-based programming models. To address this challenge, we introduce XY-Serve, a versatile, Ascend native, end-to-end production LLM-serving system. The core idea is an abstraction mechanism that smooths out the workload variability by decomposing computations into unified, hardware-friendly, fine-grained meta primitives. For attention, we propose a meta-kernel that computes the basic pattern of matmul-softmax-matmul with architectural-aware tile sizes. For GEMM, we introduce a virtual padding scheme that adapts to dynamic shape changes while using highly efficient GEMM primitives with assorted fixed tile sizes. XY-Serve sits harmoniously with vLLM. Experimental results show up to 89% end-to-end throughput improvement compared with current publicly available baselines on Ascend NPUs. Additionally, our approach outperforms existing GEMM (average 14.6% faster) and attention (average 21.5% faster) kernels relative to existing libraries. While the work is Ascend native, we believe the approach can be readily applicable to SIMT architectures as well.

EvoPat: A Multi-LLM-based Patents Summarization and Analysis Agent 2024-12-24 Go
Show

The rapid growth of scientific techniques and knowledge is reflected in the exponential increase in new patents filed annually. While these patents drive innovation, they also present significant burden for researchers and engineers, especially newcomers. To avoid the tedious work of navigating a vast and complex landscape to identify trends and breakthroughs, researchers urgently need efficient tools to summarize, evaluate, and contextualize patents, revealing their innovative contributions and underlying scientific principles.To address this need, we present EvoPat, a multi-LLM-based patent agent designed to assist users in analyzing patents through Retrieval-Augmented Generation (RAG) and advanced search strategies. EvoPat leverages multiple Large Language Models (LLMs), each performing specialized roles such as planning, identifying innovations, and conducting comparative evaluations. The system integrates data from local databases, including patents, literature, product catalogous, and company repositories, and online searches to provide up-to-date insights. The ability to collect information not included in original database automatically is also implemented. Through extensive testing in the natural language processing (NLP) domain, we demonstrate that EvoPat outperforms GPT-4 in tasks such as patent summarization, comparative analysis, and technical evaluation. EvoPat represents a significant step toward creating AI-powered tools that empower researchers and engineers to efficiently navigate the complexities of the patent landscape.

15 pa...

15 pages,2 figures,8 tables

Online Learning from Strategic Human Feedback in LLM Fine-Tuning 2024-12-24 Go
Show

Reinforcement learning from human feedback (RLHF) has become an essential step in fine-tuning large language models (LLMs) to align them with human preferences. However, human labelers are selfish and have diverse preferences. They may strategically misreport their online feedback to influence the system's aggregation towards their own preferences. Current practice simply averages labelers' feedback per time and fails to identify the most accurate human labeler, leading to linear regret $\mathcal{O}(T)$ for $T$ time slots. To our best knowledge, we are the first to study online learning mechanisms against strategic human labelers in the LLM fine-tuning process. We formulate a new dynamic Bayesian game and dynamically adjust human labelers' weights in the preference aggregation, ensuring their truthful feedback and sublinear regret $\mathcal{O}(T^{1/2})$. Simulation results demonstrate our mechanism's great advantages over the existing benchmark schemes.

Is Parameter Collision Hindering Continual Learning in LLMs? 2024-12-24 Go
Show

Large Language Models (LLMs) often suffer from catastrophic forgetting when learning multiple tasks sequentially, making continual learning (CL) essential for their dynamic deployment. Existing state-of-the-art (SOTA) methods, such as O-LoRA, typically focus on constructing orthogonality tasks to decouple parameter interdependence from various domains.In this paper, we reveal that building non-collision parameters is a more critical factor in addressing CL challenges. Our theoretical and experimental analyses demonstrate that non-collision parameters can provide better task orthogonality, which is a sufficient but unnecessary condition. Furthermore, knowledge from multiple domains will be preserved in non-collision parameter subspaces, making it more difficult to forget previously seen data. Leveraging this insight, we propose Non-collision Low-Rank Adaptation (N-LoRA), a simple yet effective approach leveraging low collision rates to enhance CL in LLMs. Experimental results on multiple CL benchmarks indicate that N-LoRA achieves superior performance (+2.9), higher task orthogonality (*4.1 times), and lower parameter collision (*58.1 times) than SOTA methods.

Thinking Assistants: LLM-Based Conversational Assistants that Help Users Think By Asking rather than Answering 2024-12-24 Go
Show

Many AI systems focus solely on providing solutions or explaining outcomes. However, complex tasks like research and strategic thinking often benefit from a more comprehensive approach to augmenting the thinking process rather than passively getting information. We introduce the concept of "Thinking Assistant", a new genre of assistants that help users improve decision-making with a combination of asking reflection questions based on expert knowledge. Through our lab study (N=80), these Large Language Model (LLM) based Thinking Assistants were better able to guide users to make important decisions, compared with conversational agents that only asked questions, provided advice, or neither. Based on the results, we develop a Thinking Assistant in academic career development, determining research trajectory or developing one's unique research identity, which requires deliberation, reflection and experts' advice accordingly. In a longitudinal deployment with 223 conversations, participants responded positively to approximately 65% of the responses. Our work proposes directions for developing more effective LLM agents. Rather than adhering to the prevailing authoritative approach of generating definitive answers, LLM agents aimed at assisting with cognitive enhancement should prioritize fostering reflection. They should initially provide responses designed to prompt thoughtful consideration through inquiring, followed by offering advice only after gaining a deeper understanding of the user's context and needs.

Factuality or Fiction? Benchmarking Modern LLMs on Ambiguous QA with Citations 2024-12-23 Go
Show

Benchmarking modern large language models (LLMs) on complex and realistic tasks is critical to advancing their development. In this work, we evaluate the factual accuracy and citation performance of state-of-the-art LLMs on the task of Question Answering (QA) in ambiguous settings with source citations. Using three recently published datasets-DisentQA-DupliCite, DisentQA-ParaCite, and AmbigQA-Cite-featuring a range of real-world ambiguities, we analyze the performance of two leading LLMs, GPT-4o-mini and Claude-3.5. Our results show that larger, recent models consistently predict at least one correct answer in ambiguous contexts but fail to handle cases with multiple valid answers. Additionally, all models perform equally poorly in citation generation, with citation accuracy consistently at 0. However, introducing conflict-aware prompting leads to large improvements, enabling models to better address multiple valid answers and improve citation accuracy, while maintaining their ability to predict correct answers. These findings highlight the challenges and opportunities in developing LLMs that can handle ambiguity and provide reliable source citations. Our benchmarking study provides critical insights and sets a foundation for future improvements in trustworthy and interpretable QA systems.

Trustworthy and Efficient LLMs Meet Databases 2024-12-23 Go
Show

In the rapidly evolving AI era with large language models (LLMs) at the core, making LLMs more trustworthy and efficient, especially in output generation (inference), has gained significant attention. This is to reduce plausible but faulty LLM outputs (a.k.a hallucinations) and meet the highly increased inference demands. This tutorial explores such efforts and makes them transparent to the database community. Understanding these efforts is essential in harnessing LLMs in database tasks and adapting database techniques to LLMs. Furthermore, we delve into the synergy between LLMs and databases, highlighting new opportunities and challenges in their intersection. This tutorial aims to share with database researchers and practitioners essential concepts and strategies around LLMs, reduce the unfamiliarity of LLMs, and inspire joining in the intersection between LLMs and databases.

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs 2024-12-23 Go
Show

The rapid development of large language models (LLMs) necessitates robust, unbiased, and scalable methods for evaluating their capabilities. However, human annotations are expensive to scale, model-based evaluations are prone to biases in answer style, while target-answer-based benchmarks are vulnerable to data contamination and cheating. To address these limitations, we propose StructTest, a novel benchmark that evaluates LLMs on their ability to produce compositionally specified structured outputs as an unbiased, cheap-to-run and difficult-to-cheat measure. The evaluation is done deterministically by a rule-based evaluator, which can be easily extended to new tasks. By testing structured outputs across diverse task domains -- including Summarization, Code, HTML and Math -- we demonstrate that StructTest serves as a good proxy for general reasoning abilities, as producing structured outputs often requires internal logical reasoning. We believe that StructTest offers a critical, complementary approach to objective and robust model evaluation.

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents 2024-12-23 Go
Show

Evaluating Large Language Models (LLMs) as general-purpose agents is essential for understanding their capabilities and facilitating their integration into practical applications. However, the evaluation process presents substantial challenges. A primary obstacle is the benchmarking of agent performance across diverse scenarios within a unified framework, especially in maintaining partially-observable environments and ensuring multi-round interactions. Moreover, current evaluation frameworks mostly focus on the final success rate, revealing few insights during the process and failing to provide a deep understanding of the model abilities. To address these challenges, we introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents. AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit that features easy assessment of agents for multi-faceted analysis. This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront. Ultimately, AgentBoard serves as a step towards demystifying agent behaviors and accelerating the development of stronger LLM agents.

NeurIPS 2024 (Oral)
Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs 2024-12-23 Go
Show

Backdoors are hidden behaviors that are only triggered once an AI system has been deployed. Bad actors looking to create successful backdoors must design them to avoid activation during training and evaluation. Since data used in these stages often only contains information about events that have already occurred, a component of a simple backdoor trigger could be a model recognizing data that is in the future relative to when it was trained. Through prompting experiments and by probing internal activations, we show that current large language models (LLMs) can distinguish past from future events, with probes on model activations achieving 90% accuracy. We train models with backdoors triggered by a temporal distributional shift; they activate when the model is exposed to news headlines beyond their training cut-off dates. Fine-tuning on helpful, harmless and honest (HHH) data does not work well for removing simpler backdoor triggers but is effective on our backdoored models, although this distinction is smaller for the larger-scale model we tested. We also find that an activation-steering vector representing a model's internal representation of the date influences the rate of backdoor activation. We take these results as initial evidence that, at least for models at the modest scale we test, standard safety measures are enough to remove these backdoors.

LLM for Barcodes: Generating Diverse Synthetic Data for Identity Documents 2024-12-23 Go
Show

Accurate barcode detection and decoding in Identity documents is crucial for applications like security, healthcare, and education, where reliable data extraction and verification are essential. However, building robust detection models is challenging due to the lack of diverse, realistic datasets an issue often tied to privacy concerns and the wide variety of document formats. Traditional tools like Faker rely on predefined templates, making them less effective for capturing the complexity of real-world identity documents. In this paper, we introduce a new approach to synthetic data generation that uses LLMs to create contextually rich and realistic data without relying on predefined field. Using the vast knowledge LLMs have about different documents and content, our method creates data that reflects the variety found in real identity documents. This data is then encoded into barcode and overlayed on templates for documents such as Driver's licenses, Insurance cards, Student IDs. Our approach simplifies the process of dataset creation, eliminating the need for extensive domain knowledge or predefined fields. Compared to traditional methods like Faker, data generated by LLM demonstrates greater diversity and contextual relevance, leading to improved performance in barcode detection models. This scalable, privacy-first solution is a big step forward in advancing machine learning for automated document processing and identity verification.

5 pages, 1 figures
LLM-Driven Feedback for Enhancing Conceptual Design Learning in Database Systems Courses 2024-12-23 Go
Show

The integration of LLM-generated feedback into educational settings has shown promise in enhancing student learning outcomes. This paper presents a novel LLM-driven system that provides targeted feedback for conceptual designs in a Database Systems course. The system converts student-created entity-relationship diagrams (ERDs) into JSON format, allows the student to prune the diagram by isolating a relationship, extracts relevant requirements for the selected relationship, and utilizes a large language model (LLM) to generate detailed feedback. Additionally, the system creates a tailored set of questions and answers to further aid student understanding. Our pilot implementation in a Database System course demonstrates effective feedback generation that helped the students improve their design skills.

Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families 2024-12-23 Go
Show

Scaling laws for large language models (LLMs) predict model performance based on parameters like size and training data. However, differences in training configurations and data processing across model families lead to significant variations in benchmark performance, making it difficult for a single scaling law to generalize across all LLMs. On the other hand, training family-specific scaling laws requires training models of varying sizes for every family. In this work, we propose Skills Scaling Laws (SSLaws, pronounced as Sloth), a novel scaling law that leverages publicly available benchmark data and assumes LLM performance is driven by low-dimensional latent skills, such as reasoning and instruction following. These latent skills are influenced by computational resources like model size and training tokens but with varying efficiencies across model families. Sloth exploits correlations across benchmarks to provide more accurate and interpretable predictions while alleviating the need to train multiple LLMs per family. We present both theoretical results on parameter identification and empirical evaluations on 12 prominent benchmarks, from Open LLM Leaderboard v1/v2, demonstrating that Sloth predicts LLM performance efficiently and offers insights into scaling behaviors for downstream tasks such as coding and emotional intelligence applications.

SCBench: A Sports Commentary Benchmark for Video LLMs 2024-12-23 Go
Show

Recently, significant advances have been made in Video Large Language Models (Video LLMs) in both academia and industry. However, methods to evaluate and benchmark the performance of different Video LLMs, especially their fine-grained, temporal visual capabilities, remain very limited. On one hand, current benchmarks use relatively simple videos (e.g., subtitled movie clips) where the model can understand the entire video by processing just a few frames. On the other hand, their datasets lack diversity in task format, comprising only QA or multi-choice QA, which overlooks the models' capacity for generating in-depth and precise texts. Sports videos, which feature intricate visual information, sequential events, and emotionally charged commentary, present a critical challenge for Video LLMs, making sports commentary an ideal benchmarking task. Inspired by these challenges, we propose a novel task: sports video commentary generation, developed $\textbf{SCBench}$ for Video LLMs. To construct such a benchmark, we introduce (1) $\textbf{SCORES}$, a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method, and (2) $\textbf{CommentarySet}$, a dataset consisting of 5,775 annotated video clips and ground-truth labels tailored to our metric. Based on SCBench, we conduct comprehensive evaluations on multiple Video LLMs (e.g. VILA, Video-LLaVA, etc.) and chain-of-thought baseline methods. Our results found that InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04. Our work provides a fresh perspective for future research, aiming to enhance models' overall capabilities in complex visual understanding tasks. Our dataset will be released soon.

Tracking the Feature Dynamics in LLM Training: A Mechanistic Study 2024-12-23 Go
Show

Understanding training dynamics and feature evolution is crucial for the mechanistic interpretability of large language models (LLMs). Although sparse autoencoders (SAEs) have been used to identify features within LLMs, a clear picture of how these features evolve during training remains elusive. In this study, we: (1) introduce SAE-Track, a method to efficiently obtain a continual series of SAEs; (2) formulate the process of feature formation and conduct a mechanistic analysis; and (3) analyze and visualize feature drift during training. Our work provides new insights into the dynamics of features in LLMs, enhancing our understanding of training mechanisms and feature evolution.

LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context 2024-12-23 Go
Show

While Large Language Models (LLMs) have demonstrated remarkable capabilities in scientific tasks, existing evaluation frameworks primarily assess their performance using rich contextual inputs, overlooking their ability to generate novel ideas from minimal information. We introduce LiveIdeaBench, a comprehensive benchmark that evaluates LLMs' scientific creativity and divergent thinking capabilities using single-keyword prompts. Drawing from Guilford's creativity theory, our framework employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across four key dimensions: originality, feasibility, fluency, and flexibility. Through extensive experimentation with 20 leading models across 1,180 keywords spanning 18 scientific domains, we reveal that scientific creative ability shows distinct patterns from general intelligence metrics. Notably, our results demonstrate that models like QwQ-32B-preview achieve comparable creative performance to top-tier models like o1-preview, despite significant gaps in their general intelligence scores. These findings highlight the importance of specialized evaluation frameworks for scientific creativity and suggest that the development of creative capabilities in LLMs may follow different trajectories than traditional problem-solving abilities.

Leveraging Memory Retrieval to Enhance LLM-based Generative Recommendation 2024-12-23 Go
Show

Leveraging Large Language Models (LLMs) to harness user-item interaction histories for item generation has emerged as a promising paradigm in generative recommendation. However, the limited context window of LLMs often restricts them to focusing on recent user interactions only, leading to the neglect of long-term interests involved in the longer histories. To address this challenge, we propose a novel Automatic Memory-Retrieval framework (AutoMR), which is capable of storing long-term interests in the memory and extracting relevant information from it for next-item generation within LLMs. Extensive experimental results on two real-world datasets demonstrate the effectiveness of our proposed AutoMR framework in utilizing long-term interests for generative recommendation.

A Framework for Effective Invocation Methods of Various LLM Services 2024-12-23 Go
Show

Large Language Models (LLMs) have shown impressive abilities in solving various natural language processing tasks and are now widely offered as services. LLM services enable users to accomplish tasks without requiring specialized knowledge, simply by paying service providers. However, numerous providers offer various LLM services with variations in pricing, latency, and performance. These factors are also affected by different invocation methods, such as the choice of context and the use of cache, which lead to unpredictable and uncontrollable service cost and quality. Consequently, utilizing various LLM services invocation methods to construct an effective (cost-saving, low-latency and high-performance) invocation strategy that best meets task demands becomes a pressing challenge. This paper provides a comprehensive overview of methods help LLM services to be invoked efficiently. Technically, we define the problem of constructing an effective LLM services invocation strategy, and based on this, propose a unified LLM service invocation framework. The framework classifies existing methods into four categories: input abstraction, semantic cache, solution design, and output enhancement, which can be used separately or jointly during the invocation life cycle. We discuss the methods in each category and compare them to provide valuable guidance for researchers. Finally, we emphasize the open challenges in this domain and shed light on future research.

DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak 2024-12-23 Go
Show

Large Language Models (LLMs) are susceptible to generating harmful content when prompted with carefully crafted inputs, a vulnerability known as LLM jailbreaking. As LLMs become more powerful, studying jailbreak methods is critical to enhancing security and aligning models with human values. Traditionally, jailbreak techniques have relied on suffix addition or prompt templates, but these methods suffer from limited attack diversity. This paper introduces DiffusionAttacker, an end-to-end generative approach for jailbreak rewriting inspired by diffusion models. Our method employs a sequence-to-sequence (seq2seq) text diffusion model as a generator, conditioning on the original prompt and guiding the denoising process with a novel attack loss. Unlike previous approaches that use autoregressive LLMs to generate jailbreak prompts, which limit the modification of already generated tokens and restrict the rewriting space, DiffusionAttacker utilizes a seq2seq diffusion model, allowing more flexible token modifications. This approach preserves the semantic content of the original prompt while producing harmful content. Additionally, we leverage the Gumbel-Softmax technique to make the sampling process from the diffusion model's output distribution differentiable, eliminating the need for iterative token search. Extensive experiments on Advbench and Harmbench demonstrate that DiffusionAttacker outperforms previous methods across various evaluation metrics, including attack success rate (ASR), fluency, and diversity.

Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal Nuances 2024-12-23 Go
Show

Emotion recognition in speech is a challenging multimodal task that requires understanding both verbal content and vocal nuances. This paper introduces a novel approach to emotion detection using Large Language Models (LLMs), which have demonstrated exceptional capabilities in natural language understanding. To overcome the inherent limitation of LLMs in processing audio inputs, we propose SpeechCueLLM, a method that translates speech characteristics into natural language descriptions, allowing LLMs to perform multimodal emotion analysis via text prompts without any architectural changes. Our method is minimal yet impactful, outperforming baseline models that require structural modifications. We evaluate SpeechCueLLM on two datasets: IEMOCAP and MELD, showing significant improvements in emotion recognition accuracy, particularly for high-quality audio data. We also explore the effectiveness of various feature representations and fine-tuning strategies for different LLMs. Our experiments demonstrate that incorporating speech descriptions yields a more than 2% increase in the average weighted F1 score on IEMOCAP (from 70.111% to 72.596%).

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data 2024-12-23 Go
Show

One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs $(x,f(x))$ can articulate a definition of $f$ and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to "connect the dots" without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.

Accep...

Accepted at NeurIPS 2024. 10 pages, 8 figures

SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training 2024-12-23 Go
Show

Adaptive optimizers such as Adam (Kingma & Ba, 2015) have been central to the success of large language models. However, they often require to maintain optimizer states throughout training, which can result in memory requirements several times greater than the model footprint. This overhead imposes constraints on scalability and computational efficiency. Stochastic Gradient Descent (SGD), in contrast, is a stateless optimizer, as it does not track state variables during training. Consequently, it achieves optimal memory efficiency. However, its capability in LLM training is limited (Zhao et al., 2024b). In this work, we show that pre-processing SGD in a stateless manner can achieve the same performance as the Adam optimizer for LLM training, while drastically reducing the memory cost. Specifically, we propose to pre-process the instantaneous stochastic gradients using normalization and whitening. We show that normalization stabilizes gradient distributions, and whitening counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), a stochastic optimizer that eliminates the need to store any optimizer states. Empirically, SWAN has the same memory footprint as SGD, achieving $\approx 50%$ reduction on total end-to-end memory compared to Adam. In language modeling tasks, SWAN demonstrates comparable or even better performance than Adam: when pre-training the LLaMA model with 350M and 1.3B parameters, SWAN achieves a 2x speedup by reaching the same evaluation perplexity using half as many tokens.

In v2...

In v2 we have revised the related work, added more comprehensive citations, and clarified our key contributions

AutoLife: Automatic Life Journaling with Smartphones and LLMs 2024-12-23 Go
Show

This paper introduces a novel mobile sensing application - life journaling - designed to generate semantic descriptions of users' daily lives. We present AutoLife, an automatic life journaling system based on commercial smartphones. AutoLife only inputs low-cost sensor data (without photos or audio) from smartphones and can automatically generate comprehensive life journals for users. To achieve this, we first derive time, motion, and location contexts from multimodal sensor data, and harness the zero-shot capabilities of Large Language Models (LLMs), enriched with commonsense knowledge about human lives, to interpret diverse contexts and generate life journals. To manage the task complexity and long sensing duration, a multilayer framework is proposed, which decomposes tasks and seamlessly integrates LLMs with other techniques for life journaling. This study establishes a real-life dataset as a benchmark and extensive experiment results demonstrate that AutoLife produces accurate and reliable life journals.

13 pages
Enhancing Reasoning Capabilities of LLMs via Principled Synthetic Logic Corpus 2024-12-23 Go
Show

Large language models (LLMs) are capable of solving a wide range of tasks, yet they have struggled with reasoning. To address this, we propose $\textbf{Additional Logic Training (ALT)}$, which aims to enhance LLMs' reasoning capabilities by program-generated logical reasoning samples. We first establish principles for designing high-quality samples by integrating symbolic logic theory and previous empirical insights. Then, based on these principles, we construct a synthetic corpus named $\textbf{Formal Logic Deduction Diverse}$ ($\textbf{FLD}$${\times 2}$), comprising numerous samples of multi-step deduction with unknown facts, diverse reasoning rules, diverse linguistic expressions, and challenging distractors. Finally, we empirically show that ALT on FLD${\times2}$ substantially enhances the reasoning capabilities of state-of-the-art LLMs, including LLaMA-3.1-70B. Improvements include gains of up to 30 points on logical reasoning benchmarks, up to 10 points on math and coding benchmarks, and 5 points on the benchmark suite BBH.

NeurIPS 2024
LayoutCopilot: An LLM-powered Multi-agent Collaborative Framework for Interactive Analog Layout Design 2024-12-23 Go
Show

Analog layout design heavily involves interactive processes between humans and design tools. Electronic Design Automation (EDA) tools for this task are usually designed to use scripting commands or visualized buttons for manipulation, especially for interactive automation functionalities, which have a steep learning curve and cumbersome user experience, making a notable barrier to designers' adoption. Aiming to address such a usability issue, this paper introduces LayoutCopilot, a pioneering multi-agent collaborative framework powered by Large Language Models (LLMs) for interactive analog layout design. LayoutCopilot simplifies human-tool interaction by converting natural language instructions into executable script commands, and it interprets high-level design intents into actionable suggestions, significantly streamlining the design process. Experimental results demonstrate the flexibility, efficiency, and accessibility of LayoutCopilot in handling real-world analog designs.

8pages, 8figures
Assessing Human Editing Effort on LLM-Generated Texts via Compression-Based Edit Distance 2024-12-23 Go
Show

Assessing the extent of human edits on texts generated by Large Language Models (LLMs) is crucial to understanding the human-AI interactions and improving the quality of automated text generation systems. Existing edit distance metrics, such as Levenshtein, BLEU, ROUGE, and TER, often fail to accurately measure the effort required for post-editing, especially when edits involve substantial modifications, such as block operations. In this paper, we introduce a novel compression-based edit distance metric grounded in the Lempel-Ziv-77 algorithm, designed to quantify the amount of post-editing applied to LLM-generated texts. Our method leverages the properties of text compression to measure the informational difference between the original and edited texts. Through experiments on real-world human edits datasets, we demonstrate that our proposed metric is highly correlated with actual edit time and effort. We also show that LLMs exhibit an implicit understanding of editing speed, that aligns well with our metric. Furthermore, we compare our metric with existing ones, highlighting its advantages in capturing complex edits with linear computational efficiency. Our code and data are available at: https://github.com/NDV-tiime/CompressionDistance

RATT: A Thought Structure for Coherent and Correct LLM Reasoning 2024-12-23 Go
Show

Large Language Models (LLMs) gain substantial reasoning and decision-making capabilities from thought structures. However, existing methods such as Tree of Thought and Retrieval Augmented Thoughts often fall short in complex tasks due to the limitations of insufficient local retrieval of factual knowledge and inadequate global selection of strategies. These limitations make it challenging for these methods to balance factual accuracy and comprehensive logical optimization effectively. To address these limitations, we introduce the Retrieval Augmented Thought Tree (RATT), a novel thought structure that considers both overall logical soundness and factual correctness at each step of the thinking process. Specifically, at every point of a thought branch, RATT performs planning and lookahead to explore and evaluate multiple potential reasoning steps, and integrate the fact-checking ability of Retrieval-Augmented Generation (RAG) with LLM's ability to assess overall strategy. Through this combination of factual knowledge and strategic feasibility, the RATT adjusts and integrates the thought tree structure to search for the most promising branches within the search space. This thought structure significantly enhances the model's coherence in logical inference and efficiency in decision-making, and thus increases the limit of the capacity of LLM to generate reliable inferences and decisions based on thought structures. A broad range of experiments on different types of tasks showcases that the RATT structure significantly outperforms existing methods in factual correctness and logical coherence.

Accep...

Accepted by AAAI 2025

FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity 2024-12-23 Go
Show

The widespread of generative artificial intelligence has heightened concerns about the potential harms posed by AI-generated texts, primarily stemming from factoid, unfair, and toxic content. Previous researchers have invested much effort in assessing the harmlessness of generative language models. However, existing benchmarks are struggling in the era of large language models (LLMs), due to the stronger language generation and instruction following capabilities, as well as wider applications. In this paper, we propose FFT, a new benchmark with 2116 elaborated-designed instances, for LLM harmlessness evaluation with factuality, fairness, and toxicity. To investigate the potential harms of LLMs, we evaluate 9 representative LLMs covering various parameter scales, training stages, and creators. Experiments show that the harmlessness of LLMs is still under-satisfactory, and extensive analysis derives some insightful findings that could inspire future research for harmless LLM research.

Accep...

Accepted by KDD workshop on Evaluation and Trustworthiness of Generative AI Models

LegalAgentBench: Evaluating LLM Agents in Legal Domain 2024-12-23 Go
Show

With the increasing intelligence and autonomy of LLM agents, their potential applications in the legal domain are becoming increasingly apparent. However, existing general-domain benchmarks cannot fully capture the complexity and subtle nuances of real-world judicial cognition and decision-making. Therefore, we propose LegalAgentBench, a comprehensive benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge. We designed a scalable task construction framework and carefully annotated 300 tasks. These tasks span various types, including multi-hop reasoning and writing, and range across different difficulty levels, effectively reflecting the complexity of real-world legal scenarios. Moreover, beyond evaluating final success, LegalAgentBench incorporates keyword analysis during intermediate processes to calculate progress rates, enabling more fine-grained evaluation. We evaluated eight popular LLMs, highlighting the strengths, limitations, and potential areas for improvement of existing models and methods. LegalAgentBench sets a new benchmark for the practical application of LLMs in the legal domain, with its code and data available at \url{https://github.com/CSHaitao/LegalAgentBench}.

23 pages
SyNeg: LLM-Driven Synthetic Hard-Negatives for Dense Retrieval 2024-12-23 Go
Show

The performance of Dense retrieval (DR) is significantly influenced by the quality of negative sampling. Traditional DR methods primarily depend on naive negative sampling techniques or on mining hard negatives through external retriever and meticulously crafted strategies. However, naive negative sampling often fails to adequately capture the accurate boundaries between positive and negative samples, whereas existing hard negative sampling methods are prone to false negatives, resulting in performance degradation and training instability. Recent advancements in large language models (LLMs) offer an innovative solution to these challenges by generating contextually rich and diverse negative samples. In this work, we present a framework that harnesses LLMs to synthesize high-quality hard negative samples. We first devise a \textit{multi-attribute self-reflection prompting strategy} to direct LLMs in hard negative sample generation. Then, we implement a \textit{hybrid sampling strategy} that integrates these synthetic negatives with traditionally retrieved negatives, thereby stabilizing the training process and improving retrieval performance. Extensive experiments on five benchmark datasets demonstrate the efficacy of our approach, and code is also publicly available.

LLM-based relevance assessment still can't replace human relevance assessment 2024-12-22 Go
Show

The use of large language models (LLMs) for relevance assessment in information retrieval has gained significant attention, with recent studies suggesting that LLM-based judgments provide comparable evaluations to human judgments. Notably, based on TREC 2024 data, Upadhyay et al. make a bold claim that LLM-based relevance assessments, such as those generated by the UMBRELA system, can fully replace traditional human relevance assessments in TREC-style evaluations. This paper critically examines this claim, highlighting practical and theoretical limitations that undermine the validity of this conclusion. First, we question whether the evidence provided by Upadhyay et al. really supports their claim, particularly if a test collection is used asa benchmark for future improvements. Second, through a submission deliberately intended to do so, we demonstrate the ease with which automatic evaluation metrics can be subverted, showing that systems designed to exploit these evaluations can achieve artificially high scores. Theoretical challenges -- such as the inherent narcissism of LLMs, the risk of overfitting to LLM-based metrics, and the potential degradation of future LLM performance -- must be addressed before LLM-based relevance assessments can be considered a viable replacement for human judgments.

A Multi-AI Agent System for Autonomous Optimization of Agentic AI Solutions via Iterative Refinement and LLM-Driven Feedback Loops 2024-12-22 Go
Show

Agentic AI systems use specialized agents to handle tasks within complex workflows, enabling automation and efficiency. However, optimizing these systems often requires labor-intensive, manual adjustments to refine roles, tasks, and interactions. This paper introduces a framework for autonomously optimizing Agentic AI solutions across industries, such as NLP-driven enterprise applications. The system employs agents for Refinement, Execution, Evaluation, Modification, and Documentation, leveraging iterative feedback loops powered by an LLM (Llama 3.2-3B). The framework achieves optimal performance without human input by autonomously generating and testing hypotheses to improve system configurations. This approach enhances scalability and adaptability, offering a robust solution for real-world applications in dynamic environments. Case studies across diverse domains illustrate the transformative impact of this framework, showcasing significant improvements in output quality, relevance, and actionability. All data for these case studies, including original and evolved agent codes, along with their outputs, are here: https://anonymous.4open.science/r/evolver-1D11/

LLM Agent for Fire Dynamics Simulations 2024-12-22 Go
Show

Significant advances have been achieved in leveraging foundation models, such as large language models (LLMs), to accelerate complex scientific workflows. In this work we introduce FoamPilot, a proof-of-concept LLM agent designed to enhance the usability of FireFOAM, a specialized solver for fire dynamics and fire suppression simulations built using OpenFOAM, a popular open-source toolbox for computational fluid dynamics (CFD). FoamPilot provides three core functionalities: code insight, case configuration and simulation evaluation. Code insight is an alternative to traditional keyword searching leveraging retrieval-augmented generation (RAG) and aims to enable efficient navigation and summarization of the FireFOAM source code for developers and experienced users. For case configuration, the agent interprets user requests in natural language and aims to modify existing simulation setups accordingly to support intermediate users. FoamPilot's job execution functionality seeks to manage the submission and execution of simulations in high-performance computing (HPC) environments and provide preliminary analysis of simulation results to support less experienced users. Promising results were achieved for each functionality, particularly for simple tasks, and opportunities were identified for significant further improvement for more complex tasks. The integration of these functionalities into a single LLM agent is a step aimed at accelerating the simulation workflow for engineers and scientists employing FireFOAM for complex simulations critical for improving fire safety.

NeurI...

NeurIPS 2024 Foundation Models for Science Workshop (38th Conference on Neural Information Processing Systems). 12 pages, 8 figures

Hate Speech Detection and Target Identification in Devanagari Languages via Parameter Efficient Fine-Tuning of LLMs 2024-12-22 Go
Show

The detection of hate speech has become increasingly important in combating online hostility and its real-world consequences. Despite recent advancements, there is limited research addressing hate speech detection in Devanagari-scripted languages, where resources and tools are scarce. While large language models (LLMs) have shown promise in language-related tasks, traditional fine-tuning approaches are often infeasible given the size of the models. In this paper, we propose a Parameter Efficient Fine tuning (PEFT) based solution for hate speech detection and target identification. We evaluate multiple LLMs on the Devanagari dataset provided by (Thapa et al., 2025), which contains annotated instances in 2 languages - Hindi and Nepali. The results demonstrate the efficacy of our approach in handling Devanagari-scripted content.

VIoTGPT: Learning to Schedule Vision Tools in LLMs towards Intelligent Video Internet of Things 2024-12-22 Go
Show

Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data. How to schedule the domain-specific perceiving models and analyze the collected videos uniformly, efficiently, and especially intelligently to accomplish complicated tasks is challenging. To address the challenge, we build VIoTGPT, the framework based on LLMs to correctly interact with humans, query knowledge videos, and invoke vision models to analyze multimedia data collaboratively. To support VIoTGPT and related future works, we meticulously crafted the VIoT-Tool dataset, including the training dataset and the benchmark involving 11 representative vision models across three categories based on semi-automatic annotations. To guide LLM to act as the intelligent agent towards intelligent VIoT, we resort to the ReAct instruction tuning method based on VIoT-Tool to learn the tool capability. Quantitative and qualitative experiments and analyses demonstrate the effectiveness of VIoTGPT. We believe VIoTGPT contributes to improving human-centered experiences in VIoT applications. The project website is https://github.com/zhongyy/VIoTGPT.

AAAI 2025, 12 pages
Analysis on LLMs Performance for Code Summarization 2024-12-22 Go
Show

Code summarization aims to generate concise natural language descriptions for source code. Deep learning has been used more and more recently in software engineering, particularly for tasks like code creation and summarization. Specifically, it appears that the most current Large Language Models with coding perform well on these tasks. Large Language Models (LLMs) have significantly advanced the field of code summarization, providing sophisticated methods for generating concise and accurate summaries of source code. This study aims to perform a comparative analysis of several open-source LLMs, namely LLaMA-3, Phi-3, Mistral, and Gemma. These models' performance is assessed using important metrics such as BLEU\textsubscript{3.1} and ROUGE\textsubscript{3.2}. Through this analysis, we seek to identify the strengths and weaknesses of each model, offering insights into their applicability and effectiveness in code summarization tasks. Our findings contribute to the ongoing development and refinement of LLMs, supporting their integration into tools that enhance software development and maintenance processes.

The HalluRAG Dataset: Detecting Closed-Domain Hallucinations in RAG Applications Using an LLM's Internal States 2024-12-22 Go
Show

Detecting hallucinations in large language models (LLMs) is critical for enhancing their reliability and trustworthiness. Most research focuses on hallucinations as deviations from information seen during training. However, the opaque nature of an LLM's parametric knowledge complicates the understanding of why generated texts appear ungrounded: The LLM might not have picked up the necessary knowledge from large and often inaccessible datasets, or the information might have been changed or contradicted during further training. Our focus is on hallucinations involving information not used in training, which we determine by using recency to ensure the information emerged after a cut-off date. This study investigates these hallucinations by detecting them at sentence level using different internal states of various LLMs. We present HalluRAG, a dataset designed to train classifiers on these hallucinations. Depending on the model and quantization, MLPs trained on HalluRAG detect hallucinations with test accuracies ranging up to 75 %, with Mistral-7B-Instruct-v0.1 achieving the highest test accuracies. Our results show that IAVs detect hallucinations as effectively as CEVs and reveal that answerable and unanswerable prompts are encoded differently as separate classifiers for these categories improved accuracy. However, HalluRAG showed some limited generalizability, advocating for more diversity in datasets on hallucinations.

19 pages, 3 figures
MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge 2024-12-22 Go
Show

Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks but face significant challenges with complex, knowledge-intensive multi-hop queries, particularly those involving new or long-tail knowledge. Existing benchmarks often fail to fully address these challenges. To bridge this gap, we introduce MINTQA (Multi-hop Question Answering on New and Tail Knowledge), a comprehensive benchmark to evaluate LLMs' capabilities in multi-hop reasoning across four critical dimensions: question handling strategy, sub-question generation, retrieval-augmented generation, and iterative or dynamic decomposition and retrieval. MINTQA comprises 10,479 question-answer pairs for evaluating new knowledge and 17,887 pairs for assessing long-tail knowledge, with each question equipped with corresponding sub-questions and answers. Our systematic evaluation of 22 state-of-the-art LLMs on MINTQA reveals significant limitations in their ability to handle complex knowledge base queries, particularly in handling new or unpopular knowledge. Our findings highlight critical challenges and offer insights for advancing multi-hop reasoning capabilities. The MINTQA benchmark is available at https://github.com/probe2/multi-hop/.

LLM-Powered User Simulator for Recommender System 2024-12-22 Go
Show

User simulators can rapidly generate a large volume of timely user behavior data, providing a testing platform for reinforcement learning-based recommender systems, thus accelerating their iteration and optimization. However, prevalent user simulators generally suffer from significant limitations, including the opacity of user preference modeling and the incapability of evaluating simulation accuracy. In this paper, we introduce an LLM-powered user simulator to simulate user engagement with items in an explicit manner, thereby enhancing the efficiency and effectiveness of reinforcement learning-based recommender systems training. Specifically, we identify the explicit logic of user preferences, leverage LLMs to analyze item characteristics and distill user sentiments, and design a logical model to imitate real human engagement. By integrating a statistical model, we further enhance the reliability of the simulation, proposing an ensemble model that synergizes logical and statistical insights for user interaction simulations. Capitalizing on the extensive knowledge and semantic generation capabilities of LLMs, our user simulator faithfully emulates user behaviors and preferences, yielding high-fidelity training data that enrich the training of recommendation algorithms. We establish quantifying and qualifying experiments on five datasets to validate the simulator's effectiveness and stability across various recommendation scenarios.

Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior of Black-Box LLMs 2024-12-22 Go
Show

Refusals - instances where large language models (LLMs) decline or fail to fully execute user instructions - are crucial for both AI safety and AI capabilities and the reduction of hallucinations in particular. These behaviors are learned during post-training, especially in instruction fine-tuning (IFT) and reinforcement learning from human feedback (RLHF). However, existing taxonomies and evaluation datasets for refusals are inadequate, often focusing solely on should-not-related (instead of cannot-related) categories, and lacking tools for auditing refusal content in black-box LLM outputs. We present a comprehensive framework for classifying LLM refusals: (a) a taxonomy of 16 refusal categories, (b) a human-annotated dataset of over 8,600 instances from publicly available IFT and RLHF datasets, (c) a synthetic dataset with 8,000 examples for each refusal category, and (d) classifiers trained for refusal classification. Our work enables precise auditing of refusal behaviors in black-box LLMs and automatic analyses of refusal patterns in large IFT and RLHF datasets. This facilitates the strategic adjustment of LLM refusals, contributing to the development of more safe and reliable LLMs.

NeurI...

NeurIPS 2024 Workshop SFLLM

Technical Report: Enhancing LLM Reasoning with Reward-guided Tree Search 2024-12-22 Go
Show

Recently, test-time scaling has garnered significant attention from the research community, largely due to the substantial advancements of the o1 model released by OpenAI. By allocating more computational resources during the inference phase, large language models~(LLMs) can extensively explore the solution space by generating more thought tokens or diverse solutions, thereby producing more accurate responses. However, developing an o1-like reasoning approach is challenging, and researchers have been making various attempts to advance this open area of research. In this paper, we present a preliminary exploration into enhancing the reasoning abilities of LLMs through reward-guided tree search algorithms. This framework is implemented by integrating the policy model, reward model, and search algorithm. It is primarily constructed around a tree search algorithm, where the policy model navigates a dynamically expanding tree guided by a specially trained reward model. The implemented framework is denoted as \textbf{STILL-1}. We thoroughly explore various design considerations necessary for implementing this framework and provide a detailed report of the technical aspects. To assess the effectiveness of our approach, we focus on mathematical reasoning tasks and conduct extensive evaluations on four challenging datasets, significantly enhancing the reasoning abilities of LLMs.

Techn...

Technical Report on Slow Thinking with LLMs: I

Evaluating LLM Reasoning in the Operations Research Domain with ORQA 2024-12-22 Go
Show

In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark designed to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when confronted with diverse and complex optimization problems. The dataset, developed by OR experts, features real-world optimization problems that demand multistep reasoning to construct their mathematical models. Our evaluations of various open source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral, reveal their modest performance, highlighting a gap in their ability to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs generalization capabilities, offering valuable insights for future research in this area. The dataset and evaluation code are publicly available.

12 pa...

12 pages, 10 figures. Accepted and to be published in AAAI25

Enhancing Supply Chain Transparency in Emerging Economies Using Online Contents and LLMs 2024-12-22 Go
Show

In the current global economy, supply chain transparency plays a pivotal role in ensuring this security by enabling companies to monitor supplier performance and fostering accountability and responsibility. Despite the advancements in supply chain relationship datasets like Bloomberg and FactSet, supply chain transparency remains a significant challenge in emerging economies due to issues such as information asymmetry and institutional gaps in regulation. This study proposes a novel approach to enhance supply chain transparency in emerging economies by leveraging online content and large language models (LLMs). We develop a Supply Chain Knowledge Graph Mining System that integrates advanced LLMs with web crawler technology to automatically collect and analyze supply chain information. The system's effectiveness is validated through a case study focusing on the semiconductor supply chain, a domain that has recently gained significant attention due to supply chain risks. Our results demonstrate that the proposed system provides greater applicability for emerging economies, such as mainland China, complementing the data gaps in existing datasets. However, challenges including the accurate estimation of monetary and material flows, the handling of time series data, synonyms disambiguation, and mitigating biases from online contents still remains. Future research should focus on addressing these issues to further enhance the system's capabilities and broaden its application to other emerging economies and industries.

6 pages
KG-FPQ: Evaluating Factuality Hallucination in LLMs with Knowledge Graph-based False Premise Questions 2024-12-22 Go
Show

Recent studies have demonstrated that large language models (LLMs) are susceptible to being misled by false premise questions (FPQs), leading to errors in factual knowledge, know as factuality hallucination. Existing benchmarks that assess this vulnerability primarily rely on manual construction, resulting in limited scale and lack of scalability. In this work, we introduce an automated, scalable pipeline to create FPQs based on knowledge graphs (KGs). The first step is modifying true triplets extracted from KGs to create false premises. Subsequently, utilizing the state-of-the-art capabilities of GPTs, we generate semantically rich FPQs. Based on the proposed method, we present a comprehensive benchmark, the Knowledge Graph-based False Premise Questions (KG-FPQ), which contains approximately 178k FPQs across three knowledge domains, at six levels of confusability, and in two task formats. Using KG-FPQ, we conduct extensive evaluations on several representative LLMs and provide valuable insights. The KG-FPQ dataset and code are available at~https://github.com/yanxuzhu/KG-FPQ.

COLING2025 main
Transformer Block Coupling and its Correlation with Generalization in LLMs 2024-12-22 Go
Show

Large Language Models (LLMs) have made significant strides in natural language processing, and a precise understanding of the internal mechanisms driving their success is essential. In this work, we trace the trajectories of individual tokens as they pass through transformer blocks, and linearize the system along these trajectories through their Jacobian matrices. By examining the relationships between these Jacobians, we uncover a $\textbf{transformer block coupling}$ phenomenon in a variety of LLMs, characterized by the coupling of their top singular vectors across tokens and depth. Our findings reveal that coupling $\textit{positively correlates}$ with model performance, and that this relationship is stronger than with other hyperparameters, namely parameter budget, model depth, and embedding dimension. We further investigate the emergence of these properties through training, noting the development of coupling, as well as an increase in linearity and layer-wise exponential growth in the token trajectories. These collective insights provide a novel perspective on the interactions between token embeddings, and prompt further approaches to study training and generalization in LLMs.

PsychAdapter: Adapting LLM Transformers to Reflect Traits, Personality and Mental Health 2024-12-22 Go
Show

Artificial intelligence-based language generators are now a part of most people's lives. However, by default, they tend to generate "average" language without reflecting the ways in which people differ. Here, we propose a lightweight modification to the standard language model transformer architecture - "PsychAdapter" - that uses empirically derived trait-language patterns to generate natural language for specified personality, demographic, and mental health characteristics (with or without prompting). We applied PsychAdapters to modify OpenAI's GPT-2, Google's Gemma, and Meta's Llama 3 and found generated text to reflect the desired traits. For example, expert raters evaluated PsychAdapter's generated text output and found it matched intended trait levels with 87.3% average accuracy for Big Five personalities, and 96.7% for depression and life satisfaction. PsychAdapter is a novel method to introduce psychological behavior patterns into language models at the foundation level, independent of prompting, by influencing every transformer layer. This approach can create chatbots with specific personality profiles, clinical training tools that mirror language associated with psychological conditionals, and machine translations that match an authors reading or education level without taking up LLM context windows. PsychAdapter also allows for the exploration psychological constructs through natural language expression, extending the natural language processing toolkit to study human psychology.

Teaching LLMs to Refine with Tools 2024-12-22 Go
Show

Large language models (LLMs) can refine their responses based on feedback, enabling self-improvement through iterative training or test-time refinement. However, existing methods predominantly focus on refinement within the same reasoning format, which may lead to non-correcting behaviors. We propose CaP, a novel approach that uses external tools to refine chain-of-thought (CoT) responses generated by the same or other LLMs. CaP employs a two-stage training process: supervised fine-tuning followed by preference optimization with DPO variants. Our observations highlight the critical role of preference optimization in enabling effective refinement. Additionally, we compare several sampling strategies to leverage CoT and tools at inference time. Experimental results demonstrate CaP's potential for effective cross-reasoning refinement and efficient inference.

From Intention To Implementation: Automating Biomedical Research via LLMs 2024-12-22 Go
Show

Conventional biomedical research is increasingly labor-intensive due to the exponential growth of scientific literature and datasets. Artificial intelligence (AI), particularly Large Language Models (LLMs), has the potential to revolutionize this process by automating various steps. Still, significant challenges remain, including the need for multidisciplinary expertise, logicality of experimental design, and performance measurements. This paper introduces BioResearcher, the first end-to-end automated system designed to streamline the entire biomedical research process involving dry lab experiments. BioResearcher employs a modular multi-agent architecture, integrating specialized agents for search, literature processing, experimental design, and programming. By decomposing complex tasks into logically related sub-tasks and utilizing a hierarchical learning approach, BioResearcher effectively addresses the challenges of multidisciplinary requirements and logical complexity. Furthermore, BioResearcher incorporates an LLM-based reviewer for in-process quality control and introduces novel evaluation metrics to assess the quality and automation of experimental protocols. BioResearcher successfully achieves an average execution success rate of 63.07% across eight previously unmet research objectives. The generated protocols averagely outperform typical agent systems by 22.0% on five quality metrics. The system demonstrates significant potential to reduce researchers' workloads and accelerate biomedical discoveries, paving the way for future innovations in automated research systems.

Ask-Before-Detection: Identifying and Mitigating Conformity Bias in LLM-Powered Error Detector for Math Word Problem Solutions 2024-12-22 Go
Show

The rise of large language models (LLMs) offers new opportunities for automatic error detection in education, particularly for math word problems (MWPs). While prior studies demonstrate the promise of LLMs as error detectors, they overlook the presence of multiple valid solutions for a single MWP. Our preliminary analysis reveals a significant performance gap between conventional and alternative solutions in MWPs, a phenomenon we term conformity bias in this work. To mitigate this bias, we introduce the Ask-Before-Detect (AskBD) framework, which generates adaptive reference solutions using LLMs to enhance error detection. Experiments on 200 examples of GSM8K show that AskBD effectively mitigates bias and improves performance, especially when combined with reasoning-enhancing techniques like chain-of-thought prompting.

12 pages, 4 figures
KG4Diagnosis: A Hierarchical Multi-Agent LLM Framework with Knowledge Graph Enhancement for Medical Diagnosis 2024-12-22 Go
Show

Integrating Large Language Models (LLMs) in healthcare diagnosis demands systematic frameworks that can handle complex medical scenarios while maintaining specialized expertise. We present KG4Diagnosis, a novel hierarchical multi-agent framework that combines LLMs with automated knowledge graph construction, encompassing 362 common diseases across medical specialties. Our framework mirrors real-world medical systems through a two-tier architecture: a general practitioner (GP) agent for initial assessment and triage, coordinating with specialized agents for in-depth diagnosis in specific domains. The core innovation lies in our end-to-end knowledge graph generation methodology, incorporating: (1) semantic-driven entity and relation extraction optimized for medical terminology, (2) multi-dimensional decision relationship reconstruction from unstructured medical texts, and (3) human-guided reasoning for knowledge expansion. KG4Diagnosis serves as an extensible foundation for specialized medical diagnosis systems, with capabilities to incorporate new diseases and medical knowledge. The framework's modular design enables seamless integration of domain-specific enhancements, making it valuable for developing targeted medical diagnosis systems. We provide architectural guidelines and protocols to facilitate adoption across medical contexts.

10 pa...

10 pages,5 figures,published to AAAI-25 Bridge Program

SubData: A Python Library to Collect and Combine Datasets for Evaluating LLM Alignment on Downstream Tasks 2024-12-21 Go
Show

With the release of ever more capable large language models (LLMs), researchers in NLP and related disciplines have started to explore the usability of LLMs for a wide variety of different annotation tasks. Very recently, a lot of this attention has shifted to tasks that are subjective in nature. Given that the latest generations of LLMs have digested and encoded extensive knowledge about different human subpopulations and individuals, the hope is that these models can be trained, tuned or prompted to align with a wide range of different human perspectives. While researchers already evaluate the success of this alignment via surveys and tests, there is a lack of resources to evaluate the alignment on what oftentimes matters the most in NLP; the actual downstream tasks. To fill this gap we present SubData, a Python library that offers researchers working on topics related to subjectivity in annotation tasks a convenient way of collecting, combining and using a range of suitable datasets.

4 pages, 1 figure
Creating an LLM-based AI-agent: A high-level methodology towards enhancing LLMs with APIs 2024-12-21 Go
Show

Large Language Models (LLMs) have revolutionized various aspects of engineering and science. Their utility is often bottlenecked by the lack of interaction with the external digital environment. To overcome this limitation and achieve integration of LLMs and Artificial Intelligence (AI) into real-world applications, customized AI agents are being constructed. Based on the technological trends and techniques, we extract a high-level approach for constructing these AI agents, focusing on their underlying architecture. This thesis serves as a comprehensive guide that elucidates a multi-faceted approach for empowering LLMs with the capability to leverage Application Programming Interfaces (APIs). We present a 7-step methodology that begins with the selection of suitable LLMs and the task decomposition that is necessary for complex problem-solving. This methodology includes techniques for generating training data for API interactions and heuristics for selecting the appropriate API among a plethora of options. These steps eventually lead to the generation of API calls that are both syntactically and semantically aligned with the LLM's understanding of a given task. Moreover, we review existing frameworks and tools that facilitate these processes and highlight the gaps in current attempts. In this direction, we propose an on-device architecture that aims to exploit the functionality of carry-on devices by using small models from the Hugging Face community. We examine the effectiveness of these approaches on real-world applications of various domains, including the generation of a piano sheet. Through an extensive analysis of the literature and available technologies, this thesis aims to set a compass for researchers and practitioners to harness the full potential of LLMs augmented with external tool capabilities, thus paving the way for more autonomous, robust, and context-aware AI agents.

in Gr...

in Greek language. This Master's Thesis was supervised by Prof. Nikolaos Papaspyrou (National Technical University of Athens) and Dr. Aifen Sui (Huawei Munich Research Center). English version: pages 57-104. Original submission link: http://artemis.cslab.ece.ntua.gr:8080/jspui/handle/123456789/19180

LLM-Based Intent Processing and Network Optimization Using Attention-Based Hierarchical Reinforcement Learning 2024-12-21 Go
Show

Intent-based network automation is a promising tool to enable easier network management however certain challenges need to be effectively addressed. These are: 1) processing intents, i.e., identification of logic and necessary parameters to fulfill an intent, 2) validating an intent to align it with current network status, and 3) satisfying intents via network optimizing functions like xApps and rApps in O-RAN. This paper addresses these points via a three-fold strategy to introduce intent-based automation for O-RAN. First, intents are processed via a lightweight Large Language Model (LLM). Secondly, once an intent is processed, it is validated against future incoming traffic volume profiles (high or low). Finally, a series of network optimization applications (rApps and xApps) have been developed. With their machine learning-based functionalities, they can improve certain key performance indicators such as throughput, delay, and energy efficiency. In this final stage, using an attention-based hierarchical reinforcement learning algorithm, these applications are optimally initiated to satisfy the intent of an operator. Our simulations show that the proposed method can achieve at least 12% increase in throughput, 17.1% increase in energy efficiency, and 26.5% decrease in network delay compared to the baseline algorithms.

Accep...

Accepted paper at WCNC 2025

Syzygy: Dual Code-Test C to (safe) Rust Translation using LLMs and Dynamic Analysis 2024-12-21 Go
Show

Despite extensive usage in high-performance, low-level systems programming applications, C is susceptible to vulnerabilities due to manual memory management and unsafe pointer operations. Rust, a modern systems programming language, offers a compelling alternative. Its unique ownership model and type system ensure memory safety without sacrificing performance. In this paper, we present Syzygy, an automated approach to translate C to safe Rust. Our technique uses a synergistic combination of LLM-driven code and test translation guided by dynamic-analysis-generated execution information. This paired translation runs incrementally in a loop over the program in dependency order of the code elements while maintaining per-step correctness. Our approach exposes novel insights on combining the strengths of LLMs and dynamic analysis in the context of scaling and combining code generation with testing. We apply our approach to successfully translate Zopfli, a high-performance compression library with ~3000 lines of code and 98 functions. We validate the translation by testing equivalence with the source C program on a set of inputs. To our knowledge, this is the largest automated and test-validated C to safe Rust code translation achieved so far.

Proje...

Project webpage at https://syzygy-project.github.io/. Preliminary version accepted at LLM4Code 2025, 34 pages

From Correlation to Causation: Understanding Climate Change through Causal Analysis and LLM Interpretations 2024-12-21 Go
Show

This research presents a three-step causal inference framework that integrates correlation analysis, machine learning-based causality discovery, and LLM-driven interpretations to identify socioeconomic factors influencing carbon emissions and contributing to climate change. The approach begins with identifying correlations, progresses to causal analysis, and enhances decision making through LLM-generated inquiries about the context of climate change. The proposed framework offers adaptable solutions that support data-driven policy-making and strategic decision-making in climate-related contexts, uncovering causal relationships within the climate change domain.

The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents 2024-12-21 Go
Show

Large Language Model (LLM) agents are increasingly being deployed as conversational assistants capable of performing complex real-world tasks through tool integration. This enhanced ability to interact with external systems and process various data sources, while powerful, introduces significant security vulnerabilities. In particular, indirect prompt injection attacks pose a critical threat, where malicious instructions embedded within external data sources can manipulate agents to deviate from user intentions. While existing defenses based on rule constraints, source spotlighting, and authentication protocols show promise, they struggle to maintain robust security while preserving task functionality. We propose a novel and orthogonal perspective that reframes agent security from preventing harmful actions to ensuring task alignment, requiring every agent action to serve user objectives. Based on this insight, we develop Task Shield, a test-time defense mechanism that systematically verifies whether each instruction and tool call contributes to user-specified goals. Through experiments on the AgentDojo benchmark, we demonstrate that Task Shield reduces attack success rates (2.07%) while maintaining high task utility (69.79%) on GPT-4o.

How Multi-Modal LLMs Reshape Visual Deep Learning Testing? A Comprehensive Study Through the Lens of Image Mutation 2024-12-21 Go
Show

Visual deep learning (VDL) systems have shown significant success in real-world applications like image recognition, object detection, and autonomous driving. To evaluate the reliability of VDL, a mainstream approach is software testing, which requires diverse mutations over image semantics. The rapid development of multi-modal large language models (MLLMs) has introduced revolutionary image mutation potentials through instruction-driven methods. Users can now freely describe desired mutations and let MLLMs generate the mutated images. Hence, parallel to large language models' (LLMs) recent success in traditional software fuzzing, one may also expect MLLMs to be promising for VDL testing in terms of offering unified, diverse, and complex image mutations. However, the quality and applicability of MLLM-based mutations in VDL testing remain largely unexplored. We present the first study, aiming to assess MLLMs' adequacy from 1) the semantic validity of MLLM mutated images, 2) the alignment of MLLM mutated images with their text instructions (prompts), and 3) the faithfulness of how different mutations preserve semantics that are ought to remain unchanged. With large-scale human studies and quantitative evaluations, we identify MLLM's promising potentials in expanding the covered semantics of image mutations. Notably, while SoTA MLLMs (e.g., GPT-4V) fail to support or perform worse in editing existing semantics in images (as in traditional mutations like rotation), they generate high-quality test inputs using "semantic-replacement" mutations (e.g., "dress a dog with clothes"), which bring extra semantics to images; these were infeasible for past approaches. Hence, we view MLLM-based mutations as a vital complement to traditional mutations, and advocate future VDL testing tasks to combine MLLM-based methods and traditional image mutations for comprehensive and reliable testing.

TimeRAG: BOOSTING LLM Time Series Forecasting via Retrieval-Augmented Generation 2024-12-21 Go
Show

Although the rise of large language models (LLMs) has introduced new opportunities for time series forecasting, existing LLM-based solutions require excessive training and exhibit limited transferability. In view of these challenges, we propose TimeRAG, a framework that incorporates Retrieval-Augmented Generation (RAG) into time series forecasting LLMs, which constructs a time series knowledge base from historical sequences, retrieves reference sequences from the knowledge base that exhibit similar patterns to the query sequence measured by Dynamic Time Warping (DTW), and combines these reference sequences and the prediction query as a textual prompt to the time series forecasting LLM. Experiments on datasets from various domains show that the integration of RAG improved the prediction accuracy of the original model by 2.97% on average.

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool 2024-12-21 Go
Show

Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.

TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration 2024-12-21 Go
Show

Unit testing is crucial for detecting bugs in individual program units but consumes time and effort. Recently, large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases. However, several problems limit their ability to generate high-quality unit test cases: (1) compilation and runtime errors caused by the hallucination of LLMs; (2) lack of testing and coverage feedback information restricting the increase of code coverage;(3) the repetitive suppression problem causing invalid LLM-based repair and generation attempts. To address these limitations, we propose TestART, a novel unit test generation method. TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration, representing a significant advancement in automated unit test generation. TestART leverages the template-based repair strategy to effectively fix bugs in LLM-generated test cases for the first time. Meanwhile, TestART extracts coverage information from successful test cases and uses it as coverage-guided testing feedback. It also incorporates positive prompt injection to prevent repetition suppression, thereby enhancing the sufficiency of the final test case. This synergy between generation and repair elevates the correctness and sufficiency of the produced test cases significantly beyond previous methods. In comparative experiments, TestART demonstrates an 18% improvement in pass rate and a 20% enhancement in coverage across three types of datasets compared to baseline models. Additionally, it achieves better coverage rates than EvoSuite with only half the number of test cases. These results demonstrate TestART's superior ability to produce high-quality unit test cases by harnessing the power of LLMs while overcoming their inherent flaws.

Regulator-Manufacturer AI Agents Modeling: Mathematical Feedback-Driven Multi-Agent LLM Framework 2024-12-21 Go
Show

The increasing complexity of regulatory updates from global authorities presents significant challenges for medical device manufacturers, necessitating agile strategies to sustain compliance and maintain market access. Concurrently, regulatory bodies must effectively monitor manufacturers' responses and develop strategic surveillance plans. This study employs a multi-agent modeling approach, enhanced with Large Language Models (LLMs), to simulate regulatory dynamics and examine the adaptive behaviors of key actors, including regulatory bodies, manufacturers, and competitors. These agents operate within a simulated environment governed by regulatory flow theory, capturing the impacts of regulatory changes on compliance decisions, market adaptation, and innovation strategies. Our findings illuminate the influence of regulatory shifts on industry behaviour and identify strategic opportunities for improving regulatory practices, optimizing compliance, and fostering innovation. By leveraging the integration of multi-agent systems and LLMs, this research provides a novel perspective and offers actionable insights for stakeholders navigating the evolving regulatory landscape of the medical device industry.

Evaluating and Enhancing LLMs for Multi-turn Text-to-SQL with Multiple Question Types 2024-12-21 Go
Show

Recent advancements in large language models (LLMs) have significantly advanced text-to-SQL systems. However, most LLM-based methods often narrowly focus on SQL generation, neglecting the complexities of real-world conversational queries. This oversight can lead to unreliable responses, particularly for ambiguous questions that cannot be directly addressed with SQL. To bridge this gap, we propose MMSQL, a comprehensive test suite designed to evaluate the question classification and SQL generation capabilities of LLMs by simulating real-world scenarios with diverse question types and multi-turn Q&A interactions. Using MMSQL, we assessed the performance of popular LLMs, including both open-source and closed-source models, and identified key factors impacting their performance in such scenarios. Moreover, we introduce an LLM-based multi-agent framework that employs specialized agents to identify question types and determine appropriate answering strategies. Our experiments demonstrate that this approach significantly enhances the model's ability to navigate the complexities of conversational dynamics, effectively handling the diverse and complex nature of user queries.

20 pages, 3 figures
InsightLens: Augmenting LLM-Powered Data Analysis with Interactive Insight Management and Navigation 2024-12-21 Go
Show

The proliferation of large language models (LLMs) has revolutionized the capabilities of natural language interfaces (NLIs) for data analysis. LLMs can perform multi-step and complex reasoning to generate data insights based on users' analytic intents. However, these insights often entangle with an abundance of contexts in analytic conversations such as code, visualizations, and natural language explanations. This hinders efficient recording, organization, and navigation of insights within the current chat-based LLM interfaces. In this paper, we first conduct a formative study with eight data analysts to understand their general workflow and pain points of insight management during LLM-powered data analysis. Accordingly, we introduce InsightLens, an interactive system to overcome such challenges. Built upon an LLM-agent-based framework that automates insight recording and organization along with the analysis process, InsightLens visualizes the complex conversational contexts from multiple aspects to facilitate insight navigation. A user study with twelve data analysts demonstrates the effectiveness of InsightLens, showing that it significantly reduces users' manual and cognitive effort without disrupting their conversational data analysis workflow, leading to a more efficient analysis experience.

Accep...

Accepted to IEEE TVCG (PacificVis 2025)

$C^2$: Scalable Auto-Feedback for LLM-based Chart Generation 2024-12-21 Go
Show

Generating high-quality charts with Large Language Models (LLMs) presents significant challenges due to limited data and the high cost of scaling through human curation. $\langle \text{instruction}, \text{data}, \text{code} \rangle$ triplets are scarce and expensive to manually curate as their creation demands technical expertise. To address this scalability challenge, we introduce a reference-free automatic feedback generator, which eliminates the need for costly human intervention. Our novel framework, C$^2$, consists of (1) an automatic feedback provider (ChartAF) and (2) a diverse, reference-free dataset (ChartUIE-8K). The results are compelling: in our first experiment, 74% of respondents strongly preferred, and 10% preferred, the results after feedback. The second post-feedback experiment demonstrates that ChartAF outperform nine baselines. Moreover, ChartUIE-8K significantly improves data diversity by increasing queries, datasets, and chart types by 5982%, 1936%, and 91%, respectively, over benchmarks. Finally, a study of LLM users revealed that 94% of participants preferred ChartUIE-8K's queries, with 93% deeming them aligned with real-world use cases. Core contributions are available as open-source at chartsquared.github.io, with ample qualitative examples.

Preprint
HDMoLE: Mixture of LoRA Experts with Hierarchical Routing and Dynamic Thresholds for Fine-Tuning LLM-based ASR Models 2024-12-21 Go
Show

Recent advancements in integrating Large Language Models (LLM) with automatic speech recognition (ASR) have performed remarkably in general domains. While supervised fine-tuning (SFT) of all model parameters is often employed to adapt pre-trained LLM-based ASR models to specific domains, it imposes high computational costs and notably reduces their performance in general domains. In this paper, we propose a novel parameter-efficient multi-domain fine-tuning method for adapting pre-trained LLM-based ASR models to multi-accent domains without catastrophic forgetting named \textit{HDMoLE}, which leverages hierarchical routing and dynamic thresholds based on combining low-rank adaptation (LoRA) with the mixer of experts (MoE) and can be generalized to any linear layer. Hierarchical routing establishes a clear correspondence between LoRA experts and accent domains, improving cross-domain collaboration among the LoRA experts. Unlike the static Top-K strategy for activating LoRA experts, dynamic thresholds can adaptively activate varying numbers of LoRA experts at each MoE layer. Experiments on the multi-accent and standard Mandarin datasets demonstrate the efficacy of HDMoLE. Applying HDMoLE to an LLM-based ASR model projector module achieves similar performance to full fine-tuning in the target multi-accent domains while using only 9.6% of the trainable parameters required for full fine-tuning and minimal degradation in the source general domain.

Accep...

Accepted by ICASSP 2025

Transducer-Llama: Integrating LLMs into Streamable Transducer-based Speech Recognition 2024-12-21 Go
Show

While large language models (LLMs) have been applied to automatic speech recognition (ASR), the task of making the model streamable remains a challenge. This paper proposes a novel model architecture, Transducer-Llama, that integrates LLMs into a Factorized Transducer (FT) model, naturally enabling streaming capabilities. Furthermore, given that the large vocabulary of LLMs can cause data sparsity issue and increased training costs for spoken language systems, this paper introduces an efficient vocabulary adaptation technique to align LLMs with speech system vocabularies. The results show that directly optimizing the FT model with a strong pre-trained LLM-based predictor using the RNN-T loss yields some but limited improvements over a smaller pre-trained LM predictor. Therefore, this paper proposes a weak-to-strong LM swap strategy, using a weak LM predictor during RNN-T loss training and then replacing it with a strong LLM. After LM replacement, the minimum word error rate (MWER) loss is employed to finetune the integration of the LLM predictor with the Transducer-Llama model. Experiments on the LibriSpeech and large-scale multi-lingual LibriSpeech corpora show that the proposed streaming Transducer-Llama approach gave a 17% relative WER reduction (WERR) over a strong FT baseline and a 32% WERR over an RNN-T baseline.

Accep...

Accepted by ICASSP 2025

Has LLM Reached the Scaling Ceiling Yet? Unified Insights into LLM Regularities and Constraints 2024-12-21 Go
Show

Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their scalability raises a critical question: Have we reached the scaling ceiling? This paper addresses this pivotal question by developing a unified theoretical framework that integrates mathematical and statistical insights to explain the scaling dynamics of LLMs. We present: 1. Central Limit Theorem (CLT) for Hidden Representations: We show that noise in hidden representations scales inversely with context size, explaining stabilization effects and the limits of context length improvements. 2. Bias-Variance Decomposition: We decompose next-token prediction loss into irreducible entropy, capacity-driven bias, and finite sample variance, revealing trade-offs where scaling yields diminishing returns. 3. Emergent SNR Thresholds: By defining signal-to-noise ratio (SNR), we quantify how capabilities emerge abruptly once SNR surpasses a threshold, offering insights into when scaling becomes less effective. Through this framework, we conclude that while LLMs have not reached an absolute scaling ceiling, practical constraints are increasingly prominent: diminishing returns, resource inefficiencies, and data limitations. Future progress will require a shift from brute-force scaling to innovations in architecture, data quality, and training paradigms. This work provides a roadmap for guiding the efficient development of next-generation LLMs and advancing the field beyond traditional scaling strategies. Keywords: Large Language Models; Scaling Ceiling; Central Limit Theorem; Bias-Variance Trade-Off; Signal-to-Noise Ratio; Emergent Capabilities

REAL: Response Embedding-based Alignment for LLMs 2024-12-21 Go
Show

Aligning large language models (LLMs) to human preferences is a crucial step in building helpful and safe AI tools, which usually involve training on supervised datasets. Popular algorithms such as Direct Preference Optimization rely on pairs of AI-generated responses ranked according to human feedback. The response pair annotation process is the most labor-intensive and costly part of the alignment pipeline, and improving its efficiency and annotation quality would have a meaningful impact on AI development. We propose REAL: Response Embedding-based Alignment for LLMs, a strategy for constructing a high-quality training dataset that focuses on acquiring the most informative response pairs for labeling out of a set of response candidates. Our selection process is based on embedding responses independently of prompts. Experimental results on real-world dataset SHP2 and synthetic HH-RLHF benchmarks indicate that choosing dissimilar response pairs enhances the direct alignment of LLMs while reducing inherited labeling errors. The model aligned on dissimilar response pairs obtained a better margin and win rate on the dialogue task. Our findings suggest that focusing on distinct pairs can reduce the label error to improve the efficiency of LLM alignment, saving up to 65% of annotators' work.

SYMPHONY: Improving Memory Management for LLM Inference Workloads 2024-12-21 Go
Show

Large Language Models (LLMs) are increasingly being deployed in applications such as chatbots, code editors, and conversational agents. A key feature of LLMs is their ability to engage in multi-turn interactions with humans or external tools, enabling a wide range of tasks. Each new request in a multi-turn interaction depends on the intermediate state, specifically the key-value (K,V) caches, from previous requests in the ongoing interaction. Existing serving engines either recompute the K,V caches or offload them to main memory. Profiling reveals that recomputation can result in over 99% of processed tokens being redundant. On the other hand, offloading K,V caches from GPU memory makes inference serving stateful, leading to load imbalances across the cluster. To address these challenges, we developed SYMPHONY. SYMPHONY leverages the observation that multi-turn work loads provide additional hints that allow K,V caches to be migrated off the critical serving path. By utilizing these hints, SYMPHONY dynamically migrates K,V caches to enable finegrained scheduling of inference requests. Our experiments demonstrate that SYMPHONY can handle over 8x the number of requests compared to state-of-the-art baselines, with a similar latency profile.

What if LLMs Have Different World Views: Simulating Alien Civilizations with LLM-based Agents 2024-12-20 Go
Show

In this study, we introduce "CosmoAgent," an innovative artificial intelligence framework utilizing Large Language Models (LLMs) to simulate complex interactions between human and extraterrestrial civilizations, with a special emphasis on Stephen Hawking's cautionary advice about not sending radio signals haphazardly into the universe. The goal is to assess the feasibility of peaceful coexistence while considering potential risks that could threaten well-intentioned civilizations. Employing mathematical models and state transition matrices, our approach quantitatively evaluates the development trajectories of civilizations, offering insights into future decision-making at critical points of growth and saturation. Furthermore, the paper acknowledges the vast diversity in potential living conditions across the universe, which could foster unique cosmologies, ethical codes, and worldviews among various civilizations. Recognizing the Earth-centric bias inherent in current LLM designs, we propose the novel concept of using LLMs with diverse ethical paradigms and simulating interactions between entities with distinct moral principles. This innovative research provides a new way to understand complex inter-civilizational dynamics, expanding our perspective while pioneering novel strategies for conflict resolution, which are crucial for preventing interstellar conflicts. We have also released the code and datasets to enable further academic investigation into this interesting area of research. The code is available at https://github.com/MingyuJ666/Simulating-Alien-Civilizations-with-LLM-based-Agents.

GLIDER: Grading LLM Interactions and Decisions using Explainable Ranking 2024-12-20 Go
Show

The LLM-as-judge paradigm is increasingly being adopted for automated evaluation of model outputs. While LLM judges have shown promise on constrained evaluation tasks, closed source LLMs display critical shortcomings when deployed in real world applications due to challenges of fine grained metrics and explainability, while task specific evaluation models lack cross-domain generalization. We introduce GLIDER, a powerful 3B evaluator LLM that can score any text input and associated context on arbitrary user defined criteria. GLIDER shows higher Pearson's correlation than GPT-4o on FLASK and greatly outperforms prior evaluation models, achieving comparable performance to LLMs 17x its size. GLIDER supports fine-grained scoring, multilingual reasoning, span highlighting and was trained on 685 domains and 183 criteria. Extensive qualitative analysis shows that GLIDER scores are highly correlated with human judgments, with 91.3% human agreement. We have open-sourced GLIDER to facilitate future research.

Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context 2024-12-20 Go
Show

Previous research on LLM vulnerabilities often relied on nonsensical adversarial prompts, which were easily detectable by automated methods. We address this gap by focusing on human-readable adversarial prompts, a more realistic and potent threat. Our key contributions are situation-driven attacks leveraging movie scripts to create contextually relevant, human-readable prompts that successfully deceive LLMs, adversarial suffix conversion to transform nonsensical adversarial suffixes into meaningful text, and AdvPrompter with p-nucleus sampling, a method to generate diverse, human-readable adversarial suffixes, improving attack efficacy in models like GPT-3.5 and Gemma 7B. Our findings demonstrate that LLMs can be tricked by sophisticated adversaries into producing harmful responses with human-readable adversarial prompts and that there exists a scope for improvement when it comes to robust LLMs.

Understanding Layer Significance in LLM Alignment 2024-12-20 Go
Show

Aligning large language models (LLMs) through fine-tuning is essential for tailoring them to specific applications. Therefore, understanding what LLMs learn during the alignment process is crucial. Recent studies suggest that alignment primarily adjusts a model's presentation style rather than its foundational knowledge, indicating that only certain components of the model are significantly impacted. To delve deeper into LLM alignment, we propose to identify which layers within LLMs are most critical to the alignment process, thereby uncovering how alignment influences model behavior at a granular level. We propose a novel approach to identify the important layers for LLM alignment (ILA). It involves learning a binary mask for each incremental weight matrix in the LoRA algorithm, indicating the significance of each layer. ILA consistently identifies important layers across various alignment datasets, with nearly 90% overlap even with substantial dataset differences, highlighting fundamental patterns in LLM alignment. Experimental results indicate that freezing non-essential layers improves overall model performance, while selectively tuning the most critical layers significantly enhances fine-tuning efficiency with minimal performance loss.

Benchmarking LLMs and SLMs for patient reported outcomes 2024-12-20 Go
Show

LLMs have transformed the execution of numerous tasks, including those in the medical domain. Among these, summarizing patient-reported outcomes (PROs) into concise natural language reports is of particular interest to clinicians, as it enables them to focus on critical patient concerns and spend more time in meaningful discussions. While existing work with LLMs like GPT-4 has shown impressive results, real breakthroughs could arise from leveraging SLMs as they offer the advantage of being deployable locally, ensuring patient data privacy and compliance with healthcare regulations. This study benchmarks several SLMs against LLMs for summarizing patient-reported Q&A forms in the context of radiotherapy. Using various metrics, we evaluate their precision and reliability. The findings highlight both the promise and limitations of SLMs for high-stakes medical tasks, fostering more efficient and privacy-preserving AI-driven healthcare solutions.

10 pages
Offline Reinforcement Learning for LLM Multi-Step Reasoning 2024-12-20 Go
Show

Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi-step reasoning tasks, which often come with sparse reward. In this work, we propose OREO (Offline Reasoning Optimization), an offline RL method for enhancing LLM multi-step reasoning. Building on insights from previous works of maximum entropy reinforcement learning, it jointly learns a policy model and value function by optimizing the soft Bellman Equation. We show in principle that it reduces the need to collect pairwise data and enables better credit assignment. Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH) and embodied agent control (ALFWorld). The approach can be extended to a multi-iteration framework when additional resources are available. Furthermore, the learned value function can be leveraged to guide the tree search for free, which can further boost performance during test time.

What is the Role of Small Models in the LLM Era: A Survey 2024-12-20 Go
Show

Large Language Models (LLMs) have made significant progress in advancing artificial general intelligence (AGI), leading to the development of increasingly large models such as GPT-4 and LLaMA-405B. However, scaling up model sizes results in exponentially higher computational costs and energy consumption, making these models impractical for academic researchers and businesses with limited resources. At the same time, Small Models (SMs) are frequently used in practical settings, although their significance is currently underestimated. This raises important questions about the role of small models in the era of LLMs, a topic that has received limited attention in prior research. In this work, we systematically examine the relationship between LLMs and SMs from two key perspectives: Collaboration and Competition. We hope this survey provides valuable insights for practitioners, fostering a deeper understanding of the contribution of small models and promoting more efficient use of computational resources. The code is available at https://github.com/tigerchen52/role_of_small_models

a sur...

a survey paper of small models

PromptOptMe: Error-Aware Prompt Compression for LLM-based MT Evaluation Metrics 2024-12-20 Go
Show

Evaluating the quality of machine-generated natural language content is a challenging task in Natural Language Processing (NLP). Recently, large language models (LLMs) like GPT-4 have been employed for this purpose, but they are computationally expensive due to the extensive token usage required by complex evaluation prompts. In this paper, we propose a prompt optimization approach that uses a smaller, fine-tuned language model to compress input data for evaluation prompt, thus reducing token usage and computational cost when using larger LLMs for downstream evaluation. Our method involves a two-stage fine-tuning process: supervised fine-tuning followed by preference optimization to refine the model's outputs based on human preferences. We focus on Machine Translation (MT) evaluation and utilize the GEMBA-MQM metric as a starting point. Our results show a $2.37\times$ reduction in token usage without any loss in evaluation quality. This work makes state-of-the-art LLM-based metrics like GEMBA-MQM more cost-effective and efficient, enhancing their accessibility for broader use.

Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts 2024-12-20 Go
Show

This study investigates the potential of Large Language Models (LLMs), particularly GPT-4o, for Optical Character Recognition (OCR) in low-resource scripts such as Urdu, Albanian, and Tajik, with English serving as a benchmark. Using a meticulously curated dataset of 2,520 images incorporating controlled variations in text length, font size, background color, and blur, the research simulates diverse real-world challenges. Results emphasize the limitations of zero-shot LLM-based OCR, particularly for linguistically complex scripts, highlighting the need for annotated datasets and fine-tuned models. This work underscores the urgency of addressing accessibility gaps in text digitization, paving the way for inclusive and robust OCR solutions for underserved languages.

The Evolution of LLM Adoption in Industry Data Curation Practices 2024-12-20 Go
Show

As large language models (LLMs) grow increasingly adept at processing unstructured text data, they offer new opportunities to enhance data curation workflows. This paper explores the evolution of LLM adoption among practitioners at a large technology company, evaluating the impact of LLMs in data curation tasks through participants' perceptions, integration strategies, and reported usage scenarios. Through a series of surveys, interviews, and user studies, we provide a timely snapshot of how organizations are navigating a pivotal moment in LLM evolution. In Q2 2023, we conducted a survey to assess LLM adoption in industry for development tasks (N=84), and facilitated expert interviews to assess evolving data needs (N=10) in Q3 2023. In Q2 2024, we explored practitioners' current and anticipated LLM usage through a user study involving two LLM-based prototypes (N=12). While each study addressed distinct research goals, they revealed a broader narrative about evolving LLM usage in aggregate. We discovered an emerging shift in data understanding from heuristic-first, bottom-up approaches to insights-first, top-down workflows supported by LLMs. Furthermore, to respond to a more complex data landscape, data practitioners now supplement traditional subject-expert-created 'golden datasets' with LLM-generated 'silver' datasets and rigorously validated 'super golden' datasets curated by diverse experts. This research sheds light on the transformative role of LLMs in large-scale analysis of unstructured data and highlights opportunities for further tool development.

19 pa...

19 pages, 4 tables, 3 figures

All-in-One Tuning and Structural Pruning for Domain-Specific LLMs 2024-12-20 Go
Show

Existing pruning techniques for large language models (LLMs) targeting domain-specific applications typically follow a two-stage process: pruning the pretrained general-purpose LLMs and then fine-tuning the pruned LLMs on specific domains. However, the pruning decisions, derived from the pretrained weights, remain unchanged during fine-tuning, even if the weights have been updated. Therefore, such a combination of the pruning decisions and the finetuned weights may be suboptimal, leading to non-negligible performance degradation. To address these limitations, we propose ATP: All-in-One Tuning and Structural Pruning, a unified one-stage structural pruning and fine-tuning approach that dynamically identifies the current optimal substructure throughout the fine-tuning phase via a trainable pruning decision generator. Moreover, given the limited available data for domain-specific applications, Low-Rank Adaptation (LoRA) becomes a common technique to fine-tune the LLMs. In ATP, we introduce LoRA-aware forward and sparsity regularization to ensure that the substructures corresponding to the learned pruning decisions can be directly removed after the ATP process. ATP outperforms the state-of-the-art two-stage pruning methods on tasks in the legal and healthcare domains. More specifically, ATP recovers up to 88% and 91% performance of the dense model when pruning 40% parameters of LLaMA2-7B and LLaMA3-8B models, respectively.

Updat...

Updated a typo in the author list;

Image Restoration

Back to Index

Title Date Cool Paper Abstract Comment
Underwater Image Restoration via Polymorphic Large Kernel CNNs 2024-12-24 Go
Show

Underwater Image Restoration (UIR) remains a challenging task in computer vision due to the complex degradation of images in underwater environments. While recent approaches have leveraged various deep learning techniques, including Transformers and complex, parameter-heavy models to achieve significant improvements in restoration effects, we demonstrate that pure CNN architectures with lightweight parameters can achieve comparable results. In this paper, we introduce UIR-PolyKernel, a novel method for underwater image restoration that leverages Polymorphic Large Kernel CNNs. Our approach uniquely combines large kernel convolutions of diverse sizes and shapes to effectively capture long-range dependencies within underwater imagery. Additionally, we introduce a Hybrid Domain Attention module that integrates frequency and spatial domain attention mechanisms to enhance feature importance. By leveraging the frequency domain, we can capture hidden features that may not be perceptible to humans but are crucial for identifying patterns in both underwater and on-air images. This approach enhances the generalization and robustness of our UIR model. Extensive experiments on benchmark datasets demonstrate that UIR-PolyKernel achieves state-of-the-art performance in underwater image restoration tasks, both quantitatively and qualitatively. Our results show that well-designed pure CNN architectures can effectively compete with more complex models, offering a balance between performance and computational efficiency. This work provides new insights into the potential of CNN-based approaches for challenging image restoration tasks in underwater environments. The code is available at \href{https://github.com/CXH-Research/UIR-PolyKernel}{https://github.com/CXH-Research/UIR-PolyKernel}.

Accep...

Accepted by ICASSP2025

UNet--: Memory-Efficient and Feature-Enhanced Network Architecture based on U-Net with Reduced Skip-Connections 2024-12-24 Go
Show

U-Net models with encoder, decoder, and skip-connections components have demonstrated effectiveness in a variety of vision tasks. The skip-connections transmit fine-grained information from the encoder to the decoder. It is necessary to maintain the feature maps used by the skip-connections in memory before the decoding stage. Therefore, they are not friendly to devices with limited resource. In this paper, we propose a universal method and architecture to reduce the memory consumption and meanwhile generate enhanced feature maps to improve network performance. To this end, we design a simple but effective Multi-Scale Information Aggregation Module (MSIAM) in the encoder and an Information Enhancement Module (IEM) in the decoder. The MSIAM aggregates multi-scale feature maps into single-scale with less memory. After that, the aggregated feature maps can be expanded and enhanced to multi-scale feature maps by the IEM. By applying the proposed method on NAFNet, a SOTA model in the field of image restoration, we design a memory-efficient and feature-enhanced network architecture, UNet--. The memory demand by the skip-connections in the UNet-- is reduced by 93.3%, while the performance is improved compared to NAFNet. Furthermore, we show that our proposed method can be generalized to multiple visual tasks, with consistent improvements in both memory consumption and network accuracy compared to the existing efficient architectures.

17 pa...

17 pages, 7 figures, accepted by ACCV2024

AWRaCLe: All-Weather Image Restoration using Visual In-Context Learning 2024-12-22 Go
Show

All-Weather Image Restoration (AWIR) under adverse weather conditions is a challenging task due to the presence of different types of degradations. Prior research in this domain relies on extensive training data but lacks the utilization of additional contextual information for restoration guidance. Consequently, the performance of existing methods is limited by the degradation cues that are learnt from individual training samples. Recent advancements in visual in-context learning have introduced generalist models that are capable of addressing multiple computer vision tasks simultaneously by using the information present in the provided context as a prior. In this paper, we propose All-Weather Image Restoration using Visual In-Context Learning (AWRaCLe), a novel approach for AWIR that innovatively utilizes degradation-specific visual context information to steer the image restoration process. To achieve this, AWRaCLe incorporates Degradation Context Extraction (DCE) and Context Fusion (CF) to seamlessly integrate degradation-specific features from the context into an image restoration network. The proposed DCE and CF blocks leverage CLIP features and incorporate attention mechanisms to adeptly learn and fuse contextual information. These blocks are specifically designed for visual in-context learning under all-weather conditions and are crucial for effective context utilization. Through extensive experiments, we demonstrate the effectiveness of AWRaCLe for all-weather restoration and show that our method advances the state-of-the-art in AWIR.

NeuroPump: Simultaneous Geometric and Color Rectification for Underwater Images 2024-12-20 Go
Show

Underwater image restoration aims to remove geometric and color distortions due to water refraction, absorption and scattering. Previous studies focus on restoring either color or the geometry, but to our best knowledge, not both. However, in practice it may be cumbersome to address the two rectifications one-by-one. In this paper, we propose NeuroPump, a self-supervised method to simultaneously optimize and rectify underwater geometry and color as if water were pumped out. The key idea is to explicitly model refraction, absorption and scattering in Neural Radiance Field (NeRF) pipeline, such that it not only performs simultaneous geometric and color rectification, but also enables to synthesize novel views and optical effects by controlling the decoupled parameters. In addition, to address issue of lack of real paired ground truth images, we propose an underwater 360 benchmark dataset that has real paired (i.e., with and without water) images. Our method clearly outperforms other baselines both quantitatively and qualitatively.

Multi-dimensional Visual Prompt Enhanced Image Restoration via Mamba-Transformer Aggregation 2024-12-20 Go
Show

Recent efforts on image restoration have focused on developing "all-in-one" models that can handle different degradation types and levels within single model. However, most of mainstream Transformer-based ones confronted with dilemma between model capabilities and computation burdens, since self-attention mechanism quadratically increase in computational complexity with respect to image size, and has inadequacies in capturing long-range dependencies. Most of Mamba-related ones solely scanned feature map in spatial dimension for global modeling, failing to fully utilize information in channel dimension. To address aforementioned problems, this paper has proposed to fully utilize complementary advantages from Mamba and Transformer without sacrificing computation efficiency. Specifically, the selective scanning mechanism of Mamba is employed to focus on spatial modeling, enabling capture long-range spatial dependencies under linear complexity. The self-attention mechanism of Transformer is applied to focus on channel modeling, avoiding high computation burdens that are in quadratic growth with image's spatial dimensions. Moreover, to enrich informative prompts for effective image restoration, multi-dimensional prompt learning modules are proposed to learn prompt-flows from multi-scale encoder/decoder layers, benefiting for revealing underlying characteristic of various degradations from both spatial and channel perspectives, therefore, enhancing the capabilities of "all-in-one" model to solve various restoration tasks. Extensive experiment results on several image restoration benchmark tasks such as image denoising, dehazing, and deraining, have demonstrated that the proposed method can achieve new state-of-the-art performance, compared with many popular mainstream methods. Related source codes and pre-trained parameters will be public on github https://github.com/12138-chr/MTAIR.

Unified Image Restoration and Enhancement: Degradation Calibrated Cycle Reconstruction Diffusion Model 2024-12-19 Go
Show

Image restoration and enhancement are pivotal for numerous computer vision applications, yet unifying these tasks efficiently remains a significant challenge. Inspired by the iterative refinement capabilities of diffusion models, we propose CycleRDM, a novel framework designed to unify restoration and enhancement tasks while achieving high-quality mapping. Specifically, CycleRDM first learns the mapping relationships among the degraded domain, the rough normal domain, and the normal domain through a two-stage diffusion inference process. Subsequently, we transfer the final calibration process to the wavelet low-frequency domain using discrete wavelet transform, performing fine-grained calibration from a frequency domain perspective by leveraging task-specific frequency spaces. To improve restoration quality, we design a feature gain module for the decomposed wavelet high-frequency domain to eliminate redundant features. Additionally, we employ multimodal textual prompts and Fourier transform to drive stable denoising and reduce randomness during the inference process. After extensive validation, CycleRDM can be effectively generalized to a wide range of image restoration and enhancement tasks while requiring only a small number of training samples to be significantly superior on various benchmarks of reconstruction quality and perceptual quality. The source code will be available at https://github.com/hejh8/CycleRDM.

Personalized Generative Low-light Image Denoising and Enhancement 2024-12-18 Go
Show

While smartphone cameras today can produce astonishingly good photos, their performance in low light is still not completely satisfactory because of the fundamental limits in photon shot noise and sensor read noise. Generative image restoration methods have demonstrated promising results compared to traditional methods, but they suffer from hallucinatory content generation when the signal-to-noise ratio (SNR) is low. Recognizing the availability of personalized photo galleries on users' smartphones, we propose Personalized Generative Denoising (PGD) by building a diffusion model customized for different users. Our core innovation is an identity-consistent physical buffer that extracts the physical attributes of the person from the gallery. This ID-consistent physical buffer provides a strong prior that can be integrated with the diffusion model to restore the degraded images, without the need of fine-tuning. Over a wide range of low-light testing scenarios, we show that PGD achieves superior image denoising and enhancement performance compared to existing diffusion-based denoising approaches.

Distilled Pooling Transformer Encoder for Efficient Realistic Image Dehazing 2024-12-18 Go
Show

This paper proposes a lightweight neural network designed for realistic image dehazing, utilizing a Distilled Pooling Transformer Encoder, named DPTE-Net. Recently, while vision transformers (ViTs) have achieved great success in various vision tasks, their self-attention (SA) module's complexity scales quadratically with image resolution, hindering their applicability on resource-constrained devices. To overcome this, the proposed DPTE-Net substitutes traditional SA modules with efficient pooling mechanisms, significantly reducing computational demands while preserving ViTs' learning capabilities. To further enhance semantic feature learning, a distillation-based training process is implemented which transfers rich knowledge from a larger teacher network to DPTE-Net. Additionally, DPTE-Net is trained within a generative adversarial network (GAN) framework, leveraging the strong generalization of GAN in image restoration, and employs a transmission-aware loss function to dynamically adapt to varying haze densities. Experimental results on various benchmark datasets have shown that the proposed DPTE-Net can achieve competitive dehazing performance when compared to state-of-the-art methods while maintaining low computational complexity, making it a promising solution for resource-limited applications. The code of this work is available at https://github.com/tranleanh/dpte-net.

18 pages, 17 figures
Sharing Key Semantics in Transformer Makes Efficient Image Restoration 2024-12-18 Go
Show

Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the emergence of Vision Transformers (ViTs) has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objects or regions. This inclusivity introduces computational inefficiencies, particularly noticeable with high input resolution, as it requires processing irrelevant information, thereby impeding efficiency. Additionally, for IR, it is commonly noted that small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process, as they contribute essential contextual cues crucial for accurate reconstruction. To address these challenges, we propose boosting IR's performance by sharing the key semantics via Transformer for IR (\ie, SemanIR) in this paper. Specifically, SemanIR initially constructs a sparse yet comprehensive key-semantic dictionary within each transformer stage by establishing essential semantic connections for every degraded patch. Subsequently, this dictionary is shared across all subsequent transformer blocks within the same stage. This strategy optimizes attention calculation within each block by focusing exclusively on semantically related components stored in the key-semantic dictionary. As a result, attention calculation achieves linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed SemanIR's state-of-the-art performance, quantitatively and qualitatively showcasing advancements. The visual results, code, and trained models are available at https://github.com/Amazingren/SemanIR.

Accep...

Accepted by NeurIPS2024

DarkIR: Robust Low-Light Image Restoration 2024-12-18 Go
Show

Photography during night or in dark conditions typically suffers from noise, low light and blurring issues due to the dim environment and the common use of long exposure. Although Deblurring and Low-light Image Enhancement (LLIE) are related under these conditions, most approaches in image restoration solve these tasks separately. In this paper, we present an efficient and robust neural network for multi-task low-light image restoration. Instead of following the current tendency of Transformer-based models, we propose new attention mechanisms to enhance the receptive field of efficient CNNs. Our method reduces the computational costs in terms of parameters and MAC operations compared to previous methods. Our model, DarkIR, achieves new state-of-the-art results on the popular LOLBlur, LOLv2 and Real-LOLBlur datasets, being able to generalize on real-world night and dark images. Code and models at https://github.com/cidautai/DarkIR

Technical Report
Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing 2024-12-18 Go
Show

Diffusion models have recently achieved success in solving Bayesian inverse problems with learned data priors. Current methods build on top of the diffusion sampling process, where each denoising step makes small modifications to samples from the previous step. However, this process struggles to correct errors from earlier sampling steps, leading to worse performance in complicated nonlinear inverse problems, such as phase retrieval. To address this challenge, we propose a new method called Decoupled Annealing Posterior Sampling (DAPS) that relies on a novel noise annealing process. Specifically, we decouple consecutive steps in a diffusion sampling trajectory, allowing them to vary considerably from one another while ensuring their time-marginals anneal to the true posterior as we reduce noise levels. This approach enables the exploration of a larger solution space, improving the success rate for accurate reconstructions. We demonstrate that DAPS significantly improves sample quality and stability across multiple image restoration tasks, particularly in complicated nonlinear inverse problems.

Step-Calibrated Diffusion for Biomedical Optical Image Restoration 2024-12-17 Go
Show

High-quality, high-resolution medical imaging is essential for clinical care. Raman-based biomedical optical imaging uses non-ionizing infrared radiation to evaluate human tissues in real time and is used for early cancer detection, brain tumor diagnosis, and intraoperative tissue analysis. Unfortunately, optical imaging is vulnerable to image degradation due to laser scattering and absorption, which can result in diagnostic errors and misguided treatment. Restoration of optical images is a challenging computer vision task because the sources of image degradation are multi-factorial, stochastic, and tissue-dependent, preventing a straightforward method to obtain paired low-quality/high-quality data. Here, we present Restorative Step-Calibrated Diffusion (RSCD), an unpaired diffusion-based image restoration method that uses a step calibrator model to dynamically determine the number of steps required to complete the reverse diffusion process for image restoration. RSCD outperforms other widely used unpaired image restoration methods on both image quality and perceptual evaluation metrics for restoring optical images. Medical imaging experts consistently prefer images restored using RSCD in blinded comparison experiments and report minimal to no hallucinations. Finally, we show that RSCD improves performance on downstream clinical imaging tasks, including automated brain tumor diagnosis and deep tissue imaging. Our code is available at https://github.com/MLNeurosurg/restorative_step-calibrated_diffusion.

Neural Degradation Representation Learning for All-In-One Image Restoration 2024-12-17 Go
Show

Existing methods have demonstrated effective performance on a single degradation type. In practical applications, however, the degradation is often unknown, and the mismatch between the model and the degradation will result in a severe performance drop. In this paper, we propose an all-in-one image restoration network that tackles multiple degradations. Due to the heterogeneous nature of different types of degradations, it is difficult to process multiple degradations in a single network. To this end, we propose to learn a neural degradation representation (NDR) that captures the underlying characteristics of various degradations. The learned NDR decomposes different types of degradations adaptively, similar to a neural dictionary that represents basic degradation components. Subsequently, we develop a degradation query module and a degradation injection module to effectively recognize and utilize the specific degradation based on NDR, enabling the all-in-one restoration ability for multiple degradations. Moreover, we propose a bidirectional optimization strategy to effectively drive NDR to learn the degradation representation by optimizing the degradation and restoration processes alternately. Comprehensive experiments on representative types of degradations (including noise, haze, rain, and downsampling) demonstrate the effectiveness and generalization capability of our method.

Code:...

Code: https://github.com/mdyao/NDR-Restore

TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution 2024-12-17 Go
Show

Pre-trained text-to-image diffusion models are increasingly applied to real-world image super-resolution (Real-ISR) task. Given the iterative refinement nature of diffusion models, most existing approaches are computationally expensive. While methods such as SinSR and OSEDiff have emerged to condense inference steps via distillation, their performance in image restoration or details recovery is not satisfied. To address this, we propose TSD-SR, a novel distillation framework specifically designed for real-world image super-resolution, aiming to construct an efficient and effective one-step model. We first introduce the Target Score Distillation, which leverages the priors of diffusion models and real image references to achieve more realistic image restoration. Secondly, we propose a Distribution-Aware Sampling Module to make detail-oriented gradients more readily accessible, addressing the challenge of recovering fine details. Extensive experiments demonstrate that our TSD-SR has superior restoration results (most of the metrics perform the best) and the fastest inference speed (e.g. 40 times faster than SeeSR) compared to the past Real-ISR approaches based on pre-trained diffusion priors.

Dynamic Contrastive Knowledge Distillation for Efficient Image Restoration 2024-12-17 Go
Show

Knowledge distillation (KD) is a valuable yet challenging approach that enhances a compact student network by learning from a high-performance but cumbersome teacher model. However, previous KD methods for image restoration overlook the state of the student during the distillation, adopting a fixed solution space that limits the capability of KD. Additionally, relying solely on L1-type loss struggles to leverage the distribution information of images. In this work, we propose a novel dynamic contrastive knowledge distillation (DCKD) framework for image restoration. Specifically, we introduce dynamic contrastive regularization to perceive the student's learning state and dynamically adjust the distilled solution space using contrastive learning. Additionally, we also propose a distribution mapping module to extract and align the pixel-level category distribution of the teacher and student models. Note that the proposed DCKD is a structure-agnostic distillation framework, which can adapt to different backbones and can be combined with methods that optimize upper-bound constraints to further enhance model performance. Extensive experiments demonstrate that DCKD significantly outperforms the state-of-the-art KD methods across various image restoration tasks and backbones.

Consistent Diffusion: Denoising Diffusion Model with Data-Consistent Training for Image Restoration 2024-12-17 Go
Show

In this work, we address the limitations of denoising diffusion models (DDMs) in image restoration tasks, particularly the shape and color distortions that can compromise image quality. While DDMs have demonstrated a promising performance in many applications such as text-to-image synthesis, their effectiveness in image restoration is often hindered by shape and color distortions. We observe that these issues arise from inconsistencies between the training and testing data used by DDMs. Based on our observation, we propose a novel training method, named data-consistent training, which allows the DDMs to access images with accumulated errors during training, thereby ensuring the model to learn to correct these errors. Experimental results show that, across five image restoration tasks, our method has significant improvements over state-of-the-art methods while effectively minimizing distortions and preserving image fidelity.

Rethinking Transformer-Based Blind-Spot Network for Self-Supervised Image Denoising 2024-12-17 Go
Show

Blind-spot networks (BSN) have been prevalent neural architectures in self-supervised image denoising (SSID). However, most existing BSNs are conducted with convolution layers. Although transformers have shown the potential to overcome the limitations of convolutions in many image restoration tasks, the attention mechanisms may violate the blind-spot requirement, thereby restricting their applicability in BSN. To this end, we propose to analyze and redesign the channel and spatial attentions to meet the blind-spot requirement. Specifically, channel self-attention may leak the blind-spot information in multi-scale architectures, since the downsampling shuffles the spatial feature into channel dimensions. To alleviate this problem, we divide the channel into several groups and perform channel attention separately. For spatial selfattention, we apply an elaborate mask to the attention matrix to restrict and mimic the receptive field of dilated convolution. Based on the redesigned channel and window attentions, we build a Transformer-based Blind-Spot Network (TBSN), which shows strong local fitting and global perspective abilities. Furthermore, we introduce a knowledge distillation strategy that distills TBSN into smaller denoisers to improve computational efficiency while maintaining performance. Extensive experiments on real-world image denoising datasets show that TBSN largely extends the receptive field and exhibits favorable performance against state-of-theart SSID methods.

AAAI ...

AAAI 2025 Camera Ready

Haar Nuclear Norms with Applications to Remote Sensing Imagery Restoration 2024-12-16 Go
Show

Remote sensing image restoration aims to reconstruct missing or corrupted areas within images. To date, low-rank based models have garnered significant interest in this field. This paper proposes a novel low-rank regularization term, named the Haar nuclear norm (HNN), for efficient and effective remote sensing image restoration. It leverages the low-rank properties of wavelet coefficients derived from the 2-D frontal slice-wise Haar discrete wavelet transform, effectively modeling the low-rank prior for separated coarse-grained structure and fine-grained textures in the image. Experimental evaluations conducted on hyperspectral image inpainting, multi-temporal image cloud removal, and hyperspectral image denoising have revealed the HNN's potential. Typically, HNN achieves a performance improvement of 1-4 dB and a speedup of 10-28x compared to some state-of-the-art methods (e.g., tensor correlated total variation, and fully-connected tensor network) for inpainting tasks.

Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation 2024-12-16 Go
Show

Due to the unaffordable size and intensive computation costs of low-level vision models, All-in-One models that are designed to address a handful of low-level vision tasks simultaneously have been popular. However, existing All-in-One models are limited in terms of the range of tasks and performance. To overcome these limitations, we propose Instruct-IPT -- an All-in-One Image Processing Transformer (IPT) that could effectively address manifold image restoration tasks with large inter-task gaps, such as denoising, deblurring, deraining, dehazing, and desnowing. While most research propose feature adaptation methods, we reveal their failure in addressing highly distinct tasks, and suggest weight modulation that adapts weights to specific tasks. Firstly, we search for task-sensitive weights and introduce task-specific biases on top of them. Secondly, we conduct rank analysis for a good compression strategy and perform low-rank decomposition on the biases. Thirdly, we propose synchronous training that updates the task-general backbone model and the task-specific biases simultaneously. In this way, the model is instructed to learn both general and task-specific knowledge. Via our simple yet effective method that instructs the IPT to be task experts, Instruct-IPT could better cooperate between tasks with distinct characteristics at humble costs. As an additional feature, we enable Instruct-IPT to receive human prompts. We have conducted experiments on Instruct-IPT to demonstrate the effectiveness of our method on manifold tasks, and we have effectively extended our method to diffusion denoisers as well. The code is available at https://github.com/huawei-noah/Pretrained-IPT.

14 pages, 5 figures
Matrix Completion via Residual Spectral Matching 2024-12-16 Go
Show

Noisy matrix completion has attracted significant attention due to its applications in recommendation systems, signal processing and image restoration. Most existing works rely on (weighted) least squares methods under various low-rank constraints. However, minimizing the sum of squared residuals is not always efficient, as it may ignore the potential structural information in the residuals. In this study, we propose a novel residual spectral matching criterion that incorporates not only the numerical but also locational information of residuals. This criterion is the first in noisy matrix completion to adopt the perspective of low-rank perturbation of random matrices and exploit the spectral properties of sparse random matrices. We derive optimal statistical properties by analyzing the spectral properties of sparse random matrices and bounding the effects of low-rank perturbations and partial observations. Additionally, we propose algorithms that efficiently approximate solutions by constructing easily computable pseudo-gradients. The iterative process of the proposed algorithms ensures convergence at a rate consistent with the optimal statistical error bound. Our method and algorithms demonstrate improved numerical performance in both simulated and real data examples, particularly in environments with high noise levels.

23 pages, 6 figures
Towards Context-aware Convolutional Network for Image Restoration 2024-12-15 Go
Show

Image restoration (IR) is a long-standing task to recover a high-quality image from its corrupted observation. Recently, transformer-based algorithms and some attention-based convolutional neural networks (CNNs) have presented promising results on several IR tasks. However, existing convolutional residual building modules for IR encounter limited ability to map inputs into high-dimensional and non-linear feature spaces, and their local receptive fields have difficulty in capturing long-range context information like Transformer. Besides, CNN-based attention modules for IR either face static abundant parameters or have limited receptive fields. To address the first issue, we propose an efficient residual star module (ERSM) that includes context-aware "star operation" (element-wise multiplication) to contextually map features into exceedingly high-dimensional and non-linear feature spaces, which greatly enhances representation learning. To further boost the extraction of contextual information, as for the second issue, we propose a large dynamic integration module (LDIM) which possesses an extremely large receptive field. Thus, LDIM can dynamically and efficiently integrate more contextual information that helps to further significantly improve the reconstruction performance. Integrating ERSM and LDIM into an U-shaped backbone, we propose a context-aware convolutional network (CCNet) with powerful learning ability for contextual high-dimensional mapping and abundant contextual information. Extensive experiments show that our CCNet with low model complexity achieves superior performance compared to other state-of-the-art IR methods on several IR tasks, including image dehazing, image motion deblurring, and image desnowing.

Boosting ViT-based MRI Reconstruction from the Perspectives of Frequency Modulation, Spatial Purification, and Scale Diversification 2024-12-14 Go
Show

The accelerated MRI reconstruction process presents a challenging ill-posed inverse problem due to the extensive under-sampling in k-space. Recently, Vision Transformers (ViTs) have become the mainstream for this task, demonstrating substantial performance improvements. However, there are still three significant issues remain unaddressed: (1) ViTs struggle to capture high-frequency components of images, limiting their ability to detect local textures and edge information, thereby impeding MRI restoration; (2) Previous methods calculate multi-head self-attention (MSA) among both related and unrelated tokens in content, introducing noise and significantly increasing computational burden; (3) The naive feed-forward network in ViTs cannot model the multi-scale information that is important for image restoration. In this paper, we propose FPS-Former, a powerful ViT-based framework, to address these issues from the perspectives of frequency modulation, spatial purification, and scale diversification. Specifically, for issue (1), we introduce a frequency modulation attention module to enhance the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid. For issue (2), we customize a spatial purification attention module to capture interactions among closely related tokens, thereby reducing redundant or irrelevant feature representations. For issue (3), we propose an efficient feed-forward network based on a hybrid-scale fusion strategy. Comprehensive experiments conducted on three public datasets show that our FPS-Former outperforms state-of-the-art methods while requiring lower computational costs.

Are Conditional Latent Diffusion Models Effective for Image Restoration? 2024-12-13 Go
Show

Recent advancements in image restoration increasingly employ conditional latent diffusion models (CLDMs). While these models have demonstrated notable performance improvements in recent years, this work questions their suitability for IR tasks. CLDMs excel in capturing high-level semantic correlations, making them effective for tasks like text-to-image generation with spatial conditioning. However, in IR, where the goal is to enhance image perceptual quality, these models face difficulty of modeling the relationship between degraded images and ground truth images using a low-level representation. To support our claims, we compare state-of-the-art CLDMs with traditional image restoration models through extensive experiments. Results reveal that despite the scaling advantages of CLDMs, they suffer from high distortion and semantic deviation, especially in cases with minimal degradation, where traditional methods outperform them. Additionally, we perform empirical studies to examine the impact of various CLDM design elements on their restoration performance. We hope this finding inspires a reexamination of current CLDM-based IR solutions, opening up more opportunities in this field.

OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs 2024-12-12 Go
Show

Recent advances in diffusion and flow-based generative models have demonstrated remarkable success in image restoration tasks, achieving superior perceptual quality compared to traditional deep learning approaches. However, these methods either require numerous sampling steps to generate high-quality images, resulting in significant computational overhead, or rely on model distillation, which usually imposes a fixed fidelity-realism trade-off and thus lacks flexibility. In this paper, we introduce OFTSR, a novel flow-based framework for one-step image super-resolution that can produce outputs with tunable levels of fidelity and realism. Our approach first trains a conditional flow-based super-resolution model to serve as a teacher model. We then distill this teacher model by applying a specialized constraint. Specifically, we force the predictions from our one-step student model for same input to lie on the same sampling ODE trajectory of the teacher model. This alignment ensures that the student model's single-step predictions from initial states match the teacher's predictions from a closer intermediate state. Through extensive experiments on challenging datasets including FFHQ (256$\times$256), DIV2K, and ImageNet (256$\times$256), we demonstrate that OFTSR achieves state-of-the-art performance for one-step image super-resolution, while having the ability to flexibly tune the fidelity-realism trade-off. Code and pre-trained models are available at https://github.com/yuanzhi-zhu/OFTSR and https://huggingface.co/Yuanzhi/OFTSR, respectively.

ExpRDiff: Short-exposure Guided Diffusion Model for Realistic Local Motion Deblurring 2024-12-12 Go
Show

Removing blur caused by moving objects is challenging, as the moving objects are usually significantly blurry while the static background remains clear. Existing methods that rely on local blur detection often suffer from inaccuracies and cannot generate satisfactory results when focusing solely on blurred regions. To overcome these problems, we first design a context-based local blur detection module that incorporates additional contextual information to improve the identification of blurry regions. Considering that modern smartphones are equipped with cameras capable of providing short-exposure images, we develop a blur-aware guided image restoration method that utilizes sharp structural details from short-exposure images, facilitating accurate reconstruction of heavily blurred regions. Furthermore, to restore images realistically and visually-pleasant, we develop a short-exposure guided diffusion model that explores useful features from short-exposure images and blurred regions to better constrain the diffusion process. Finally, we formulate the above components into a simple yet effective network, named ExpRDiff. Experimental results show that ExpRDiff performs favorably against state-of-the-art methods.

Proje...

Project website: https://github.com/yzb1997/ExpRDiff

Analysis and Synthesis Denoisers for Forward-Backward Plug-and-Play Algorithms 2024-12-11 Go
Show

In this work we study the behavior of the forward-backward (FB) algorithm when the proximity operator is replaced by a sub-iterative procedure to approximate a Gaussian denoiser, in a Plug-and-Play (PnP) fashion. In particular, we consider both analysis and synthesis Gaussian denoisers within a dictionary framework, obtained by unrolling dual-FB iterations or FB iterations, respectively. We analyze the associated minimization problems as well as the asymptotic behavior of the resulting FB-PnP iterations. In particular, we show that the synthesis Gaussian denoising problem can be viewed as a proximity operator. For each case, analysis and synthesis, we show that the FB-PnP algorithms solve the same problem whether we use only one or an infinite number of sub-iteration to solve the denoising problem at each iteration. To this aim, we show that each "one sub-iteration" strategy within the FB-PnP can be interpreted as a primal-dual algorithm when a warm-restart strategy is used. We further present similar results when using a Moreau-Yosida smoothing of the global problem, for an arbitrary number of sub-iterations. Finally, we provide numerical simulations to illustrate our theoretical results. In particular we first consider a toy compressive sensing example, as well as an image restoration problem in a deep dictionary framework.

Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration 2024-12-11 Go
Show

Image restoration (IR) aims to recover high-quality images from degraded inputs, with recent deep learning advancements significantly enhancing performance. However, existing methods lack a unified training benchmark for iterations and configurations. We also identify a bias in image complexity distributions between commonly used IR training and testing datasets, resulting in suboptimal restoration outcomes. To address this, we introduce a large-scale IR dataset called ReSyn, which employs a novel image filtering method based on image complexity to ensure a balanced distribution and includes both real and AIGC synthetic images. We establish a unified training standard that specifies iterations and configurations for image restoration models, focusing on measuring model convergence and restoration capability. Additionally, we enhance transformer-based image restoration models using linear attention mechanisms by proposing RWKV-IR, which integrates linear complexity RWKV into the transformer structure, allowing for both global and local receptive fields. Instead of directly using Vision-RWKV, we replace the original Q-Shift in RWKV with a Depth-wise Convolution shift to better model local dependencies, combined with Bi-directional attention for comprehensive linear attention. We also introduce a Cross-Bi-WKV module that merges two Bi-WKV modules with different scanning orders for balanced horizontal and vertical attention. Extensive experiments validate the effectiveness of our RWKV-IR model.

Unsupervised Variational Translator for Bridging Image Restoration and High-Level Vision Tasks 2024-12-11 Go
Show

Recent research tries to extend image restoration capabilities from human perception to machine perception, thereby enhancing the performance of high-level vision tasks in degraded environments. These methods, primarily based on supervised learning, typically involve the retraining of restoration networks or high-level vision networks. However, collecting paired data in real-world scenarios and retraining large-scale models are challenge. To this end, we propose an unsupervised learning method called \textbf{Va}riational \textbf{T}ranslator (VaT), which does not require retraining existing restoration and high-level vision networks. Instead, it establishes a lightweight network that serves as an intermediate bridge between them. By variational inference, VaT approximates the joint distribution of restoration output and high-level vision input, dividing the optimization objective into preserving content and maximizing marginal likelihood associated with high-level vision tasks. By cleverly leveraging self-training paradigms, VaT achieves the above optimization objective without requiring labels. As a result, the translated images maintain a close resemblance to their original content while also demonstrating exceptional performance on high-level vision tasks. Extensive experiments in dehazing and low-light enhancement for detection and classification show the superiority of our method over other state-of-the-art unsupervised counterparts, even significantly surpassing supervised methods in some complex real-world scenarios.Code is available at https://github.com/Fire-friend/VaT.

Taming Diffusion Prior for Image Super-Resolution with Domain Shift SDEs 2024-12-11 Go
Show

Diffusion-based image super-resolution (SR) models have attracted substantial interest due to their powerful image restoration capabilities. However, prevailing diffusion models often struggle to strike an optimal balance between efficiency and performance. Typically, they either neglect to exploit the potential of existing extensive pretrained models, limiting their generative capacity, or they necessitate a dozens of forward passes starting from random noises, compromising inference efficiency. In this paper, we present DoSSR, a Domain Shift diffusion-based SR model that capitalizes on the generative powers of pretrained diffusion models while significantly enhancing efficiency by initiating the diffusion process with low-resolution (LR) images. At the core of our approach is a domain shift equation that integrates seamlessly with existing diffusion models. This integration not only improves the use of diffusion prior but also boosts inference efficiency. Moreover, we advance our method by transitioning the discrete shift process to a continuous formulation, termed as DoS-SDEs. This advancement leads to the fast and customized solvers that further enhance sampling efficiency. Empirical results demonstrate that our proposed method achieves state-of-the-art performance on synthetic and real-world datasets, while notably requiring only 5 sampling steps. Compared to previous diffusion prior based methods, our approach achieves a remarkable speedup of 5-7 times, demonstrating its superior efficiency. Code: https://github.com/QinpengCui/DoSSR.

This ...

This paper is accepted by NeurIPS 2024

Modeling Dual-Exposure Quad-Bayer Patterns for Joint Denoising and Deblurring 2024-12-10 Go
Show

Image degradation caused by noise and blur remains a persistent challenge in imaging systems, stemming from limitations in both hardware and methodology. Single-image solutions face an inherent tradeoff between noise reduction and motion blur. While short exposures can capture clear motion, they suffer from noise amplification. Long exposures reduce noise but introduce blur. Learning-based single-image enhancers tend to be over-smooth due to the limited information. Multi-image solutions using burst mode avoid this tradeoff by capturing more spatial-temporal information but often struggle with misalignment from camera/scene motion. To address these limitations, we propose a physical-model-based image restoration approach leveraging a novel dual-exposure Quad-Bayer pattern sensor. By capturing pairs of short and long exposures at the same starting point but with varying durations, this method integrates complementary noise-blur information within a single image. We further introduce a Quad-Bayer synthesis method (B2QB) to simulate sensor data from Bayer patterns to facilitate training. Based on this dual-exposure sensor model, we design a hierarchical convolutional neural network called QRNet to recover high-quality RGB images. The network incorporates input enhancement blocks and multi-level feature extraction to improve restoration quality. Experiments demonstrate superior performance over state-of-the-art deblurring and denoising methods on both synthetic and real-world datasets. The code, model, and datasets are publicly available at https://github.com/zhaoyuzhi/QRNet.

accep...

accepted by IEEE Transactions on Image Processing (TIP)

EchoIR: Advancing Image Restoration with Echo Upsampling and Bi-Level Optimization 2024-12-10 Go
Show

Image restoration represents a fundamental challenge in low-level vision, focusing on reconstructing high-quality images from their degraded counterparts. With the rapid advancement of deep learning technologies, transformer-based methods with pyramid structures have advanced the field by capturing long-range cross-scale spatial interaction. Despite its popularity, the degradation of essential features during the upsampling process notably compromised the restoration performance, resulting in suboptimal reconstruction outcomes. We introduce the EchoIR, an UNet-like image restoration network with a bilateral learnable upsampling mechanism to bridge this gap. Specifically, we proposed the Echo-Upsampler that optimizes the upsampling process by learning from the bilateral intermediate features of U-Net, the "Echo", aiming for a more refined restoration by minimizing the degradation during upsampling. In pursuit of modeling a hierarchical model of image restoration and upsampling tasks, we propose the Approximated Sequential Bi-level Optimization (AS-BLO), an advanced bi-level optimization model establishing a relationship between upsampling learning and image restoration tasks. Extensive experiments against the state-of-the-art (SOTA) methods demonstrate the proposed EchoIR surpasses the existing methods, achieving SOTA performance in image restoration tasks.

A Progressive Image Restoration Network for High-order Degradation Imaging in Remote Sensing 2024-12-10 Go
Show

Recently, deep learning methods have gained remarkable achievements in the field of image restoration for remote sensing (RS). However, most existing RS image restoration methods focus mainly on conventional first-order degradation models, which may not effectively capture the imaging mechanisms of remote sensing images. Furthermore, many RS image restoration approaches that use deep learning are often criticized for their lacks of architecture transparency and model interpretability. To address these problems, we propose a novel progressive restoration network for high-order degradation imaging (HDI-PRNet), to progressively restore different image degradation. HDI-PRNet is developed based on the theoretical framework of degradation imaging, offering the benefit of mathematical interpretability within the unfolding network. The framework is composed of three main components: a module for image denoising that relies on proximal mapping prior learning, a module for image deblurring that integrates Neumann series expansion with dual-domain degradation learning, and a module for super-resolution. Extensive experiments demonstrate that our method achieves superior performance on both synthetic and real remote sensing images.

14 pages
InstantRestore: Single-Step Personalized Face Restoration with Shared-Image Attention 2024-12-09 Go
Show

Face image restoration aims to enhance degraded facial images while addressing challenges such as diverse degradation types, real-time processing demands, and, most crucially, the preservation of identity-specific features. Existing methods often struggle with slow processing times and suboptimal restoration, especially under severe degradation, failing to accurately reconstruct finer-level identity details. To address these issues, we introduce InstantRestore, a novel framework that leverages a single-step image diffusion model and an attention-sharing mechanism for fast and personalized face restoration. Additionally, InstantRestore incorporates a novel landmark attention loss, aligning key facial landmarks to refine the attention maps, enhancing identity preservation. At inference time, given a degraded input and a small (~4) set of reference images, InstantRestore performs a single forward pass through the network to achieve near real-time performance. Unlike prior approaches that rely on full diffusion processes or per-identity model tuning, InstantRestore offers a scalable solution suitable for large-scale applications. Extensive experiments demonstrate that InstantRestore outperforms existing methods in quality and speed, making it an appealing choice for identity-preserving face restoration.

Proje...

Project page: https://snap-research.github.io/InstantRestore/

Enhancing Sample Generation of Diffusion Models using Noise Level Correction 2024-12-07 Go
Show

The denoising process of diffusion models can be interpreted as a projection of noisy samples onto the data manifold. Moreover, the noise level in these samples approximates their distance to the underlying manifold. Building on this insight, we propose a novel method to enhance sample generation by aligning the estimated noise level with the true distance of noisy samples to the manifold. Specifically, we introduce a noise level correction network, leveraging a pre-trained denoising network, to refine noise level estimates during the denoising process. Additionally, we extend this approach to various image restoration tasks by integrating task-specific constraints, including inpainting, deblurring, super-resolution, colorization, and compressed sensing. Experimental results demonstrate that our method significantly improves sample quality in both unconstrained and constrained generation scenarios. Notably, the proposed noise level correction framework is compatible with existing denoising schedulers (e.g., DDIM), offering additional performance improvements.

Beyond Pixels: Text Enhances Generalization in Real-World Image Restoration 2024-12-06 Go
Show

Generalization has long been a central challenge in real-world image restoration. While recent diffusion-based restoration methods, which leverage generative priors from text-to-image models, have made progress in recovering more realistic details, they still encounter "generative capability deactivation" when applied to out-of-distribution real-world data. To address this, we propose using text as an auxiliary invariant representation to reactivate the generative capabilities of these models. We begin by identifying two key properties of text input: richness and relevance, and examine their respective influence on model performance. Building on these insights, we introduce Res-Captioner, a module that generates enhanced textual descriptions tailored to image content and degradation levels, effectively mitigating response failures. Additionally, we present RealIR, a new benchmark designed to capture diverse real-world scenarios. Extensive experiments demonstrate that Res-Captioner significantly enhances the generalization abilities of diffusion-based restoration models, while remaining fully plug-and-play.

ReF-LDM: A Latent Diffusion Model for Reference-based Face Image Restoration 2024-12-06 Go
Show

While recent works on blind face image restoration have successfully produced impressive high-quality (HQ) images with abundant details from low-quality (LQ) input images, the generated content may not accurately reflect the real appearance of a person. To address this problem, incorporating well-shot personal images as additional reference inputs could be a promising strategy. Inspired by the recent success of the Latent Diffusion Model (LDM), we propose ReF-LDM, an adaptation of LDM designed to generate HQ face images conditioned on one LQ image and multiple HQ reference images. Our model integrates an effective and efficient mechanism, CacheKV, to leverage the reference images during the generation process. Additionally, we design a timestep-scaled identity loss, enabling our LDM-based model to focus on learning the discriminating features of human faces. Lastly, we construct FFHQ-Ref, a dataset consisting of 20,405 high-quality (HQ) face images with corresponding reference images, which can serve as both training and evaluation data for reference-based face restoration models.

NeurI...

NeurIPS 2024, project page https://chiweihsiao.github.io/refldm.github.io/

Equivariant Denoisers for Image Restoration 2024-12-06 Go
Show

One key ingredient of image restoration is to define a realistic prior on clean images to complete the missing information in the observation. State-of-the-art restoration methods rely on a neural network to encode this prior. Moreover, typical image distributions are invariant to some set of transformations, such as rotations or flips. However, most deep architectures are not designed to represent an invariant image distribution. Recent works have proposed to overcome this difficulty by including equivariance properties within a Plug-and-Play paradigm. In this work, we propose a unified framework named Equivariant Regularization by Denoising (ERED) based on equivariant denoisers and stochastic optimization. We analyze the convergence of this algorithm and discuss its practical benefit.

MetaFormer: High-fidelity Metalens Imaging via Aberration Correcting Transformers 2024-12-05 Go
Show

Metalens is an emerging optical system with an irreplaceable merit in that it can be manufactured in ultra-thin and compact sizes, which shows great promise of various applications such as medical imaging and augmented/virtual reality (AR/VR). Despite its advantage in miniaturization, its practicality is constrained by severe aberrations and distortions, which significantly degrade the image quality. Several previous arts have attempted to address different types of aberrations, yet most of them are mainly designed for the traditional bulky lens and not convincing enough to remedy harsh aberrations of the metalens. While there have existed aberration correction methods specifically for metalens, they still fall short of restoration quality. In this work, we propose MetaFormer, an aberration correction framework for metalens-captured images, harnessing Vision Transformers (ViT) that has shown remarkable restoration performance in diverse image restoration tasks. Specifically, we devise a Multiple Adaptive Filters Guidance (MAFG), where multiple Wiener filters enrich the degraded input images with various noise-detail balances, enhancing output restoration quality. In addition, we introduce a Spatial and Transposed self-Attention Fusion (STAF) module, which aggregates features from spatial self-attention and transposed self-attention modules to further ameliorate aberration correction. We conduct extensive experiments, including correcting aberrated images and videos, and clean 3D reconstruction from the degraded images. The proposed method outperforms the previous arts by a significant margin. We further fabricate a metalens and verify the practicality of MetaFormer by restoring the images captured with the manufactured metalens in the wild. Code and pre-trained models are available at https://benhenryl.github.io/MetaFormer

19 pages, 18 figures
Deep priors for satellite image restoration with accurate uncertainties 2024-12-05 Go
Show

Satellite optical images, upon their on-ground receipt, offer a distorted view of the observed scene. Their restoration, classically including denoising, deblurring, and sometimes super-resolution, is required before their exploitation. Moreover, quantifying the uncertainty related to this restoration could be valuable by lowering the risk of hallucination and avoiding propagating these biases in downstream applications. Deep learning methods are now state-of-the-art for satellite image restoration. However, they require to train a specific network for each sensor and they do not provide the associated uncertainties. This paper proposes a generic method involving a single network to restore images from several sensors and a scalable way to derive the uncertainties. We focus on deep regularization (DR) methods, which learn a deep prior on target images before plugging it into a model-based optimization scheme. First, we introduce VBLE-xz, which solves the inverse problem in the latent space of a variational compressive autoencoder, estimating the uncertainty jointly in the latent and in the image spaces. It enables scalable posterior sampling with relevant and calibrated uncertainties. Second, we propose the denoiser-based method SatDPIR, adapted from DPIR, which efficiently computes accurate point estimates. We conduct a comprehensive set of experiments on very high resolution simulated and real Pleiades images, asserting both the performance and robustness of the proposed methods. VBLE-xz and SatDPIR achieve state-of-the-art results compared to direct inversion methods. In particular, VBLE-xz is a scalable method to get realistic posterior samples and accurate uncertainties, while SatDPIR represents a compelling alternative to direct inversion methods when uncertainty quantification is not required.

Blind Underwater Image Restoration using Co-Operational Regressor Networks 2024-12-05 Go
Show

The exploration of underwater environments is essential for applications such as biological research, archaeology, and infrastructure maintenanceHowever, underwater imaging is challenging due to the waters unique properties, including scattering, absorption, color distortion, and reduced visibility. To address such visual degradations, a variety of approaches have been proposed covering from basic signal processing methods to deep learning models; however, none of them has proven to be consistently successful. In this paper, we propose a novel machine learning model, Co-Operational Regressor Networks (CoRe-Nets), designed to achieve the best possible underwater image restoration. A CoRe-Net consists of two co-operating networks: the Apprentice Regressor (AR), responsible for image transformation, and the Master Regressor (MR), which evaluates the Peak Signal-to-Noise Ratio (PSNR) of the images generated by the AR and feeds it back to AR. CoRe-Nets are built on Self-Organized Operational Neural Networks (Self-ONNs), which offer a superior learning capability by modulating nonlinearity in kernel transformations. The effectiveness of the proposed model is demonstrated on the benchmark Large Scale Underwater Image (LSUI) dataset. Leveraging the joint learning capabilities of the two cooperating networks, the proposed model achieves the state-of-art restoration performance with significantly reduced computational complexity and often presents such results that can even surpass the visual quality of the ground truth with a 2-pass application. Our results and the optimized PyTorch implementation of the proposed approach are now publicly shared on GitHub.

11 pages
LL-ICM: Image Compression for Low-level Machine Vision via Large Vision-Language Model 2024-12-05 Go
Show

Image Compression for Machines (ICM) aims to compress images for machine vision tasks rather than human viewing. Current works predominantly concentrate on high-level tasks like object detection and semantic segmentation. However, the quality of original images is usually not guaranteed in the real world, leading to even worse perceptual quality or downstream task performance after compression. Low-level (LL) machine vision models, like image restoration models, can help improve such quality, and thereby their compression requirements should also be considered. In this paper, we propose a pioneered ICM framework for LL machine vision tasks, namely LL-ICM. By jointly optimizing compression and LL tasks, the proposed LL-ICM not only enriches its encoding ability in generalizing to versatile LL tasks but also optimizes the processing ability of down-stream LL task models, achieving mutual adaptation for image codecs and LL task models. Furthermore, we integrate large-scale vision-language models into the LL-ICM framework to generate more universal and distortion-robust feature embeddings for LL vision tasks. Therefore, one LL-ICM codec can generalize to multiple tasks. We establish a solid benchmark to evaluate LL-ICM, which includes extensive objective experiments by using both full and no-reference image quality assessments. Experimental results show that LL-ICM can achieve 22.65% BD-rate reductions over the state-of-the-art methods.

Chain-of-Restoration: Multi-Task Image Restoration Models are Zero-Shot Step-by-Step Universal Image Restorers 2024-12-04 Go
Show

Despite previous image restoration (IR) methods have often concentrated on isolated degradations, recent research has increasingly focused on addressing composite degradations involving a complex combination of multiple isolated degradations. However, current IR methods for composite degradations require building training data that contain an exponential number of possible degradation combinations, which brings in a significant burden. To alleviate this issue, this paper proposes a new task setting, i.e. Universal Image Restoration (UIR). Specifically, UIR doesn't require training on all the degradation combinations but only on a set of degradation bases and then removing any degradation that these bases can potentially compose in a zero-shot manner. Inspired by the Chain-of-Thought that prompts large language models (LLMs) to address problems step-by-step, we propose Chain-of-Restoration (CoR) mechanism, which instructs models to remove unknown composite degradations step-by-step. By integrating a simple Degradation Discriminator into pre-trained multi-task models, CoR facilitates the process where models remove one degradation basis per step, continuing this process until the image is fully restored from the unknown composite degradation. Extensive experiments show that CoR can significantly improve model performance in removing composite degradations, achieving comparable or better results than those state-of-the-art (SoTA) methods trained on all degradations.

code:...

code: https://github.com/toummHus/Chain-of-Restoration

Semantic Segmentation Prior for Diffusion-Based Real-World Super-Resolution 2024-12-04 Go
Show

Real-world image super-resolution (Real-ISR) has achieved a remarkable leap by leveraging large-scale text-to-image models, enabling realistic image restoration from given recognition textual prompts. However, these methods sometimes fail to recognize some salient objects, resulting in inaccurate semantic restoration in these regions. Additionally, the same region may have a strong response to more than one prompt and it will lead to semantic ambiguity for image super-resolution. To alleviate the above two issues, in this paper, we propose to consider semantic segmentation as an additional control condition into diffusion-based image super-resolution. Compared to textual prompt conditions, semantic segmentation enables a more comprehensive perception of salient objects within an image by assigning class labels to each pixel. It also mitigates the risks of semantic ambiguities by explicitly allocating objects to their respective spatial regions. In practice, inspired by the fact that image super-resolution and segmentation can benefit each other, we propose SegSR which introduces a dual-diffusion framework to facilitate interaction between the image super-resolution and segmentation diffusion models. Specifically, we develop a Dual-Modality Bridge module to enable updated information flow between these two diffusion models, achieving mutual benefit during the reverse diffusion process. Extensive experiments show that SegSR can generate realistic images while preserving semantic structures more effectively.

Phaseformer: Phase-based Attention Mechanism for Underwater Image Restoration and Beyond 2024-12-02 Go
Show

Quality degradation is observed in underwater images due to the effects of light refraction and absorption by water, leading to issues like color cast, haziness, and limited visibility. This degradation negatively affects the performance of autonomous underwater vehicles used in marine applications. To address these challenges, we propose a lightweight phase-based transformer network with 1.77M parameters for underwater image restoration (UIR). Our approach focuses on effectively extracting non-contaminated features using a phase-based self-attention mechanism. We also introduce an optimized phase attention block to restore structural information by propagating prominent attentive features from the input. We evaluate our method on both synthetic (UIEB, UFO-120) and real-world (UIEB, U45, UCCS, SQUID) underwater image datasets. Additionally, we demonstrate its effectiveness for low-light image enhancement using the LOL dataset. Through extensive ablation studies and comparative analysis, it is clear that the proposed approach outperforms existing state-of-the-art (SOTA) methods.

8 pag...

8 pages, 8 figures, conference

FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration 2024-12-02 Go
Show

Despite the significant progress made by all-in-one models in universal image restoration, existing methods suffer from a generalization bottleneck in real-world scenarios, as they are mostly trained on small-scale synthetic datasets with limited degradations. Therefore, large-scale high-quality real-world training data is urgently needed to facilitate the emergence of foundational models for image restoration. To advance this field, we spare no effort in contributing a million-scale dataset with two notable advantages over existing training data: real-world samples with larger-scale, and degradation types with higher diversity. By adjusting internal camera settings and external imaging conditions, we can capture aligned image pairs using our well-designed data acquisition system over multiple rounds and our data alignment criterion. Moreover, we propose a robust model, FoundIR, to better address a broader range of restoration tasks in real-world scenarios, taking a further step toward foundation models. Specifically, we first utilize a diffusion-based generalist model to remove degradations by learning the degradation-agnostic common representations from diverse inputs, where incremental learning strategy is adopted to better guide model training. To refine the model's restoration capability in complex scenarios, we introduce degradation-aware specialist models for achieving final high-quality results. Extensive experiments show the value of our dataset and the effectiveness of our method.

Proje...

Project website: https://www.foundir.net

Blind Inverse Problem Solving Made Easy by Text-to-Image Latent Diffusion 2024-11-30 Go
Show

Blind inverse problems, where both the target data and forward operator are unknown, are crucial to many computer vision applications. Existing methods often depend on restrictive assumptions such as additional training, operator linearity, or narrow image distributions, thus limiting their generalizability. In this work, we present LADiBI, a training-free framework that uses large-scale text-to-image diffusion models to solve blind inverse problems with minimal assumptions. By leveraging natural language prompts, LADiBI jointly models priors for both the target image and operator, allowing for flexible adaptation across a variety of tasks. Additionally, we propose a novel posterior sampling approach that combines effective operator initialization with iterative refinement, enabling LADiBI to operate without predefined operator forms. Our experiments show that LADiBI is capable of solving a broad range of image restoration tasks, including both linear and nonlinear problems, on diverse target image distributions.

PromptHSI: Universal Hyperspectral Image Restoration Framework for Composite Degradation 2024-11-29 Go
Show

Recent developments in All-in-One (AiO) RGB image restoration and prompt learning have enabled the representation of distinct degradations through prompts, allowing degraded images to be effectively addressed by a single restoration model. However, this paradigm faces significant challenges when transferring to hyperspectral image (HSI) restoration tasks due to: 1) the domain gap between RGB and HSI features and difference on their structures, 2) information loss in visual prompts under severe composite degradations, and 3) difficulties in capturing HSI-specific degradation representations through text prompts. To address these challenges, we propose PromptHSI, the first universal AiO HSI restoration framework. By leveraging the frequency-aware feature modulation based on characteristics of HSI degradations, we decompose text prompts into intensity and bias controllers to effectively guide the restoration process while avoiding domain gaps. Our unified architecture excels at both fine-grained recovery and global information restoration tasks. Experimental results demonstrate superior performance under various degradation combinations, indicating great potential for practical remote sensing applications. The source code and dataset will be publicly released.

11 pages, 8 figures
Hierarchical Information Flow for Generalized Efficient Image Restoration 2024-11-27 Go
Show

While vision transformers show promise in numerous image restoration (IR) tasks, the challenge remains in efficiently generalizing and scaling up a model for multiple IR tasks. To strike a balance between efficiency and model capacity for a generalized transformer-based IR method, we propose a hierarchical information flow mechanism for image restoration, dubbed Hi-IR, which progressively propagates information among pixels in a bottom-up manner. Hi-IR constructs a hierarchical information tree representing the degraded image across three levels. Each level encapsulates different types of information, with higher levels encompassing broader objects and concepts and lower levels focusing on local details. Moreover, the hierarchical tree architecture removes long-range self-attention, improves the computational efficiency and memory utilization, thus preparing it for effective model scaling. Based on that, we explore model scaling to improve our method's capabilities, which is expected to positively impact IR in large-scale training settings. Extensive experimental results show that Hi-IR achieves state-of-the-art performance in seven common image restoration tasks, affirming its effectiveness and generalizability.

Complexity Experts are Task-Discriminative Learners for Any Image Restoration 2024-11-27 Go
Show

Recent advancements in all-in-one image restoration models have revolutionized the ability to address diverse degradations through a unified framework. However, parameters tied to specific tasks often remain inactive for other tasks, making mixture-of-experts (MoE) architectures a natural extension. Despite this, MoEs often show inconsistent behavior, with some experts unexpectedly generalizing across tasks while others struggle within their intended scope. This hinders leveraging MoEs' computational benefits by bypassing irrelevant experts during inference. We attribute this undesired behavior to the uniform and rigid architecture of traditional MoEs. To address this, we introduce ``complexity experts" -- flexible expert blocks with varying computational complexity and receptive fields. A key challenge is assigning tasks to each expert, as degradation complexity is unknown in advance. Thus, we execute tasks with a simple bias toward lower complexity. To our surprise, this preference effectively drives task-specific allocation, assigning tasks to experts with the appropriate complexity. Extensive experiments validate our approach, demonstrating the ability to bypass irrelevant experts during inference while maintaining superior performance. The proposed MoCE-IR model outperforms state-of-the-art methods, affirming its efficiency and practical applicability. The source will be publicly made available at \href{https://eduardzamfir.github.io/moceir/}{\texttt{eduardzamfir.github.io/MoCE-IR/}}

Adaptive Blind All-in-One Image Restoration 2024-11-27 Go
Show

Blind all-in-one image restoration models aim to recover a high-quality image from an input degraded with unknown distortions. However, these models require all the possible degradation types to be defined during the training stage while showing limited generalization to unseen degradations, which limits their practical application in complex cases. In this paper, we propose a simple but effective adaptive blind all-in-one restoration (ABAIR) model, which can address multiple degradations, generalizes well to unseen degradations, and efficiently incorporate new degradations by training a small fraction of parameters. First, we train our baseline model on a large dataset of natural images with multiple synthetic degradations, augmented with a segmentation head to estimate per-pixel degradation types, resulting in a powerful backbone able to generalize to a wide range of degradations. Second, we adapt our baseline model to varying image restoration tasks using independent low-rank adapters. Third, we learn to adaptively combine adapters to versatile images via a flexible and lightweight degradation estimator. Our model is both powerful in handling specific distortions and flexible in adapting to complex tasks, it not only outperforms the state-of-the-art by a large margin on five- and three-task IR setups, but also shows improved generalization to unseen degradations and also composite distortions.

17 pages
Diffusion State-Guided Projected Gradient for Inverse Problems 2024-11-26 Go
Show

Recent advancements in diffusion models have been effective in learning data priors for solving inverse problems. They leverage diffusion sampling steps for inducing a data prior while using a measurement guidance gradient at each step to impose data consistency. For general inverse problems, approximations are needed when an unconditionally trained diffusion model is used since the measurement likelihood is intractable, leading to inaccurate posterior sampling. In other words, due to their approximations, these methods fail to preserve the generation process on the data manifold defined by the diffusion prior, leading to artifacts in applications such as image restoration. To enhance the performance and robustness of diffusion models in solving inverse problems, we propose Diffusion State-Guided Projected Gradient (DiffStateGrad), which projects the measurement gradient onto a subspace that is a low-rank approximation of an intermediate state of the diffusion process. DiffStateGrad, as a module, can be added to a wide range of diffusion-based inverse solvers to improve the preservation of the diffusion process on the prior manifold and filter out artifact-inducing components. We highlight that DiffStateGrad improves the robustness of diffusion models in terms of the choice of measurement guidance step size and noise while improving the worst-case performance. Finally, we demonstrate that DiffStateGrad improves upon the state-of-the-art on linear and nonlinear image restoration inverse problems.

prepr...

preprint. under review. RZ and BT have equal contributions

Low-rank Adaptation-based All-Weather Removal for Autonomous Navigation 2024-11-26 Go
Show

All-weather image restoration (AWIR) is crucial for reliable autonomous navigation under adverse weather conditions. AWIR models are trained to address a specific set of weather conditions such as fog, rain, and snow. But this causes them to often struggle with out-of-distribution (OoD) samples or unseen degradations which limits their effectiveness for real-world autonomous navigation. To overcome this issue, existing models must either be retrained or fine-tuned, both of which are inefficient and impractical, with retraining needing access to large datasets, and fine-tuning involving many parameters. In this paper, we propose using Low-Rank Adaptation (LoRA) to efficiently adapt a pre-trained all-weather model to novel weather restoration tasks. Furthermore, we observe that LoRA lowers the performance of the adapted model on the pre-trained restoration tasks. To address this issue, we introduce a LoRA-based fine-tuning method called LoRA-Align (LoRA-A) which seeks to align the singular vectors of the fine-tuned and pre-trained weight matrices using Singular Value Decomposition (SVD). This alignment helps preserve the model's knowledge of its original tasks while adapting it to unseen tasks. We show that images restored with LoRA and LoRA-A can be effectively used for computer vision tasks in autonomous navigation, such as semantic segmentation and depth estimation.

Proje...

Project page: https://sudraj2002.github.io/loraapage/

GenDeg: Diffusion-Based Degradation Synthesis for Generalizable All-in-One Image Restoration 2024-11-26 Go
Show

Deep learning-based models for All-In-One Image Restoration (AIOR) have achieved significant advancements in recent years. However, their practical applicability is limited by poor generalization to samples outside the training distribution. This limitation arises primarily from insufficient diversity in degradation variations and scenes within existing datasets, resulting in inadequate representations of real-world scenarios. Additionally, capturing large-scale real-world paired data for degradations such as haze, low-light, and raindrops is often cumbersome and sometimes infeasible. In this paper, we leverage the generative capabilities of latent diffusion models to synthesize high-quality degraded images from their clean counterparts. Specifically, we introduce GenDeg, a degradation and intensity-aware conditional diffusion model capable of producing diverse degradation patterns on clean images. Using GenDeg, we synthesize over 550k samples across six degradation types: haze, rain, snow, motion blur, low-light, and raindrops. These generated samples are integrated with existing datasets to form the GenDS dataset, comprising over 750k samples. Our experiments reveal that image restoration models trained on the GenDS dataset exhibit significant improvements in out-of-distribution performance compared to those trained solely on existing datasets. Furthermore, we provide comprehensive analyses on the implications of diffusion model-based synthetic degradations for AIOR. The code will be made publicly available.

Proje...

Project Page: https://sudraj2002.github.io/gendegpage/

Puzzle Similarity: A Perceptually-guided No-Reference Metric for Artifact Detection in 3D Scene Reconstructions 2024-11-26 Go
Show

Modern reconstruction techniques can effectively model complex 3D scenes from sparse 2D views. However, automatically assessing the quality of novel views and identifying artifacts is challenging due to the lack of ground truth images and the limitations of no-reference image metrics in predicting detailed artifact maps. The absence of such quality metrics hinders accurate predictions of the quality of generated views and limits the adoption of post-processing techniques, such as inpainting, to enhance reconstruction quality. In this work, we propose a new no-reference metric, Puzzle Similarity, which is designed to localize artifacts in novel views. Our approach utilizes image patch statistics from the input views to establish a scene-specific distribution that is later used to identify poorly reconstructed regions in the novel views. We test and evaluate our method in the context of 3D reconstruction; to this end, we collected a novel dataset of human quality assessment in unseen reconstructed views. Through this dataset, we demonstrate that our method can not only successfully localize artifacts in novel views, correlating with human assessment, but do so without direct references. Surprisingly, our metric outperforms both no-reference metrics and popular full-reference image metrics. We can leverage our new metric to enhance applications like automatic image restoration, guided acquisition, or 3D reconstruction from sparse inputs.

MWFormer: Multi-Weather Image Restoration Using Degradation-Aware Transformers 2024-11-26 Go
Show

Restoring images captured under adverse weather conditions is a fundamental task for many computer vision applications. However, most existing weather restoration approaches are only capable of handling a specific type of degradation, which is often insufficient in real-world scenarios, such as rainy-snowy or rainy-hazy weather. Towards being able to address these situations, we propose a multi-weather Transformer, or MWFormer for short, which is a holistic vision Transformer that aims to solve multiple weather-induced degradations using a single, unified architecture. MWFormer uses hyper-networks and feature-wise linear modulation blocks to restore images degraded by various weather types using the same set of learned parameters. We first employ contrastive learning to train an auxiliary network that extracts content-independent, distortion-aware feature embeddings that efficiently represent predicted weather types, of which more than one may occur. Guided by these weather-informed predictions, the image restoration Transformer adaptively modulates its parameters to conduct both local and global feature processing, in response to multiple possible weather. Moreover, MWFormer allows for a novel way of tuning, during application, to either a single type of weather restoration or to hybrid weather restoration without any retraining, offering greater controllability than existing methods. Our experimental results on multi-weather restoration benchmarks show that MWFormer achieves significant performance improvements compared to existing state-of-the-art methods, without requiring much computational cost. Moreover, we demonstrate that our methodology of using hyper-networks can be integrated into various network architectures to further boost their performance. The code is available at: https://github.com/taco-group/MWFormer

Accep...

Accepted by IEEE Transactions on Image Processing. The code is available at: https://github.com/taco-group/MWFormer

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval 2024-11-25 Go
Show

The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. A primary challenge in this task is bridging the substantial representational gap between visual and textual modalities. The prevailing methods map texts and images into unified embedding space for matching, while the intricate semantic correspondences between texts and images are still not effectively constructed. To address this issue, we propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts. Specifically, via fine-tuning the Contrastive Language-Image Pre-training (CLIP) model, a visual-textual dual encoder is firstly constructed, to preliminarily align the image and text features. Secondly, a Text-guided Image Restoration (TIR) auxiliary task is proposed to map abstract textual entities to specific image regions, improving the alignment between local textual and visual embeddings. Additionally, a cross-modal triplet loss is presented to handle hard samples, and further enhance the model's discriminability for minor differences. Moreover, a pruning-based text data augmentation approach is proposed to enhance focus on essential elements in descriptions, thereby avoiding excessive model attention to less significant information. The experimental results show our proposed method outperforms state-of-the-art methods on three popular benchmark datasets, and the code will be made publicly available at https://github.com/Delong-liu-bupt/SEN.

Mixed Degradation Image Restoration via Local Dynamic Optimization and Conditional Embedding 2024-11-25 Go
Show

Multiple-in-one image restoration (IR) has made significant progress, aiming to handle all types of single degraded image restoration with a single model. However, in real-world scenarios, images often suffer from combinations of multiple degradation factors. Existing multiple-in-one IR models encounter challenges related to degradation diversity and prompt singularity when addressing this issue. In this paper, we propose a novel multiple-in-one IR model that can effectively restore images with both single and mixed degradations. To address degradation diversity, we design a Local Dynamic Optimization (LDO) module which dynamically processes degraded areas of varying types and granularities. To tackle the prompt singularity issue, we develop an efficient Conditional Feature Embedding (CFE) module that guides the decoder in leveraging degradation-type-related features, significantly improving the model's performance in mixed degradation restoration scenarios. To validate the effectiveness of our model, we introduce a new dataset containing both single and mixed degradation elements. Experimental results demonstrate that our proposed model achieves state-of-the-art (SOTA) performance not only on mixed degradation tasks but also on classic single-task restoration benchmarks.

10 pa...

10 pages, 3 figures, 8 tables

Reconstructing Satellites in 3D from Amateur Telescope Images 2024-11-25 Go
Show

This paper proposes a framework for the 3D reconstruction of satellites in low-Earth orbit, utilizing videos captured by small amateur telescopes. The video data obtained from these telescopes differ significantly from data for standard 3D reconstruction tasks, characterized by intense motion blur, atmospheric turbulence, pervasive background light pollution, extended focal length and constrained observational perspectives. To address these challenges, our approach begins with a comprehensive pre-processing workflow that encompasses deep learning-based image restoration, feature point extraction and camera pose initialization. We apply a customized Structure from Motion (SfM) approach, followed by an improved 3D Gaussian splatting algorithm, to achieve high-fidelity 3D model reconstruction. Our technique supports simultaneous 3D Gaussian training and pose estimation, enabling the robust generation of intricate 3D point clouds from sparse, noisy data. The procedure is further bolstered by a post-editing phase designed to eliminate noise points inconsistent with our prior knowledge of a satellite's geometric constraints. We validate our approach on synthetic datasets and actual observations of China's Space Station and International Space Station, showcasing its significant advantages over existing methods in reconstructing 3D space objects from ground-based observations.

U2NeRF: Unsupervised Underwater Image Restoration and Neural Radiance Fields 2024-11-25 Go
Show

Underwater images suffer from colour shifts, low contrast, and haziness due to light absorption, refraction, scattering and restoring these images has warranted much attention. In this work, we present Unsupervised Underwater Neural Radiance Field U2NeRF, a transformer-based architecture that learns to render and restore novel views conditioned on multi-view geometry simultaneously. Due to the absence of supervision, we attempt to implicitly bake restoring capabilities onto the NeRF pipeline and disentangle the predicted color into several components - scene radiance, direct transmission map, backscatter transmission map, and global background light, and when combined reconstruct the underwater image in a self-supervised manner. In addition, we release an Underwater View Synthesis UVS dataset consisting of 12 underwater scenes, containing both synthetically-generated and real-world data. Our experiments demonstrate that when optimized on a single scene, U2NeRF outperforms several baselines by as much LPIPS 11%, UIQM 5%, UCIQE 4% (on average) and showcases improved rendering and restoration capabilities. Code will be made available upon acceptance.

ICLR ...

ICLR Tiny Papers 2024. arXiv admin note: text overlap with arXiv:2207.13298

LTCF-Net: A Transformer-Enhanced Dual-Channel Fourier Framework for Low-Light Image Restoration 2024-11-24 Go
Show

We introduce LTCF-Net, a novel network architecture designed for enhancing low-light images. Unlike Retinex-based methods, our approach utilizes two color spaces - LAB and YUV - to efficiently separate and process color information, by leveraging the separation of luminance from chromatic components in color images. In addition, our model incorporates the Transformer architecture to comprehensively understand image content while maintaining computational efficiency. To dynamically balance the brightness in output images, we also introduce a Fourier transform module that adjusts the luminance channel in the frequency domain. This mechanism could uniformly balance brightness across different regions while eliminating background noises, and thereby enhancing visual quality. By combining these innovative components, LTCF-Net effectively improves low-light image quality while keeping the model lightweight. Experimental results demonstrate that our method outperforms current state-of-the-art approaches across multiple evaluation metrics and datasets, achieving more natural color restoration and a balanced brightness distribution.

Gradient-Guided Parameter Mask for Multi-Scenario Image Restoration Under Adverse Weather 2024-11-23 Go
Show

Removing adverse weather conditions such as rain, raindrop, and snow from images is critical for various real-world applications, including autonomous driving, surveillance, and remote sensing. However, existing multi-task approaches typically rely on augmenting the model with additional parameters to handle multiple scenarios. While this enables the model to address diverse tasks, the introduction of extra parameters significantly complicates its practical deployment. In this paper, we propose a novel Gradient-Guided Parameter Mask for Multi-Scenario Image Restoration under adverse weather, designed to effectively handle image degradation under diverse weather conditions without additional parameters. Our method segments model parameters into common and specific components by evaluating the gradient variation intensity during training for each specific weather condition. This enables the model to precisely and adaptively learn relevant features for each weather scenario, improving both efficiency and effectiveness without compromising on performance. This method constructs specific masks based on gradient fluctuations to isolate parameters influenced by other tasks, ensuring that the model achieves strong performance across all scenarios without adding extra parameters. We demonstrate the state-of-the-art performance of our framework through extensive experiments on multiple benchmark datasets. Specifically, our method achieves PSNR scores of 29.22 on the Raindrop dataset, 30.76 on the Rain dataset, and 29.56 on the Snow100K dataset. Code is available at: \href{https://github.com/AierLab/MultiTask}{https://github.com/AierLab/MultiTask}.

MS-Glance: Bio-Insipred Non-semantic Context Vectors and their Applications in Supervising Image Reconstruction 2024-11-23 Go
Show

Non-semantic context information is crucial for visual recognition, as the human visual perception system first uses global statistics to process scenes rapidly before identifying specific objects. However, while semantic information is increasingly incorporated into computer vision tasks such as image reconstruction, non-semantic information, such as global spatial structures, is often overlooked. To bridge the gap, we propose a biologically informed non-semantic context descriptor, \textbf{MS-Glance}, along with the Glance Index Measure for comparing two images. A Global Glance vector is formulated by randomly retrieving pixels based on a perception-driven rule from an image to form a vector representing non-semantic global context, while a local Glance vector is a flattened local image window, mimicking a zoom-in observation. The Glance Index is defined as the inner product of two standardized sets of Glance vectors. We evaluate the effectiveness of incorporating Glance supervision in two reconstruction tasks: image fitting with implicit neural representation (INR) and undersampled MRI reconstruction. Extensive experimental results show that MS-Glance outperforms existing image restoration losses across both natural and medical images. The code is available at \url{https://github.com/Z7Gao/MSGlance}.

Accep...

Accepted by WACV 2025

Efficient Diffusion Model for Image Restoration by Residual Shifting 2024-11-23 Go
Show

While diffusion-based image restoration (IR) methods have achieved remarkable success, they are still limited by the low inference speed attributed to the necessity of executing hundreds or even thousands of sampling steps. Existing acceleration sampling techniques, though seeking to expedite the process, inevitably sacrifice performance to some extent, resulting in over-blurry restored outcomes. To address this issue, this study proposes a novel and efficient diffusion model for IR that significantly reduces the required number of diffusion steps. Our method avoids the need for post-acceleration during inference, thereby avoiding the associated performance deterioration. Specifically, our proposed method establishes a Markov chain that facilitates the transitions between the high-quality and low-quality images by shifting their residuals, substantially improving the transition efficiency. A carefully formulated noise schedule is devised to flexibly control the shifting speed and the noise strength during the diffusion process. Extensive experimental evaluations demonstrate that the proposed method achieves superior or comparable performance to current state-of-the-art methods on three classical IR tasks, namely image super-resolution, image inpainting, and blind face restoration, \textit{\textbf{even only with four sampling steps}}. Our code and model are publicly available at \url{https://github.com/zsyOAOA/ResShift}.

Corre...

Corrected a typo. TPAMI@2024. Code: https://github.com/zsyOAOA/ResShift

Frequency-Guided Posterior Sampling for Diffusion-Based Image Restoration 2024-11-22 Go
Show

Image restoration aims to recover high-quality images from degraded observations. When the degradation process is known, the recovery problem can be formulated as an inverse problem, and in a Bayesian context, the goal is to sample a clean reconstruction given the degraded observation. Recently, modern pretrained diffusion models have been used for image restoration by modifying their sampling procedure to account for the degradation process. However, these methods often rely on certain approximations that can lead to significant errors and compromised sample quality. In this paper, we provide the first rigorous analysis of this approximation error for linear inverse problems under distributional assumptions on the space of natural images, demonstrating cases where previous works can fail dramatically. Motivated by our theoretical insights, we propose a simple modification to existing diffusion-based restoration methods. Our approach introduces a time-varying low-pass filter in the frequency domain of the measurements, progressively incorporating higher frequencies during the restoration process. We develop an adaptive curriculum for this frequency schedule based on the underlying data distribution. Our method significantly improves performance on challenging image restoration tasks including motion deblurring and image dehazing.

MambaIRv2: Attentive State Space Restoration 2024-11-22 Go
Show

The Mamba-based image restoration backbones have recently demonstrated significant potential in balancing global reception and computational efficiency. However, the inherent causal modeling limitation of Mamba, where each token depends solely on its predecessors in the scanned sequence, restricts the full utilization of pixels across the image and thus presents new challenges in image restoration. In this work, we propose MambaIRv2, which equips Mamba with the non-causal modeling ability similar to ViTs to reach the attentive state space restoration model. Specifically, the proposed attentive state-space equation allows to attend beyond the scanned sequence and facilitate image unfolding with just one single scan. Moreover, we further introduce a semantic-guided neighboring mechanism to encourage interaction between distant but similar pixels. Extensive experiments show our MambaIRv2 outperforms SRFormer by \textbf{even 0.35dB} PSNR for lightweight SR even with \textbf{9.3% less} parameters and suppresses HAT on classic SR by \textbf{up to 0.29dB}. Code is available at \url{https://github.com/csguoh/MambaIR}.

Technical report
Rotation Equivariant Proximal Operator for Deep Unfolding Methods in Image Restoration 2024-11-20 Go
Show

The deep unfolding approach has attracted significant attention in computer vision tasks, which well connects conventional image processing modeling manners with more recent deep learning techniques. Specifically, by establishing a direct correspondence between algorithm operators at each implementation step and network modules within each layer, one can rationally construct an almost ``white box'' network architecture with high interpretability. In this architecture, only the predefined component of the proximal operator, known as a proximal network, needs manual configuration, enabling the network to automatically extract intrinsic image priors in a data-driven manner. In current deep unfolding methods, such a proximal network is generally designed as a CNN architecture, whose necessity has been proven by a recent theory. That is, CNN structure substantially delivers the translational invariant image prior, which is the most universally possessed structural prior across various types of images. However, standard CNN-based proximal networks have essential limitations in capturing the rotation symmetry prior, another universal structural prior underlying general images. This leaves a large room for further performance improvement in deep unfolding approaches. To address this issue, this study makes efforts to suggest a high-accuracy rotation equivariant proximal network that effectively embeds rotation symmetry priors into the deep unfolding framework. Especially, we deduce, for the first time, the theoretical equivariant error for such a designed proximal network with arbitrary layers under arbitrary rotation degrees. This analysis should be the most refined theoretical conclusion for such error evaluation to date and is also indispensable for supporting the rationale behind such networks with intrinsic interpretability requirements.

Publi...

Published in TPAMI 2024

TSFormer: A Robust Framework for Efficient UHD Image Restoration 2024-11-19 Go
Show

Ultra-high-definition (UHD) image restoration is vital for applications demanding exceptional visual fidelity, yet existing methods often face a trade-off between restoration quality and efficiency, limiting their practical deployment. In this paper, we propose TSFormer, an all-in-one framework that integrates \textbf{T}rusted learning with \textbf{S}parsification to boost both generalization capability and computational efficiency in UHD image restoration. The key is that only a small amount of token movement is allowed within the model. To efficiently filter tokens, we use Min-$p$ with random matrix theory to quantify the uncertainty of tokens, thereby improving the robustness of the model. Our model can run a 4K image in real time (40fps) with 3.38 M parameters. Extensive experiments demonstrate that TSFormer achieves state-of-the-art restoration quality while enhancing generalization and reducing computational demands. In addition, our token filtering method can be applied to other image restoration models to effectively accelerate inference and maintain performance.

Frequency-Aware Guidance for Blind Image Restoration via Diffusion Models 2024-11-19 Go
Show

Blind image restoration remains a significant challenge in low-level vision tasks. Recently, denoising diffusion models have shown remarkable performance in image synthesis. Guided diffusion models, leveraging the potent generative priors of pre-trained models along with a differential guidance loss, have achieved promising results in blind image restoration. However, these models typically consider data consistency solely in the spatial domain, often resulting in distorted image content. In this paper, we propose a novel frequency-aware guidance loss that can be integrated into various diffusion models in a plug-and-play manner. Our proposed guidance loss, based on 2D discrete wavelet transform, simultaneously enforces content consistency in both the spatial and frequency domains. Experimental results demonstrate the effectiveness of our method in three blind restoration tasks: blind image deblurring, imaging through turbulence, and blind restoration for multiple degradations. Notably, our method achieves a significant improvement in PSNR score, with a remarkable enhancement of 3.72,dB in image deblurring. Moreover, our method exhibits superior capability in generating images with rich details and reduced distortion, leading to the best visual quality.

17 pa...

17 pages, 6 figures, has been accepted by the ECCV 2024: AIM workshop

Versatile Cataract Fundus Image Restoration Model Utilizing Unpaired Cataract and High-quality Images 2024-11-19 Go
Show

Cataract is one of the most common blinding eye diseases and can be treated by surgery. However, because cataract patients may also suffer from other blinding eye diseases, ophthalmologists must diagnose them before surgery. The cloudy lens of cataract patients forms a hazy degeneration in the fundus images, making it challenging to observe the patient's fundus vessels, which brings difficulties to the diagnosis process. To address this issue, this paper establishes a new cataract image restoration method named Catintell. It contains a cataract image synthesizing model, Catintell-Syn, and a restoration model, Catintell-Res. Catintell-Syn uses GAN architecture with fully unsupervised data to generate paired cataract-like images with realistic style and texture rather than the conventional Gaussian degradation algorithm. Meanwhile, Catintell-Res is an image restoration network that can improve the quality of real cataract fundus images using the knowledge learned from synthetic cataract images. Extensive experiments show that Catintell-Res outperforms other cataract image restoration methods in PSNR with 39.03 and SSIM with 0.9476. Furthermore, the universal restoration ability that Catintell-Res gained from unpaired cataract images can process cataract images from various datasets. We hope the models can help ophthalmologists identify other blinding eye diseases of cataract patients and inspire more medical image restoration methods in the future.

12 pages, 8 figures
Taming Generative Diffusion Prior for Universal Blind Image Restoration 2024-11-19 Go
Show

Diffusion models have been widely utilized for image restoration. However, previous blind image restoration methods still need to assume the type of degradation model while leaving the parameters to be optimized, limiting their real-world applications. Therefore, we aim to tame generative diffusion prior for universal blind image restoration dubbed BIR-D, which utilizes an optimizable convolutional kernel to simulate the degradation model and dynamically update the parameters of the kernel in the diffusion steps, enabling it to achieve blind image restoration results even in various complex situations. Besides, based on mathematical reasoning, we have provided an empirical formula for the chosen of adaptive guidance scale, eliminating the need for a grid search for the optimal parameter. Experimentally, Our BIR-D has demonstrated superior practicality and versatility than off-the-shelf unsupervised methods across various tasks both on real-world and synthetic datasets, qualitatively and quantitatively. BIR-D is able to fulfill multi-guidance blind image restoration. Moreover, BIR-D can also restore images that undergo multiple and complicated degradations, demonstrating the practical applications.

15 pa...

15 pages, 12 figures, 8 tables

HAIR: Hypernetworks-based All-in-One Image Restoration 2024-11-18 Go
Show

Image restoration aims to recover a high-quality clean image from its degraded version. Recent progress in image restoration has demonstrated the effectiveness of All-in-One image restoration models in addressing various unknown degradations simultaneously. However, these existing methods typically utilize the same parameters to tackle images with different types of degradation, forcing the model to balance the performance between different tasks and limiting its performance on each task. To alleviate this issue, we propose HAIR, a Hypernetworks-based All-in-One Image Restoration plug-and-play method that generates parameters based on the input image and thus makes the model to adapt to specific degradation dynamically. Specifically, HAIR consists of two main components, i.e., Classifier and Hyper Selecting Net (HSN). The Classifier is a simple image classification network used to generate a Global Information Vector (GIV) that contains the degradation information of the input image, and the HSN is a simple fully-connected neural network that receives the GIV and outputs parameters for the corresponding modules. Extensive experiments demonstrate that HAIR can significantly improve the performance of existing image restoration models in a plug-and-play manner, both in single-task and All-in-One settings. Notably, our proposed model Res-HAIR, which integrates HAIR into the well-known Restormer, can obtain superior or comparable performance compared with current state-of-the-art methods. Moreover, we theoretically demonstrate that to achieve a given small enough error, our proposed HAIR requires fewer parameters in contrast to mainstream embedding-based All-in-One methods. The code is available at https://github.com/toummHus/HAIR.

HAT: Hybrid Attention Transformer for Image Restoration 2024-11-17 Go
Show

Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better restoration, we propose a new Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement. Extensive experiments have demonstrated the effectiveness of the proposed modules. We further scale up the model to show that the performance of the SR task can be greatly improved. Besides, we extend HAT to more image restoration applications, including real-world image super-resolution, Gaussian image denoising and image compression artifacts reduction. Experiments on benchmark and real-world datasets demonstrate that our HAT achieves state-of-the-art performance both quantitatively and qualitatively. Codes and models are publicly available at https://github.com/XPixelGroup/HAT.

Exten...

Extended version of HAT. arXiv admin note: text overlap with arXiv:2205.04437

LoRA-IR: Taming Low-Rank Experts for Efficient All-in-One Image Restoration 2024-11-16 Go
Show

Prompt-based all-in-one image restoration (IR) frameworks have achieved remarkable performance by incorporating degradation-specific information into prompt modules. Nevertheless, handling the complex and diverse degradations encountered in real-world scenarios remains a significant challenge. To tackle this, we propose LoRA-IR, a flexible framework that dynamically leverages compact low-rank experts to facilitate efficient all-in-one image restoration. Specifically, LoRA-IR consists of two training stages: degradation-guided pre-training and parameter-efficient fine-tuning. In the pre-training stage, we enhance the pre-trained CLIP model by introducing a simple mechanism that scales it to higher resolutions, allowing us to extract robust degradation representations that adaptively guide the IR network. In the fine-tuning stage, we refine the pre-trained IR network through low-rank adaptation (LoRA). Built upon a Mixture-of-Experts (MoE) architecture, LoRA-IR dynamically integrates multiple low-rank restoration experts through a degradation-guided router. This dynamic integration mechanism significantly enhances our model's adaptability to diverse and unknown degradations in complex real-world scenarios. Extensive experiments demonstrate that LoRA-IR achieves SOTA performance across 14 IR tasks and 29 benchmarks, while maintaining computational efficiency. Code and pre-trained models will be available at: https://github.com/shallowdream204/LoRA-IR.

AllRestorer: All-in-One Transformer for Image Restoration under Composite Degradations 2024-11-16 Go
Show

Image restoration models often face the simultaneous interaction of multiple degradations in real-world scenarios. Existing approaches typically handle single or composite degradations based on scene descriptors derived from text or image embeddings. However, due to the varying proportions of different degradations within an image, these scene descriptors may not accurately differentiate between degradations, leading to suboptimal restoration in practical applications. To address this issue, we propose a novel Transformer-based restoration framework, AllRestorer. In AllRestorer, we enable the model to adaptively consider all image impairments, thereby avoiding errors from scene descriptor misdirection. Specifically, we introduce an All-in-One Transformer Block (AiOTB), which adaptively removes all degradations present in a given image by modeling the relationships between all degradations and the image embedding in latent space. To accurately address different variations potentially present within the same type of degradation and minimize ambiguity, AiOTB utilizes a composite scene descriptor consisting of both image and text embeddings to define the degradation. Furthermore, AiOTB includes an adaptive weight for each degradation, allowing for precise control of the restoration intensity. By leveraging AiOTB, AllRestorer avoids misdirection caused by inaccurate scene descriptors, achieving a 5.00 dB increase in PSNR compared to the baseline on the CDD-11 dataset.

12 pages, 11 figures
Probabilistic Prior Driven Attention Mechanism Based on Diffusion Model for Imaging Through Atmospheric Turbulence 2024-11-15 Go
Show

Atmospheric turbulence introduces severe spatial and geometric distortions, challenging traditional image restoration methods. We propose the Probabilistic Prior Turbulence Removal Network (PPTRN), which combines probabilistic diffusion-based prior modeling with Transformer-driven feature extraction to address this issue. PPTRN employs a two-stage approach: first, a latent encoder and Transformer are jointly trained on clear images to establish robust feature representations. Then, a Denoising Diffusion Probabilistic Model (DDPM) models prior distributions over latent vectors, guiding the Transformer in capturing diverse feature variations essential for restoration. A key innovation in PPTRN is the Probabilistic Prior Driven Cross Attention mechanism, which integrates the DDPM-generated prior with feature embeddings to reduce artifacts and enhance spatial coherence. Extensive experiments validate that PPTRN significantly improves restoration quality on turbulence-degraded images, setting a new benchmark in clarity and structural fidelity.

Deep Block Proximal Linearised Minimisation Algorithm for Non-convex Inverse Problems 2024-11-14 Go
Show

Image restoration is typically addressed through non-convex inverse problems, which are often solved using first-order block-wise splitting methods. In this paper, we consider a general type of non-convex optimisation model that captures many inverse image problems and present an inertial block proximal linearised minimisation (iBPLM) algorithm. Our new method unifies the Jacobi-type parallel and the Gauss-Seidel-type alternating update rules, and extends beyond these approaches. The inertial technique is also incorporated into each block-wise subproblem update, which can accelerate numerical convergence. Furthermore, we extend this framework with a plug-and-play variant (PnP-iBPLM) that integrates deep gradient denoisers, offering a flexible and robust solution for complex imaging tasks. We provide comprehensive theoretical analysis, demonstrating both subsequential and global convergence of the proposed algorithms. To validate our methods, we apply them to multi-block dictionary learning problems in image denoising and deblurring. Experimental results show that both iBPLM and PnP-iBPLM significantly enhance numerical performance and robustness in these applications.

6 figures, 3 tables
Joint multi-dimensional dynamic attention and transformer for general image restoration 2024-11-12 Go
Show

Outdoor images often suffer from severe degradation due to rain, haze, and noise, impairing image quality and challenging high-level tasks. Current image restoration methods struggle to handle complex degradation while maintaining efficiency. This paper introduces a novel image restoration architecture that combines multi-dimensional dynamic attention and self-attention within a U-Net framework. To leverage the global modeling capabilities of transformers and the local modeling capabilities of convolutions, we integrate sole CNNs in the encoder-decoder and sole transformers in the latent layer. Additionally, we design convolutional kernels with selected multi-dimensional dynamic attention to capture diverse degraded inputs efficiently. A transformer block with transposed self-attention further enhances global feature extraction while maintaining efficiency. Extensive experiments demonstrate that our method achieves a better balance between performance and computational complexity across five image restoration tasks: deraining, deblurring, denoising, dehazing, and enhancement, as well as superior performance for high-level vision tasks. The source code will be available at https://github.com/House-yuyu/MDDA-former.

All-in-one Weather-degraded Image Restoration via Adaptive Degradation-aware Self-prompting Model 2024-11-12 Go
Show

Existing approaches for all-in-one weather-degraded image restoration suffer from inefficiencies in leveraging degradation-aware priors, resulting in sub-optimal performance in adapting to different weather conditions. To this end, we develop an adaptive degradation-aware self-prompting model (ADSM) for all-in-one weather-degraded image restoration. Specifically, our model employs the contrastive language-image pre-training model (CLIP) to facilitate the training of our proposed latent prompt generators (LPGs), which represent three types of latent prompts to characterize the degradation type, degradation property and image caption. Moreover, we integrate the acquired degradation-aware prompts into the time embedding of diffusion model to improve degradation perception. Meanwhile, we employ the latent caption prompt to guide the reverse sampling process using the cross-attention mechanism, thereby guiding the accurate image reconstruction. Furthermore, to accelerate the reverse sampling procedure of diffusion model and address the limitations of frequency perception, we introduce a wavelet-oriented noise estimating network (WNE-Net). Extensive experiments conducted on eight publicly available datasets demonstrate the effectiveness of our proposed approach in both task-specific and all-in-one applications.

Multi-scale Frequency Enhancement Network for Blind Image Deblurring 2024-11-11 Go
Show

Image deblurring is an essential image preprocessing technique, aiming to recover clear and detailed images form blurry ones. However, existing algorithms often fail to effectively integrate multi-scale feature extraction with frequency enhancement, limiting their ability to reconstruct fine textures. Additionally, non-uniform blur in images also restricts the effectiveness of image restoration. To address these issues, we propose a multi-scale frequency enhancement network (MFENet) for blind image deblurring. To capture the multi-scale spatial and channel information of blurred images, we introduce a multi-scale feature extraction module (MS-FE) based on depthwise separable convolutions, which provides rich target features for deblurring. We propose a frequency enhanced blur perception module (FEBP) that employs wavelet transforms to extract high-frequency details and utilizes multi-strip pooling to perceive non-uniform blur, combining multi-scale information with frequency enhancement to improve the restoration of image texture details. Experimental results on the GoPro and HIDE datasets demonstrate that the proposed method achieves superior deblurring performance in both visual quality and objective evaluation metrics. Furthermore, in downstream object detection tasks, the proposed blind image deblurring algorithm significantly improves detection accuracy, further validating its effectiveness androbustness in the field of image deblurring.

Dropout the High-rate Downsampling: A Novel Design Paradigm for UHD Image Restoration 2024-11-10 Go
Show

With the popularization of high-end mobile devices, Ultra-high-definition (UHD) images have become ubiquitous in our lives. The restoration of UHD images is a highly challenging problem due to the exaggerated pixel count, which often leads to memory overflow during processing. Existing methods either downsample UHD images at a high rate before processing or split them into multiple patches for separate processing. However, high-rate downsampling leads to significant information loss, while patch-based approaches inevitably introduce boundary artifacts. In this paper, we propose a novel design paradigm to solve the UHD image restoration problem, called D2Net. D2Net enables direct full-resolution inference on UHD images without the need for high-rate downsampling or dividing the images into several patches. Specifically, we ingeniously utilize the characteristics of the frequency domain to establish long-range dependencies of features. Taking into account the richer local patterns in UHD images, we also design a multi-scale convolutional group to capture local features. Additionally, during the decoding stage, we dynamically incorporate features from the encoding stage to reduce the flow of irrelevant information. Extensive experiments on three UHD image restoration tasks, including low-light image enhancement, image dehazing, and image deblurring, show that our model achieves better quantitative and qualitative results than state-of-the-art methods.

WACV2025
A Modular Conditional Diffusion Framework for Image Reconstruction 2024-11-08 Go
Show

Diffusion Probabilistic Models (DPMs) have been recently utilized to deal with various blind image restoration (IR) tasks, where they have demonstrated outstanding performance in terms of perceptual quality. However, the task-specific nature of existing solutions and the excessive computational costs related to their training, make such models impractical and challenging to use for different IR tasks than those that were initially trained for. This hinders their wider adoption, especially by those who lack access to powerful computational resources and vast amount of training data. In this work we aim to address the above issues and enable the successful adoption of DPMs in practical IR-related applications. Towards this goal, we propose a modular diffusion probabilistic IR framework (DP-IR), which allows us to combine the performance benefits of existing pre-trained state-of-the-art IR networks and generative DPMs, while it requires only the additional training of a relatively small module (0.7M params) related to the particular IR task of interest. Moreover, the architecture of the proposed framework allows for a sampling strategy that leads to at least four times reduction of neural function evaluations without suffering any performance loss, while it can also be combined with existing acceleration techniques such as DDIM. We evaluate our model on four benchmarks for the tasks of burst JDD-SR, dynamic scene deblurring, and super-resolution. Our method outperforms existing approaches in terms of perceptual quality while it retains a competitive performance with respect to fidelity metrics.

Blind Image Restoration via Fast Diffusion Inversion 2024-11-05 Go
Show

Image Restoration (IR) methods based on a pre-trained diffusion model have demonstrated state-of-the-art performance. However, they have two fundamental limitations: 1) they often assume that the degradation operator is completely known and 2) they alter the diffusion sampling process, which may result in restored images that do not lie onto the data manifold. To address these issues, we propose Blind Image Restoration via fast Diffusion inversion (BIRD) a blind IR method that jointly optimizes for the degradation model parameters and the restored image. To ensure that the restored images lie onto the data manifold, we propose a novel sampling technique on a pre-trained diffusion model. A key idea in our method is not to modify the reverse sampling, i.e, not to alter all the intermediate latents, once an initial noise is sampled. This is ultimately equivalent to casting the IR task as an optimization problem in the space of the input noise. Moreover, to mitigate the computational cost associated with inverting a fully unrolled diffusion model, we leverage the inherent capability of these models to skip ahead in the forward diffusion process using large time steps. We experimentally validate BIRD on several image restoration tasks and show that it achieves state of the art performance on all of them. Our code is available at https://github.com/hamadichihaoui/BIRD.

Accep...

Accepted to Neurips 2024

Degradation-Aware Residual-Conditioned Optimal Transport for Unified Image Restoration 2024-11-03 Go
Show

All-in-one image restoration has emerged as a practical and promising low-level vision task for real-world applications. In this context, the key issue lies in how to deal with different types of degraded images simultaneously. In this work, we present a Degradation-Aware Residual-Conditioned Optimal Transport (DA-RCOT) approach that models (all-in-one) image restoration as an optimal transport (OT) problem for unpaired and paired settings, introducing the transport residual as a degradation-specific cue for both the transport cost and the transport map. Specifically, we formalize image restoration with a residual-guided OT objective by exploiting the degradation-specific patterns of the Fourier residual in the transport cost. More crucially, we design the transport map for restoration as a two-pass DA-RCOT map, in which the transport residual is computed in the first pass and then encoded as multi-scale residual embeddings to condition the second-pass restoration. This conditioning process injects intrinsic degradation knowledge (e.g., degradation type and level) and structural information from the multi-scale residual embeddings into the OT map, which thereby can dynamically adjust its behaviors for all-in-one restoration. Extensive experiments across five degradations demonstrate the favorable performance of DA-RCOT as compared to state-of-the-art methods, in terms of distortion measures, perceptual quality, and image structure preservation. Notably, DA-RCOT delivers superior adaptability to real-world scenarios even with multiple degradations and shows distinctive robustness to both degradation levels and the number of degradations.

Adversarial Purification and Fine-tuning for Robust UDC Image Restoration 2024-11-01 Go
Show

This study delves into the enhancement of Under-Display Camera (UDC) image restoration models, focusing on their robustness against adversarial attacks. Despite its innovative approach to seamless display integration, UDC technology faces unique image degradation challenges exacerbated by the susceptibility to adversarial perturbations. Our research initially conducts an in-depth robustness evaluation of deep-learning-based UDC image restoration models by employing several white-box and black-box attacking methods. This evaluation is pivotal in understanding the vulnerabilities of current UDC image restoration techniques. Following the assessment, we introduce a defense framework integrating adversarial purification with subsequent fine-tuning processes. First, our approach employs diffusion-based adversarial purification, effectively neutralizing adversarial perturbations. Then, we apply the fine-tuning methodologies to refine the image restoration models further, ensuring that the quality and fidelity of the restored images are maintained. The effectiveness of our proposed approach is validated through extensive experiments, showing marked improvements in resilience against typical adversarial attacks.

Failu...

Failure to meet expectations

Fast Samplers for Inverse Problems in Iterative Refinement Models 2024-11-01 Go
Show

Constructing fast samplers for unconditional diffusion and flow-matching models has received much attention recently; however, existing methods for solving inverse problems, such as super-resolution, inpainting, or deblurring, still require hundreds to thousands of iterative steps to obtain high-quality results. We propose a plug-and-play framework for constructing efficient samplers for inverse problems, requiring only pre-trained diffusion or flow-matching models. We present Conditional Conjugate Integrators, which leverage the specific form of the inverse problem to project the respective conditional diffusion/flow dynamics into a more amenable space for sampling. Our method complements popular posterior approximation methods for solving inverse problems using diffusion/flow models. We evaluate the proposed method's performance on various linear image restoration tasks across multiple datasets, employing diffusion and flow-matching models. Notably, on challenging inverse problems like 4x super-resolution on the ImageNet dataset, our method can generate high-quality samples in as few as 5 conditional sampling steps and outperforms competing baselines requiring 20-1000 steps. Our code will be publicly available at https://github.com/mandt-lab/c-pigdm

43 pa...

43 pages, NeurIPS'24 Camera Ready

Aquatic-GS: A Hybrid 3D Representation for Underwater Scenes 2024-10-31 Go
Show

Representing underwater 3D scenes is a valuable yet complex task, as attenuation and scattering effects during underwater imaging significantly couple the information of the objects and the water. This coupling presents a significant challenge for existing methods in effectively representing both the objects and the water medium simultaneously. To address this challenge, we propose Aquatic-GS, a hybrid 3D representation approach for underwater scenes that effectively represents both the objects and the water medium. Specifically, we construct a Neural Water Field (NWF) to implicitly model the water parameters, while extending the latest 3D Gaussian Splatting (3DGS) to model the objects explicitly. Both components are integrated through a physics-based underwater image formation model to represent complex underwater scenes. Moreover, to construct more precise scene geometry and details, we design a Depth-Guided Optimization (DGO) mechanism that uses a pseudo-depth map as auxiliary guidance. After optimization, Aquatic-GS enables the rendering of novel underwater viewpoints and supports restoring the true appearance of underwater scenes, as if the water medium were absent. Extensive experiments on both simulated and real-world datasets demonstrate that Aquatic-GS surpasses state-of-the-art underwater 3D representation methods, achieving better rendering quality and real-time rendering performance with a 410x increase in speed. Furthermore, regarding underwater image restoration, Aquatic-GS outperforms representative dewatering methods in color correction, detail recovery, and stability. Our models, code, and datasets can be accessed at https://aquaticgs.github.io.

13 pages, 7 figures
Chasing Better Deep Image Priors between Over- and Under-parameterization 2024-10-31 Go
Show

Deep Neural Networks (DNNs) are well-known to act as over-parameterized deep image priors (DIP) that regularize various image inverse problems. Meanwhile, researchers also proposed extremely compact, under-parameterized image priors (e.g., deep decoder) that are strikingly competent for image restoration too, despite a loss of accuracy. These two extremes push us to think whether there exists a better solution in the middle: between over- and under-parameterized image priors, can one identify "intermediate" parameterized image priors that achieve better trade-offs between performance, efficiency, and even preserving strong transferability? Drawing inspirations from the lottery ticket hypothesis (LTH), we conjecture and study a novel "lottery image prior" (LIP) by exploiting DNN inherent sparsity, stated as: given an over-parameterized DNN-based image prior, it will contain a sparse subnetwork that can be trained in isolation, to match the original DNN's performance when being applied as a prior to various image inverse problems. Our results validate the superiority of LIPs: we can successfully locate the LIP subnetworks from over-parameterized DIPs at substantial sparsity ranges. Those LIP subnetworks significantly outperform deep decoders under comparably compact model sizes (by often fully preserving the effectiveness of their over-parameterized counterparts), and they also possess high transferability across different images as well as restoration task types. Besides, we also extend LIP to compressive sensing image reconstruction, where a pre-trained GAN generator is used as the prior (in contrast to untrained DIP or deep decoder), and confirm its validity in this setting too. To our best knowledge, this is the first time that LTH is demonstrated to be relevant in the context of inverse problems or image priors.

Codes...

Codes are available at https://github.com/VITA-Group/Chasing-Better-DIPs

Cycle-Constrained Adversarial Denoising Convolutional Network for PET Image Denoising: Multi-Dimensional Validation on Large Datasets with Reader Study and Real Low-Dose Data 2024-10-31 Go
Show

Positron emission tomography (PET) is a critical tool for diagnosing tumors and neurological disorders but poses radiation risks to patients, particularly to sensitive populations. While reducing injected radiation dose mitigates this risk, it often compromises image quality. To reconstruct full-dose-quality images from low-dose scans, we propose a Cycle-constrained Adversarial Denoising Convolutional Network (Cycle-DCN). This model integrates a noise predictor, two discriminators, and a consistency network, and is optimized using a combination of supervised loss, adversarial loss, cycle consistency loss, identity loss, and neighboring Structural Similarity Index (SSIM) loss. Experiments were conducted on a large dataset consisting of raw PET brain data from 1,224 patients, acquired using a Siemens Biograph Vision PET/CT scanner. Each patient underwent a 120-seconds brain scan. To simulate low-dose PET conditions, images were reconstructed from shortened scan durations of 30, 12, and 5 seconds, corresponding to 1/4, 1/10, and 1/24 of the full-dose acquisition, respectively, using a custom-developed GPU-based image reconstruction software. The results show that Cycle-DCN significantly improves average Peak Signal-to-Noise Ratio (PSNR), SSIM, and Normalized Root Mean Square Error (NRMSE) across three dose levels, with improvements of up to 56%, 35%, and 71%, respectively. Additionally, it achieves contrast-to-noise ratio (CNR) and Edge Preservation Index (EPI) values that closely align with full-dose images, effectively preserving image details, tumor shape, and contrast, while resolving issues with blurred edges. The results of reader studies indicated that the images restored by Cycle-DCN consistently received the highest ratings from nuclear medicine physicians, highlighting their strong clinical relevance.

This ...

This work has been submitted to the IEEE for possible publication

EnsIR: An Ensemble Algorithm for Image Restoration via Gaussian Mixture Models 2024-10-30 Go
Show

Image restoration has experienced significant advancements due to the development of deep learning. Nevertheless, it encounters challenges related to ill-posed problems, resulting in deviations between single model predictions and ground-truths. Ensemble learning, as a powerful machine learning technique, aims to address these deviations by combining the predictions of multiple base models. Most existing works adopt ensemble learning during the design of restoration models, while only limited research focuses on the inference-stage ensemble of pre-trained restoration models. Regression-based methods fail to enable efficient inference, leading researchers in academia and industry to prefer averaging as their choice for post-training ensemble. To address this, we reformulate the ensemble problem of image restoration into Gaussian mixture models (GMMs) and employ an expectation maximization (EM)-based algorithm to estimate ensemble weights for aggregating prediction candidates. We estimate the range-wise ensemble weights on a reference set and store them in a lookup table (LUT) for efficient ensemble inference on the test set. Our algorithm is model-agnostic and training-free, allowing seamless integration and enhancement of various pre-trained image restoration models. It consistently outperforms regression based methods and averaging ensemble approaches on 14 benchmarks across 3 image restoration tasks, including super-resolution, deblurring and deraining. The codes and all estimated weights have been released in Github.

10 pa...

10 pages for main manuscript, additional 17 pages for appendix, 18 figures, 17MB

DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation 2024-10-29 Go
Show

Image restoration (IR) in real-world scenarios presents significant challenges due to the lack of high-capacity models and comprehensive datasets. To tackle these issues, we present a dual strategy: GenIR, an innovative data curation pipeline, and DreamClear, a cutting-edge Diffusion Transformer (DiT)-based image restoration model. GenIR, our pioneering contribution, is a dual-prompt learning pipeline that overcomes the limitations of existing datasets, which typically comprise only a few thousand images and thus offer limited generalizability for larger models. GenIR streamlines the process into three stages: image-text pair construction, dual-prompt based fine-tuning, and data generation & filtering. This approach circumvents the laborious data crawling process, ensuring copyright compliance and providing a cost-effective, privacy-safe solution for IR dataset construction. The result is a large-scale dataset of one million high-quality images. Our second contribution, DreamClear, is a DiT-based image restoration model. It utilizes the generative priors of text-to-image (T2I) diffusion models and the robust perceptual capabilities of multi-modal large language models (MLLMs) to achieve photorealistic restoration. To boost the model's adaptability to diverse real-world degradations, we introduce the Mixture of Adaptive Modulator (MoAM). It employs token-wise degradation priors to dynamically integrate various restoration experts, thereby expanding the range of degradations the model can address. Our exhaustive experiments confirm DreamClear's superior performance, underlining the efficacy of our dual strategy for real-world image restoration. Code and pre-trained models are available at: https://github.com/shallowdream204/DreamClear.

Accep...

Accepted by NeurIPS 2024

Looks Too Good To Be True: An Information-Theoretic Analysis of Hallucinations in Generative Restoration Models 2024-10-25 Go
Show

The pursuit of high perceptual quality in image restoration has driven the development of revolutionary generative models, capable of producing results often visually indistinguishable from real data. However, as their perceptual quality continues to improve, these models also exhibit a growing tendency to generate hallucinations - realistic-looking details that do not exist in the ground truth images. Hallucinations in these models create uncertainty about their reliability, raising major concerns about their practical application. This paper investigates this phenomenon through the lens of information theory, revealing a fundamental tradeoff between uncertainty and perception. We rigorously analyze the relationship between these two factors, proving that the global minimal uncertainty in generative models grows in tandem with perception. In particular, we define the inherent uncertainty of the restoration problem and show that attaining perfect perceptual quality entails at least twice this uncertainty. Additionally, we establish a relation between distortion, uncertainty and perception, through which we prove the aforementioned uncertainly-perception tradeoff induces the well-known perception-distortion tradeoff. We demonstrate our theoretical findings through experiments with super-resolution and inpainting algorithms. This work uncovers fundamental limitations of generative models in achieving both high perceptual quality and reliable predictions for image restoration. Thus, we aim to raise awareness among practitioners about this inherent tradeoff, empowering them to make informed decisions and potentially prioritize safety over perceptual performance.

D3: Deep Deconvolution Deblurring for Natural Images 2024-10-25 Go
Show

In this paper, we propose to reformulate the blind image deblurring task to directly learn an inverse of the degradation model represented by a deep linear network. We introduce Deep Identity Learning (DIL), a novel learning strategy that includes a dedicated regularization term based on the properties of linear systems, to exploit the identity relation between the degradation and inverse degradation models. The salient aspect of our proposed framework is it neither relies on a deblurring dataset nor a single input blurry image (e.g. Polyblur, a self-supervised method). This framework detours the typical degradation kernel estimation step involved in most of the existing blind deblurring solutions by the proposition of our Random Kernel Gallery (RKG) dataset. The proposed approach extends our previous Image Super-Resolution (ISR) work, NSSR-DIL, to the image deblurring task. In this work, we updated the regularization term in DIL based on Fourier transform properties of the identity relation, to deliver robust performance across a wide range of degradations. Besides the regularization term, we provide an explicit and compact representation of the learned deep linear network in a matrix form, called Deep Restoration Kernel (DRK) to perform image restoration. Our experiments show that the proposed method outperforms both traditional and deep learning based deblurring methods, with at least an order of 100 lesser computational resources. The D3 model, both LCNN & DRK, can be effortlessly extended to the Image Super-Resolution (ISR) task as well to restore the low-resolution images with fine details. The D3 model and its kernel form representation (DRK) are lightweight yet robust and restore the blurry input in a fraction of a second.

One-Step Effective Diffusion Network for Real-World Image Super-Resolution 2024-10-24 Go
Show

The pre-trained text-to-image diffusion models have been increasingly employed to tackle the real-world image super-resolution (Real-ISR) problem due to their powerful generative image priors. Most of the existing methods start from random noise to reconstruct the high-quality (HQ) image under the guidance of the given low-quality (LQ) image. While promising results have been achieved, such Real-ISR methods require multiple diffusion steps to reproduce the HQ image, increasing the computational cost. Meanwhile, the random noise introduces uncertainty in the output, which is unfriendly to image restoration tasks. To address these issues, we propose a one-step effective diffusion network, namely OSEDiff, for the Real-ISR problem. We argue that the LQ image contains rich information to restore its HQ counterpart, and hence the given LQ image can be directly taken as the starting point for diffusion, eliminating the uncertainty introduced by random noise sampling. We finetune the pre-trained diffusion network with trainable layers to adapt it to complex image degradations. To ensure that the one-step diffusion model could yield HQ Real-ISR output, we apply variational score distillation in the latent space to conduct KL-divergence regularization. As a result, our OSEDiff model can efficiently and effectively generate HQ images in just one diffusion step. Our experiments demonstrate that OSEDiff achieves comparable or even better Real-ISR results, in terms of both objective metrics and subjective evaluations, than previous diffusion model-based Real-ISR methods that require dozens or hundreds of steps. The source codes are released at https://github.com/cswry/OSEDiff.

Accep...

Accepted by NeurIPS 2024

Resfusion: Denoising Diffusion Probabilistic Models for Image Restoration Based on Prior Residual Noise 2024-10-24 Go
Show

Recently, research on denoising diffusion models has expanded its application to the field of image restoration. Traditional diffusion-based image restoration methods utilize degraded images as conditional input to effectively guide the reverse generation process, without modifying the original denoising diffusion process. However, since the degraded images already include low-frequency information, starting from Gaussian white noise will result in increased sampling steps. We propose Resfusion, a general framework that incorporates the residual term into the diffusion forward process, starting the reverse process directly from the noisy degraded images. The form of our inference process is consistent with the DDPM. We introduced a weighted residual noise, named resnoise, as the prediction target and explicitly provide the quantitative relationship between the residual term and the noise term in resnoise. By leveraging a smooth equivalence transformation, Resfusion determine the optimal acceleration step and maintains the integrity of existing noise schedules, unifying the training and inference processes. The experimental results demonstrate that Resfusion exhibits competitive performance on ISTD dataset, LOL dataset and Raindrop dataset with only five sampling steps. Furthermore, Resfusion can be easily applied to image generation and emerges with strong versatility. Our code and model are available at https://github.com/nkicsl/Resfusion.

NeurIPS 2024

Diffusion Model

Back to Index

Title Date Cool Paper Abstract Comment
PartGen: Part-level 3D Generation and Reconstruction with Multi-View Diffusion Models 2024-12-24 Go
Show

Text- or image-to-3D generators and 3D scanners can now produce 3D assets with high-quality shapes and textures. These assets typically consist of a single, fused representation, like an implicit neural field, a Gaussian mixture, or a mesh, without any useful structure. However, most applications and creative workflows require assets to be made of several meaningful parts that can be manipulated independently. To address this gap, we introduce PartGen, a novel approach that generates 3D objects composed of meaningful parts starting from text, an image, or an unstructured 3D object. First, given multiple views of a 3D object, generated or rendered, a multi-view diffusion model extracts a set of plausible and view-consistent part segmentations, dividing the object into parts. Then, a second multi-view diffusion model takes each part separately, fills in the occlusions, and uses those completed views for 3D reconstruction by feeding them to a 3D reconstruction network. This completion process considers the context of the entire object to ensure that the parts integrate cohesively. The generative completion model can make up for the information missing due to occlusions; in extreme cases, it can hallucinate entirely invisible parts based on the input 3D asset. We evaluate our method on generated and real 3D assets and show that it outperforms segmentation and part-extraction baselines by a large margin. We also showcase downstream applications such as 3D part editing.

Proje...

Project Page: https://silent-chen.github.io/PartGen/

DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers 2024-12-24 Go
Show

World model-based searching and planning are widely recognized as a promising path toward human-level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next-token prediction. Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.

Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models 2024-12-24 Go
Show

Classifiers are important components in many computer vision tasks, serving as the foundational backbone of a wide variety of models employed across diverse applications. However, understanding the decision-making process of classifiers remains a significant challenge. We propose DiffEx, a novel method that leverages the capabilities of text-to-image diffusion models to explain classifier decisions. Unlike traditional GAN-based explainability models, which are limited to simple, single-concept analyses and typically require training a new model for each classifier, our approach can explain classifiers that focus on single concepts (such as faces or animals) as well as those that handle complex scenes involving multiple concepts. DiffEx employs vision-language models to create a hierarchical list of semantics, allowing users to identify not only the overarching semantic influences on classifiers (e.g., the 'beard' semantic in a facial classifier) but also their sub-types, such as 'goatee' or 'Balbo' beard. Our experiments demonstrate that DiffEx is able to cover a significantly broader spectrum of semantics compared to its GAN counterparts, providing a hierarchical tool that delivers a more detailed and fine-grained understanding of classifier decisions.

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation 2024-12-24 Go
Show

Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.

19 pa...

19 pages, 19 figures, Project page: https://onevfall.github.io/project_page/ditctrl ; GitHub repository: https://github.com/TencentARC/DiTCtrl

LatentCRF: Continuous CRF for Efficient Latent Diffusion 2024-12-24 Go
Show

Latent Diffusion Models (LDMs) produce high-quality, photo-realistic images, however, the latency incurred by multiple costly inference iterations can restrict their applicability. We introduce LatentCRF, a continuous Conditional Random Field (CRF) model, implemented as a neural network layer, that models the spatial and semantic relationships among the latent vectors in the LDM. By replacing some of the computationally-intensive LDM inference iterations with our lightweight LatentCRF, we achieve a superior balance between quality, speed and diversity. We increase inference efficiency by 33% with no loss in image quality or diversity compared to the full LDM. LatentCRF is an easy add-on, which does not require modifying the LDM.

Resolution-Robust 3D MRI Reconstruction with 2D Diffusion Priors: Diverse-Resolution Training Outperforms Interpolation 2024-12-24 Go
Show

Deep learning-based 3D imaging, in particular magnetic resonance imaging (MRI), is challenging because of limited availability of 3D training data. Therefore, 2D diffusion models trained on 2D slices are starting to be leveraged for 3D MRI reconstruction. However, as we show in this paper, existing methods pertain to a fixed voxel size, and performance degrades when the voxel size is varied, as it is often the case in clinical practice. In this paper, we propose and study several approaches for resolution-robust 3D MRI reconstruction with 2D diffusion priors. As a result of this investigation, we obtain a simple resolution-robust variational 3D reconstruction approach based on diffusion-guided regularization of randomly sampled 2D slices. This method provides competitive reconstruction quality compared to posterior sampling baselines. Towards resolving the sensitivity to resolution-shifts, we investigate state-of-the-art model-based approaches including Gaussian splatting, neural representations, and infinite-dimensional diffusion models, as well as a simple data-centric approach of training the diffusion model on several resolutions. Our experiments demonstrate that the model-based approaches fail to close the performance gap in 3D MRI. In contrast, the data-centric approach of training the diffusion model on various resolutions effectively provides a resolution-robust method without compromising accuracy.

3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement 2024-12-24 Go
Show

Despite advances in neural rendering, due to the scarcity of high-quality 3D datasets and the inherent limitations of multi-view diffusion models, view synthesis and 3D model generation are restricted to low resolutions with suboptimal multi-view consistency. In this study, we present a novel 3D enhancement pipeline, dubbed 3DEnhancer, which employs a multi-view latent diffusion model to enhance coarse 3D inputs while preserving multi-view consistency. Our method includes a pose-aware encoder and a diffusion-based denoiser to refine low-quality multi-view images, along with data augmentation and a multi-view attention module with epipolar aggregation to maintain consistent, high-quality 3D outputs across views. Unlike existing video-based approaches, our model supports seamless multi-view enhancement with improved coherence across diverse viewing angles. Extensive evaluations show that 3DEnhancer significantly outperforms existing methods, boosting both multi-view enhancement and per-instance 3D optimization tasks.

Proje...

Project page: https://yihangluo.com/projects/3DEnhancer

Fashionability-Enhancing Outfit Image Editing with Conditional Diffusion Models 2024-12-24 Go
Show

Image generation in the fashion domain has predominantly focused on preserving body characteristics or following input prompts, but little attention has been paid to improving the inherent fashionability of the output images. This paper presents a novel diffusion model-based approach that generates fashion images with improved fashionability while maintaining control over key attributes. Key components of our method include: 1) fashionability enhancement, which ensures that the generated images are more fashionable than the input; 2) preservation of body characteristics, encouraging the generated images to maintain the original shape and proportions of the input; and 3) automatic fashion optimization, which does not rely on manual input or external prompts. We also employ two methods to collect training data for guidance while generating and evaluating the images. In particular, we rate outfit images using fashionability scores annotated by multiple fashion experts through OpenSkill-based and five critical aspect-based pairwise comparisons. These methods provide complementary perspectives for assessing and improving the fashionability of the generated images. The experimental results show that our approach outperforms the baseline Fashion++ in generating images with superior fashionability, demonstrating its effectiveness in producing more stylish and appealing fashion images.

11 pages, 6 figures
Discovery of 2D Materials via Symmetry-Constrained Diffusion Model 2024-12-24 Go
Show

Generative model for 2D materials has shown significant promise in accelerating the material discovery process. The stability and performance of these materials are strongly influenced by their underlying symmetry. However, existing generative models for 2D materials often neglect symmetry constraints, which limits both the diversity and quality of the generated structures. Here, we introduce a symmetry-constrained diffusion model (SCDM) that integrates space group symmetry into the generative process. By incorporating Wyckoff positions, the model ensures adherence to symmetry principles, leading to the generation of 2,000 candidate structures. DFT calculations were conducted to evaluate the convex hull energies of these structures after structural relaxation. From the generated samples, 843 materials that met the energy stability criteria (Ehull < 0.6 eV/atom) were identified. Among these, six candidates were selected for further stability analysis, including phonon band structure evaluations and electronic properties investigations, all of which exhibited phonon spectrum stability. To benchmark the performance of SCDM, a symmetry-unconstrained diffusion model was also evaluated via crystal structure prediction model. The results highlight that incorporating symmetry constraints enhances the effectiveness of generated 2D materials, making a contribution to the discovery of 2D materials through generative modeling.

Zero-Shot Conditioning of Score-Based Diffusion Models by Neuro-Symbolic Constraints 2024-12-24 Go
Show

Score-based diffusion models have emerged as effective approaches for both conditional and unconditional generation. Still conditional generation is based on either a specific training of a conditional model or classifier guidance, which requires training a noise-dependent classifier, even when a classifier for uncorrupted data is given. We propose a method that, given a pre-trained unconditional score-based generative model, samples from the conditional distribution under arbitrary logical constraints, without requiring additional training. Differently from other zero-shot techniques, that rather aim at generating valid conditional samples, our method is designed for approximating the true conditional distribution. Firstly, we show how to manipulate the learned score in order to sample from an un-normalized distribution conditional on a user-defined constraint. Then, we define a flexible and numerically stable neuro-symbolic framework for encoding soft logical constraints. Combining these two ingredients we obtain a general, but approximate, conditional sampling algorithm. We further developed effective heuristics aimed at improving the approximation. Finally, we show the effectiveness of our approach in approximating conditional distributions for various types of constraints and data: tabular data, images and time series.

FameBias: Embedding Manipulation Bias Attack in Text-to-Image Models 2024-12-24 Go
Show

Text-to-Image (T2I) diffusion models have rapidly advanced, enabling the generation of high-quality images that align closely with textual descriptions. However, this progress has also raised concerns about their misuse for propaganda and other malicious activities. Recent studies reveal that attackers can embed biases into these models through simple fine-tuning, causing them to generate targeted imagery when triggered by specific phrases. This underscores the potential for T2I models to act as tools for disseminating propaganda, producing images aligned with an attacker's objective for end-users. Building on this concept, we introduce FameBias, a T2I biasing attack that manipulates the embeddings of input prompts to generate images featuring specific public figures. Unlike prior methods, Famebias operates solely on the input embedding vectors without requiring additional model training. We evaluate FameBias comprehensively using Stable Diffusion V2, generating a large corpus of images based on various trigger nouns and target public figures. Our experiments demonstrate that FameBias achieves a high attack success rate while preserving the semantic context of the original prompts across multiple trigger-target pairs.

BS-LDM: Effective Bone Suppression in High-Resolution Chest X-Ray Images with Conditional Latent Diffusion Models 2024-12-24 Go
Show

The interference of overlapping bones and pulmonary structures can reduce the effectiveness of Chest X-ray (CXR) examinations. Bone suppression techniques have been developed to improve diagnostic accuracy. Dual-energy subtraction (DES) imaging, a common method for bone suppression, is costly and exposes patients to higher radiation levels. Deep learning-based image generation methods have been proposed as alternatives, however, they often fail to produce high-quality and high-resolution images, resulting in the loss of critical lesion information and texture details. To address these issues, in this paper, we introduce an end-to-end framework for bone suppression in high-resolution CXR images, termed BS-LDM. This framework employs a conditional latent diffusion model to generate high-resolution soft tissue images with fine detail and critical lung pathology by performing bone suppression in the latent space. We implement offset noise during the noise addition phase of the training process to better render low-frequency information in soft tissue images. Additionally, we introduce a dynamic clipping strategy during the sampling process to refine pixel intensity in the generated soft tissue images. We compiled a substantial and high-quality bone suppression dataset, SZCH-X-Rays, including high-resolution paired CXR and DES soft tissue images from 818 patients, collected from our partner hospitals. Moreover, we pre-processed 241 pairs of CXR and DES soft tissue images from the JSRT dataset, the largest publicly available dataset. Comprehensive experimental and clinical evaluations demonstrate that BS-LDM exhibits superior bone suppression capabilities, highlighting its significant clinical potential.

9 pages, 6 figures
GDM4MMIMO: Generative Diffusion Models for Massive MIMO Communications 2024-12-24 Go
Show

Massive multiple-input multiple-output (MIMO) offers significant advantages in spectral and energy efficiencies, positioning it as a cornerstone technology of fifth-generation (5G) wireless communication systems and a promising solution for the burgeoning data demands anticipated in sixth-generation (6G) networks. In recent years, with the continuous advancement of artificial intelligence (AI), a multitude of task-oriented generative foundation models (GFMs) have emerged, achieving remarkable performance in various fields such as computer vision (CV), natural language processing (NLP), and autonomous driving. As a pioneering force, these models are driving the paradigm shift in AI towards generative AI (GenAI). Among them, the generative diffusion model (GDM), as one of state-of-the-art families of generative models, demonstrates an exceptional capability to learn implicit prior knowledge and robust generalization capabilities, thereby enhancing its versatility and effectiveness across diverse applications. In this paper, we delve into the potential applications of GDM in massive MIMO communications. Specifically, we first provide an overview of massive MIMO communication, the framework of GFMs, and the working mechanism of GDM. Following this, we discuss recent research advancements in the field and present a case study of near-field channel estimation based on GDM, demonstrating its promising potential for facilitating efficient ultra-dimensional channel statement information (CSI) acquisition in the context of massive MIMO communications. Finally, we highlight several pressing challenges in future mobile communications and identify promising research directions surrounding GDM.

6 pages, 3 figures
ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps 2024-12-24 Go
Show

Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at https://keylxiao.github.io/consinger.

Singi...

Singing voice synthesis, Consistency models, Shallow Diffusion Mechanism; Accepted by ICASSP 2025

Variational Diffusion Posterior Sampling with Midpoint Guidance 2024-12-24 Go
Show

Diffusion models have recently shown considerable potential in solving Bayesian inverse problems when used as priors. However, sampling from the resulting denoising posterior distributions remains a challenge as it involves intractable terms. To tackle this issue, state-of-the-art approaches formulate the problem as that of sampling from a surrogate diffusion model targeting the posterior and decompose its scores into two terms: the prior score and an intractable guidance term. While the former is replaced by the pre-trained score of the considered diffusion model, the guidance term has to be estimated. In this paper, we propose a novel approach that utilises a decomposition of the transitions which, in contrast to previous methods, allows a trade-off between the complexity of the intractable guidance term and that of the prior transitions. We validate the proposed approach through extensive experiments on linear and nonlinear inverse problems, including challenging cases with latent diffusion models as priors. We then demonstrate its applicability to various modalities and its promising impact on public health by tackling cardiovascular disease diagnosis through the reconstruction of incomplete electrocardiograms. The code is publicly available at \url{https://github.com/yazidjanati/mgps}.

Singular Value Scaling: Efficient Generative Model Compression via Pruned Weights Refinement 2024-12-24 Go
Show

While pruning methods effectively maintain model performance without extra training costs, they often focus solely on preserving crucial connections, overlooking the impact of pruned weights on subsequent fine-tuning or distillation, leading to inefficiencies. Moreover, most compression techniques for generative models have been developed primarily for GANs, tailored to specific architectures like StyleGAN, and research into compressing Diffusion models has just begun. Even more, these methods are often applicable only to GANs or Diffusion models, highlighting the need for approaches that work across both model types. In this paper, we introduce Singular Value Scaling (SVS), a versatile technique for refining pruned weights, applicable to both model types. Our analysis reveals that pruned weights often exhibit dominant singular vectors, hindering fine-tuning efficiency and leading to suboptimal performance compared to random initialization. Our method enhances weight initialization by minimizing the disparities between singular values of pruned weights, thereby improving the fine-tuning process. This approach not only guides the compressed model toward superior solutions but also significantly speeds up fine-tuning. Extensive experiments on StyleGAN2, StyleGAN3 and DDPM demonstrate that SVS improves compression performance across model types without additional training costs. Our code is available at: https://github.com/LAIT-CVLab/Singular-Value-Scaling.

Accep...

Accepted to AAAI 2025

Schödinger Bridge Type Diffusion Models as an Extension of Variational Autoencoders 2024-12-24 Go
Show

Generative diffusion models use time-forward and backward stochastic differential equations to connect the data and prior distributions. While conventional diffusion models (e.g., score-based models) only learn the backward process, more flexible frameworks have been proposed to also learn the forward process by employing the Schr"odinger bridge (SB). However, due to the complexity of the mathematical structure behind SB-type models, we can not easily give an intuitive understanding of their objective function. In this work, we propose a unified framework to construct diffusion models by reinterpreting the SB-type models as an extension of variational autoencoders. In this context, the data processing inequality plays a crucial role. As a result, we find that the objective function consists of the prior loss and drift matching parts.

Expand VSR Benchmark for VLLM to Expertize in Spatial Rules 2024-12-24 Go
Show

Distinguishing spatial relations is a basic part of human cognition which requires fine-grained perception on cross-instance. Although benchmarks like MME, MMBench and SEED comprehensively have evaluated various capabilities which already include visual spatial reasoning(VSR). There is still a lack of sufficient quantity and quality evaluation and optimization datasets for Vision Large Language Models(VLLMs) specifically targeting visual positional reasoning. To handle this, we first diagnosed current VLLMs with the VSR dataset and proposed a unified test set. We found current VLLMs to exhibit a contradiction of over-sensitivity to language instructions and under-sensitivity to visual positional information. By expanding the original benchmark from two aspects of tunning data and model structure, we mitigated this phenomenon. To our knowledge, we expanded spatially positioned image data controllably using diffusion models for the first time and integrated original visual encoding(CLIP) with other 3 powerful visual encoders(SigLIP, SAM and DINO). After conducting combination experiments on scaling data and models, we obtained a VLLM VSR Expert(VSRE) that not only generalizes better to different instructions but also accurately distinguishes differences in visual positional information. VSRE achieved over a 27% increase in accuracy on the VSR test set. It becomes a performant VLLM on the position reasoning of both the VSR dataset and relevant subsets of other evaluation benchmarks. We open-sourced the expanded model with data and Appendix at \url{https://github.com/peijin360/vsre} and hope it will accelerate advancements in VLLM on VSR learning.

Accelerating AIGC Services with Latent Action Diffusion Scheduling in Edge Networks 2024-12-24 Go
Show

Artificial Intelligence Generated Content (AIGC) has gained significant popularity for creating diverse content. Current AIGC models primarily focus on content quality within a centralized framework, resulting in a high service delay and negative user experiences. However, not only does the workload of an AIGC task depend on the AIGC model's complexity rather than the amount of data, but the large model and its multi-layer encoder structure also result in a huge demand for computational and memory resources. These unique characteristics pose new challenges in its modeling, deployment, and scheduling at edge networks. Thus, we model an offloading problem among edges for providing real AIGC services and propose LAD-TS, a novel Latent Action Diffusion-based Task Scheduling method that orchestrates multiple edge servers for expedited AIGC services. The LAD-TS generates a near-optimal offloading decision by leveraging the diffusion model's conditional generation capability and the reinforcement learning's environment interaction ability, thereby minimizing the service delays under multiple resource constraints. Meanwhile, a latent action diffusion strategy is designed to guide decision generation by utilizing historical action probability, enabling rapid achievement of near-optimal decisions. Furthermore, we develop DEdgeAI, a prototype edge system with a refined AIGC model deployment to implement and evaluate our LAD-TS method. DEdgeAI provides a real AIGC service for users, demonstrating up to 29.18% shorter service delays than the current five representative AIGC platforms. We release our open-source code at https://github.com/ChangfuXu/DEdgeAI/.

Under review
Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step 2024-12-24 Go
Show

Score identity Distillation (SiD) is a data-free method that has achieved SOTA performance in image generation by leveraging only a pretrained diffusion model, without requiring any training data. However, its ultimate performance is constrained by how accurate the pretrained model captures the true data scores at different stages of the diffusion process. In this paper, we introduce SiDA (SiD with Adversarial Loss), which not only enhances generation quality but also improves distillation efficiency by incorporating real images and adversarial loss. SiDA utilizes the encoder from the generator's score network as a discriminator, allowing it to distinguish between real images and those generated by SiD. The adversarial loss is batch-normalized within each GPU and then combined with the original SiD loss. This integration effectively incorporates the average "fakeness" per GPU batch into the pixel-based SiD loss, enabling SiDA to distill a single-step generator. SiDA converges significantly faster than its predecessor when distilled from scratch, and swiftly improves upon the original model's performance during fine-tuning from a pre-distilled SiD generator. This one-step adversarial distillation method establishes new benchmarks in generation performance when distilling EDM diffusion models, achieving FID scores of 1.110 on ImageNet 64x64. When distilling EDM2 models trained on ImageNet 512x512, our SiDA method surpasses even the largest teacher model, EDM2-XXL, which achieved an FID of 1.81 using classifier-free guidance (CFG) and 63 generation steps. In contrast, SiDA achieves FID scores of 2.156 for size XS, 1.669 for S, 1.488 for M, 1.413 for L, 1.379 for XL, and 1.366 for XXL, all without CFG and in a single generation step. These results highlight substantial improvements across all model sizes. Our code is available at https://github.com/mingyuanzhou/SiD/tree/sida.

10 pa...

10 pages (main text), 34 figures, and 10 tables

Stochastic Control for Fine-tuning Diffusion Models: Optimality, Regularity, and Convergence 2024-12-24 Go
Show

Diffusion models have emerged as powerful tools for generative modeling, demonstrating exceptional capability in capturing target data distributions from large datasets. However, fine-tuning these massive models for specific downstream tasks, constraints, and human preferences remains a critical challenge. While recent advances have leveraged reinforcement learning algorithms to tackle this problem, much of the progress has been empirical, with limited theoretical understanding. To bridge this gap, we propose a stochastic control framework for fine-tuning diffusion models. Building on denoising diffusion probabilistic models as the pre-trained reference dynamics, our approach integrates linear dynamics control with Kullback-Leibler regularization. We establish the well-posedness and regularity of the stochastic control problem and develop a policy iteration algorithm (PI-FT) for numerical solution. We show that PI-FT achieves global convergence at a linear rate. Unlike existing work that assumes regularities throughout training, we prove that the control and value sequences generated by the algorithm maintain the regularity. Additionally, we explore extensions of our framework to parametric settings and continuous-time formulations.

28 pages
Dense-Face: Personalized Face Generation Model via Dense Annotation Prediction 2024-12-24 Go
Show

The text-to-image (T2I) personalization diffusion model can generate images of the novel concept based on the user input text caption. However, existing T2I personalized methods either require test-time fine-tuning or fail to generate images that align well with the given text caption. In this work, we propose a new T2I personalization diffusion model, Dense-Face, which can generate face images with a consistent identity as the given reference subject and align well with the text caption. Specifically, we introduce a pose-controllable adapter for the high-fidelity image generation while maintaining the text-based editing ability of the pre-trained stable diffusion (SD). Additionally, we use internal features of the SD UNet to predict dense face annotations, enabling the proposed method to gain domain knowledge in face generation. Empirically, our method achieves state-of-the-art or competitive generation performance in image-text alignment, identity preservation, and pose control.

15 figures, 5 tables
Ensuring Consistency for In-Image Translation 2024-12-24 Go
Show

The in-image machine translation task involves translating text embedded within images, with the translated results presented in image format. While this task has numerous applications in various scenarios such as film poster translation and everyday scene image translation, existing methods frequently neglect the aspect of consistency throughout this process. We propose the need to uphold two types of consistency in this task: translation consistency and image generation consistency. The former entails incorporating image information during translation, while the latter involves maintaining consistency between the style of the text-image and the original image, ensuring background integrity. To address these consistency requirements, we introduce a novel two-stage framework named HCIIT (High-Consistency In-Image Translation) which involves text-image translation using a multimodal multilingual large language model in the first stage and image backfilling with a diffusion model in the second stage. Chain of thought learning is utilized in the first stage to enhance the model's ability to leverage image information during translation. Subsequently, a diffusion model trained for style-consistent text-image generation ensures uniformity in text style within images and preserves background details. A dataset comprising 400,000 style-consistent pseudo text-image pairs is curated for model training. Results obtained on both curated test sets and authentic image test sets validate the effectiveness of our framework in ensuring consistency and producing high-quality translated images.

A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data 2024-12-24 Go
Show

Understanding the structure of real data is paramount in advancing modern deep-learning methodologies. Natural data such as images are believed to be composed of features organized in a hierarchical and combinatorial manner, which neural networks capture during learning. Recent advancements show that diffusion models can generate high-quality images, hinting at their ability to capture this underlying compositional structure. We study this phenomenon in a hierarchical generative model of data. We find that the backward diffusion process acting after a time $t$ is governed by a phase transition at some threshold time, where the probability of reconstructing high-level features, like the class of an image, suddenly drops. Instead, the reconstruction of low-level features, such as specific details of an image, evolves smoothly across the whole diffusion process. This result implies that at times beyond the transition, the class has changed, but the generated sample may still be composed of low-level elements of the initial image. We validate these theoretical insights through numerical experiments on class-unconditional ImageNet diffusion models. Our analysis characterizes the relationship between time and scale in diffusion models and puts forward generative models as powerful tools to model combinatorial data properties.

9 pag...

9 pages, 7 figures. Appendix: 11 pages, 9 figures

Multi-Agent Path Finding in Continuous Spaces with Projected Diffusion Models 2024-12-23 Go
Show

Multi-Agent Path Finding (MAPF) is a fundamental problem in robotics, requiring the computation of collision-free paths for multiple agents moving from their respective start to goal positions. Coordinating multiple agents in a shared environment poses significant challenges, especially in continuous spaces where traditional optimization algorithms struggle with scalability. Moreover, these algorithms often depend on discretized representations of the environment, which can be impractical in image-based or high-dimensional settings. Recently, diffusion models have shown promise in single-agent path planning, capturing complex trajectory distributions and generating smooth paths that navigate continuous, high-dimensional spaces. However, directly extending diffusion models to MAPF introduces new challenges since these models struggle to ensure constraint feasibility, such as inter-agent collision avoidance. To overcome this limitation, this work proposes a novel approach that integrates constrained optimization with diffusion models for MAPF in continuous spaces. This unique combination directly produces feasible multi-agent trajectories that respect collision avoidance and kinematic constraints. The effectiveness of our approach is demonstrated across various challenging simulated scenarios of varying dimensionality.

Label-Efficient Data Augmentation with Video Diffusion Models for Guidewire Segmentation in Cardiac Fluoroscopy 2024-12-23 Go
Show

The accurate segmentation of guidewires in interventional cardiac fluoroscopy videos is crucial for computer-aided navigation tasks. Although deep learning methods have demonstrated high accuracy and robustness in wire segmentation, they require substantial annotated datasets for generalizability, underscoring the need for extensive labeled data to enhance model performance. To address this challenge, we propose the Segmentation-guided Frame-consistency Video Diffusion Model (SF-VD) to generate large collections of labeled fluoroscopy videos, augmenting the training data for wire segmentation networks. SF-VD leverages videos with limited annotations by independently modeling scene distribution and motion distribution. It first samples the scene distribution by generating 2D fluoroscopy images with wires positioned according to a specified input mask, and then samples the motion distribution by progressively generating subsequent frames, ensuring frame-to-frame coherence through a frame-consistency strategy. A segmentation-guided mechanism further refines the process by adjusting wire contrast, ensuring a diverse range of visibility in the synthesized image. Evaluation on a fluoroscopy dataset confirms the superior quality of the generated videos and shows significant improvements in guidewire segmentation.

AAAI 2025
EraseDraw: Learning to Draw Step-by-Step via Erasing Objects from Images 2024-12-23 Go
Show

Creative processes such as painting often involve creating different components of an image one by one. Can we build a computational model to perform this task? Prior works often fail by making global changes to the image, inserting objects in unrealistic spatial locations, and generating inaccurate lighting details. We observe that while state-of-the-art models perform poorly on object insertion, they can remove objects and erase the background in natural images very well. Inverting the direction of object removal, we obtain high-quality data for learning to insert objects that are spatially, physically, and optically consistent with the surroundings. With this scalable automatic data generation pipeline, we can create a dataset for learning object insertion, which is used to train our proposed text conditioned diffusion model. Qualitative and quantitative experiments have shown that our model achieves state-of-the-art results in object insertion, particularly for in-the-wild images. We show compelling results on diverse insertion prompts and images across various domains.In addition, we automate iterative insertion by combining our insertion model with beam search guided by CLIP.

Causal Composition Diffusion Model for Closed-loop Traffic Generation 2024-12-23 Go
Show

Simulation is critical for safety evaluation in autonomous driving, particularly in capturing complex interactive behaviors. However, generating realistic and controllable traffic scenarios in long-tail situations remains a significant challenge. Existing generative models suffer from the conflicting objective between user-defined controllability and realism constraints, which is amplified in safety-critical contexts. In this work, we introduce the Causal Compositional Diffusion Model (CCDiff), a structure-guided diffusion framework to address these challenges. We first formulate the learning of controllable and realistic closed-loop simulation as a constrained optimization problem. Then, CCDiff maximizes controllability while adhering to realism by automatically identifying and injecting causal structures directly into the diffusion process, providing structured guidance to enhance both realism and controllability. Through rigorous evaluations on benchmark datasets and in a closed-loop simulator, CCDiff demonstrates substantial gains over state-of-the-art approaches in generating realistic and user-preferred trajectories. Our results show CCDiff's effectiveness in extracting and leveraging causal structures, showing improved closed-loop performance based on key metrics such as collision rate, off-road rate, FDE, and comfort.

FaceLift: Single Image to 3D Head with View Generation and GS-LRM 2024-12-23 Go
Show

We present FaceLift, a feed-forward approach for rapid, high-quality, 360-degree head reconstruction from a single image. Our pipeline begins by employing a multi-view latent diffusion model that generates consistent side and back views of the head from a single facial input. These generated views then serve as input to a GS-LRM reconstructor, which produces a comprehensive 3D representation using Gaussian splats. To train our system, we develop a dataset of multi-view renderings using synthetic 3D human head as-sets. The diffusion-based multi-view generator is trained exclusively on synthetic head images, while the GS-LRM reconstructor undergoes initial training on Objaverse followed by fine-tuning on synthetic head data. FaceLift excels at preserving identity and maintaining view consistency across views. Despite being trained solely on synthetic data, FaceLift demonstrates remarkable generalization to real-world images. Through extensive qualitative and quantitative evaluations, we show that FaceLift outperforms state-of-the-art methods in 3D head reconstruction, highlighting its practical applicability and robust performance on real-world images. In addition to single image reconstruction, FaceLift supports video inputs for 4D novel view synthesis and seamlessly integrates with 2D reanimation techniques to enable 3D facial animation. Project page: https://weijielyu.github.io/FaceLift.

Proje...

Project page: https://weijielyu.github.io/FaceLift

PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion 2024-12-23 Go
Show

Peptide therapeutics, a major class of medicines, have achieved remarkable success across diseases such as diabetes and cancer, with landmark examples such as GLP-1 receptor agonists revolutionizing the treatment of type-2 diabetes and obesity. Despite their success, designing peptides that satisfy multiple conflicting objectives, such as target binding affinity, solubility, and membrane permeability, remains a major challenge. Classical drug development and structure-based design are ineffective for such tasks, as they fail to optimize global functional properties critical for therapeutic efficacy. Existing generative frameworks are largely limited to continuous spaces, unconditioned outputs, or single-objective guidance, making them unsuitable for discrete sequence optimization across multiple properties. To address this, we present PepTune, a multi-objective discrete diffusion model for the simultaneous generation and optimization of therapeutic peptide SMILES. Built on the Masked Discrete Language Model (MDLM) framework, PepTune ensures valid peptide structures with state-dependent masking schedules and penalty-based objectives. To guide the diffusion process, we propose a Monte Carlo Tree Search (MCTS)-based strategy that balances exploration and exploitation to iteratively refine Pareto-optimal sequences. MCTS integrates classifier-based rewards with search-tree expansion, overcoming gradient estimation challenges and data sparsity inherent to discrete spaces. Using PepTune, we generate diverse, chemically-modified peptides optimized for multiple therapeutic properties, including target binding affinity, membrane permeability, solubility, hemolysis, and non-fouling characteristics on various disease-relevant targets. In total, our results demonstrate that MCTS-guided discrete diffusion is a powerful and modular approach for multi-objective sequence design in discrete state spaces.

The Superposition of Diffusion Models Using the Itô Density Estimator 2024-12-23 Go
Show

The Cambrian explosion of easily accessible pre-trained diffusion models suggests a demand for methods that combine multiple different pre-trained diffusion models without incurring the significant computational burden of re-training a larger combined model. In this paper, we cast the problem of combining multiple pre-trained diffusion models at the generation stage under a novel proposed framework termed superposition. Theoretically, we derive superposition from rigorous first principles stemming from the celebrated continuity equation and design two novel algorithms tailor-made for combining diffusion models in SuperDiff. SuperDiff leverages a new scalable It^o density estimator for the log likelihood of the diffusion SDE which incurs no additional overhead compared to the well-known Hutchinson's estimator needed for divergence calculations. We demonstrate that SuperDiff is scalable to large pre-trained diffusion models as superposition is performed solely through composition during inference, and also enjoys painless implementation as it combines different pre-trained vector fields through an automated re-weighting scheme. Notably, we show that SuperDiff is efficient during inference time, and mimics traditional composition operators such as the logical OR and the logical AND. We empirically demonstrate the utility of using SuperDiff for generating more diverse images on CIFAR-10, more faithful prompt conditioned image editing using Stable Diffusion, and improved unconditional de novo structure design of proteins. https://github.com/necludov/super-diffusion

DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions 2024-12-23 Go
Show

Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. In this paper, we propose a novel method, dubbed DiffH2O, which can synthesize realistic, one or two-handed object interactions from provided text prompts and geometry of the object. The method introduces three techniques that enable effective learning from limited data. First, we decompose the task into a grasping stage and an text-based manipulation stage and use separate diffusion models for each. In the grasping stage, the model only generates hand motions, whereas in the manipulation phase both hand and object poses are synthesized. Second, we propose a compact representation that tightly couples hand and object poses and helps in generating realistic hand-object interactions. Third, we propose two different guidance schemes to allow more control of the generated motions: grasp guidance and detailed textual guidance. Grasp guidance takes a single target grasping pose and guides the diffusion model to reach this grasp at the end of the grasping stage, which provides control over the grasping pose. Given a grasping motion from this stage, multiple different actions can be prompted in the manipulation phase. For the textual guidance, we contribute comprehensive text descriptions to the GRAB dataset and show that they enable our method to have more fine-grained control over hand-object interactions. Our quantitative and qualitative evaluation demonstrates that the proposed method outperforms baseline methods and leads to natural hand-object motions.

Proje...

Project Page: https://diffh2o.github.io/

A Bias-Free Training Paradigm for More General AI-generated Image Detection 2024-12-23 Go
Show

Successful forensic detectors can produce excellent results in supervised learning benchmarks but struggle to transfer to real-world applications. We believe this limitation is largely due to inadequate training data quality. While most research focuses on developing new algorithms, less attention is given to training data selection, despite evidence that performance can be strongly impacted by spurious correlations such as content, format, or resolution. A well-designed forensic detector should detect generator specific artifacts rather than reflect data biases. To this end, we propose B-Free, a bias-free training paradigm, where fake images are generated from real ones using the conditioning procedure of stable diffusion models. This ensures semantic alignment between real and fake images, allowing any differences to stem solely from the subtle artifacts introduced by AI generation. Through content-based augmentation, we show significant improvements in both generalization and robustness over state-of-the-art detectors and more calibrated results across 27 different generative models, including recent releases, like FLUX and Stable Diffusion 3.5. Our findings emphasize the importance of a careful dataset curation, highlighting the need for further research in dataset design. Code and data will be publicly available at https://grip-unina.github.io/B-Free/

Benchmarking Generative AI Models for Deep Learning Test Input Generation 2024-12-23 Go
Show

Test Input Generators (TIGs) are crucial to assess the ability of Deep Learning (DL) image classifiers to provide correct predictions for inputs beyond their training and test sets. Recent advancements in Generative AI (GenAI) models have made them a powerful tool for creating and manipulating synthetic images, although these advancements also imply increased complexity and resource demands for training. In this work, we benchmark and combine different GenAI models with TIGs, assessing their effectiveness, efficiency, and quality of the generated test images, in terms of domain validity and label preservation. We conduct an empirical study involving three different GenAI architectures (VAEs, GANs, Diffusion Models), five classification tasks of increasing complexity, and 364 human evaluations. Our results show that simpler architectures, such as VAEs, are sufficient for less complex datasets like MNIST. However, when dealing with feature-rich datasets, such as ImageNet, more sophisticated architectures like Diffusion Models achieve superior performance by generating a higher number of valid, misclassification-inducing inputs.

Accep...

Accepted at the 18th IEEE International Conference on Software Testing, Verification and Validation (ICST 2025)

DreamFit: Garment-Centric Human Generation via a Lightweight Anything-Dressing Encoder 2024-12-23 Go
Show

Diffusion models for garment-centric human generation from text or image prompts have garnered emerging attention for their great application potential. However, existing methods often face a dilemma: lightweight approaches, such as adapters, are prone to generate inconsistent textures; while finetune-based methods involve high training costs and struggle to maintain the generalization capabilities of pretrained diffusion models, limiting their performance across diverse scenarios. To address these challenges, we propose DreamFit, which incorporates a lightweight Anything-Dressing Encoder specifically tailored for the garment-centric human generation. DreamFit has three key advantages: (1) \textbf{Lightweight training}: with the proposed adaptive attention and LoRA modules, DreamFit significantly minimizes the model complexity to 83.4M trainable parameters. (2)\textbf{Anything-Dressing}: Our model generalizes surprisingly well to a wide range of (non-)garments, creative styles, and prompt instructions, consistently delivering high-quality results across diverse scenarios. (3) \textbf{Plug-and-play}: DreamFit is engineered for smooth integration with any community control plugins for diffusion models, ensuring easy compatibility and minimizing adoption barriers. To further enhance generation quality, DreamFit leverages pretrained large multi-modal models (LMMs) to enrich the prompt with fine-grained garment descriptions, thereby reducing the prompt gap between training and inference. We conduct comprehensive experiments on both $768 \times 512$ high-resolution benchmarks and in-the-wild images. DreamFit surpasses all existing methods, highlighting its state-of-the-art capabilities of garment-centric human generation.

Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis 2024-12-23 Go
Show

Recent advances in diffusion models have revolutionized audio-driven talking head synthesis. Beyond precise lip synchronization, diffusion-based methods excel in generating subtle expressions and natural head movements that are well-aligned with the audio signal. However, these methods are confronted by slow inference speed, insufficient fine-grained control over facial motions, and occasional visual artifacts largely due to an implicit latent space derived from Variational Auto-Encoders (VAE), which prevent their adoption in realtime interaction applications. To address these issues, we introduce Ditto, a diffusion-based framework that enables controllable realtime talking head synthesis. Our key innovation lies in bridging motion generation and photorealistic neural rendering through an explicit identity-agnostic motion space, replacing conventional VAE representations. This design substantially reduces the complexity of diffusion learning while enabling precise control over the synthesized talking heads. We further propose an inference strategy that jointly optimizes three key components: audio feature extraction, motion generation, and video synthesis. This optimization enables streaming processing, realtime inference, and low first-frame delay, which are the functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and substantially outperforms existing methods in both motion control and realtime performance.

Proje...

Project Page: https://digital-avatar.github.io/ai/Ditto/

Retention Score: Quantifying Jailbreak Risks for Vision Language Models 2024-12-23 Go
Show

The emergence of Vision-Language Models (VLMs) is a significant advancement in integrating computer vision with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities. However, this progress has also made VLMs vulnerable to sophisticated adversarial attacks, raising concerns about their reliability. The objective of this paper is to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs. To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the \textbf{Retention Score}. Retention Score is a multi-modal evaluation metric that includes Retention-I and Retention-T scores for quantifying jailbreak risks in visual and textual components of VLMs. Our process involves generating synthetic image-text pairs using a conditional diffusion model. These pairs are then predicted for toxicity score by a VLM alongside a toxicity judgment classifier. By calculating the margin in toxicity scores, we can quantify the robustness of the VLM in an attack-agnostic manner. Our work has four main contributions. First, we prove that Retention Score can serve as a certified robustness metric. Second, we demonstrate that most VLMs with visual components are less robust against jailbreak attacks than the corresponding plain VLMs. Additionally, we evaluate black-box VLM APIs and find that the security settings in Google Gemini significantly affect the score and robustness. Moreover, the robustness of GPT4V is similar to the medium settings of Gemini. Finally, our approach offers a time-efficient alternative to existing adversarial attack methods and provides consistent model robustness rankings when evaluated on VLMs including MiniGPT-4, InstructBLIP, and LLaVA.

14 pa...

14 pages, 8 figures, AAAI 2025

DiffusionAttacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak 2024-12-23 Go
Show

Large Language Models (LLMs) are susceptible to generating harmful content when prompted with carefully crafted inputs, a vulnerability known as LLM jailbreaking. As LLMs become more powerful, studying jailbreak methods is critical to enhancing security and aligning models with human values. Traditionally, jailbreak techniques have relied on suffix addition or prompt templates, but these methods suffer from limited attack diversity. This paper introduces DiffusionAttacker, an end-to-end generative approach for jailbreak rewriting inspired by diffusion models. Our method employs a sequence-to-sequence (seq2seq) text diffusion model as a generator, conditioning on the original prompt and guiding the denoising process with a novel attack loss. Unlike previous approaches that use autoregressive LLMs to generate jailbreak prompts, which limit the modification of already generated tokens and restrict the rewriting space, DiffusionAttacker utilizes a seq2seq diffusion model, allowing more flexible token modifications. This approach preserves the semantic content of the original prompt while producing harmful content. Additionally, we leverage the Gumbel-Softmax technique to make the sampling process from the diffusion model's output distribution differentiable, eliminating the need for iterative token search. Extensive experiments on Advbench and Harmbench demonstrate that DiffusionAttacker outperforms previous methods across various evaluation metrics, including attack success rate (ASR), fluency, and diversity.

CC-Diff: Enhancing Contextual Coherence in Remote Sensing Image Synthesis 2024-12-23 Go
Show

Accurately depicting real-world landscapes in remote sensing (RS) images requires precise alignment between objects and their environment. However, most existing synthesis methods for natural images prioritize foreground control, often reducing the background to plain textures. This neglects the interaction between foreground and background, which can lead to incoherence in RS scenarios. In this paper, we introduce CC-Diff, a Diffusion Model-based approach for RS image generation with enhanced Context Coherence. To capture spatial interdependence, we propose a sequential pipeline where background generation is conditioned on synthesized foreground instances. Distinct learnable queries are also employed to model both the complex background texture and its semantic relation to the foreground. Extensive experiments demonstrate that CC-Diff outperforms state-of-the-art methods in visual fidelity, semantic accuracy, and positional precision, excelling in both RS and natural image domains. CC-Diff also shows strong trainability, improving detection accuracy by 2.04 mAP on DOTA and 2.25 mAP on the COCO benchmark.

BudgetFusion: Perceptually-Guided Adaptive Diffusion Models 2024-12-23 Go
Show

Diffusion models have shown unprecedented success in the task of text-to-image generation. While these models are capable of generating high-quality and realistic images, the complexity of sequential denoising has raised societal concerns regarding high computational demands and energy consumption. In response, various efforts have been made to improve inference efficiency. However, most of the existing efforts have taken a fixed approach with neural network simplification or text prompt optimization. Are the quality improvements from all denoising computations equally perceivable to humans? We observed that images from different text prompts may require different computational efforts given the desired content. The observation motivates us to present BudgetFusion, a novel model that suggests the most perceptually efficient number of diffusion steps before a diffusion model starts to generate an image. This is achieved by predicting multi-level perceptual metrics relative to diffusion steps. With the popular Stable Diffusion as an example, we conduct both numerical analyses and user studies. Our experiments show that BudgetFusion saves up to five seconds per prompt without compromising perceptual similarity. We hope this work can initiate efforts toward answering a core question: how much do humans perceptually gain from images created by a generative model, per watt of energy?

Enhanced Generative Data Augmentation for Semantic Segmentation via Stronger Guidance 2024-12-23 Go
Show

Data augmentation is crucial for pixel-wise annotation tasks like semantic segmentation, where labeling requires significant effort and intensive labor. Traditional methods, involving simple transformations such as rotations and flips, create new images but often lack diversity along key semantic dimensions and fail to alter high-level semantic properties. To address this issue, generative models have emerged as an effective solution for augmenting data by generating synthetic images. Controllable Generative models offer data augmentation methods for semantic segmentation tasks by using prompts and visual references from the original image. However, these models face challenges in generating synthetic images that accurately reflect the content and structure of the original image due to difficulties in creating effective prompts and visual references. In this work, we introduce an effective data augmentation pipeline for semantic segmentation using Controllable Diffusion model. Our proposed method includes efficient prompt generation using \textit{Class-Prompt Appending} and \textit{Visual Prior Blending} to enhance attention to labeled classes in real images, allowing the pipeline to generate a precise number of augmented images while preserving the structure of segmentation-labeled classes. In addition, we implement a \textit{class balancing algorithm} to ensure a balanced training dataset when merging the synthetic and original images. Evaluation on PASCAL VOC datasets, our pipeline demonstrates its effectiveness in generating high-quality synthetic images for semantic segmentation. Our code is available at \href{https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance}{this https URL}.

Accep...

Accepted to ICPRAM 2025

Broadband Ground Motion Synthesis by Diffusion Model with Minimal Condition 2024-12-23 Go
Show

Earthquakes are rare. Hence there is a fundamental call for reliable methods to generate realistic ground motion data for data-driven approaches in seismology. Recent GAN-based methods fall short of the call, as the methods either require special information such as geological traits or generate subpar waveforms that fail to satisfy seismological constraints such as phase arrival times. We propose a specialized Latent Diffusion Model (LDM) that reliably generates realistic waveforms after learning from real earthquake data with minimal conditions: location and magnitude. We also design a domain-specific training method that exploits the traits of earthquake dataset: multiple observed waveforms time-aligned and paired to each earthquake source that are tagged with seismological metadata comprised of earthquake magnitude, depth of focus, and the locations of epicenter and seismometers. We construct the time-aligned earthquake dataset using Southern California Earthquake Data Center (SCEDC) API, and train our model with the dataset and our proposed training method for performance evaluation. Our model surpasses all comparable data-driven methods in various test criteria not only from waveform generation domain but also from seismology such as phase arrival time, GMPE analysis, and spectrum analysis. Our result opens new future research directions for deep learning applications in seismology.

CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training 2024-12-23 Go
Show

Benefiting from large-scale pre-training of text-video pairs, current text-to-video (T2V) diffusion models can generate high-quality videos from the text description. Besides, given some reference images or videos, the parameter-efficient fine-tuning method, i.e. LoRA, can generate high-quality customized concepts, e.g., the specific subject or the motions from a reference video. However, combining the trained multiple concepts from different references into a single network shows obvious artifacts. To this end, we propose CustomTTT, where we can joint custom the appearance and the motion of the given video easily. In detail, we first analyze the prompt influence in the current video diffusion model and find the LoRAs are only needed for the specific layers for appearance and motion customization. Besides, since each LoRA is trained individually, we propose a novel test-time training technique to update parameters after combination utilizing the trained customized models. We conduct detailed experiments to verify the effectiveness of the proposed methods. Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations.

Accep...

Accepted in AAAI 2025. Project Page: https://customttt.github.io/ Code: https://github.com/RongPiKing/CustomTTT

Free-viewpoint Human Animation with Pose-correlated Reference Selection 2024-12-23 Go
Show

Diffusion-based human animation aims to animate a human character based on a source human image as well as driving signals such as a sequence of poses. Leveraging the generative capacity of diffusion model, existing approaches are able to generate high-fidelity poses, but struggle with significant viewpoint changes, especially in zoom-in/zoom-out scenarios where camera-character distance varies. This limits the applications such as cinematic shot type plan or camera control. We propose a pose-correlated reference selection diffusion network, supporting substantial viewpoint variations in human animation. Our key idea is to enable the network to utilize multiple reference images as input, since significant viewpoint changes often lead to missing appearance details on the human body. To eliminate the computational cost, we first introduce a novel pose correlation module to compute similarities between non-aligned target and source poses, and then propose an adaptive reference selection strategy, utilizing the attention map to identify key regions for animation generation. To train our model, we curated a large dataset from public TED talks featuring varied shots of the same character, helping the model learn synthesis for different perspectives. Our experimental results show that with the same number of reference images, our model performs favorably compared to the current SOTA methods under large viewpoint change. We further show that the adaptive reference selection is able to choose the most relevant reference regions to generate humans under free viewpoints.

Under review
Enhancing Multi-Text Long Video Generation Consistency without Tuning: Time-Frequency Analysis, Prompt Alignment, and Theory 2024-12-23 Go
Show

Despite the considerable progress achieved in the long video generation problem, there is still significant room to improve the consistency of the videos, particularly in terms of smoothness and transitions between scenes. We address these issues to enhance the consistency and coherence of videos generated with either single or multiple prompts. We propose the Time-frequency based temporal Attention Reweighting Algorithm (TiARA), which meticulously edits the attention score matrix based on the Discrete Short-Time Fourier Transform. Our method is supported by a theoretical guarantee, the first-of-its-kind for frequency-based methods in diffusion models. For videos generated by multiple prompts, we further investigate key factors affecting prompt interpolation quality and propose PromptBlend, an advanced prompt interpolation pipeline. The efficacy of our proposed method is validated via extensive experimental results, exhibiting consistent and impressive improvements over baseline methods. The code will be released upon acceptance.

34 pages, 11 figures
OLiDM: Object-aware LiDAR Diffusion Models for Autonomous Driving 2024-12-23 Go
Show

To enhance autonomous driving safety in complex scenarios, various methods have been proposed to simulate LiDAR point cloud data. Nevertheless, these methods often face challenges in producing high-quality, diverse, and controllable foreground objects. To address the needs of object-aware tasks in 3D perception, we introduce OLiDM, a novel framework capable of generating high-fidelity LiDAR data at both the object and the scene levels. OLiDM consists of two pivotal components: the Object-Scene Progressive Generation (OPG) module and the Object Semantic Alignment (OSA) module. OPG adapts to user-specific prompts to generate desired foreground objects, which are subsequently employed as conditions in scene generation, ensuring controllable outputs at both the object and scene levels. This also facilitates the association of user-defined object-level annotations with the generated LiDAR scenes. Moreover, OSA aims to rectify the misalignment between foreground objects and background scenes, enhancing the overall quality of the generated objects. The broad effectiveness of OLiDM is demonstrated across various LiDAR generation tasks, as well as in 3D perception tasks. Specifically, on the KITTI-360 dataset, OLiDM surpasses prior state-of-the-art methods such as UltraLiDAR by 17.5 in FPD. Additionally, in sparse-to-dense LiDAR completion, OLiDM achieves a significant improvement over LiDARGen, with a 57.47% increase in semantic IoU. Moreover, OLiDM enhances the performance of mainstream 3D detectors by 2.4% in mAP and 1.9% in NDS, underscoring its potential in advancing object-aware 3D tasks. Code is available at: https://yanty123.github.io/OLiDM.

AAAI ...

AAAI 2025, https://yanty123.github.io/OLiDM

CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder 2024-12-23 Go
Show

Recently, significant advancements have been made in diffusion-based visual text generation models. Although the effectiveness of these methods in visual text rendering is rapidly improving, they still encounter challenges such as inaccurate characters and strokes when rendering complex visual text. In this paper, we propose CharGen, a highly accurate character-level visual text generation and editing model. Specifically, CharGen employs a character-level multimodal encoder that not only extracts character-level text embeddings but also encodes glyph images character by character. This enables it to capture fine-grained cross-modality features more effectively. Additionally, we introduce a new perceptual loss in CharGen to enhance character shape supervision and address the issue of inaccurate strokes in generated text. It is worth mentioning that CharGen can be integrated into existing diffusion models to generate visual text with high accuracy. CharGen significantly improves text rendering accuracy, outperforming recent methods in public benchmarks such as AnyText-benchmark and MARIO-Eval, with improvements of more than 8% and 6%, respectively. Notably, CharGen achieved a 5.5% increase in accuracy on Chinese test sets.

Discriminative Image Generation with Diffusion Models for Zero-Shot Learning 2024-12-23 Go
Show<

About

Get arXiv papers everyday

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages