TY - JOUR
T1 - AnoViT
T2 - Unsupervised Anomaly Detection and Localization With Vision Transformer-Based Encoder-Decoder
AU - Lee, Yunseung
AU - Kang, Pilsung
N1 - Funding Information:
This work was supported in part by the National Research Foundation of Korea (NRF) Grant by the Korea Government through MSIT under Grant NRF-2022R1A2C2005455, and in part by the Institute of Information and Communications Technology Planning and Evaluation (IITP) Grant by the Korea Government through MSIT (Clustering technologies of fragmented data for time-based data analysis) under Grant 2021-0-00034.
Publisher Copyright:
© 2013 IEEE.
PY - 2022
Y1 - 2022
N2 - Image anomaly detection problems aim to determine whether an image is abnormal, and to detect anomalous areas. These methods are actively used in various fields such as manufacturing, medical care, and intelligent information. Encoder-decoder structures have been widely used in the field of anomaly detection because they can easily learn normal patterns in an unsupervised learning environment and calculate a score to identify abnormalities through a reconstruction error indicating the difference between input and reconstructed images. Therefore, current image anomaly detection methods have commonly used convolutional encoder-decoders to extract normal information through the local features of images. However, they are limited in that only local features of the image can be utilized when constructing a normal representation owing to the characteristics of convolution operations using a filter of fixed size. Therefore, we propose a vision transformer-based encoder-decoder model, named AnoViT, designed to reflect normal information by additionally learning the global relationship between image patches, which is capable of both image anomaly detection and localization. While existing vision transformers perform image classification using only a class token, the proposed approach constructs a feature map that maintains the existing location information of individual patches by using the embeddings of all patches passed through multiple self-attention layers. Subsequently, the feature map, which has been transformed into three dimensions, is used to perform decoding. This design preserves the spatial information sufficiently by excluding the fully-connected layer, which extracts latent vectors in existing convolution-based encoder-decoders. The proposed AnoViT model performed better than the convolution-based model on three benchmark datasets. In MVTecAD, which is a representative benchmark dataset for anomaly localization, it showed improved results on 10 out of 15 classes compared with the baseline. Furthermore, the proposed method showed good performance regardless of the class and type of the anomalous area when localization results were evaluated qualitatively.
AB - Image anomaly detection problems aim to determine whether an image is abnormal, and to detect anomalous areas. These methods are actively used in various fields such as manufacturing, medical care, and intelligent information. Encoder-decoder structures have been widely used in the field of anomaly detection because they can easily learn normal patterns in an unsupervised learning environment and calculate a score to identify abnormalities through a reconstruction error indicating the difference between input and reconstructed images. Therefore, current image anomaly detection methods have commonly used convolutional encoder-decoders to extract normal information through the local features of images. However, they are limited in that only local features of the image can be utilized when constructing a normal representation owing to the characteristics of convolution operations using a filter of fixed size. Therefore, we propose a vision transformer-based encoder-decoder model, named AnoViT, designed to reflect normal information by additionally learning the global relationship between image patches, which is capable of both image anomaly detection and localization. While existing vision transformers perform image classification using only a class token, the proposed approach constructs a feature map that maintains the existing location information of individual patches by using the embeddings of all patches passed through multiple self-attention layers. Subsequently, the feature map, which has been transformed into three dimensions, is used to perform decoding. This design preserves the spatial information sufficiently by excluding the fully-connected layer, which extracts latent vectors in existing convolution-based encoder-decoders. The proposed AnoViT model performed better than the convolution-based model on three benchmark datasets. In MVTecAD, which is a representative benchmark dataset for anomaly localization, it showed improved results on 10 out of 15 classes compared with the baseline. Furthermore, the proposed method showed good performance regardless of the class and type of the anomalous area when localization results were evaluated qualitatively.
KW - Anomaly detection
KW - anomaly localization
KW - MVTecAD
KW - vision transformer
UR - http://www.scopus.com/inward/record.url?scp=85129657863&partnerID=8YFLogxK
U2 - 10.1109/ACCESS.2022.3171559
DO - 10.1109/ACCESS.2022.3171559
M3 - Article
AN - SCOPUS:85129657863
SN - 2169-3536
VL - 10
SP - 46717
EP - 46724
JO - IEEE Access
JF - IEEE Access
ER -