A clip-based dual-stream method for text based vehicle search
Email:
thuybinh_ktdt@utc.edu.vn
Tóm tắt
A text-based vehicle search refers to a system where users can find vehicles or route information by entering text-based queries. The primary objective of text-based vehicle search is to identify the most relevant vehicle in a given dataset using a natural language description as a query. This approach leverages natural language processing (NLP) to understand and interpret description queries and provide relevant results. Despite significant progress, this task still faces several challenges due to the complexity and diversity of natural language, as well as inherent difficulties in the vision domain. Moreover, few studies have focused on tracked-vehicle retrieval, where vehicle tracklets are considered instead of single images. In this paper, we propose a novel framework for natural language-based tracked-vehicle retrieval based on CLIP model, one of the most effective models for image-text matching task. This framework leverages both appearance and motion information to enhance the matching accuracy of vehicle tracklet retrieval. Some experiments are conducted on the CityFlow-NL dataset, provided by the 6-th AI City Challenge, an annual competition. The results are comparable to state-of-the-art methods, achieving an MRR score of 46.63%, Rank@5 of 67.02%, and Rank@10 of 81.82%Tài liệu tham khảo
[1]. S. Bai, Z. Zheng, X. Wang, J. Lin, Z. Zhang, C. Zhou, H. Yang, Y. Yang, Connecting Language and Vision for Natural Language-Based Vehicle Retrieval, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021. https://doi.org/10.1109/CVPRW53098.2021.00455
[2]. Quang-Huy Can, Hong-Quan Nguyen, Thi-Ngoc-Diep Do, Hoai Phan, Thuy-Binh Nguyen, Thi Thanh Thuy Pham, Thanh-Hai Tran, Thi-Lan Le, Exploring the Effect of Vehicle Appearance and Motion for Natural Language-Based Vehicle Retrieval, in Asian Conference on Intelligent Information and Database Systems, 2022. https://doi.org/10.1007/978-981-19-8234-7_5
[3]. Y. Du, B. Zhang, X. Ruan, F. Su, Z. Zhao, H. Chen, OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval, in Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022. https://doi.org/10.1109/CVPRW56347.2022.00352
[4]. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, Learning transferable visual models from natural language supervision, in: International conference on machine learning, pp. 8748-8763, 2021.
[5]. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in 31st International Conference on Neural Information Processing Systems, 2017. https://doi.org/10.48550/arXiv.1706.03762
[6]. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in 2021 International Conference on Learning Representations, 2021.
[7]. Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee, ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, in 33rd International Conference on Neural Information Processing Systems, 2019. https://doi.org/10.48550/arXiv.1908.02265
[8]. H. Fang, P. Xiong, L. Xu, Y. Chen, CLIP2Video: Mastering Video-Text Retrieval via Image CLIP, in arXiv preprint arXiv:2106.11097, 2021. https://doi.org/10.48550/arXiv.2106.11097
[9]. C. Scribano, D. Sapienza, G. Franchini, M. Verucchi, M. Bertogna, All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers, in Conference on Computer Vision and Pattern Recognition, 2021. https://doi.org/10.1109/CVPRW53098.2021.00481
[10]. Kyunghyun Cho, B van Merrienboer, Caglar Gulcehre, F Bougares, H Schwenk, Yoshua Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, in Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), 2014. https://doi.org/10.3115/v1/D14-1179
[11]. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. https://doi.org/10.1109/CVPR.2016.90
[12]. Sangrok Lee, Taekang Woo, Sang Hun Lee, SBNet: Segmentation-based Network for Natural Language-based Vehicle Search, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 2021. https://doi.org/10.1109/CVPRW53098.2021.00457
[13]. P. Khorramshahi, S. S. Rambhatla, R. Chellappa, Towards Accurate Visual and Natural Language-Based Vehicle Retrieval Systems, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 2021. https://doi.org/10.1109/CVPRW53098.2021.00472
[14]. Tam Minh Nguyen, Quang Huu Pham, Linh Bao Doan, Hoang Viet Trinh, Viet-Anh Nguyen, Viet-Hoang Phan, Contrastive Learning for Natural Language-Based Vehicle Retrieval, in 2021 Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 2021. https://doi.org/10.1109/CVPRW53098.2021.00480
[15]. Clint Sebastian; Raffaele Imbriaco, Panagiotis Meletis, Gijs Dubbelman, Egor Bondarev, Peter H.N. de With, TIED: A Cycle Consistent Encoder-Decoder Model for Text-to-Image Retrieval, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 2021. https://doi.org/10.1109/CVPRW53098.2021.00467
[16]. Bird Steven, Ewan Klein, Edward Loper, Natural language processing with Python: analyzing text with the natural language toolkit, O'Reilly Media, Inc, 2009.
[17]. Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, Balaji Lakshminarayanan, AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty, in arXiv:1912.02781v2, 2019. https://doi.org/10.48550/arXiv.1912.02781
[18]. Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz, Mixup: Beyond Empirical Risk Minimization, in arXiv preprint arXiv:1710.09412, 2018. https://doi.org/10.48550/arXiv.1710.09412
[19]. Sennrich, R., B. Haddow, A. Birch, Improving Neural Machine Translation Models with Monolingual Data, in arXiv preprint arXiv:1511.06709, 2016. https://doi.org/10.18653/v1/P16-1009
[20]. Cloud translation documentation, [Online]. Available: https://cloud.google.com/translate/docs. [Accessed 22 August 2024].
[21]. Translate any application with SysTran API, [Online]. Available: https://www.systransoft.com/translation-products/translate-api/. [Accessed 22 August 2024].
[22]. Rico Sennrich, Barry Haddow, Alexandra Birch, Neural Machine Translation of Rare Words with Subword Units, in 54th Annual Meeting of the Association for Computational Linguistics, 2016. https://doi.org/10.18653/v1/P16-1162
[23]. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, in arXiv preprint arXiv:1907.11692, 2019.
[24]. Z. Zheng, L. Zheng, M. Garrett, Y. Yang, M. Xu, Y.-D. Shen, Dual-path convolutional image-text embeddings with instance loss, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(2) (2020) 1–23. https://doi.org/10.1145/3383184
[25]. Feng, Qi Ablavsky, Vitaly Sclaroff, Stan, CityFlow-NL: Tracking and retrieval of vehicles at city scale by natural language descriptions, arXiv preprint arXiv:2101.04741, 2021. https://doi.org/10.48550/arXiv.2101.04741
[26]. Chuyang Zhao, Haobo Chen, Wenyuan Zhang, Junru Chen, Symmetric Network with Spatial Relationship Modeling for Natural Language-based Vehicle Retrieval, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022. https://doi.org/10.1109/CVPRW56347.2022.00364
[2]. Quang-Huy Can, Hong-Quan Nguyen, Thi-Ngoc-Diep Do, Hoai Phan, Thuy-Binh Nguyen, Thi Thanh Thuy Pham, Thanh-Hai Tran, Thi-Lan Le, Exploring the Effect of Vehicle Appearance and Motion for Natural Language-Based Vehicle Retrieval, in Asian Conference on Intelligent Information and Database Systems, 2022. https://doi.org/10.1007/978-981-19-8234-7_5
[3]. Y. Du, B. Zhang, X. Ruan, F. Su, Z. Zhao, H. Chen, OMG: Observe Multiple Granularities for Natural Language-Based Vehicle Retrieval, in Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022. https://doi.org/10.1109/CVPRW56347.2022.00352
[4]. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, Learning transferable visual models from natural language supervision, in: International conference on machine learning, pp. 8748-8763, 2021.
[5]. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in 31st International Conference on Neural Information Processing Systems, 2017. https://doi.org/10.48550/arXiv.1706.03762
[6]. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in 2021 International Conference on Learning Representations, 2021.
[7]. Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee, ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, in 33rd International Conference on Neural Information Processing Systems, 2019. https://doi.org/10.48550/arXiv.1908.02265
[8]. H. Fang, P. Xiong, L. Xu, Y. Chen, CLIP2Video: Mastering Video-Text Retrieval via Image CLIP, in arXiv preprint arXiv:2106.11097, 2021. https://doi.org/10.48550/arXiv.2106.11097
[9]. C. Scribano, D. Sapienza, G. Franchini, M. Verucchi, M. Bertogna, All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers, in Conference on Computer Vision and Pattern Recognition, 2021. https://doi.org/10.1109/CVPRW53098.2021.00481
[10]. Kyunghyun Cho, B van Merrienboer, Caglar Gulcehre, F Bougares, H Schwenk, Yoshua Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, in Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), 2014. https://doi.org/10.3115/v1/D14-1179
[11]. Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016. https://doi.org/10.1109/CVPR.2016.90
[12]. Sangrok Lee, Taekang Woo, Sang Hun Lee, SBNet: Segmentation-based Network for Natural Language-based Vehicle Search, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 2021. https://doi.org/10.1109/CVPRW53098.2021.00457
[13]. P. Khorramshahi, S. S. Rambhatla, R. Chellappa, Towards Accurate Visual and Natural Language-Based Vehicle Retrieval Systems, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 2021. https://doi.org/10.1109/CVPRW53098.2021.00472
[14]. Tam Minh Nguyen, Quang Huu Pham, Linh Bao Doan, Hoang Viet Trinh, Viet-Anh Nguyen, Viet-Hoang Phan, Contrastive Learning for Natural Language-Based Vehicle Retrieval, in 2021 Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 2021. https://doi.org/10.1109/CVPRW53098.2021.00480
[15]. Clint Sebastian; Raffaele Imbriaco, Panagiotis Meletis, Gijs Dubbelman, Egor Bondarev, Peter H.N. de With, TIED: A Cycle Consistent Encoder-Decoder Model for Text-to-Image Retrieval, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 2021. https://doi.org/10.1109/CVPRW53098.2021.00467
[16]. Bird Steven, Ewan Klein, Edward Loper, Natural language processing with Python: analyzing text with the natural language toolkit, O'Reilly Media, Inc, 2009.
[17]. Dan Hendrycks, Norman Mu, Ekin D. Cubuk, Barret Zoph, Justin Gilmer, Balaji Lakshminarayanan, AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty, in arXiv:1912.02781v2, 2019. https://doi.org/10.48550/arXiv.1912.02781
[18]. Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz, Mixup: Beyond Empirical Risk Minimization, in arXiv preprint arXiv:1710.09412, 2018. https://doi.org/10.48550/arXiv.1710.09412
[19]. Sennrich, R., B. Haddow, A. Birch, Improving Neural Machine Translation Models with Monolingual Data, in arXiv preprint arXiv:1511.06709, 2016. https://doi.org/10.18653/v1/P16-1009
[20]. Cloud translation documentation, [Online]. Available: https://cloud.google.com/translate/docs. [Accessed 22 August 2024].
[21]. Translate any application with SysTran API, [Online]. Available: https://www.systransoft.com/translation-products/translate-api/. [Accessed 22 August 2024].
[22]. Rico Sennrich, Barry Haddow, Alexandra Birch, Neural Machine Translation of Rare Words with Subword Units, in 54th Annual Meeting of the Association for Computational Linguistics, 2016. https://doi.org/10.18653/v1/P16-1162
[23]. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, in arXiv preprint arXiv:1907.11692, 2019.
[24]. Z. Zheng, L. Zheng, M. Garrett, Y. Yang, M. Xu, Y.-D. Shen, Dual-path convolutional image-text embeddings with instance loss, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(2) (2020) 1–23. https://doi.org/10.1145/3383184
[25]. Feng, Qi Ablavsky, Vitaly Sclaroff, Stan, CityFlow-NL: Tracking and retrieval of vehicles at city scale by natural language descriptions, arXiv preprint arXiv:2101.04741, 2021. https://doi.org/10.48550/arXiv.2101.04741
[26]. Chuyang Zhao, Haobo Chen, Wenyuan Zhang, Junru Chen, Symmetric Network with Spatial Relationship Modeling for Natural Language-based Vehicle Retrieval, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2022. https://doi.org/10.1109/CVPRW56347.2022.00364
Tải xuống
Chưa có dữ liệu thống kê
Nhận bài
08/09/2024
Nhận bài sửa
22/10/2024
Chấp nhận đăng
10/01/2025
Xuất bản
15/01/2025
Chuyên mục
Công trình khoa học
Kiểu trích dẫn
Quang Huy, C., Phuong Dung, N., Thuy Binh, N., Hong Quan, N., Thien Linh, V., & Thi Lan, L. (1736874000). A clip-based dual-stream method for text based vehicle search. Tạp Chí Khoa Học Giao Thông Vận Tải, 76(1), 16-30. https://doi.org/10.47869/tcsj.76.1.2
Số lần xem tóm tắt
24
Số lần xem bài báo
12