HIT4DAR: Holistic interaction transformer for driver action recognition

Bui Hoang Hiep; Bui Khanh Huyen; Vo Thien Linh; Nguyen Hong Quan; Nguyen Thuy Binh; Dao Thanh Toan; Le Thi Lan

doi:10.47869/tcsj.77.4.12

HIT4DAR: Holistic interaction transformer for driver action recognition

Hoang Hiep Bui

SigM Lab, School of Electrical and Electronic Engineering (SEEE), Hanoi University of Science and Technology, Hanoi, Vietnam
Khanh Huyen Bui

SigM Lab, School of Electrical and Electronic Engineering (SEEE), Hanoi University of Science and Technology, Hanoi, Vietnam
Thien Linh Vo

Universiy of Transport and Communications, Hanoi, Vietnam
Hong Quan Nguyen

Viet-Hung Industrial University, Hanoi, Vietnam
Thuy Binh Nguyen

Universiy of Transport and Communications, Hanoi, Vietnam
Thanh Toan Dao

Universiy of Transport and Communications, Hanoi, Vietnam
Thi Lan Le

SigM Lab, School of Electrical and Electronic Engineering (SEEE), Hanoi University of Science and Technology, Hanoi, Vietnam

DOI: https://doi.org/10.47869/tcsj.77.4.12

Email: thuybinh_ktdt@utc.edu.vn

Từ khóa: Driver Action Recognition, Transformer-based Action Recognition, Human-Object Interaction.

Tóm tắt

Traffic accidents remain a critical issue worldwide, especially in the context of the rapidly growing number of vehicles on the road. This leads to the need for an automatic system to recognize distracted driver actions and provide early warnings to enhance road safety. In fact, driver action recognition (DAR) can be considered as a subfield of human action recognition (HAR). In addition to the common challenges of HAR, DAR must address additional difficulties, including fine-grained and subtle hand movements, self-occlusion, and complex interactions with multiple objects. To overcome these challenges, we leverage Holistic Interaction Transformer (HIT) network, originally designed for HAR, to recognize driver activities from video sequence. The proposed method named HIT4DAR (Holistic Interaction Transformer for Driver Action Recognition). Some experiments are conducted on UTCDriverAct to show the effectiveness of HIT network in DAR task. The overall performance across the six actions of interest achieves a Video-mAP of 47.8%. In particular, the recognition accuracy for Texting activity reaches 78.9%. Furthermore, an ablation study is performed to investigate the influence of pose estimation models on recognition accuracy. The experimental results indicate the trade-off between recognition accuracy and computational efficiency in different pose estimation models. This comprehensive analysis provides useful recommendations for the research community when deploying the proposed framework in real-world DAR systems.

Tài liệu tham khảo

[1]. C. Meurie, O. Lézoray, A comprehensive review of on-board action recognition models in public transportation systems, Expert Systems with Applications 290 (2025) 128311. https://doi.org/10.1016/j.eswa.2025.128311
[2]. Y. Xu, S. Jiang, Z. Cui, F. Su, Multi-view action recognition for distracted driver behavior localization, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023) 5375-5380. https://doi.org/10.1109/CVPRW59228.2023.00567
[3]. T. Mewborne, L. Zhang, S. Tan, A wearable-based distracted driving detection leveraging BLE, in Proceedings of the 19th ACM Conference on Embedded Networked Sensor Systems, (2021) 365-366. https://doi.org/10.1145/3485730.3492872
[4]. B. Zhang, J. Wang, J. Fu, J. Xia, Driver action recognition using federated learning, in Proceedings of the 7th International Conference on Communication and Information Processing, (2021) 74-77. https://doi.org/10.1145/3507971.3507985
[5]. Hong-Quan Nguyen, Thuy-Binh Nguyen, Trung Kien Tran, Van-Nam Hoang, Thi-Lan Le, Thanh-Hai Tran, Hai Vu, End-to-end deep learning-based framework for driver action recognition, in Proceedings of the IEEE Conference on Multimedia Analysis and Pattern Recognition, (2022) 1-6. https://doi.org/10.1109/MAPR56351.2022.9924944
[6]. Y. Hu, M. Lu, C. Xie, X. Lu, Video-based driver action recognition via hybrid spatial–temporal deep learning framework, Multimedia systems, 27 (2021) 483-501. https://doi.org/10.1007/s00530-020-00724-y
[7]. M. Lu, Y. Hu, X. Lu, Driver action recognition using deformable and dilated faster R-CNN with optimized region proposals, Applied Intelligence, 50 (2020) 1100-1111. https://doi.org/10.1007/s10489-019-01603-4
[8]. H. I. Qamar, U. Saeed, M. Hussain, Driver distraction detection using a multi-stream deep fusion network, in Proceedings of the International Conference on Computing & Emerging Technologies, (2025) 97-104. https://doi.org/10.1007/978-3-031-77617-5_9
[9]. G. Shan, Q. Ji, Y. Xie, Multi-view vision transformer for driver action recognition, in Proceedings of the International Conference on Intelligent Transportation Engineering, (2022) 970-981. https://doi.org/10.1007/978-981-19-2259-6_85
[10]. R. Pizarro, R. Valle, L. M. Bergasa, J. M. Buenaposada, L. Baumela, Pose-guided multi-task video transformer for driver action recognition, arXiv preprint arXiv:2407.13750, (2024). https://doi.org/10.48550/arXiv.2407.13750
[11]. N. Sengar, I. Kumari, J. Lee, D. Har, PoseViNet: Distracted driver action recognition framework using multi-view pose estimation and vision transformer, arXiv preprint arXiv:2312.14577, (2023). https://doi.org/10.48550/arXiv.2312.14577
[12]. Y. Ma, L. Yuan, A. Abdelraouf, K. Han, R. Gupta, Z. Li, Z. Wang, M2DAR: Multi-view multi- scale driver action recognition with vision transformer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023) 5287-5294. https://doi.org/10.1109/CVPRW59228.2023.00557
[13]. G. J. Faure, M.-H. Chen, S.-H. Lai, Holistic interaction transformer network for action detection, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, (2023) 3340-3350. https://doi.org/10.1109/WACV56688.2023.00334
[14]. K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in Proceedings of the 28th International Conference on Neural Information Processing Systems, 1 (2014) 568-576. https://dl.acm.org/doi/10.5555/2968826.2968890
[15]. C. Feichtenhofer, A. Pinz, R. P. Wildes, Spatio-temporal residual networks for video action recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017) 3468-3476. https://doi.org/10.48550/arXiv.1611.02155
[16]. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016) 770-778. https://doi.org/10.1109/CVPR.2016.90
[17]. D. Kondratyuk, L. Yuan, Y. Li, L. Zhang, M. Tan, M. Brown, B. Gong, Movinets: Mobile video networks for efficient video recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021) 16020-16030. https://doi.org/10.1109/CVPR46437.2021.01576
[18]. M. T. Tran, M. Q. Vu, N. D. Hoang, K.-H. N. Bui, An effective temporal localization method with multi-view 3D action recognition for untrimmed naturalistic driving videos, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022) 3168-3173. https://doi.org/10.1109/CVPRW56347.2022.00357
[19]. C. Feichtenhofer, X3D: Expanding architectures for efficient video recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020) 203-213. https://doi.org/10.1109/CVPR42600.2020.00028
[20]. Z. Tong, Y. Song, J. Wang, L. Wang, VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training, in Proceedings of the 36th International Conference on Neural information processing system, 35 (2022) 10078-10093. https://dl.acm.org/doi/10.5555/3600270.3601002
[21]. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929, (2020). https://doi.org/10.48550/arXiv.2010.11929
[22]. Z. Xu, J. Xu, Spatio-temporal decoupling attention transformer for 3D skeleton-based driver action recognition, Complex & Intelligent Systems, 11 (2025) 1-12. https://doi.org/10.1007/s40747-025-01811-1
[23]. C. Feichtenhofer, H. Fan, J. Malik, K. He, Slowfast networks for video recognition, in: Pro- ceedings of the IEEE/CVF international conference on computer vision, (2019) 6202-6211. https://doi.org/10.1109/ICCV.2019.00630
[24]. S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, Advances in neural information processing systems 9199.10.5555 (2015): 2969239-2969250. https://doi.org/10.1109/TPAMI.2016.2577031
[25]. K. He, G. Gkioxari, P. Dolla´r, R. Girshick, Mask R-CNN, in Proceedings of the IEEE International Conference on Computer vision, (2017) 2961-2969. https://doi.org/10.1109/ICCV.2017.322
[26]. D. Maji, S. Nagori, M. Mathew, D. Poddar, Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss, in Proceedings of the IEEE/CVF Conference on Computer vision and Pattern recognition, (2022) 2637-2646. https://doi.org/10.1109/CVPRW56347.2022.00297

Tải xuống

Chưa có dữ liệu thống kê

HIT4DAR: Holistic interaction transformer for driver action recognition

Nhận bài

20/02/2026

Nhận bài sửa

22/03/2026

Chấp nhận đăng

23/03/2026

Xuất bản

15/05/2026

Chuyên mục

Công trình khoa học

Kiểu trích dẫn

Hoang Hiep, B., Khanh Huyen, B., Thien Linh, V., Hong Quan, N., Thuy Binh, N., Thanh Toan, D., & Thi Lan, L. (1778778000). HIT4DAR: Holistic interaction transformer for driver action recognition. Tạp Chí Khoa Học Giao Thông Vận Tải, 77(4), 500-514. https://doi.org/10.47869/tcsj.77.4.12