MULTIMODAL IMAGE RECOGNITION SYSTEM BASED ON CLIP VIT-B/32 WITH SIMILARITY ANALYSIS AND VISUALIZATION

Authors

Sadriddinzoda Nekruz – PhD doctoral student, Department of Digital Economy, Polytechnic Institute оf the Tajik Technical University named after academician M.S. Osimi, Khujand, Republic of Tajikistan, nekruzjons2000@gmail.com

Abstract

This article presents a developed system for automatic product recognition and search based on machine learning technology. The system uses a pre-trained multimodal CLIP (Contrastive Language-Image Pre-training) model to generate vector representations (embeddings) of product images. The implemented web application in Python/Flask allows for indexing product items by visual features, searching by image, managing the database of products and store, and performing a visual analysis of the feature space. The system demonstrates high accuracy in searching for similar products and can be used to automate inventory, logistics, and customer service processes in retail. Similarity metrics were analyzed, and threshold values for filtering results were proposed. The key features of the system are modularity, scalability, and the availability of a REST API for integration with third-party services. This paper presents the development and research of an intelligent system for recognizing product images based on machine learning methods and vector representations (embeddings). The system uses the CLIP multimodal neural network model to extract image features and is implemented as a web appliсation with a REST API based on the Flask framework. The proposed solution enables automatic image comparison, search for the most similar objects, visualization of the feature space, and scalability for practical applications in trading and recommendation systems.

Keywords

machine learning, image recognition, computer vision, image search, contrast learning, CLIP, vector representations, Flask, embeddings.

References

1. Власов И.М., Рогов А.А. Глубокие нейросетевые модели в задачах обработки изображений // Научно-технический вестник информационных технологий, 2022, №5, с. 34-47.

2. Кириллов С.А., Петров В.В. Методы глубокого обучения в задаче классификации изображений // Информационные технологии и вычислительные системы, 2021, №3, с. 12-23.

3. Chollet F. Deep Learning with Python. — Manning, 2024.

4. Goodfellow I., Bengio Y., Courville A. Deep Learning. — MIT Press, 2024.

5. Grinberg G., M. (2023). Flask Web Development: Developing Web Applications with Python. O’Reilly Media.

6. LeCun, Y., Bengio, Y., & Hinton, G. (2021). Deep learning. Nature, 521(7553), 436-444.

7. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2024). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

8. Radford A. et al. Learning Transferable Visual Models From Natural Language Supervision. — 2021.

9. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. An Incremental Improvement. (2022).

10. Reimers, N., & Gurevych, I. (2021). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. arXiv preprint arXiv:1908.10084.

11. Van der Maaten, L., & Hinton, G. (2025). Visualizing data using t-SNE. Journal ofmachine learning research, 9(11).

Publish date

2026-04-03