Fin Irjmets1681386363
Fin Irjmets1681386363
Fin Irjmets1681386363
‘‘
Fig. 1. Data Flow Diagram
V. DATA COLLECTION
There are numerous open-source datasets addressing this issue, including MS COCO (which contains 180k
photos), Flickr8k (which contains 8k images), and Flickr30k (which contains 30k images). The dataset
considered for this case study is Flick8k dataset. It contains 8000 images, each with five captions for testing
purpose. These images are divided into three parts: Training Set has 6000 images, Development Set has 1000
images and the Test Set has 1000 images. The captions in this dataset are manually annotated and designed to
be both descriptive and informative. According to literature surveys, this dataset has been widely used in
research on image captioning. One of the advantages of this dataset is that it allows researchers to train and test
models on a wide range of visual content. It provides a large and diverse set of images.
VI. CONCLUSION
We have looked at methods for captioning images that use deep learning. We’ve outlined the benefits and
drawbacks of each method, offered a taxonomy of picture captioning techniques, and shown broad block
diagrams of the main categories. We briefly talked about further directions for this field’s research are possible.
Although deep learning-based picture captioning algorithms have advanced significantly in recent years, a
dependable method that can produce high- quality captions for nearly all photos has not yet been created. With
the advent of unique deep-learning network designs, automatic picture captioning will remain a hotly debated
www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science
[1606]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:05/Issue:04/April-2023 Impact Factor- 7.868 www.irjmets.com
subject for a while. The captions for the over 8000 photographs that make up the Flickr 8k dataset that we used
are also included in the text file.
We had completed the text cleaning, image preprocessing, and dataset understanding up until this point in the
project. The potential for image captioning is very vast in the future because more individuals use social media
every day, and the majority of them post images.
VII. FUTURE SCOPE
Image caption generator has a lot of potential for future development and applications. To improve the quality
of the image caption, multi-modal learning can be introduced in the future which can combine information from
different sources like images, videos, and text. In self-driving cars, captions can be developed in real-time which
can enhance the function- ality of applications. Self-driving cars can be beneficial for visually-impaired
individuals. Currently, many image caption generation models can create captions only in the English language.
But there is growing demand for captions in multiple languages. This can enable communication across different
cultures and communities. Today’s model can create one line caption for an image but this can be improved so
that it can explain the content of the image in natural language.
There is some future scope in the medical field such as disease diagnosis. Image caption generator can be used
to automatically generate accurate and informative descriptions of medical images, such as X-rays, and CT
scans.
ACKNOWLEDGEMENT
We express our appreciation to the Pune Institute of Com- puter Technology for providing this project-based
learning opportunity. We would like to extend our sincere gratitude to Dr. S. T. Gandhe sir, our project
guide, and Prof. V.
A. Patil, our project co-guide for his unwavering support and suggestions throughout the project work
suggestions that proved to be helpful in the overall progress of this project, for helping us complete stage 1 of
the project successfully. The project is titled ”Image description generator using recurrent neural and
convolution network.” He taught us a lot, and we learned a lot. Additionally, we appreciate the cooperation of
the Director Dr. P. T. Kulkarni and Principal Dr. S. T. Gandhe, and HOED Dr. Mousami Munot. We also value the
support and assistance that our friends gave us in finishing the project.
VIII. REFERENCES
[1] Aishwarya Maroju, Sneha Sri Doma, Lahari Chandarlapati, 2021, Image Caption Generating Deep
Learning Model, INTERNATIONAL JOUR- NAL OF ENGINEERING RESEARCH TECHNOLOGY (IJERT)
Volume 10, Issue 09 (September 2021).
[2] Shetty, R., Rezazadegan Tavakoli, H., Laaksonen, J. (2018). Image and Video Captioning with
Augmented Neural Architectures. IEEE Multimedia, 25(2), 34-46.
[3] Geetha,T.Kirthigadevi,G GODWIN Ponsam,T.Karthik,M.Safa,” Image Captioning Using Deep
Convolutional Neural Networks(CNNs)” Pub- lished under license by IOP Publishing Ltd in Journal of
Physics: Conference Series , Volume 1712, International Conference On Com- putational Physics in
Emerging Technologies(ICCPET) 2020 August 2020,Manglore India in 2015
[4] Syed Haseeb, Srushti G M, Bhamidi Haripriya, Mrs. Madhura Prakash, 2019, Image Captioning
using Deep Learning, INTERNA- TIONAL JOURNAL OF ENGINEERING RESEARCH TECHNOLOGY
(IJERT) Volume 08, Issue 05 (May 2019).
[5] M. A. Kastner et al., ”Imageability- and Length-Controllable Image Captioning,” in IEEE Access, vol. 9,
pp. 162951-162961, 2021, doi: 10.1109/ACCESS.2021.3131393.
[6] O. Vinyals et al., “Show and Tell: A Neural Image Caption Generator,” Proc. 2015 IEEE Conf. Computer
Vision and Pattern Recognition (CVPR 15), 2015. doi.org/10.1109/CVPR.2015.7298935.
[7] Andrej Karpathy, Li Fei-Fei, ”Deep visual-semantic alignments for generating image captions.”
[8] Ramanishka, V., Das, A., Zhang, J., Saenko, K. (2016). Top-down Visual Saliency Guided by Captions.
ArXiv. /abs/1612.07360
[9] M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolu- tional Networks. In European
www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science
[1607]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:05/Issue:04/April-2023 Impact Factor- 7.868 www.irjmets.com
Conference on Computer vision, pages 818–833. 2014
[10] A. Mahendran and A. Vedaldi. Understanding Deep Image Representa- tions by Inverting Them. In
IEEE conference on computer vision and pattern recognition, 2015.
[11] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly
learning to align and translate. In Inter- national Conference on Learning Representations (ICLR).
[12] Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A. (2015). Describing Videos by
Exploiting Temporal Struc- ture. ArXiv. /abs/1502.08029
[13] M.-T. Luong, H. Pham, C. D. Manning, and Q. V. Le, ”Effective Approaches to Attention-based Neural
Machine Translation,” in Pro- ceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing (EMNLP), Lisbon, Portugal, September 2015
[14] Feng, Yansong, and Mirella Lapata. ”How many words is a picture worth? automatic caption
generation for news images.” Proceedings of the 48th annual meeting of the Association for
Computational Linguistics. Association for Computational Linguistics, 2010
[15] Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Ray- mond Mooney, Kate Saenko,
Trevor Darrell, Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, et al. 2016. Deep
compo- sitional captioning: Describing novel object categories without paired training data. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.