Gen - AI Project Report
Gen - AI Project Report
ON
GenAI-ModelHub
(SDG -9 Industry,
Innovation and
Infrastructure)
By
Submitted to
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
AISSMS INSTITUTE OF INFORMATION TECHNOLOGY (IOIT),
PUNE - 411001
Academic Year (2024 -2025)
GenAI-ModelHub: Generative AI powered Data Science Automation platform
CERTIFICATE
GenAI-ModelHub
Submitted By
is a bonafide work carried out by them under the supervision of Mr. S. D. Kale and it is approved
for the partial fulfilment of the requirement of Savitribai Phule Pune University for the Project
Stage-1 in the Final Year of Artificial Intelligence and Data Science.
Place: Pune
ii
2
GenAI-ModelHub: Generative AI powered Data Science Automation platform
ACKNOWLEDGEMENT
We would like to take this opportunity to thank all the people who were part of this seminar
in numerous ways, people who gave un-ending support right from the initial stage.
In particular we wish to thank Mr. S. D. Kale as internal project guide who gave their co-
operation timely and precious guidance without which this project would not have been a
success. We thank them for reviewing the entire project with painstaking efforts and more of
their, unbanning ability to spot the mistakes.
We would like to thank our H.O.D Dr. R.A. Jamadar for his continuous
encouragement, support and guidance at each and every stage of project.
And last but not the least we would like to thank all my friends who were Associated with me
and helped me in preparing my project. The project named “GenAI-ModelHub” would not
have been possible without the extensive support of people who were directly or indirectly
involved in its successful execution.
iii
3
GenAI-ModelHub: Generative AI powered Data Science Automation platform
iv
4
GenAI-ModelHub: Generative AI powered Data Science Automation platform
ABSTRACT
Data science and machine learning are fields characterized by rapid growth and increasing complexity,
posing a range of challenges for practitioners. From inefficiencies in querying databases with SQL and
Pandas to the complexities of model optimization and hyperparameter tuning, these tasks demand
considerable time and expertise. ML Model HUB is designed to streamline these processes by offering an
integrated platform that utilizes advanced technologies like Large Language Models (LLMs) and Retrieval-
Augmented Generation (RAG) concepts. By bridging the gap between SQL and Pandas for efficient data
querying, generating robust baseline models, and optimizing deep learning architectures, ML Model HUB
seeks to simplify and enhance the data science workflow. The platform's contribution lies in reducing the
technical barriers associated with machine learning and data science tasks, empowering users to build
optimized models more efficiently. This project aims to provide a practical solution to ongoing issues in
model development and data processing, thereby enhancing productivity and innovation in the field
v
5
GenAI-ModelHub: Generative AI powered Data Science Automation platform
TABLE OF CONTENTS
vi
6
GenAI-ModelHub: Generative AI powered Data Science Automation platform
CHAPTER 1
INTRODUCTION
1
GenAI-ModelHub: Generative AI powered Data Science Automation platform
INTRODUCTION
1.1 Introduction
In the rapidly evolving fields of data science and machine learning, practitioners encounter numerous
challenges that hinder efficient workflow and optimal model performance. A significant issue lies in the
complexity and inefficiency of data querying, where transitioning between SQL databases and Pandas-
based data frames often leads to redundancies and errors. This inefficiency can slow down the entire data
processing pipeline, which is fundamental to model development. Furthermore, generating robust baseline
models, fine-tuning hyperparameters, and optimizing deep learning architectures are tasks that are not only
time-consuming but also demand a high level of expertise. Each of these tasks involves intricate knowledge
of various tools and techniques, from parameter optimization to model architecture design, which can
overwhelm even experienced data scientists.
The need for a new solution that can streamline these processes has become critical, particularly given the
growing demand for faster, more accurate model deployments. Existing platforms primarily focus on
specific aspects of model development, lacking comprehensive support that covers the full data science
workflow, from data querying to model optimization. ML Model HUB is proposed as an integrated platform
to address this gap, leveraging Large Language Models (LLMs) and Retrieval-Augmented Generation
(RAG) to assist users across the entire data science pipeline. These technologies aim to make data querying
more efficient, help in generating robust baseline models, and facilitate fine-tuning and optimization in deep
learning models.
This report will discuss the detailed problem definition, justifying the need for an advanced platform like
ML Model HUB. It will compare existing systems, highlighting how ML Model HUB's innovative use of
LLMs and RAG sets it apart. The report will also outline the organization of ML Model HUB’s features
and the benefits of its comprehensive approach, ultimately demonstrating how this platform can enhance
productivity and innovation in the data science community.
2
GenAI-ModelHub: Generative AI powered Data Science Automation platform
1.2 Motivation
The motivation for Gen-AI Model HUB stems from the persistent challenges that data scientists and
machine learning practitioners face in managing complex workflows. In the data science pipeline, efficient
data querying and transformation are essential yet time-consuming tasks, especially when transitioning
between SQL-based databases and Pandas data frames for analysis. These transitions often require
duplicative efforts and a deep understanding of both environments, making it difficult to move quickly from
data retrieval to model building. Additionally, building robust baseline models and fine-tuning
hyperparameters involve intricate processes that require significant expertise. Deep learning models, while
powerful, further complicate the workflow with their demands on computational resources and knowledge
of architecture optimization. While there are individual tools for querying, model selection, and tuning, an
end-to-end platform that seamlessly integrates all these functionalities remains unavailable.
ML Model HUB aims to address this gap by leveraging the latest advancements in Large Language
Models (LLMs) and Retrieval-Augmented Generation (RAG) to create a unified platform that supports
every stage of the data science process. By automating and simplifying data querying, ML Model HUB
facilitates smoother transitions between SQL and Pandas, minimizing the time spent on redundant tasks.
The integration of LLMs allows the platform to assist in generating baseline models, offering intelligent
suggestions for hyperparameter tuning, and optimizing deep learning architectures, thus making advanced
techniques accessible to practitioners of all expertise levels. This comprehensive approach not only
improves productivity but also empowers data scientists to focus on deriving insights and creating
innovative solutions, freeing them from the complexities of managing fragmented tools and processes.
Data scientists and analysts frequently encounter challenges in efficiently transitioning between SQL and
Pandas for data manipulation and querying, leading to a workflow that is both cumbersome and prone to
errors. Moreover, the tendency to bypass the establishment of robust baseline models in favor of more
complex approaches often results in suboptimal outcomes and overlooked insights. Hyperparameter
tuning, a crucial yet complex process, demands expert knowledge and iterative experimentation, further
complicating the model development process. Additionally, the determination of optimal configurations
for convolutional layers, pooling layers, strides, and filters in deep learning models remains a significant
challenge, often relying on trial and error. This project addresses these critical challenges by developing
an integrated platform that seamlessly bridges the gap between SQL and Pandas, automates the creation
of baseline models, streamlines hyperparameter tuning, and optimizes deep learning architectures, all
while providing comprehensive guidance and support through advanced AI-driven tool.
3
GenAI-ModelHub: Generative AI powered Data Science Automation platform
1.4 Objectives
1. To convert user queries into SQL and Pandas queries using LLMs with RAG, supported by an
interactive chatbot for real-time assistance.
2. To efficiently generate baseline models with automated preprocessing, training, and accuracy
evaluation, complemented by data visualizations and detailed statistical insights.
3. To enable users to fine-tune model hyperparameters through an intuitive UI, providing explanations
on bias-variance trade-offs and recommending optimal settings.
4. To identify optimal deep learning parameters (e.g., convolution layers, pooling layers) for image
data through automated analysis and a user-friendly experimentation UI.
4
GenAI-ModelHub: Generative AI powered Data Science Automation platform
CHAPTER 2
LITERATURE REVIEW
5
GenAI-ModelHub: Generative AI powered Data Science Automation platform
LITERATURE REVIEW
The paper by Rutuja Nikum, Vaishnavi Shinde, and Vijay Khadse focuses on a noise filter mechanism
within neural machine translation (NMT) for translating textual queries into Python source code using
transformer models. They developed a system employing a self-attention-based encoder-decoder
transformer architecture to translate English queries into Python code, achieving a BLEU score of 0.78. The
model was retrained with merged datasets to improve accuracy, and the resulting system includes a Flask-
based UI, enhancing user interaction and accessibility. This work demonstrates the potential of transformer-
based models in automated code generation for query-based programming assistance. [11]
Mohammad Latif Siddiq, Shafayat H. Mujumder, Maisha R. Mim, Sourav Jajodia, and Joanna C.S. Santos
conducted an empirical study investigating the presence of code smells and security vulnerabilities in
transformer-based code generation techniques, specifically in models like GPT-Neo and GitHub Copilot.
Utilizing tools such as Pylint and Bandit, they discovered that code generated by these models often contains
issues rooted in the training datasets, which may have inadvertently included faulty code. Their findings
highlight a significant concern regarding the reliability of AI-generated code and underscore the need for
research on improving code quality in transformer-based generation tools. [10]
The research by M.R. Aadhil Rushdy and Uthayasanker Thayasivam proposes a Seq2Seq-based transformer
model with enhanced encoding and decoding techniques specifically for Text-to-SQL generation. Their
model achieved high performance on the Spider dataset, with an Exact Match accuracy of 72.7% and
Execution accuracy of 80.2%, illustrating its efficiency in converting natural language to SQL queries. This
study emphasizes the value of advanced encoding-decoding strategies in improving transformer-based
Text-to-SQL performance, a promising development for query translation applications in data science and
database management. [9]
In a comprehensive survey, Monica and Parul Agrawal review hyperparameter optimization methods
essential for enhancing machine learning model performance and generalization. Their study focuses on
techniques such as grid search, random search, and Bayesian optimization, as well as meta-heuristic
approaches like genetic and evolutionary algorithms. They emphasize the efficacy of Bayesian optimization
for complex tuning and propose that future work involving advanced meta-heuristic algorithms could
further streamline hyperparameter tuning. This survey offers valuable insights into the role of optimized
hyperparameters in achieving robust machine learning models and sets a foundation for exploring
innovative optimization strategies.
6
GenAI-ModelHub: Generative AI powered Data Science Automation platform
7
GenAI-ModelHub: Generative AI powered Data Science Automation platform
8
GenAI-ModelHub: Generative AI powered Data Science Automation platform
SQL-PaLM: Improved Ruoxi Sun1 , Sercan Ö. SQL-PaLM uses the SQL-PaLM improves
large language model Arik1 , Alex Muzio1 , PaLM-2 LLM with few- Text-to-SQL accuracy
adaptation for Text-to- Lesly Miculicich1 , shot prompting, on Spider and BIRD by
SQL (extended) Satya Gundabathula1 , instruction fine-tuning, expanding training data
Pengcheng Yin2 , and synthetic data diversity, integrating
Hanjun Dai2 , Hootan augmentation. relevant database
Techniques like content, and optimizing
Nakhost1 , Rajarishi
execution-based error input size via column
Sinha1 , Zifeng Wang1 ,
filtering, selective selection. Test-time
Tomas Pfister1 column encoding, and refinement further boosts
test-time refinement are accuracy, offering
applied to enhance Text- insights into model
to-SQL accuracy on strengths and real-world
complex databases. scaling.
9
GenAI-ModelHub: Generative AI powered Data Science Automation platform
10
GenAI-ModelHub: Generative AI powered Data Science Automation platform
CHAPTER 3
REQUIRMENT & ANALYSIS
11
GenAI-ModelHub: Generative AI powered Data Science Automation platform
The requirement analysis for ML Model HUB involves identifying and outlining the technical, functional,
and operational needs necessary to create a robust platform that integrates data science and machine learning
workflows. This section outlines the specific requirements in steps and points under each step to ensure a
structured approach for development.
12
GenAI-ModelHub: Generative AI powered Data Science Automation platform
o Provide accurate data translation between SQL and Pandas without loss of data integrity or context.
The requirement analysis for Gen-AI Model HUB focuses on creating a unified platform that simplifies
complex data science and machine learning workflows. A core requirement is the ability to support both
SQL and Pandas queries, enabling seamless data manipulation and integration from various sources without
redundant transformations. This dual-query capability will bridge the common gap between structured
database environments and Python’s data handling, saving practitioners time and effort. Another key
functional requirement is the generation of baseline machine learning models. By providing automated
options for creating initial models based on data characteristics, ML Model HUB will streamline model
development for users with varying levels of expertise. To maximize usability, the platform will feature a
user-friendly interface that integrates real-time feedback and visualizations, allowing users to track data
insights and model performance effortlessly.
Beyond the core functional requirements, Gen-AI Model HUB requires advanced technical capabilities,
particularly in hyperparameter optimization and deep learning model customization. Techniques such as
grid search, random search, and Bayesian optimization will be built into the platform, with options to apply
meta-heuristic algorithms for more effective tuning. The platform also needs robust backend processing for
handling data-intensive tasks and model training. Integrating Large Language Models (LLMs) and
Retrieval-Augmented Generation (RAG) technologies will enable automated translation of data queries and
boost user efficiency. Security and compliance are additional requirements, as the platform must safeguard
user data and restrict access to authorized users, ensuring that data and model integrity are maintained
throughout the workflow. Together, these requirements lay the foundation for a powerful, accessible tool
that addresses the pressing needs of modern data science and machine learning practitioners.
Libraries: SciKit Learn, TensorFlow, LangChain, Google.genAI, Flask, Pandas, NumPy, PyTorch
Database: SQL
Platform: Windows
Language: Python
13
GenAI-ModelHub: Generative AI powered Data Science Automation platform
CHAPTER 4
DESIGN
14
GenAI-ModelHub: Generative AI powered Data Science Automation platform
4.1 Architecture
15
GenAI-ModelHub: Generative AI powered Data Science Automation platform
1. Query Translator:
The Query Translator module serves as a bridge between SQL and Pandas, enabling users to translate
their natural language queries into executable SQL and Pandas code.
The steps involved are:
A) Upload Database: Users upload a database file (e.g., .db).
B) Natural Language Input: Users input their queries in natural language.
C) LLM Processing: The system processes the query using an LLM to understand the intent and
generate the appropriate SQL and Pandas code.
D) Output Generation: The generated output includes the SQL query and Pandas code along with a
description explaining the operations performed.
3. Hyperparameter Tuning:
This module focuses on optimizing the hyperparameters of machine learning models to improve
performance.
The steps involved are:
A) Upload Cleaned Data: Users upload cleaned data files (e.g., .csv, .json).
B) Model Selection: Users select their preferred machine learning model.
C) Hyperparameter Tuning: Hyperparameters are tuned using techniques like Grid Search Cross-
Validation (CV) and Bayesian Optimization.
D) Accuracy Evaluation: The system provides accuracy metrics based on the tuned hyperparameters.
E) Code and Description: The system generates the overall code block and provides explanations on
how each hyperparameter affects the bias-variance trade off.
F) Recommendation: The system recommends optimal hyperparameters for the user’s model.
16
GenAI-ModelHub: Generative AI powered Data Science Automation platform
4.3Algorithm Strategy
For the Gen- AI Model HUB project, the algorithm strategy involves combining a series of machine
learning and deep learning techniques to ensure an efficient, seamless process for data querying, model
generation, optimization, and deployment. Here's an outline of the algorithm strategy:
17
GenAI-ModelHub: Generative AI powered Data Science Automation platform
18
GenAI-ModelHub: Generative AI powered Data Science Automation platform
CHAPTER 5
CONCLUSION
19
GenAI-ModelHub: Generative AI powered Data Science Automation platform
5.1 Conclusion
In conclusion, the Gen-AI Model HUB project offers a powerful, integrated solution to the challenges faced
by data scientists and machine learning practitioners. By combining seamless data querying, automated
baseline model generation, and advanced hyperparameter optimization techniques, the platform simplifies
and accelerates the process of building and refining machine learning models. Leveraging cutting-edge
technologies such as Large Language Models and Retrieval-Augmented Generation, it enhances both the
efficiency and user experience of model development. With a user-friendly interface and robust backend
support, ML Model HUB streamlines workflows, making complex data science tasks more accessible and
less time-consuming, ultimately empowering users to build high-performance models more effectively.
20
GenAI-ModelHub: Generative AI powered Data Science Automation platform
CHAPTER 6
PROJECT STAGE-2 PLAN
21
GenAI-ModelHub: Generative AI powered Data Science Automation platform
22
GenAI-ModelHub: Generative AI powered Data Science Automation platform
CHAPTER 7
REFERENCES
23
GenAI-ModelHub: Generative AI powered Data Science Automation platform
REFERENCES
7.1 Reference:
[1] Application of Noise Filter Mechanism for T5-Based Text-to-SQL Generation by M.R. Aadhil
Rushdy1, Uthayasanker Thayasivam2. 1Department of Computer Science and Engineering
University of Moratuwa Katubedda, Sri Lanka, 2Department of Computer Science and Engineering
University of Moratuwa Katubedda, Sri Lanka.
[2] A Survey on Hyperparameter Optimization of Machine Learning Models by Monica (Department
of CSE&IT Jaypee Institute of Information Technology Noida, India), Parul Agrawal (Department
of CSE&IT Jaypee Institute of Information Technology Noida, India).
[3] Textual Query Translation into Python Source Code using Transformers by Rutuja Nikum 1,
Vaishnavi Shinde2, Vijay Khadse3. 1,2,3Computer Engineering College of Engineering Pune, India.
[4] An Empirical Study of Code Smells in Transformer-based Code Generation Techniques by
Mohammed Latif Siddiq1, Shafayat H. Majumder2 , Maisha R. Mim2 , Sourov Jajodia2 , Joanna C. S.
Santos1, 1Department of Computer Science and Engineering, University of Notre Dame, USA,
2
Department of Computer Science, Bangladesh University of Engineering and Technology, Dhaka,
Bangladesh.
[5] A Combinatorial Approach to Hyperparameter Optimization by Krishna Khadka1, Jaganmohan
Chandrasekaran2, Yu Lei3, Raghu N. Kacker4, D. Richard Kuhn5. 1,3University of Texas at Arlington,
Arlington, TX, USA, 2National Security Institute, Virginia Tech Arlington, VA, USA, 4,5Information
Technology Laboratory, National Institute of Standards and Technology.
[6] Hyperparameter Optimization to Improve Bug Prediction Accuracy by Haidar Osman,
Mohammad Ghafari, and Oscar Nierstrasz Software Composition Group, University of Bern, Bern,
Switzerland.
[7] Conversion of Natural Language Query to SQL Query by Abhilasha Kate 1, Satish Kamble2,
Aishwarya Bodkhe3, Mrunal Joshi4. 1,2,3,4Dept. of IT Engineering PVG’s COET, Pune, India.
[8] On Optimization Methods for Deep Learning by Quoc V. Le1, Jiquan Ngiam, Adam Coates,
Abhik Lahiri, Bobby Prochnow, Andrew Y. Ng. Computer Science Department, Stanford University,
Stanford, CA 94305, USA
24
GenAI-ModelHub: Generative AI powered Data Science Automation platform
APPENDIX
25
GenAI-ModelHub: Generative AI powered Data Science Automation platform
Map the project title with PO's and PSO's on the scale of 3. 3-
Substantial mapping
2-Moderate mapping1-
Low mapping
Detailed mapping:
POs:
1. Engineering Knowledge: Your project applies advanced AI techniques like LLMs and RAG to solve
complex data science problems, combining knowledge from computer science, data engineering, and AI to
develop robust solutions.
2. Problem Analysis: It addresses the need for a unified data science platform that simplifies the transition
between SQL and Pandas, providing substantiated solutions using data preprocessing and model generation.
3. Design/Development of Solutions: The platform is designed to meet user needs for efficient model
development and optimization, incorporating elements of usability, reliability, and adaptability.
4. Conduct Investigations of Complex Problems: By automating data processing and analysis, the project
uses a research-driven approach to streamline investigations and deliver actionable insights.
5. Modern Tool Usage: Leveraging modern AI tools and techniques, the platform facilitates complex tasks
such as data querying, model training, and tuning, making them accessible to users with diverse technical
skills.
7. Environment and Sustainability: The project promotes sustainable data science practices by reducing
redundant work and optimizing resource use, potentially lowering computational costs and environmental
impact.
8. Ethics: GenAI-ModelHub supports responsible data handling and transparency, ensuring AI-driven tasks
are ethically managed and users understand the impact of their models.
9. Individual and Team Work: The platform is designed to support both individual users and teams, enabling
collaborative and efficient workflows across different levels of expertise.
10. Communication: Through an intuitive chatbot interface, the project enhances communication with users
by explaining processes and providing real-time guidance, making data science more approachable.
11. Project Management and Finance: The phased development approach (research, design, development,
testing, deployment) reflects sound project management principles to deliver a high-quality platform
efficiently.
12. Life-long Learning: By providing an accessible and interactive data science platform, GenAI-ModelHub
fosters continued learning for users, especially those looking to improve their data science skills or explore
AI-driven tools.
26
GenAI-ModelHub: Generative AI powered Data Science Automation platform
PSOs:
1. Applying domain-specific knowledge:
o GenAI-ModelHub uses domain-specific AI technologies like Large Language Models (LLMs) and
Retrieval-Augmented Generation (RAG) to develop tools that enhance data science workflows.
o The platform is tailored to support data science applications, including model development,
optimization, and data querying, specific to electronics and telecommunication-related workflows in
data science.
2. Selecting and using software tools efficiently:
o The platform incorporates various software tools like SQL for data querying, Pandas for data
manipulation, and automated deep learning model generation tools, helping users efficiently manage
complex data science tasks.
o The intelligent chatbot and automation features simplify the use of multiple tools, making it easier
for users to navigate through data science workflows without needing in-depth expertise in each tool.
27
GenAI-ModelHub: Generative AI powered Data Science Automation platform
28