SOWNDAR Document
SOWNDAR Document
WORK
Submitted by
SOWNDARRAJAN N
of
MCA
(AUTONOMOUS)
JUNE 2023
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
DECLARATION
I affirm that the project work titled “EDUCATIONAL DATA MINING TO SUPPORT
PROGRAMMING LEARNING USING PROBLEM-SOLVING DATA” being submitted
in partial fulfillment for the award of MASTER OF COMPUTER APPLICATIONS is the
original work carried out by me. It has not formed the part of any other project work
submitted for award of any degree or diploma, either in this or any other University.
Associate Professor,
ABSTRACT
(i) problem-solving data collection (logs and scores are collected from the OJ)
and preprocessing;
(ii) MK-means clustering algorithm is used for data clustering in Euclidean space;
statistical features are extracted from each cluster; frequent pattern (FP)growth
algorithm is applied to each cluster to mine data patterns and association rules;
(iii) A set of suggestions are provided on the basis of the extracted features, data
patterns, and rules. Different parameters are adjusted to achieve the best
results for clustering and association rule mining algorithms.
ACKNOWLEDGEMENT
I convey my deep sense of gratitude to the almighty, who has helped me all the way through
my life and molded me into what I am today.
I would like to express my profuse gratitude to our Founder, Correspondent and president of
K.S.R Group of institutions, Theivathiru. Lion. Dr. K. S. RANGASAMY, M.J.F., for
providing extra ordinary infrastructure, which helped me in the completion of the project in
time.
I wish to thank our Head of the Department Dr. P. ANITHA, M.C.A., M. Phil., Ph.D., for
giving me this opportunity with full encouragement to complete this project.
I would like to thank my project guide DR. M. GEETHA M.C.A., M. Phil, Ph.D.,
for guiding me at various stages of the project.
I Express my Sincere thanks to Mr. S. PRAKASH, M.C.A., IMMACULATE
TECHNOLOGIES, Coimbatore for his valuable help and encouragement in this project.
I whole-heartedly thank my beloved friends for their suggestions and timely help throughout
my project work. Finally, I thank my parents for their moral support and encouragement,
without whom successful completion of this project would not have been possible.
SOWNDARRAJAN N
TABLE OF CONTENTS
ABSTRACT V
viii
ACKNOWLEDGEMENT VI
LIST OF TABLES IX
LIST OF FIGURES X
1 INTRODUCTION 1
1.2 OBJECTIVES 2
2 LITERATURE OF STUDY 3
3 SYSTEM ANALYSIS 5
4 SYSTEM SPECIFICATIONS 7
5 SYSTEM STUDY 8
5.1 FEASIBILITY STUDY 8
6
SOFTWARE DESCRIPTION 9
6.1 ANACONDA 9
6.2 OVERVIEW 9
10
6.3 ANACONDA NAVIGATOR
13
6.4 THE NOTEBOOK INTERFACE
6.5 PYTHON 15
6.7 INTEGRATIONS 18
7
PROJECT DESCRIPTION 20
8
TESTING AND IMPLEMENTATION 25
8.1 IMPLEMENTATION 25
9.1 CONCLUSION 31
10 APPENDICES
32
10.1 SOURCE CODE
32
10.2 SCREENSHOTS
51
10.3 JOURNAL
58
10.5 RESUME
67
11 REFERENCES 69
LIST OF TABLES
7.4.1 STUDENT 24
EDUCATIONAL
DATASET DESIGN
7.4.2 24
STUDENT
EDUCATIONAL
DATASET DESIGN
LEVEL-2
xi
LIST OF FIGURES
CHAPTER 1
INTRODUCTION
The indispensable factors, which give the competitive advantages over others in the market,
may be slated as:
• Performance
• Pioneering efforts
• Client satisfaction
• Innovative concepts
• Constant Evaluations
• Improvisation
2
• Cost Effectiveness
As a team we have the clear vision and realize it too. As a statistical evaluation, the
team has more than 40,000 hours of expertise in providing real-time solutions in the fields
of Android Mobile Apps Development, Networking, Web Designing, Secure Computing,
Mobile Computing, Cloud Computing, Image Processing And Implementation,
Networking With OMNET++
Simulator, client Server Technologies in Java,(J2EE\J2ME\EJB), ANDROID, DOTNET
(ASP.NET, VB.NET, C#.NET), MATLAB, NS2, SIMULINK, EMBEDDED, POWER
ELECTRONICS, VB & VC++, Oracle and operating system concepts with LINUX.
OUR VISION:
1.2 OBJECTIVES
CHAPTER 2
LITERATURE OF STUDY
Huge amounts of educational data are being produced, and a common challenge that
many educational organizations confront, is finding an effective method to harness and
analyze this data for continuously delivering enhanced education. Nowadays, the educational
data is evolving and has become large in volume, wide in variety and high in velocity. This
produced data needs to be handled in an efficient manner to extract value and make informed
decisions. For that, this paper confronts such data as a big data challenge and presents a
comprehensive platform tailored to perform educational big data analytical applications.
Further, present an effective environment for non-data scientists and people in the educational
sector to apply their demanding educational big data applications. The implementation stages
of the educational big data platform on a cloud computing platform and the organization of
educational data in a data lake architecture are highlighted. Furthermore, two analytical
applications are performed to test the feasibility of the presented platform in discovering
knowledge that potentially promotes the educational institutions.
applications and significance across the world. In this paper, a novel framework to extract
hidden features and related association rules using a real-world dataset is proposed. An
unsupervised k-means clustering algorithm is applied for data clustering, and then the
frequent pattern-growth algorithm is used for association rule mining. We leverage students’
programming logs and academic scores as an experimental dataset. The programming logs
are collected from an online judge (OJ) system, as OJs play a key role in conducting
programming practices, competitions, assignments, and tests. To explore the correlation
between practical (e.g., programming, logical implementations, etc.) skills and overall
academic performance, the statistical features of students are analyzed and the related results
are presented. A number of useful recommendations are provided for students in each cluster
based on the identified hidden features. In addition, the analytical results of this paper can
help teachers prepare effective lesson plans, evaluate programs with special arrangements,
and identify the academic weaknesses of students. Moreover, a prototype of the proposed
approach and data-driven analytical results can be applied to other practical courses in ICT or
engineering disciplines.
CHAPTER 3
5
SYSTEM ANALYSIS
DISADVANTAGES
• The growing ratio between students and educators raises the question of how to
provide individual support to students to improve their problem-solving skills.
• Especially, when learning computer programming, students need a lot of practice and
individual tutoring to improve their programming knowledge and skills.
• Computer programming is one of the fundamental courses in ICT discipline.
ADVANTAGES
CHAPTER 4
SYSTEM SPECIFICATIONS
7
• Mouse : Logitech.
• RAM : 4 GB.
The feasibility of the project is analyzed in this phase and business proposal is put forth with a very
general plan for the project and some cost estimates. During system analysis the feasibility study of
the proposed system is to be carried out. This is to ensure that the proposed system is not a burden
to the company. For feasibility analysis, some understanding of the major requirements for the
system is essential. Three key considerations involved in the feasibility analysis are,
• Economical Feasibility
• Technical Feasibility
• Social Feasibility
5.2 ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and development
of the system is limited. The expenditures must be justified. Thus the developed system as well
within the budget and this was achieved because most of the technologies used are freely
available. Only the customized products had to be purchased.
5.3 TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical requirements
of the system. Any system developed must not have a high demand on the available technical
resources. This will lead to high demands on the available technical resources. This will lead to
high demands being placed on the client. The developed system must have a modest requirement,
as only minimal or null changes are required for implementing this system.
5.4 SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the user. This
includes the process of training the user to use the system efficiently. The user must not feel
threatened by the system, instead must accept it as a necessity. The level of acceptance by the
users solely depends on the methods that are employed to educate the user about the system and
to make him familiar with it. His level of confidence must be raised so that he is also able to
make some constructive criticism, which is welcomed, as he is the final user of the system.
CHAPTER 6
SOFTWARE ENVIRONMENT
6.1 ANACONDA:
9
Package versions in Anaconda are managed by the package management system conda. This
package manager was spun out as a separate open-source package as it ended up being useful
on its own and for things other than Python. There is also a small, bootstrap version of
Anaconda called Miniconda, which includes only conda, Python, the packages they depend
on, and a small number of other packages.
6.2 OVERVIEW:
Anaconda distribution comes with over 250 packages automatically installed, and
over 7,500 additional open-source packages can be installed from PyPI as well as the conda
package and virtual environment manager. It also includes a GUI, Anaconda Navigator, as a
graphical alternative to the command-line interface (CLI).The big difference between conda
and the pip package manager is in how package dependencies are managed, which is a
significant challenge for Python data science and the reason conda exists.Before version 20.3,
when pip installed a package, it automatically installed any dependent Python packages
without checking if these conflict with previously installed packages. It would install a
package and any of its dependencies regardless of the state of the existing installation.
Because of this, a user with a working installation of, for example, TensorFlow, could find
that it stopped working having used pip to install a different package that requires a different
version of the dependent numpy library than the one used by TensorFlow. In some cases, the
package would appear to work but produce different results in detail. While pip has since
implemented consistent dependency resolution, this difference accounts for a historical
differentiation of the conda package manager.In contrast, conda analyses the current
environment including everything currently installed, and, together with any version
limitations specified (e.g. the user may wish to have TensorFlow version 2,0 or higher), works
out how to install a compatible set of dependencies, and shows a warning if this cannot be
done.Open source packages can be individually installed from the Anaconda repository,
10
Anaconda Cloud (anaconda.org), or the user's own private repository or mirror, using the
conda install command.
• JupyterLab
• Jupyter Notebook
• QtConsole
• Spyder
• Glue
• Orange
• RStudio
• Visual Studio Code
CONDA:
11
JUPYTER NOTEBOOK:
A notebook integrates code and its output into a single document that combines
visualizations, narrative text, mathematical equations, and other rich media. In other words:
it's a single document where you can run code, display the output, and also add explanations,
formulas, charts, and make your work more transparent, understandable, repeatable, and
shareable. Using Notebooks is now a major part of the data science workflow at companies
across the globe. If your goal is to work with data, using a Notebook will speed up your
workflow and make it easier to communicate and share your results. Best of all, as part of the
open source Project Jupyter, Jupyter Notebooks are completely free. You can download the
software on its own, or as part of the Anaconda data science toolkit.
INSTALLATION:
The easiest way for a beginner to get started with Jupyter Notebooks is by installing
Anaconda. Anaconda is the most widely used Python distribution for data science and comes
pre-loaded with all the most popular libraries and tools. Some of the biggest Python libraries
included in Anaconda include NumPy, pandas, and Matplotlib, though the full 1000+ list is
exhaustive. Anaconda thus lets us hit the ground running with a fully stocked data science
workshop without the hassle of managing countless installations or worrying about
dependencies and OS-specific (read: Windows-specific) installation issues.
In this section, we’re going to learn to run and save notebooks, familiarize ourselves with
their structure, and understand the interface. We’ll become intimate with some core
terminology that will steer you towards a practical understanding of how to use Jupyter
Notebooks by yourself and set us up for the next section, which walks through an example
data analysis and brings everything we learn here to life.
RUNNING JUPYTER:
On Windows, you can run Jupyter via the shortcut Anaconda adds to your start menu,
which will open a new tab in your default web browser that should look something like the
following screenshot.
This isn’t a notebook just yet, but don’t panic! There’s not much to it. This is the
Notebook Dashboard, specifically designed for managing your Jupyter Notebooks. Think of it
as the launchpad for exploring, editing and creating your notebooks.
Be aware that the dashboard will give you access only to the files and sub-folders
contained within Jupyter’s start-up directory (i.e., where Jupyter or Anaconda is installed).
However, the start-up directory can be changed.
It is also possible to start the dashboard on any system via the command prompt (or terminal
on Unix systems) by entering the command jupyter notebook; in this case, the current
working directory will be the start-up directory. With Jupyter Notebook open in your browser,
you may have noticed that the URL for the dashboard is something like
https://localhost:8888/tree. Localhost is not a website, but indicates that the content is being
served from your local machine: your own computer. Jupyter’s Notebooks and dashboard are
web apps, and Jupyter starts up a local Python server to serve these apps to your web browser,
making it essentially platform-independent and opening the door to easier sharing on the web.
13
The short answer: each .ipynb file is one notebook, so each time you create a new notebook, a
new .ipynb file will be created.
The longer answer: Each .ipynb file is a text file that describes the contents of your notebook
in a format called JSON. Each cell and its contents, including image attachments that have
been converted into strings of text, is listed therein along with some metadata.
You can edit this yourself - if you know what you are doing! - by selecting “Edit > Edit
Notebook Metadata” from the menu bar in the notebook. You can also view the contents of
your notebook files by selecting “Edit” from the controls on the dashboard
However, the key word there is can. In most cases, there's no reason you should ever need to
edit your notebook metadata manually.
Now that you have an open notebook in front of you, its interface will hopefully not
look entirely alien. After all, Jupyter is essentially just an advanced word processor. Why not
take a look around? Check out the menus to get a feel for it, especially take a few moments to
scroll down the list of commands in the command palette, which is the small button with the
keyboard icon (or Ctrl + Shift + P).
There are two fairly prominent terms that you should notice, which are probably new to you:
cells and kernels are key both to understanding Jupyter and to what makes it more than just a
word processor. Fortunately, these concepts are not difficult to understand.
We’ll return to kernels a little later, but first let’s come to grips with cells. Cells form the
body of a notebook. In the screenshot of a new notebook in the section above, that box with
the green outline is an empty cell. There are two main cell types that we will cover:
• A code cell contains code to be executed in the kernel. When the code is run, the
notebook displays the output below the code cell that generated it.
• A Markdown cell contains text formatted using Markdown and displays its output
inplace when the Markdown cell is run.
6.5 PYTHON:
PYTHON HISTORY:
Python was invented by Guido van Rossum in 1991 at CWI in Netherland. The idea of
Python programming language has taken from the ABC programming language or we can say
that ABC is a predecessor of Python language.
There is also a fact behind the choosing name Python. Guido van Rossum was a fan of the
popular BBC comedy show of that time, "Monty Python's Flying Circus". So he decided to
pick the name Python for his newly created programming language.
Python has the vast community across the world and releases its version within the short
period.
• Python is Interactive − You can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.
Python provides many useful features to the programmer. These features make it most
popular and widely used language. We have listed below few-essential feature of Python.
Python has wide range of libraries and frameworks widely used in various fields such as
machine learning, artificial intelligence, web applications, etc. We define some popular
frameworks and libraries of Python as follows.
• It provides very high-level dynamic data types and supports dynamic type checking.
• It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.
TENSORFLOW:
TensorFlow is a free and open-source software library for machine learning and
artificial intelligence.
It can be used across a range of tasks but has a particular focus on training and
inference of deep neural networks.
TensorFlow was developed by the Google Brain team for internal Google use in
research and production.
The initial version was released under the Apache License 2.0 in 2015.Google
released the updated version of TensorFlow, named TensorFlow 2.0, in September 2019.
Version 1.0.0 was released on February 11, 2017.While the reference implementation
runs on single devices, TensorFlow can run on multiple CPUs and GPUs (with optional
CUDA and SYCL extensions for general-purpose computing on graphics processing
units).TensorFlow is available on 64-bit Linux, macOS, Windows, and mobile computing
platforms including Android and iOS. Its flexible architecture allows for the easy deployment
of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to
clusters of servers to mobile and edge devices.
6.7 INTEGRATIONS:
18
Numpy:
Numpy is one of the most popular Python data libraries, and TensorFlow offers
integration and compatibility with its data structures. Numpy NDarrays, the library’s native
datatype, are automatically converted to TensorFlow Tensors in TF operations; the same is
also true vise-versa. This allows for the two libraries to work in unison without requiring the
user to write explicit data conversions. Moreover, the integration extends to memory
optimization by having TF Tensors share the underlying memory representations of Numpy
NDarrays whenever possible.
Extensions:
TensorFlow also offers a variety of libraries and extensions to advance and extend the
models and methods used. For example, TensorFlow Recommenders and TensorFlow
Graphics are libraries for their respective functionalities in recommendation systems and
graphics, TensorFlow Federated provides a framework for decentralized data, and TensorFlow
Cloud allows users to directly interact with Google Cloud to integrate their local code to
Google Cloud. Other add-ons, libraries, and frameworks include TensorFlow Model
Optimization, TensorFlow Probability, TensorFlow Quantum, and TensorFlow Decision
Forests.
Google Collab:
Google also released Colaboratory, a TensorFlow Jupyter notebook environment that does not
require any setup. It runs on Google Cloud and allows users free access to GPUs and the
ability to store and share notebooks on Google Drive.
Tensor Board
Performance
19
Extend
CHAPTER 7
PROJECT DESCRIPTION
(i) Today’s information and communication technology (ICT) industry demands for
highly skilled programmers for further development.
(v) Furthermore, the growing ratio between students and educators raises the question of
how to provide individual support to students to improve their problem-solving skills.
Especially, when learning computer programming, students need a lot of practice and
individual tutoring to improve their programming knowledge and skills.
E-learning platforms have become more popular for a variety of reasons and demands,
including teacher shortage, unbalanced student-teacher ratio, logistical and infrastructure
constraints, high cost of technical and professional courses, dissemination of education to a
large number of people, time saving and easy access to many courses. As the use of e-learning
systems increases, different types of data are being generated regularly.
Some data are structured whereas some are unstructured.
An EDM technique such as k-means clustering was used to evaluate students'
activities in an e-learning system and identified students' interests. It also identified the
correlation between the activity in the e-learning system and academic performance.
Recently, some RSs have been using a mixed approach of content-based filtering and
collaborative filtering to achieve high-quality results in specific contexts
Clustering techniques are widely used in data analysis and play an important role in
the field of data mining.
With the diversification of data, many variations of clustering techniques have been
developed simultaneously to analyze different types of data.
Each clustering technique has its advantages and disadvantages for clustering data.
The usability and applicability of clustering techniques in the context of the EDM has
been described in a study.
To the best of our knowledge, there is no single clustering technique that can handle
all types of data including text, and numbers.
ARM is an unsupervised algorithm and was first introduced in research. There are diverse
applications of the ARM technique in various fields such as pattern mining, education, social,
medical, census, market-basket, and big data analysis.
ARM is an efficient technique for obtaining frequent items from large datasets.
Among the many types of ARM algorithms, Apriori and FP-growth algorithms are the most
widely useded. Comparing Apriori and FP-growth, Apriori requires repeated scanning of the
database to form a candidate itemset, whereas the FP-growth algorithm is very fast because it
only needs to scan the database twice to complete the process.
users (u),
TABLE DESIGN
LEVEL-2:
24
CHAPTER 8
8.1 IMPLEMENTATION
Implementation is the stage of the project when the theoretical design is turned out
into a working system. Thus, it can be considered to be the most critical stage in achieving a
successful new system and in giving the user, confidence that the new system will work and
be effective. The implementation stage involves careful planning, investigation of the existing
system and its constraints on implementation, designing of methods to achieve changeover
and evaluation of changeover methods.
INPUT DESIGN
The input design is the link between the information system and the user. It comprises the
developing specification and procedures for data preparation and those steps are necessary to
25
put transaction data in to a usable form for processing can be achieved by inspecting the
computer to read data from a written or printed document or it can occur by having people
keying the data directly into the system.
The design of input focuses on controlling the amount of input required, controlling the
errors, avoiding delay, avoiding extra steps and keeping the process simple.
The input is designed in such a way so that it provides security and ease of use with retaining
the privacy. Input Design considered the following things:
OBJECTIVES
1. Input Design is the process of converting a user-oriented description of the input into a
computer-based system. This design is important to avoid errors in the data input process and
show the correct direction to the management for getting correct information from the
computerized system.
2. It is achieved by creating user-friendly screens for the data entry to handle large volume
of data. The goal of designing input is to make data entry easier and to be free from errors.
The data entry screen is designed in such a way that all the data manipulates can be
performed. It also provides record viewing facilities.
3. When the data is entered it will check for its validity. Data can be entered with the help of
screens. Appropriate messages are provided as when needed so that the user will not be in
maize of instant. Thus the objective of input design is to create an input layout that is easy to
follow.
26
OUTPUT DESIGN
A quality output is one, which meets the requirements of the end user and presents the
information clearly. In any system results of processing are communicated to the users and to
other system through outputs. In output design it is determined how the information is to be
displaced for immediate need and also the hard copy output. It is the most important and
direct source information to the user. Efficient and intelligent output design improves the
system’s relationship to help user decision-making.
1. Designing computer output should proceed in an organized, well thought out manner; the
right output must be developed while ensuring that each output element is designed so that
people will find the system can use easily and effectively. When analysis design computer
output, they should Identify the specific output that is needed to meet the requirements.
3. Create document, report, or other formats that contain information produced by the system.
The output form of an information system should accomplish one or more of the following
objectives.
• Convey information about past activities, current status or projections of the Future.
• Signal important events, opportunities, problems, or warnings.
• Trigger an action.
• Confirm an action.
Software system meets its requirements and user expectations and does not fail in an
unacceptable manner.
There are various types of test,each test type addresses a specific testing requirement.
It is the testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing, that relies on
knowledge of its construction and is invasive.
Unit tests perform basic tests at component level and test a specific business process,
application, and/or system configuration. Unit tests ensure that each unique path of a business
process performs accurately to the documented specifications and contains clearly defined
inputs and expected results.
Integration Testing
Integration tests are designed to test integrated software components to determine if they
actually run as one program.
Testing is event driven and is more concerned with the basic outcome of screens or fields.
Integration tests demonstrate that although the components were individually satisfaction,
as shown by successfully unit testing, the combination of components is correct and consistent.
Integration testing is specifically aimed at exposing the problems that arise from the combination
of components.
Functional Test
Functional tests provide systematic demonstrations that functions tested are available as specified
by the business and technical requirements, system documentation, and user manuals.
28
In addition, systematic coverage pertaining to identify Business process flows; data fields,
predefined processes, and successive processes must be considered for testing.
Before functional testing is complete, additional tests are identified and the effective value of
current tests is determined.
System Test
System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results.
System testing is based on process descriptions and flows, emphasizing pre-driven process links
and integration points.
It is purpose. It is used to test areas that cannot be reached from a black box level.
Black box tests, as most other kinds of tests, must be written from a definitive source
document, such as specification or requirements document, such as specification or requirements
document.
29
It is a testing in which the software under test is treated, as a black box. you cannot “see”
into it.
The test provides inputs and responds to outputs without considering how the software works.
Unit Testing:
Unit testing is usually conducted as part of a combined code and unit test phase of the software
lifecycle, although it is not uncommon for coding and unit testing to be conducted as two distinct
phases.
Test objectives
All field entries must work properly.
Pages must be activated from the identified link.
The entry screen, messages and responses must not be delayed.
Features to be tested
Verify that the entries are of the correct format
No duplicate entries should be allowed
All links should take the user to the correct page.
Integration Testing
Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects.
The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level –
interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
30
Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation by the
end user.
Test Results
All the test cases mentioned above passed successfully. No defects encountered.
CHAPTER 9
CONCLUSION AND FUTURE ENHANCEMENT
9.1 CONCLUSION
An EDM framework for data clustering, patterns, and rules mining using real-world
problemsolving data. A mathematical model for data preprocessing, MK-means, and
FPgrowth algorithms were used to conduct this study. For programming education, OJ
systems have been adopted by many institutions as academic tools. As a result, a huge
number of programming-related resources (source codes, logs, scores, activities, etc.) are
regularly accumulated in OJ systems. In this study, a large amount of real-world problem-
solving data collected from the AOJ system was used in the experiments. Problem-solving
data preprocessing is one of the main tasks to achieve accurate EDM results. Therefore, a
mathematical model for problem-solving data preprocessing is developed. Then, the
processed data are clustered using Elbow and MK-means algorithms. Various statistical
features, data patterns and rules are extracted from each cluster based on different threshold
values (K, minConf , minSup). These results can effectively contribute to the improvement of
overall programming education. Moreover, based on the experimental results, some pertinent
suggestions have been made. Furthermore, the proposed framework can be applied to other
practical/exercise courses to demonstrate data patterns, statistical features, and rules. Besides,
any third-party applications with similar data resources such as AlgoA, ProgA, FCT, and FPT,
can use the proposed approach for EDM and analysis.
31
In the future, the experimental results of EDM using problem-solving data can be
integrated to visualize different LA for programming platforms such as the OJ system. In
addition, fuzzy estimation and polynomial approximation methods can be handy to
dynamically select the optimal minSup values based on the dataset. Appropriate minSup
values could help to generate the actual number of frequent elements and association rules
from the dataset
CHAPTER 10 APPENDICES
#!/usr/bin/env python
# coding: utf-8
# # Import Libraries
# In[1]:
# # Dataset
# In[2]:
# In[4]:
stud['G3'].describe()
# In[5]:
# In[6]:
# In[7]:
# In[8]:
# In[9]:
# In[10]:
# In[13]:
# In[14]:
# In[16]:
stud.iplot(kind='box')
# In[17]:
stud['G3'].iplot(kind='hist',bins=100,color='blue')
# # Data Visualization
# In[18]:
35
# In[19]:
# # Student's Sex
# In[20]:
# In[21]:
# # Age of Students
# In[22]:
# In[23]:
# - The student age seems to be ranging from 15-19, where gender distribution is pretty even
in each age group.
# - The age group above 19 may be outliers, year back students or droupouts.
# In[24]:
# In[25]:
sns.set_style('whitegrid')
sns.countplot(x='address',data=stud,palette='magma') # urban & rural representaion on
countplot
# - Approximately 77.72% students come from urban region and 22.28% from rural region.
# In[26]:
sns.countplot(x='address',hue='G3',data=stud,palette='Oranges')
# In[27]:
# - Plotting the distribution rather than statistics would help us better understand the data.
# - The above plot shows that the median grades of the three age groups(15,16,17) are similar.
Note the skewness of age group 19. (may be due to sample size). Age group 20 seems to
score highest grades among all.
# In[28]:
# In[29]:
# - The above graph clearly shows there is not much difference between the grades based on
location.
# In[30]:
39
stud.corr()['G3'].sort_values()
# In[31]:
# In[32]:
stud.head()
# In[33]:
stud.tail()
# In[34]:
# In[35]:
# In[36]:
# In[37]:
# In[38]:
b = sns.swarmplot(x=stud['failures'],y=stud['G3'],palette='autumn') b.axes.set_title('Previous
Failures vs Final Grade(G3)')
# **Observation :** Student with less previous failures usually score higher
# In[39]:
# In[40]:
b = sns.boxplot(x=stud['higher'],y=stud['G3'],palette='binary') b.axes.set_title('Higher
Education vs Final Grade(G3)')
# **Observation :** Students who wish to go for higher studies score more
# In[41]:
b = sns.countplot(x=stud['goout'],palette='OrRd') b.axes.set_title('Go
Out vs Final Grade(G3)')
43
# **Observation :** The students have an average score when it comes to going out with
friends.
# In[42]:
b = sns.swarmplot(x=stud['goout'],y=stud['G3'],palette='autumn') b.axes.set_title('Go
Out vs Final Grade(G3)')
# In[43]:
b = sns.swarmplot(x=stud['romantic'],y=stud['G3'],palette='YlOrBr')
b.axes.set_title('Romantic Relationship vs Final Grade(G3)')
# - Here romantic attribute with value 0 means no relationship and value with 1 means in
relationship
#
# **Observation :** Students with no romantic relationship score higher
# In[44]:
# In[45]:
# **Observation :** The students have an equally distributed average score when it comes to
reason attribute.
# In[46]:
# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error,
median_absolute_error
# Distributions import
scipy
# In[47]:
# splitting the data into training and testing data (75% and 25%)
# we mention the random state to achieve the same split everytime we run the code
X_train, X_test, y_train, y_test = train_test_split(stud, stud['G3'], test_size = 0.25,
random_state=42)
# In[48]:
X_train.head()
# ## MAE - Mean Absolute Error & RMSE - Root Mean Square Error
# In[49]:
46
# In[50]:
# store the true G3 values for passing into the function true
= X_test['G3']
# In[51]:
# In[52]:
# Evaluate several ml models by training on training set and testing on testing set def
evaluate(X_train, X_test, y_train, y_test):
# Names of models
model_name_list = ['Linear Regression', 'ElasticNet Regression',
'Random Forest', 'Extra Trees', 'SVM',
'Gradient Boosted', 'Baseline']
X_train = X_train.drop('G3', axis='columns')
X_test = X_test.drop('G3', axis='columns')
# Metrics
48
return results
# In[53]:
# In[54]:
plt.figure(figsize=(12, 7))
DATASET DESCRIPTION:
51
DATA VISUALIZATION:
AGE OF STUDENTS:
53
MAE - MEAN ABSOLUTE ERROR & RMSE - ROOT MEAN SQUARE ERROR:
10.3 JOURNAL
57
58
DR.M. GEETHA,
ASSOCIATE PROFESSOR,
DEPARTMENT OF COMPUTER APPLICATIONS (MCA), (AUTONOMOUS),
SOWNDARRAJAN N,
activities are primarily conversational, users' keywords increases the chance that a minimum of one
information requests might be described as implicit of the proposed papers contains the keywords, while
queries that emerge from the conversation in the topic-based grouping reduces the risk of ASR mistakes
background and are recovered by actual time listening. in the queries can give you the details you need or can
Making use of speech recognition software (VRS) point you in the direction of another useful document
during a meeting. Such implicit searches are used to by clicking on its hyperlinks. For instance, depending
locate and suggest content from the web or a local the Wikipedia articles based on frequency of words
storage facility, which users may opt to explore further "Light," "Lighting," and "Light My Fire" would be
if they are interested in. returned. Users would, however, favor a set that
contained words like "Lighter," "Wool," and
"Chocolate."
This study discusses how to create implicit
inquiries for the just-in-time retrieval system that will
be utilized in conference rooms. Our just-intime To guarantee relevance and diversity,
retrieval system must build implicit inquiries from three stages can be used: extracting the
conversational input, which comprises a lot more keywords, generating one or more implicit
words than a query. This is compared to the potentially queries, and reordering the outcomes. This
done explicit spoken queries with corporate digital article focuses on the first two strategies
engines of search. Consider the following example, investigation. A new study employing the third
where four people are asked to name various items, one [1] demonstrates that reranking the
such as "chocolate," "pistol," or "lighter," that will outcomes of a single implicit test cannot
allow them to live in the mountains. This field of study increase the users' happiness with the
is known as "automatic term recognition" in computing recommended texts inquiry. In prior to choosing
linguistics, and it is known as "automatic indexing" or the peak-ranking keywords, keywords are
"automatic keyword extraction" compared to the ranked using frequency of words, TFIDF
potentially done explicit spoken queries with corporate weights, or approaches for extracting implicit
digital engines of search. questions from text are used [2, 3]. This work
introduces a unique method for collecting
keywords from ASR output that increases
In order to give a limited sample of suggestions protection of users' prospective information
based on the most likely hypotheses, our goal is to demands and minimizes the usage of
maintain a variety of assumptions regarding users' unnecessary phrases. After being extracted, a
information needs. This is due to the possibility of a group of keywords is clustered to produce a
wide range of topics, which might be made more number of topically-separated searches that can
challenging by probable ASR mistakes or speech be used independently and have greater
stutters (like "whisk" in this case). Therefore, our precision than a single, more complex topically-
objective is to obtain a broad and relevant keyword mixed query. Eventually, the findings are
collection, organize them into topic-specific queries, compiled into a ranked set before being made
rank them according to importance, and finally provide available to users as suggestions.
a sample of the results to the user. The diversity of
60
II. Comparable works that the inclusion of POS and sentence score data
facilitates extraction of keywords.
I. Proposed Work
z N1 1i N p z | w
i were,
document summarization, obtaining keywords information retrieval. The frequency of a word in the
is an essential approach. The primary goal of corpus reduces the value of tf-idf, which grows directly
obtaining keywords is to extract terms based on in proportion to a word's frequency in the text.
how significant they are in the text. The initial
action is to choose and preprocess the desired
documents. A measure of how frequently a phrase appears in a
document is called term frequency (TF).
keywords covering a larger range of topics are then categorized employing machine learning
gathered. The project uses conversations, and techniques Decision trees, k-Nearest Neighbor, and
the data is categorized using a free application. Nave Bayes Neighbor. The Decision Tree outperforms
RapidMiner is a platform for predictive other machine learning algorithms for text
analytics, business analytics, machine learning, categorization. according to a performance evaluation
text analysis, and data mining. Applications for technique outperforms the other two algorithms in
the technology include research, discipline, terms of prediction accuracy.
teaching, creation of applications, fast
developing prototypes, and business. It offers References
data methods loading and transformation, [1] M. Habibi and A. Popescu-Belis, “Enforcing topic
preprocessing, visualization, modelling, diversity in a document recommender for
conversations,” in Proc. 25th Int. Conf.
evaluation, and deployment for machine
Comput. Linguist. (Coling), 2014, pp. 588–599.
learning and data mining. This learning-based J.
tool was developed using the Java programming [2] S. Ye, T.-S. Chua, M.-Y. Kan, and L. Qiu,
language. The experimentation in the proposed “Document concept lattice for text understanding
work makes use of twenty presentations. and summarization,” Inf. Process. Manage., vol. 43,
no. 6, pp. 1643–
Collection of discussions is done by hand. The
1662, 2007.
effective the use of TF-IDF and WordNet to [3] D. Harwath and T. J. Hazen, “Topic identification
derive keywords from the segments. The based extrinsic evaluation of summarization
procedure is created using Java keyword techniques applied to conversational speech,” in
research. The retrieved keywords are then saved Proc. Int. Conf. Acoust., Speech, Signal Process.
(ICASSP), 2012, pp. 5073–5076. [4] Y. Matsuo and
for classification and clustering. M. Ishizuka, “Keyword extraction from a single
document using word
co-occurrence statistical information,” Int. J.
Artif. Intell. Tools, vol. 13, no. 1, pp. 157–
Keyword clustering performs less well than 169, 2004.
categorization. A component of supervised [5] A. Nenkova and K. McKeown, “A survey
learning is classification. The training duration of text summarization techniques,” in
Mining Text Data, C. C. Aggarwal and C.
and prediction accuracy are utilized to compare Zhai, Eds. New York, NY, USA: Springer,
the precision of predictions of the various 2012, ch. 3, pp. 43– 76.
training models and assess their performance. [6] T. J. Hazen, “Latent topic modeling for
audio corpus summarization,” in Proc. 12th
Annu.
Conf. Int. Speech Commun. Assoc., 2011,
Conclusion pp. 913–916.
One of the primary applications of machine [7] J. Wang, J. Liu, and C. Wang, “Keyword
learning is text classification. The proposed technique extraction based on pagerank,” in Proc.
uses text analysis algorithms to pull keywords out of Adv.
journal articles. Calculating the semantic distances Knowl. Disc. Data Mining (PAKDD), 2007, pp.
between terms makes use of the WordNet vocabulary. 857–864.
[8] Z. Liu, W. Huang, Y. Zheng, and M. Sun,
Italicized terms that were found to be comparable the “Automatic keyphrase extraction via topic
most. Based on the obtained keywords, documents are
66
10.3 RESUME
E-mail: [email protected]
Ph-No: 9597316660
CARRER OBJECTIVE
I seek challenging opportunities where I can fully use my skills for the success of the
organization.
ACADEMIC DETAILS
SKILLS
Java
Python
Basic of C
MS office
PERSONAL SKILLS
Logical Thinking
Ability to perform in a Team Task
Hard Worker
Developing Communication skills
68
ACHIEVEMENTS
Certificate of Participated in the one-day National Level Seminar on Novel IoT
Insights and its Artificial Intelligence.
Certificate of Participated in Design Thinking-Leveraging the Power of your
Mind.
Certificate of Participated in Workshop at ASTHRA 2022 an international level
technical symposium.
PROJECT
EXTRA-CURRICULAR ACTIVITIES
Playing cricket
Hearing music & songs
HOBBIES
PERSONAL INFORMATION
COMMUNICATION DETAILS
CHAPTER 11
REFERENCES
[1] A. K. Singh and I. Gupta, ``Online information leaker identification scheme for secure data
sharing,'' Multimedia Tools Appl., vol. 79, no. 41, pp. 31165_31182, Nov. 2020.
[2] E. Zaghloul, K. Zhou, and J. Ren, ``P-MOD: Secure privilege-based multilevel organizational
data-sharing in cloud computing,'' IEEE Trans. Big Data, vol. 6, no. 4, pp.
804_815, Dec. 2020.
[3] I. Gupta and A. K. Singh, ``GUIM-SMD: Guilty user identification model using summation
matrix-based distribution,'' IET Inf. Secur., vol. 14, no. 6, pp. 773_782, Nov.
2020.
[4] W. Shen, J. Qin, J. Yu, R. Hao, and J. Hu, ``Enabling identity-based integrity auditing and
data sharing with sensitive information hiding for secure cloud storage,'' IEEE Trans. Inf.
Forensics Security, vol. 14, no. 2, pp. 331_346, Feb. 2019.
[5] I. Gupta and A. K. Singh, ``An integrated approach for data leaker detection in cloud
environment,'' J. Inf. Sci. Eng., vol. 36, no. 5, pp. 993_1005, Sep. 2020.
[6] R. Li, C. Shen, H. He, X. Gu, Z. Xu, and C.-Z. Xu, ``A lightweight secure data sharing
scheme for mobile cloud computing,'' IEEE Trans. Cloud Comput., vol. 6, no. 2, pp.
344_357, Apr. 2018.
[7] I. Gupta, N. Singh, and A. K. Singh, ``Layer-based privacy and security architecture for cloud
data sharing,'' J. Commun. Softw. Syst., vol. 15, no. 2, pp. 173_185, Apr. 2019.
[8] J. Li, S. Wang, Y. Li, H. Wang, H. Wang, H. Wang, J. Chen, and Z. You, ``An efficient
attribute-based encryption scheme with policy update and _le update in cloud computing,''
IEEE Trans. Ind. Informat., vol. 15, no. 12, pp. 6500_6509, Dec. 2019.
[9] C. Suisse. (2017). 2018 Data Center Market Drivers: Enablers Boosting Enterprise Cloud
Growth. Accessed: May 19, 2019. [Online]. Available:
71
https://cloudscene.com/news/2017/12/2018-data-center-predictions/
[10] I. Gupta and A. K. Singh, ``A framework for malicious agent detection in cloud
computing environment,'' Int. J. Adv. Sci. Technol., vol. 135, pp. 49_62, Feb. 2020.
[11] Y. Li, Y. Yu, G. Min, W. Susilo, J. Ni, and K.-R. Choo, ``Fuzzy identity-based data
integrity auditing for reliable cloud storage systems,'' IEEE Trans. Dependable Secure
Comput., vol. 16, no. 1, pp. 72_83, Jan./Feb. 2019.
[12] I. Gupta and A. K. Singh, ``A probabilistic approach for guilty agent detection using
bigraph after distribution of sample data,'' Proc. Comput. Sci., vol. 125, pp. 662_668, Jan.
2018.
[13] L. Zhang, Y. Cui, and Y. Mu, ``Improving security and privacy attribute based data
sharing in cloud computing,'' IEEE Syst. J., vol. 14, no. 1, pp. 387_397, Mar. 2020.
[14] I. Gupta and A. K. Singh, ``Dynamic threshold based information leaker identification
scheme,'' Inf. Process. Lett., vol. 147, pp. 69_73, Jul. 2019
[15] I. Gupta and A. K. Singh, ``SELI: Statistical evaluation based leaker identification
stochastic scheme for secure data sharing,'' IET Commun., vol. 14, no. 20, pp. 3607_3618,
Dec. 2020..
72
E-mail: [email protected]