0% found this document useful (0 votes)
38 views83 pages

SOWNDAR Document

The document presents a project report by Sowndarrajan N on 'Educational Data Mining to Support Programming Learning Using Problem-Solving Data', submitted for the MCA degree at K.S.R. College of Engineering. It outlines a framework utilizing unsupervised algorithms to analyze problem-solving data from online judge systems to enhance programming education. The project aims to identify weaknesses and suggest improvements in programming learning through data mining techniques applied to a dataset of approximately 70,000 records from 537 students.

Uploaded by

Hari Priyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views83 pages

SOWNDAR Document

The document presents a project report by Sowndarrajan N on 'Educational Data Mining to Support Programming Learning Using Problem-Solving Data', submitted for the MCA degree at K.S.R. College of Engineering. It outlines a framework utilizing unsupervised algorithms to analyze problem-solving data from online judge systems to enhance programming education. The project aims to identify weaknesses and suggest improvements in programming learning through data mining techniques applied to a dataset of approximately 70,000 records from 537 students.

Uploaded by

Hari Priyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 83

I

EDUCATIONAL DATA MINING TO SUPPORT PROGRAMMING


LEARNING USING PROBLEM-SOLVING DATA PROJECT

WORK

Submitted by

SOWNDARRAJAN N

REGISTER NO: 73152162050

in partial fulfillment for the requirement of award of the degree

of

MCA

IN MASTER OF COMPUTER APPLICATIONS

K.S.R. COLLEGE OF ENGINEERING

(AUTONOMOUS)

ANNA UNIVERSITY: CHENNAI 600 025

JUNE 2023

ANNA UNIVERSITY: CHENNAI 600 025


II

BONAFIDE CERTIFICATE

Certified that this project report “EDUCATIONAL DATA MINING TO SUPPORT


PROGRAMMING LEARNING USING PROBLEM-SOLVING DATA” is the bonafide
work of SOWNDARRAJAN N who carried out the project work under my supervision.

SIGNATURE SIGNATURE

Dr. P. Anitha, Ph.D., Dr. M. Geetha, Ph.D.,

HEAD OF THE DEPARTMENT SUPERVISOR

Professor Associate Professor

Dept. of Computer Applications Dept. of Computer Applications

K.S.R. College of Engineering K.S.R. College of Engineering

Tiruchengode - 637 215 Tiruchengode - 637 215

Submitted for the Project Viva-Voce Examination held on ______________________.


III

Internal Examiner External Examiner


IV
V

DECLARATION

I affirm that the project work titled “EDUCATIONAL DATA MINING TO SUPPORT
PROGRAMMING LEARNING USING PROBLEM-SOLVING DATA” being submitted
in partial fulfillment for the award of MASTER OF COMPUTER APPLICATIONS is the
original work carried out by me. It has not formed the part of any other project work
submitted for award of any degree or diploma, either in this or any other University.

Signature of the Candidate


SOWNDARRAJAN N

REGISTER NUMBER: 73152162050

I certify that the declaration made above by the candidate is true.

Signature of the Supervisor

Dr. M. Geetha, Ph.D.,

Associate Professor,

Department of Computer Applications,

K.S.R. College of Engineering,

Tiruchengode - 637 215


vi

ABSTRACT

Computer programming has attracted a lot of attention in the development of


information and communication technologies in the real world. Meeting the growing demand
for highly skilled programmers in the ICT industry is one of the major challenges. In this
point, online judge (OJ) systems enhance programming learning and practice opportunities in
addition to classroom-based learning. Consequently, OJ systems have created a large number
of problem-solving data (solution codes, logs, and scores) archives that can be valuable raw
materials for programming education research. In this paper, we propose an educational data
mining framework to support programming learning using unsupervised algorithms. The
framework includes the following sequence of steps:

(i) problem-solving data collection (logs and scores are collected from the OJ)
and preprocessing;

(ii) MK-means clustering algorithm is used for data clustering in Euclidean space;
statistical features are extracted from each cluster; frequent pattern (FP)growth
algorithm is applied to each cluster to mine data patterns and association rules;

(iii) A set of suggestions are provided on the basis of the extracted features, data
patterns, and rules. Different parameters are adjusted to achieve the best
results for clustering and association rule mining algorithms.

(iv) The experiment, approximately 70,000 realworld problem-solving data from


537 students of a programming course (Algorithm and Data Structures) were
used.

In addition, synthetic data have leveraged for experiments to demonstrate the


performance of MK-means algorithm. The experimental results show that the proposed
framework effectively extracts useful features, patterns, and rules from problem-solving data.
Moreover, these extracted features, patterns, and rules highlight the weaknesses and the scope
of possible improvements in programming learning.
vii

ACKNOWLEDGEMENT

I convey my deep sense of gratitude to the almighty, who has helped me all the way through
my life and molded me into what I am today.

I would like to express my profuse gratitude to our Founder, Correspondent and president of
K.S.R Group of institutions, Theivathiru. Lion. Dr. K. S. RANGASAMY, M.J.F., for
providing extra ordinary infrastructure, which helped me in the completion of the project in
time.

I would like to express my profuse gratitude to our Chairman of K.S.R Group of


institutions, Mr. R. SRINIVASAN, B.B.M., for providing extra ordinary infrastructure,
which helped me in the project in the time.

I would like to thank to Dr. P. SENTHILKUMAR, Ph.D., Principal, for providing me an


opportunity to carry out this project.

I wish to thank our Head of the Department Dr. P. ANITHA, M.C.A., M. Phil., Ph.D., for
giving me this opportunity with full encouragement to complete this project.

I would like to thank my project guide DR. M. GEETHA M.C.A., M. Phil, Ph.D.,
for guiding me at various stages of the project.
I Express my Sincere thanks to Mr. S. PRAKASH, M.C.A., IMMACULATE
TECHNOLOGIES, Coimbatore for his valuable help and encouragement in this project.

I whole-heartedly thank my beloved friends for their suggestions and timely help throughout
my project work. Finally, I thank my parents for their moral support and encouragement,
without whom successful completion of this project would not have been possible.

SOWNDARRAJAN N
TABLE OF CONTENTS

CHAPTER NO TITLE PAGE NO

ABSTRACT V
viii

ACKNOWLEDGEMENT VI

LIST OF TABLES IX

LIST OF FIGURES X

1 INTRODUCTION 1

1.1 COMPANY PROFILE 1

1.2 OBJECTIVES 2

2 LITERATURE OF STUDY 3

2.1 LITERARURE REVIEW 3

3 SYSTEM ANALYSIS 5

3.1 EXIXTING SYSTEM 5

3.2 PROPOSED SYSTEM 6

4 SYSTEM SPECIFICATIONS 7

4.1 HARDWARE SPECIFICATIONS 7

4.2 SOFTWARE SPECIFICATIONS 7

5 SYSTEM STUDY 8
5.1 FEASIBILITY STUDY 8

5.2 ECONOMICAL FEASIBILITY 8

5.3 TECHNICAL FEASIBILITY 8

5.4 SOCIAL FEASIBILITY 8


ix

6
SOFTWARE DESCRIPTION 9

6.1 ANACONDA 9

6.2 OVERVIEW 9

10
6.3 ANACONDA NAVIGATOR
13
6.4 THE NOTEBOOK INTERFACE
6.5 PYTHON 15

6.6 CHARACTERISTICS OF PYTHON 17

6.7 INTEGRATIONS 18

7
PROJECT DESCRIPTION 20

7.1 OVERVIEW OF PROJECT 20

7.2 SYSTEM ARCHITECTURE 21

7.3 MODULES DESCRIPTION 21

7.4 DATASET DESIGN 24

8
TESTING AND IMPLEMENTATION 25

8.1 IMPLEMENTATION 25

8.2 INPUT AND OUTPUT DESIGN 25

8.3 SYSTEM TESTING 27

8.4 TYPES OF TESTING 28

9 CONCLUSION AND FUTURE


ENHANCEMENT 31
x

9.1 CONCLUSION 31

9.2 FUTURE ENHANCEMENT 31

10 APPENDICES
32
10.1 SOURCE CODE
32
10.2 SCREENSHOTS
51
10.3 JOURNAL
58
10.5 RESUME
67

11 REFERENCES 69

LIST OF TABLES

TABLE NO : TABLE NAME PAGE NO :

7.4.1 STUDENT 24
EDUCATIONAL
DATASET DESIGN

7.4.2 24
STUDENT
EDUCATIONAL
DATASET DESIGN
LEVEL-2
xi

LIST OF FIGURES

FIGURE NAME PAGE NO


FIGURE NO

6.3.1 ANACONDA NAVIGATOR 10

6.3.2 RUNNING JUPYTER 12

6.4.1 CELL ENVIRONMENT 14

7.2.1 SYSTEM ARCHITECTURE 21


1

CHAPTER 1

INTRODUCTION

1.1 COMPANY PROFILE

Founded in 2009, Immaculate technologies located at Salem, has a rich


background in developing academic student projects, especially in solving latest IEEE
Papers, Software Development and continues its entire attention on achieving
transcending excellence in the Development and Maintenance of Software Projects and
Products in Many Areas.

In Today's Modern Technological Competitive Environment, Students in


Computer Science Stream Want To Ensure That They Are Getting Guidance In An
Organization That Can Meet Their Professional Needs. With Our Well Equipped Team of
Solid Information Systems Professionals, Who Study, Design, Develop, Enhance,
Customize, Implement, Maintain and Support Various Aspects Of Information Technology,
Students Can Be Sure.

We Understand The Students’ Needs, And Develop Their Quality Of Professional


Life By Simply Making The Technology Readily Usable For Them. We Practice
Exclusively in Software Development, Network Simulation, Search Engine Optimization,
Customization And System Integration. Our Project Methodology Includes Techniques
For Initiating A Project, Developing
The Requirements, Making Clear Assignments To The Project Team, Developing A Dynamic
Schedule, Reporting Status To Executives And Problem Solving.

The indispensable factors, which give the competitive advantages over others in the market,
may be slated as:

• Performance
• Pioneering efforts
• Client satisfaction
• Innovative concepts
• Constant Evaluations
• Improvisation
2

• Cost Effectiveness

ABOUT THE PEOPLE:

As a team we have the clear vision and realize it too. As a statistical evaluation, the
team has more than 40,000 hours of expertise in providing real-time solutions in the fields
of Android Mobile Apps Development, Networking, Web Designing, Secure Computing,
Mobile Computing, Cloud Computing, Image Processing And Implementation,
Networking With OMNET++
Simulator, client Server Technologies in Java,(J2EE\J2ME\EJB), ANDROID, DOTNET
(ASP.NET, VB.NET, C#.NET), MATLAB, NS2, SIMULINK, EMBEDDED, POWER
ELECTRONICS, VB & VC++, Oracle and operating system concepts with LINUX.

OUR VISION:

“Impossible as Possible” this is our vision; we work according to our vision.

1.2 OBJECTIVES

We propose an educational data mining framework to support programming learning


using unsupervised algorithms. The framework includes the following sequence of steps: (i)
problem-solving data collection (logs and scores are collected from the OJ) and
preprocessing; (ii) MK-means clustering algorithm is used for data clustering in Euclidean
space; (iii) statistical features are extracted from each cluster; (iv) frequent pattern
(FP)growth algorithm is applied to each cluster to mine data patterns and association rules;
(v) a set of suggestions are provided on the basis of the extracted features, data patterns, and
rules.
3

CHAPTER 2

LITERATURE OF STUDY

2.1 LITERATURE REVIEW

1) BIG DATA PLATFORM FOR EDUCATIONAL ANALYTICS

AUTHORS: AMR A. MUNSHI AND AHMAD ALHINDI

Huge amounts of educational data are being produced, and a common challenge that
many educational organizations confront, is finding an effective method to harness and
analyze this data for continuously delivering enhanced education. Nowadays, the educational
data is evolving and has become large in volume, wide in variety and high in velocity. This
produced data needs to be handled in an efficient manner to extract value and make informed
decisions. For that, this paper confronts such data as a big data challenge and presents a
comprehensive platform tailored to perform educational big data analytical applications.
Further, present an effective environment for non-data scientists and people in the educational
sector to apply their demanding educational big data applications. The implementation stages
of the educational big data platform on a cloud computing platform and the organization of
educational data in a data lake architecture are highlighted. Furthermore, two analytical
applications are performed to test the feasibility of the presented platform in discovering
knowledge that potentially promotes the educational institutions.

2) IMPACT OF PRACTICAL SKILLS ON ACADEMIC PERFORMANCE: A DATA-


DRIVEN ANALYSIS
AUTHORS: MD. MOSTAFIZER RAHMAN, YUTAKA WATANOBE AND RAGE
UDAY KIRAN

Most academic courses in information and communication technology (ICT) or engineering


disciplines are designed to improve practical skills; however, practical skills and theoretical
knowledge are equally important to achieve high academic performance. This research aims
to explore how practical skills are influential in improving students’ academic performance
by collecting real-world data from a computer programming course in the ICT discipline.
Today, computer programming has become an indispensable skill for its wide range of
4

applications and significance across the world. In this paper, a novel framework to extract
hidden features and related association rules using a real-world dataset is proposed. An
unsupervised k-means clustering algorithm is applied for data clustering, and then the
frequent pattern-growth algorithm is used for association rule mining. We leverage students’
programming logs and academic scores as an experimental dataset. The programming logs
are collected from an online judge (OJ) system, as OJs play a key role in conducting
programming practices, competitions, assignments, and tests. To explore the correlation
between practical (e.g., programming, logical implementations, etc.) skills and overall
academic performance, the statistical features of students are analyzed and the related results
are presented. A number of useful recommendations are provided for students in each cluster
based on the identified hidden features. In addition, the analytical results of this paper can
help teachers prepare effective lesson plans, evaluate programs with special arrangements,
and identify the academic weaknesses of students. Moreover, a prototype of the proposed
approach and data-driven analytical results can be applied to other practical courses in ICT or
engineering disciplines.

3) A NOVEL RULE-BASED ONLINE JUDGE RECOMMENDER SYSTEM TO PROMOTE


COMPUTER PROGRAMMING EDUCATION

AUTHORS: MD. MOSTAFIZER RAHMAN, YUTAKA WATANOBE AND UDAY


KIRAN RAGE

Reducing students’ high dropout rates in the computer programming courses is a


challenging problem of great concern in computer science education. Online Judge (OJ)
systems were recently being investigated to address this problem and promote computer
programming education. Most of the existing OJ systems have been confined only for
evaluation purposes, and do not provide any personalized recommendations to enhance the
productivity of a student. With this motivation, this paper proposes a novel rule-based OJ
recommender system to promote computer programming education. The proposed system
involves the following five steps: (i) scoring the programs submitted by a student
automatically, (ii) generation of a transactional database, (iii) clustering the database with
respect to their scores and other evaluation parameters, (iv) discovering interesting
association rules that exist in each of the cluster’s data.

CHAPTER 3
5

SYSTEM ANALYSIS

3.1 EXISTING SYSTEM

• The conventional computer programming learning environment is insufficient to


prepare highly skilled programmers due to the limited number of exercise classes,
limited practice opportunities, and lack of individual tutoring.
• In addition, most educational institutions, such as schools, colleges, and universities
are struggling to build more educational facilities to increase academic activity (e.g.,
additional exercise classes, practice, and individual tutoring) logistical and
organizational constraints.
• The growing number of people in classrooms in educational institutions, the large
number of students per class, and some lectures are conducted

DISADVANTAGES

• The growing ratio between students and educators raises the question of how to
provide individual support to students to improve their problem-solving skills.
• Especially, when learning computer programming, students need a lot of practice and
individual tutoring to improve their programming knowledge and skills.
• Computer programming is one of the fundamental courses in ICT discipline.

3.2 PROPOSED SYSTEM


6

• Scaffolding strategies transform learning activities into smaller modules, maximizing


the use of tools and structures to support students to gain more knowledge.
• Scaffolding strategies can be of two types, namely, dynamic and static.
• In dynamic scaffolding strategies, TAL systems continuously analyze student activity
and provide the necessary support based on students’ problems and responses.
• In contrast, static scaffolding strategies provide static support to students based on the
analysis of students’ previous difficulties and responses.

ADVANTAGES

• An SPA is an application of artificial intelligence (AI) that provides assistance (e.g.,


answering questions, recommendations, executing actions, suggestions, etc.) based on
user input (e.g., voice, images, and other types of information).
• Systems are hosted by big technology giants to receive voice or text data and produce
the relevant output.
• Despite the many advantages of SPA systems for quick answers, they are difficult to
employ for the purpose of programming learning and evaluation.

CHAPTER 4

SYSTEM SPECIFICATIONS
7

4.1 HARDWARE SPECIFICATIONS

• System : Pentium IV 2.4 GHz.

• Hard Disk : 500 GB.

• Monitor : 15 VGA Colour.

• Mouse : Logitech.

• RAM : 4 GB.

4.2 SOFTWARE SPECIFICATIONS

• Operating System : Windows-10/11 (64-bit).

• Language : Python 3.10 (64-bit)

• IDE Tools : Anaconda 3.0

CHAPTER 5 SYSTEM STUDY

5.1 FEASIBILITY STUDY


8

The feasibility of the project is analyzed in this phase and business proposal is put forth with a very
general plan for the project and some cost estimates. During system analysis the feasibility study of
the proposed system is to be carried out. This is to ensure that the proposed system is not a burden
to the company. For feasibility analysis, some understanding of the major requirements for the
system is essential. Three key considerations involved in the feasibility analysis are,

• Economical Feasibility
• Technical Feasibility
• Social Feasibility
5.2 ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and development
of the system is limited. The expenditures must be justified. Thus the developed system as well
within the budget and this was achieved because most of the technologies used are freely
available. Only the customized products had to be purchased.
5.3 TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical requirements
of the system. Any system developed must not have a high demand on the available technical
resources. This will lead to high demands on the available technical resources. This will lead to
high demands being placed on the client. The developed system must have a modest requirement,
as only minimal or null changes are required for implementing this system.
5.4 SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the user. This
includes the process of training the user to use the system efficiently. The user must not feel
threatened by the system, instead must accept it as a necessity. The level of acceptance by the
users solely depends on the methods that are employed to educate the user about the system and
to make him familiar with it. His level of confidence must be raised so that he is also able to
make some constructive criticism, which is welcomed, as he is the final user of the system.

CHAPTER 6

SOFTWARE ENVIRONMENT
6.1 ANACONDA:
9

Anaconda is a distribution of the Python and R programming languages for scientific


computing (data science, machine learning applications, large-scale data processing,
predictive analytics, etc.), that aims to simplify package management and deployment. The
distribution includes data-science packages suitable for Windows, Linux, and macOS. It is
developed and maintained by Anaconda, Inc., which was founded by Peter Wang and Travis
Oliphant in 2012.As an Anaconda, Inc. product, it is also known as Anaconda Distribution or
Anaconda Individual Edition, while other products from the company are Anaconda Team
Edition and Anaconda Enterprise Edition, both of which are not free.

Package versions in Anaconda are managed by the package management system conda. This
package manager was spun out as a separate open-source package as it ended up being useful
on its own and for things other than Python. There is also a small, bootstrap version of
Anaconda called Miniconda, which includes only conda, Python, the packages they depend
on, and a small number of other packages.

6.2 OVERVIEW:

Anaconda distribution comes with over 250 packages automatically installed, and
over 7,500 additional open-source packages can be installed from PyPI as well as the conda
package and virtual environment manager. It also includes a GUI, Anaconda Navigator, as a
graphical alternative to the command-line interface (CLI).The big difference between conda
and the pip package manager is in how package dependencies are managed, which is a
significant challenge for Python data science and the reason conda exists.Before version 20.3,
when pip installed a package, it automatically installed any dependent Python packages
without checking if these conflict with previously installed packages. It would install a
package and any of its dependencies regardless of the state of the existing installation.
Because of this, a user with a working installation of, for example, TensorFlow, could find
that it stopped working having used pip to install a different package that requires a different
version of the dependent numpy library than the one used by TensorFlow. In some cases, the
package would appear to work but produce different results in detail. While pip has since
implemented consistent dependency resolution, this difference accounts for a historical
differentiation of the conda package manager.In contrast, conda analyses the current
environment including everything currently installed, and, together with any version
limitations specified (e.g. the user may wish to have TensorFlow version 2,0 or higher), works
out how to install a compatible set of dependencies, and shows a warning if this cannot be
done.Open source packages can be individually installed from the Anaconda repository,
10

Anaconda Cloud (anaconda.org), or the user's own private repository or mirror, using the
conda install command.

6.3 ANACONDA NAVIGATOR:

Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda


distribution that allows users to launch applications and manage conda packages,
environments and channels without using command-line commands. Navigator can search for
packages on Anaconda Cloud or in a local Anaconda Repository, install them in an
environment, run the packages and update them. It is available for Windows, macOS and
Linux.

Fig 6.3.1- Anaconda Navigator

The following applications are available by default in Navigator:

• JupyterLab
• Jupyter Notebook
• QtConsole
• Spyder
• Glue
• Orange
• RStudio
• Visual Studio Code

CONDA:
11

Conda is an open source, cross-platform, language-agnostic package manager and


environment management system that installs, runs, and updates packages and their
dependencies.It was created for Python programs, but it can package and distribute software
for any language (e.g., R), including multi-language projects. The conda package and
environment manager is included in all versions of Anaconda, Miniconda, and Anaconda
Repository.

JUPYTER NOTEBOOK:

A notebook integrates code and its output into a single document that combines
visualizations, narrative text, mathematical equations, and other rich media. In other words:
it's a single document where you can run code, display the output, and also add explanations,
formulas, charts, and make your work more transparent, understandable, repeatable, and
shareable. Using Notebooks is now a major part of the data science workflow at companies
across the globe. If your goal is to work with data, using a Notebook will speed up your
workflow and make it easier to communicate and share your results. Best of all, as part of the
open source Project Jupyter, Jupyter Notebooks are completely free. You can download the
software on its own, or as part of the Anaconda data science toolkit.

INSTALLATION:

The easiest way for a beginner to get started with Jupyter Notebooks is by installing
Anaconda. Anaconda is the most widely used Python distribution for data science and comes
pre-loaded with all the most popular libraries and tools. Some of the biggest Python libraries
included in Anaconda include NumPy, pandas, and Matplotlib, though the full 1000+ list is
exhaustive. Anaconda thus lets us hit the ground running with a fully stocked data science
workshop without the hassle of managing countless installations or worrying about
dependencies and OS-specific (read: Windows-specific) installation issues.

To get Anaconda, simply:

1. Download the latest version of Anaconda for Python 3.8.


2. Install Anaconda by following the instructions on the download page and/or in the
executable.
If you are a more advanced user with Python already installed and prefer to manage your
packages manually, you can just use pip: Cmd: pip3 install jupyter
12

In this section, we’re going to learn to run and save notebooks, familiarize ourselves with
their structure, and understand the interface. We’ll become intimate with some core
terminology that will steer you towards a practical understanding of how to use Jupyter
Notebooks by yourself and set us up for the next section, which walks through an example
data analysis and brings everything we learn here to life.

RUNNING JUPYTER:

On Windows, you can run Jupyter via the shortcut Anaconda adds to your start menu,
which will open a new tab in your default web browser that should look something like the
following screenshot.

This isn’t a notebook just yet, but don’t panic! There’s not much to it. This is the
Notebook Dashboard, specifically designed for managing your Jupyter Notebooks. Think of it
as the launchpad for exploring, editing and creating your notebooks.

Be aware that the dashboard will give you access only to the files and sub-folders
contained within Jupyter’s start-up directory (i.e., where Jupyter or Anaconda is installed).
However, the start-up directory can be changed.

Fig 6.3.2 - Running Jupyter

It is also possible to start the dashboard on any system via the command prompt (or terminal
on Unix systems) by entering the command jupyter notebook; in this case, the current
working directory will be the start-up directory. With Jupyter Notebook open in your browser,
you may have noticed that the URL for the dashboard is something like
https://localhost:8888/tree. Localhost is not a website, but indicates that the content is being
served from your local machine: your own computer. Jupyter’s Notebooks and dashboard are
web apps, and Jupyter starts up a local Python server to serve these apps to your web browser,
making it essentially platform-independent and opening the door to easier sharing on the web.
13

The short answer: each .ipynb file is one notebook, so each time you create a new notebook, a
new .ipynb file will be created.

The longer answer: Each .ipynb file is a text file that describes the contents of your notebook
in a format called JSON. Each cell and its contents, including image attachments that have
been converted into strings of text, is listed therein along with some metadata.

You can edit this yourself - if you know what you are doing! - by selecting “Edit > Edit
Notebook Metadata” from the menu bar in the notebook. You can also view the contents of
your notebook files by selecting “Edit” from the controls on the dashboard

However, the key word there is can. In most cases, there's no reason you should ever need to
edit your notebook metadata manually.

6.4 THE NOTEBOOK INTERFACE:

Now that you have an open notebook in front of you, its interface will hopefully not
look entirely alien. After all, Jupyter is essentially just an advanced word processor. Why not
take a look around? Check out the menus to get a feel for it, especially take a few moments to
scroll down the list of commands in the command palette, which is the small button with the
keyboard icon (or Ctrl + Shift + P).

Fig 6.4.1-Cell Environment

There are two fairly prominent terms that you should notice, which are probably new to you:
cells and kernels are key both to understanding Jupyter and to what makes it more than just a
word processor. Fortunately, these concepts are not difficult to understand.

• A kernel is a “computational engine” that executes the code contained in a notebook


document.
14

• A cell is a container for text to be displayed in the notebook or code to be executed by


the notebook’s kernel.

6.4.1 CELL ENVIRONMENT:

We’ll return to kernels a little later, but first let’s come to grips with cells. Cells form the
body of a notebook. In the screenshot of a new notebook in the section above, that box with
the green outline is an empty cell. There are two main cell types that we will cover:

• A code cell contains code to be executed in the kernel. When the code is run, the
notebook displays the output below the code cell that generated it.
• A Markdown cell contains text formatted using Markdown and displays its output
inplace when the Markdown cell is run.

6.5 PYTHON:

Python is an easy to learn, powerful programming language. It has efficient high-level


data structures and a simple but effective approach to object-oriented programming. Python’s
elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal
language for scripting and rapid application development in many areas on most platforms.
The Python interpreter and the extensive standard library are freely available in source or
binary form for all major platforms from the Python web site, https://www.python.org/, and
may be freely distributed. The same site also contains distributions of and pointers to many
free third party Python modules, programs and tools, and additional documentation. The
Python interpreter is easily extended with new functions and data types implemented in C or
C++ (or other languages callable from C).
15

Python is also suitable as an extension language for customizable applications. This


tutorial introduces the reader informally to the basic concepts and features of the Python
language and system. It helps to have a Python interpreter handy for hands-on experience, but
all examples are self-contained, so the tutorial can be read off-line as well. For a description
of standard objects and modules, see The Python Standard Library. The Python Language
Reference gives a more formal definition of the language. To write extensions in C or C++,
read Extending and Embedding the Python Interpreter and Python/C API Reference Manual.
There are also several books covering Python in depth.

PYTHON HISTORY:

Python was invented by Guido van Rossum in 1991 at CWI in Netherland. The idea of
Python programming language has taken from the ABC programming language or we can say
that ABC is a predecessor of Python language.

There is also a fact behind the choosing name Python. Guido van Rossum was a fan of the
popular BBC comedy show of that time, "Monty Python's Flying Circus". So he decided to
pick the name Python for his newly created programming language.

Python has the vast community across the world and releases its version within the short
period.

Python is a high-level, interpreted, interactive and object-oriented scripting language. Python


is designed to be highly readable. It uses English keywords frequently where as other
languages use punctuation, and it has fewer syntactical constructions than other languages.

• Python is Interpreted − Python is processed at runtime by the interpreter. You do not


need to compile your program before executing it. This is similar to PERL and PHP.

• Python is Interactive − You can actually sit at a Python prompt and interact with the
interpreter directly to write your programs.

• Python is Object-Oriented − Python supports Object-Oriented style or technique of


programming that encapsulates code within objects.

• Python is a Beginner's Language − Python is a great language for the beginnerlevel


programmers and supports the development of a wide range of applications from
simple text processing to WWW browsers to games.
16

WHY LEARN PYTHON?

Python provides many useful features to the programmer. These features make it most
popular and widely used language. We have listed below few-essential feature of Python.

• Easy to use and Learn


• Expressive Language
• Interpreted Language
• Object-Oriented Language
• Open Source Language
• Extensible
• Learn Standard Library
• GUI Programming Support
• Integrated
• Embeddable
• Dynamic Memory Allocation
PYTHON POPULAR FRAMEWORKS AND LIBRARIES:

Python has wide range of libraries and frameworks widely used in various fields such as
machine learning, artificial intelligence, web applications, etc. We define some popular
frameworks and libraries of Python as follows.

• Web development (Server-side) - Django Flask, Pyramid, CherryPy  GUIs


based applications - Tk, PyGTK, PyQt, PyJs, etc.
• Machine Learning - TensorFlow, PyTorch, Scikit-learn, Matplotlib, Scipy, etc.
• Mathematics - NumPy, Pandas, etc.

6.6 CHARACTERISTICS OF PYTHON:

Following are important characteristics of Python Programming −

• It supports functional and structured programming methods as well as OOP.


17

• It can be used as a scripting language or can be compiled to byte-code for building


large applications.

• It provides very high-level dynamic data types and supports dynamic type checking.

• It supports automatic garbage collection.

• It can be easily integrated with C, C++, COM, ActiveX, CORBA, and Java.

TENSORFLOW:

TensorFlow is a free and open-source software library for machine learning and
artificial intelligence.

It can be used across a range of tasks but has a particular focus on training and
inference of deep neural networks.

TensorFlow was developed by the Google Brain team for internal Google use in
research and production.

The initial version was released under the Apache License 2.0 in 2015.Google
released the updated version of TensorFlow, named TensorFlow 2.0, in September 2019.

TensorFlow can be used in a wide variety of programming languages, most notably


Python, as well as Javascript, C++, and Java.

This flexibility lends itself to a range of applications in many different sectors.


TensorFlow is Google Brain's second-generation system.

Version 1.0.0 was released on February 11, 2017.While the reference implementation
runs on single devices, TensorFlow can run on multiple CPUs and GPUs (with optional
CUDA and SYCL extensions for general-purpose computing on graphics processing
units).TensorFlow is available on 64-bit Linux, macOS, Windows, and mobile computing
platforms including Android and iOS. Its flexible architecture allows for the easy deployment
of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to
clusters of servers to mobile and edge devices.

6.7 INTEGRATIONS:
18

Numpy:

Numpy is one of the most popular Python data libraries, and TensorFlow offers
integration and compatibility with its data structures. Numpy NDarrays, the library’s native
datatype, are automatically converted to TensorFlow Tensors in TF operations; the same is
also true vise-versa. This allows for the two libraries to work in unison without requiring the
user to write explicit data conversions. Moreover, the integration extends to memory
optimization by having TF Tensors share the underlying memory representations of Numpy
NDarrays whenever possible.

Extensions:

TensorFlow also offers a variety of libraries and extensions to advance and extend the
models and methods used. For example, TensorFlow Recommenders and TensorFlow
Graphics are libraries for their respective functionalities in recommendation systems and
graphics, TensorFlow Federated provides a framework for decentralized data, and TensorFlow
Cloud allows users to directly interact with Google Cloud to integrate their local code to
Google Cloud. Other add-ons, libraries, and frameworks include TensorFlow Model
Optimization, TensorFlow Probability, TensorFlow Quantum, and TensorFlow Decision
Forests.

Google Collab:

Google also released Colaboratory, a TensorFlow Jupyter notebook environment that does not
require any setup. It runs on Google Cloud and allows users free access to GPUs and the
ability to store and share notebooks on Google Drive.

Tensor Board

• TensorBoard is a utility to visualize different aspects of machine learning. The


following guides explain how to use TensorBoard:
• TensorBoard: Visualizing Learning, which introduces TensorBoard.
• TensorBoard: Graph Visualization, which explains how to visualize the computational
graph.
• TensorBoard Histogram Dashboard which demonstrates the how to use TensorBoard's
histogram dashboard.

Performance
19

• Performance is an important consideration when training machine learning models.


Performance speeds up and scales research while also providing end users with near
instant predictions.
• Performance overview contains a collection of best practices for optimizing your
TensorFlow code.
• Data input pipeline describes the tf.data API for building efficient data input pipelines
for TensorFlow.
• Benchmarks contain a collection of benchmark results for a variety of hardware
configurations.
• Additionally, TensorFlow Lite has optimization techniques for mobile and embedded
devices.

Extend

• This section explains how developers can add functionality to TensorFlow's


capabilities.
• TensorFlow architecture presents an architectural overview.
• Create an op, which explains how to create your own operations.

CHAPTER 7

PROJECT DESCRIPTION

7.1 OVERVIEW OF THE PROJECT

(i) Today’s information and communication technology (ICT) industry demands for
highly skilled programmers for further development.

(ii) The conventional computer programming learning environment is insufficient to


prepare highly skilled programmers due to the limited number of exercise classes,
limited practice opportunities, and lack of individual tutoring.

(iii) In addition, most educational institutions, such as schools, colleges, and


universities are struggling to build more educational facilities to increase academic
activity (e.g., additional exercise classes, practice, and individual tutoring) due to
logistical and organizational constraints.
20

(iv)The growing number of people in classrooms in educational institutions, the large


number of students per class, and some lectures are conducted with more than a
thousand participants in the massive open online courses which complicate the
individual tutoring process.

(v) Furthermore, the growing ratio between students and educators raises the question of
how to provide individual support to students to improve their problem-solving skills.
Especially, when learning computer programming, students need a lot of practice and
individual tutoring to improve their programming knowledge and skills.

(vi)Computer programming is one of the fundamental courses in ICT discipline.


Programming practice and competition can play an important role in acquiring good
programming skills.

7.2 SYSTEM ARCHITECTURE


21

FIG 7.2.1 - SYSTEM ARCHITECTURE

7.3 MODULES DESCRIPTION:

• Educational Data Collection


• Rule-Based Recommendation Module
• Clustering In Data Module
• Pattern and Association Rule Module
• Evaluation Logs Data

EDUCATIONAL DATA COLLECTION:

E-learning platforms have become more popular for a variety of reasons and demands,
including teacher shortage, unbalanced student-teacher ratio, logistical and infrastructure
constraints, high cost of technical and professional courses, dissemination of education to a
large number of people, time saving and easy access to many courses. As the use of e-learning
systems increases, different types of data are being generated regularly.
Some data are structured whereas some are unstructured.
An EDM technique such as k-means clustering was used to evaluate students'
activities in an e-learning system and identified students' interests. It also identified the
correlation between the activity in the e-learning system and academic performance.

RULE-BASED RECOMMENDATION MODULE:

The volume and variety of content on e-learning platforms are increasing at an


unprecedented rate, and at the same time the opportunities for research using the resources of
e-learning platforms are also increasing.

Recommending relevant and appropriate content to users (e.g., students, instructors,


and teachers) is a challenging and tough task for any e-learning platform. Personalized RS to
provide appropriate supportive content to users.

In their approach, FP-growth algorithm is applied to generate frequent items patterns,


and fuzzy logic is used to partition the content into three levels.

Recently, some RSs have been using a mixed approach of content-based filtering and
collaborative filtering to achieve high-quality results in specific contexts

CLUSTERING IN DATA MODULE:


22

Clustering techniques are widely used in data analysis and play an important role in
the field of data mining.

With the diversification of data, many variations of clustering techniques have been
developed simultaneously to analyze different types of data.

Each clustering technique has its advantages and disadvantages for clustering data.

The usability and applicability of clustering techniques in the context of the EDM has
been described in a study.

To the best of our knowledge, there is no single clustering technique that can handle
all types of data including text, and numbers.

PATTERN AND ASSOCIATION RULE MODULE:

ARM is an unsupervised algorithm and was first introduced in research. There are diverse
applications of the ARM technique in various fields such as pattern mining, education, social,
medical, census, market-basket, and big data analysis.

ARM is an efficient technique for obtaining frequent items from large datasets.
Among the many types of ARM algorithms, Apriori and FP-growth algorithms are the most
widely useded. Comparing Apriori and FP-growth, Apriori requires repeated scanning of the
database to form a candidate itemset, whereas the FP-growth algorithm is very fast because it
only needs to scan the database twice to complete the process.

EVALUATION LOGS DATA:

An online judge (OJ) system is a web-based service that automatically evaluates


solution code; an OJ system evaluates the solution code using a predefined test dataset and
returns a judge result with various parameters after testing. This judge result for a solution
code is called a solution evaluation log. The judgment parameters for a solution code
evaluation include,

 judge identifier (j),


23

 users (u),

 problem name (p),

 program- ming language (l),

 judge's verdict (v),

 CPU time (ct),

 memory usage (mu),

 code size (cs),

 solution submission date (sd),  solution judgment date (jd).

7.4 DATASET DESIGN:

TABLE DESIGN

STUDENTS EDUCATIONAL DATASET DESIGN:

LEVEL-2:
24

CHAPTER 8

TESTING AND IMPLEMENTATION

8.1 IMPLEMENTATION

Implementation is the stage of the project when the theoretical design is turned out
into a working system. Thus, it can be considered to be the most critical stage in achieving a
successful new system and in giving the user, confidence that the new system will work and
be effective. The implementation stage involves careful planning, investigation of the existing
system and its constraints on implementation, designing of methods to achieve changeover
and evaluation of changeover methods.

8.2 INPUT AND OUTPUT DESIGN

INPUT DESIGN

The input design is the link between the information system and the user. It comprises the
developing specification and procedures for data preparation and those steps are necessary to
25

put transaction data in to a usable form for processing can be achieved by inspecting the
computer to read data from a written or printed document or it can occur by having people
keying the data directly into the system.

The design of input focuses on controlling the amount of input required, controlling the
errors, avoiding delay, avoiding extra steps and keeping the process simple.

The input is designed in such a way so that it provides security and ease of use with retaining
the privacy. Input Design considered the following things:

• What data should be given as input?


• How the data should be arranged or coded?
• The dialog to guide the operating personnel in providing input.
• Methods for preparing input validations and steps to follow when error occur.

OBJECTIVES

1. Input Design is the process of converting a user-oriented description of the input into a
computer-based system. This design is important to avoid errors in the data input process and
show the correct direction to the management for getting correct information from the
computerized system.
2. It is achieved by creating user-friendly screens for the data entry to handle large volume
of data. The goal of designing input is to make data entry easier and to be free from errors.
The data entry screen is designed in such a way that all the data manipulates can be
performed. It also provides record viewing facilities.
3. When the data is entered it will check for its validity. Data can be entered with the help of
screens. Appropriate messages are provided as when needed so that the user will not be in
maize of instant. Thus the objective of input design is to create an input layout that is easy to
follow.
26

OUTPUT DESIGN

A quality output is one, which meets the requirements of the end user and presents the
information clearly. In any system results of processing are communicated to the users and to
other system through outputs. In output design it is determined how the information is to be
displaced for immediate need and also the hard copy output. It is the most important and
direct source information to the user. Efficient and intelligent output design improves the
system’s relationship to help user decision-making.

1. Designing computer output should proceed in an organized, well thought out manner; the
right output must be developed while ensuring that each output element is designed so that
people will find the system can use easily and effectively. When analysis design computer
output, they should Identify the specific output that is needed to meet the requirements.

2. Select methods for presenting information.

3. Create document, report, or other formats that contain information produced by the system.

The output form of an information system should accomplish one or more of the following
objectives.

• Convey information about past activities, current status or projections of the  Future.
• Signal important events, opportunities, problems, or warnings.
• Trigger an action.
• Confirm an action.

8.3 SYSTEM TESTING


The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product.
27

It provides a way to check the functionality of components, sub assemblies, assemblies


and/or a finished product It is the process of exercising software with the intent of ensuring that
the

Software system meets its requirements and user expectations and does not fail in an
unacceptable manner.

There are various types of test,each test type addresses a specific testing requirement.

8.4 TYPES OF TESTING


Unit Testing
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated.

It is the testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing, that relies on
knowledge of its construction and is invasive.

Unit tests perform basic tests at component level and test a specific business process,
application, and/or system configuration. Unit tests ensure that each unique path of a business
process performs accurately to the documented specifications and contains clearly defined
inputs and expected results.

Integration Testing
Integration tests are designed to test integrated software components to determine if they
actually run as one program.

Testing is event driven and is more concerned with the basic outcome of screens or fields.

Integration tests demonstrate that although the components were individually satisfaction,
as shown by successfully unit testing, the combination of components is correct and consistent.
Integration testing is specifically aimed at exposing the problems that arise from the combination
of components.

Functional Test
Functional tests provide systematic demonstrations that functions tested are available as specified
by the business and technical requirements, system documentation, and user manuals.
28

Functional testing is centered on the following items:


Valid Input : identified classes of valid input must be accepted.

Invalid Input : identified classes of invalid input must be rejected.

Functions : identified functions must be exercised.

Output : identified classes of application outputs must be exercised.


Systems/Procedures: interfacing systems or procedures must be invoked.

Organization and preparation of functional tests is focused on requirements, key functions, or


special test cases.

In addition, systematic coverage pertaining to identify Business process flows; data fields,
predefined processes, and successive processes must be considered for testing.

Before functional testing is complete, additional tests are identified and the effective value of
current tests is determined.

System Test
System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results.

An example of system testing is the configuration-oriented system integration test.

System testing is based on process descriptions and flows, emphasizing pre-driven process links
and integration points.

White Box Testing


White Box Testing is a testing in which in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose.

It is purpose. It is used to test areas that cannot be reached from a black box level.

Black Box Testing


Black Box Testing is testing the software without any knowledge of the inner workings, structure
or language of the module being tested.

Black box tests, as most other kinds of tests, must be written from a definitive source
document, such as specification or requirements document, such as specification or requirements
document.
29

It is a testing in which the software under test is treated, as a black box. you cannot “see”
into it.

The test provides inputs and responds to outputs without considering how the software works.

Unit Testing:

Unit testing is usually conducted as part of a combined code and unit test phase of the software
lifecycle, although it is not uncommon for coding and unit testing to be conducted as two distinct
phases.

Test Strategy and Approach


Field testing will be performed manually and functional tests will be written in detail.

Test objectives
 All field entries must work properly.
 Pages must be activated from the identified link.
 The entry screen, messages and responses must not be delayed.

Features to be tested
 Verify that the entries are of the correct format
 No duplicate entries should be allowed
 All links should take the user to the correct page.

Integration Testing
Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects.

The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level –
interact without error.

Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
30

Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation by the
end user.

It also ensures that the system meets the functional requirements.

Test Results

All the test cases mentioned above passed successfully. No defects encountered.

CHAPTER 9
CONCLUSION AND FUTURE ENHANCEMENT

9.1 CONCLUSION

An EDM framework for data clustering, patterns, and rules mining using real-world
problemsolving data. A mathematical model for data preprocessing, MK-means, and
FPgrowth algorithms were used to conduct this study. For programming education, OJ
systems have been adopted by many institutions as academic tools. As a result, a huge
number of programming-related resources (source codes, logs, scores, activities, etc.) are
regularly accumulated in OJ systems. In this study, a large amount of real-world problem-
solving data collected from the AOJ system was used in the experiments. Problem-solving
data preprocessing is one of the main tasks to achieve accurate EDM results. Therefore, a
mathematical model for problem-solving data preprocessing is developed. Then, the
processed data are clustered using Elbow and MK-means algorithms. Various statistical
features, data patterns and rules are extracted from each cluster based on different threshold
values (K, minConf , minSup). These results can effectively contribute to the improvement of
overall programming education. Moreover, based on the experimental results, some pertinent
suggestions have been made. Furthermore, the proposed framework can be applied to other
practical/exercise courses to demonstrate data patterns, statistical features, and rules. Besides,
any third-party applications with similar data resources such as AlgoA, ProgA, FCT, and FPT,
can use the proposed approach for EDM and analysis.
31

9.2 FUTURE ENHANCEMENT

In the future, the experimental results of EDM using problem-solving data can be
integrated to visualize different LA for programming platforms such as the OJ system. In
addition, fuzzy estimation and polynomial approximation methods can be handy to
dynamically select the optimal minSup values based on the dataset. Appropriate minSup
values could help to generate the actual number of frequent elements and association rules
from the dataset

CHAPTER 10 APPENDICES

10.1 SOURCE CODE

STUDENT EDUCATIONAL DATA MINING:

#!/usr/bin/env python
# coding: utf-8

# # Student Grade Analysis & Prediction

# # Import Libraries

# In[1]:

import pandas as pd import


numpy as np import
matplotlib.pyplot as plt
import seaborn as sns
get_ipython().run_line_magic('matplotlib', 'inline')
32

# # Dataset

# In[2]:

stud= pd.read_csv('student-mat.csv') # Read the dataset #


In[3]:

print('Total number of students:',len(stud))

# In[4]:

stud['G3'].describe()

# In[5]:

stud.info() # Information on dataset

# In[6]:

stud.columns # Dataset Columns


33

# In[7]:

stud.describe() # Dataset description

# In[8]:

stud.head() # First 5 values of dataset

# In[9]:

stud.tail() # Last 5 values of dataset

# In[10]:

stud.isnull().any() # To check any null values present in dataset

# In[13]:

import cufflinks as cf cf.go_offline()


34

# In[14]:

stud.iplot() # Plot for the all attributes #


In[15]:

stud.iplot(kind='scatter',x='age',y='G3',mode='markers',size=8) # Plot for age vs G3

# In[16]:

stud.iplot(kind='box')

# In[17]:

stud['G3'].iplot(kind='hist',bins=100,color='blue')

# # Data Visualization

# In[18]:
35

sns.heatmap(stud.isnull(),cmap="rainbow",yticklabels=False) # To check any null values


present in dataset pictorially

# In[19]:

sns.heatmap(stud.isnull(),cmap="viridis",yticklabels=False) # Map color - viridis

# - There are no null values in the given dataset

# # Student's Sex

# In[20]:

f_stud = len(stud[stud['sex'] == 'F']) # Number of female students print('Number


of female students:',f_stud)
m_stud = len(stud[stud['sex'] == 'M']) # Number of male students print('Number
of male students:',m_stud)

# In[21]:

sns.set_style('whitegrid') # male & female student representaion on countplot


sns.countplot(x='sex',data=stud,palette='plasma')

# - The gender distribution is pretty even.


36

# # Age of Students

# In[22]:

b = sns.kdeplot(stud['age']) # Kernel Density Estimations b.axes.set_title('Ages


of students')
b.set_xlabel('Age')
b.set_ylabel('Count') plt.show()

# In[23]:

b = sns.countplot(x='age',hue='sex', data=stud, palette='inferno')


b.axes.set_title('Number of Male & Female students in different age groups')
b.set_xlabel("Age")
b.set_ylabel("Count") plt.show()

# - The student age seems to be ranging from 15-19, where gender distribution is pretty even
in each age group.
# - The age group above 19 may be outliers, year back students or droupouts.

# # Students from Urban & Rural Areas

# In[24]:

u_stud = len(stud[stud['address'] == 'U']) # Number of urban areas students


37

print('Number of Urban students:',u_stud)


r_stud = len(stud[stud['address'] == 'R']) # Number of rural areas students print('Number
of Rural students:',r_stud)

# In[25]:

sns.set_style('whitegrid')
sns.countplot(x='address',data=stud,palette='magma') # urban & rural representaion on
countplot

# - Approximately 77.72% students come from urban region and 22.28% from rural region.

# In[26]:

sns.countplot(x='address',hue='G3',data=stud,palette='Oranges')

# # EDA - Exploratory Data Analysis

# ### 1. Does age affect final grade?

# In[27]:

b= sns.boxplot(x='age', y='G3',data=stud,palette='gist_heat') b.axes.set_title('Age


vs Final Grade')
38

# - Plotting the distribution rather than statistics would help us better understand the data.
# - The above plot shows that the median grades of the three age groups(15,16,17) are similar.
Note the skewness of age group 19. (may be due to sample size). Age group 20 seems to
score highest grades among all.

# In[28]:

b = sns.swarmplot(x='age', y='G3',hue='sex', data=stud,palette='PiYG') b.axes.set_title('Does


age affect final grade?')

# ## 2. Do urban students perform better than rural students?

# In[29]:

# Grade distribution by address


sns.kdeplot(stud.loc[stud['address'] == 'U', 'G3'], label='Urban', shade = True)
sns.kdeplot(stud.loc[stud['address'] == 'R', 'G3'], label='Rural', shade = True)
plt.title('Do urban students score higher than rural students?')
plt.xlabel('Grade'); plt.ylabel('Density') plt.show()

# - The above graph clearly shows there is not much difference between the grades based on
location.

# In[30]:
39

stud.corr()['G3'].sort_values()

# ## Encoding categorical variables using LabelEncoder()

# In[31]:

from sklearn.preprocessing import LabelEncoder le=LabelEncoder()


stud.iloc[:,0]=le.fit_transform(stud.iloc[:,0])
stud.iloc[:,1]=le.fit_transform(stud.iloc[:,1])
stud.iloc[:,3]=le.fit_transform(stud.iloc[:,3])
stud.iloc[:,4]=le.fit_transform(stud.iloc[:,4])
stud.iloc[:,5]=le.fit_transform(stud.iloc[:,5])
stud.iloc[:,8]=le.fit_transform(stud.iloc[:,8])
stud.iloc[:,9]=le.fit_transform(stud.iloc[:,9])
stud.iloc[:,10]=le.fit_transform(stud.iloc[:,10])
stud.iloc[:,11]=le.fit_transform(stud.iloc[:,11])
stud.iloc[:,15]=le.fit_transform(stud.iloc[:,15])
stud.iloc[:,16]=le.fit_transform(stud.iloc[:,16])
stud.iloc[:,17]=le.fit_transform(stud.iloc[:,17])
stud.iloc[:,18]=le.fit_transform(stud.iloc[:,18])
stud.iloc[:,19]=le.fit_transform(stud.iloc[:,19])
stud.iloc[:,20]=le.fit_transform(stud.iloc[:,20])
stud.iloc[:,21]=le.fit_transform(stud.iloc[:,21])
stud.iloc[:,22]=le.fit_transform(stud.iloc[:,22])
40

# In[32]:

stud.head()

# In[33]:

stud.tail()

# In[34]:

stud.corr()['G3'].sort_values() # Correlation wrt G3

# In[35]:

# drop the school and grade columns stud =


stud.drop(['school', 'G1', 'G2'], axis='columns') # -
Although G1 and G2 which are period grades of a
student and are highly correlated to the final grade
G3, we drop them. It is more difficult to predict G3
without G2 and G1, but such prediction is much
more useful because we want to find other factors
affect the grade.
41

# In[36]:

# Find correlations with the Grade


most_correlated = stud.corr().abs()['G3'].sort_values(ascending=False)

# Maintain the top 8 most correlation features with


Grade most_correlated = most_correlated[:9]
most_correlated

# In[37]:

stud = stud.loc[:, most_correlated.index] stud.head()

# ### Failure Attribute

# In[38]:

b = sns.swarmplot(x=stud['failures'],y=stud['G3'],palette='autumn') b.axes.set_title('Previous
Failures vs Final Grade(G3)')

# **Observation :** Student with less previous failures usually score higher

# ### Family Education Attribute ( Fedu + Medu )


42

# In[39]:

fa_edu = stud['Fedu'] + stud['Medu']


b = sns.swarmplot(x=fa_edu,y=stud['G3'],palette='summer') b.axes.set_title('Family
Education vs Final Grade(G3)')

# **Observation :** Educated families result in higher grades

# ### Wish to go for Higher Education Attribute

# In[40]:

b = sns.boxplot(x=stud['higher'],y=stud['G3'],palette='binary') b.axes.set_title('Higher
Education vs Final Grade(G3)')

# **Observation :** Students who wish to go for higher studies score more

# ## Going Out with Friends Attribute

# In[41]:

b = sns.countplot(x=stud['goout'],palette='OrRd') b.axes.set_title('Go
Out vs Final Grade(G3)')
43

# **Observation :** The students have an average score when it comes to going out with
friends.

# In[42]:

b = sns.swarmplot(x=stud['goout'],y=stud['G3'],palette='autumn') b.axes.set_title('Go
Out vs Final Grade(G3)')

# **Observation :** Students who go out a lot score less

# ### Romantic relationship Attribute

# In[43]:

b = sns.swarmplot(x=stud['romantic'],y=stud['G3'],palette='YlOrBr')
b.axes.set_title('Romantic Relationship vs Final Grade(G3)')

# - Here romantic attribute with value 0 means no relationship and value with 1 means in
relationship
#
# **Observation :** Students with no romantic relationship score higher

# ### Reason Attribute


44

# In[44]:

b = sns.countplot(x='reason',data=stud,palette='gist_rainbow') # Reason to choose this


school
b.axes.set_title('Reason vs Students Count')

# In[45]:

b = sns.swarmplot(x='reason', y='G3', data=stud,palette='gist_rainbow')


b.axes.set_title('Reason vs Final grade')

# **Observation :** The students have an equally distributed average score when it comes to
reason attribute.

# # Machine Learning Algorithms

# In[46]:

# Standard ML Models for comparison from


sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet

from sklearn.ensemble import RandomForestRegressor


from sklearn.ensemble import ExtraTreesRegressor from
sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
45

# Splitting data into training/testing


from sklearn.model_selection import train_test_split from
sklearn.preprocessing import MinMaxScaler

# Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error,
median_absolute_error

# Distributions import
scipy

# In[47]:

# splitting the data into training and testing data (75% and 25%)
# we mention the random state to achieve the same split everytime we run the code
X_train, X_test, y_train, y_test = train_test_split(stud, stud['G3'], test_size = 0.25,
random_state=42)

# In[48]:

X_train.head()
# ## MAE - Mean Absolute Error & RMSE - Root Mean Square Error

# In[49]:
46

# Calculate mae and rmse def


evaluate_predictions(predictions, true): mae =
np.mean(abs(predictions - true)) rmse =
np.sqrt(np.mean((predictions - true) ** 2))

return mae, rmse

# In[50]:

# find the median


median_pred = X_train['G3'].median()

# create a list with all values as median


median_preds = [median_pred for _ in range(len(X_test))]

# store the true G3 values for passing into the function true
= X_test['G3']

# In[51]:

# Display the naive baseline metrics


mb_mae, mb_rmse = evaluate_predictions(median_preds, true)
print('Median Baseline MAE: {:.4f}'.format(mb_mae))
print('Median Baseline RMSE: {:.4f}'.format(mb_rmse))
47

# In[52]:

# Evaluate several ml models by training on training set and testing on testing set def
evaluate(X_train, X_test, y_train, y_test):
# Names of models
model_name_list = ['Linear Regression', 'ElasticNet Regression',
'Random Forest', 'Extra Trees', 'SVM',
'Gradient Boosted', 'Baseline']
X_train = X_train.drop('G3', axis='columns')
X_test = X_test.drop('G3', axis='columns')

# Instantiate the models


model1 = LinearRegression()
model2 = ElasticNet(alpha=1.0, l1_ratio=0.5) model3 =
RandomForestRegressor(n_estimators=100) model4 =
ExtraTreesRegressor(n_estimators=100) model5 =
SVR(kernel='rbf', degree=3, C=1.0, gamma='auto') model6
= GradientBoostingRegressor(n_estimators=50)

# Dataframe for results


results = pd.DataFrame(columns=['mae', 'rmse'], index = model_name_list)

# Train and predict with each model for i, model in enumerate([model1,


model2, model3, model4, model5, model6]):
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Metrics
48

mae = np.mean(abs(predictions - y_test)) rmse


= np.sqrt(np.mean((predictions - y_test) ** 2))

# Insert results into the dataframe


model_name = model_name_list[i]
results.loc[model_name, :] = [mae, rmse]

# Median Value Baseline Metrics baseline =


np.median(y_train) baseline_mae = np.mean(abs(baseline
- y_test)) baseline_rmse = np.sqrt(np.mean((baseline -
y_test) ** 2))

results.loc['Baseline', :] = [baseline_mae, baseline_rmse]

return results

# In[53]:

results = evaluate(X_train, X_test, y_train, y_test) results

# In[54]:

plt.figure(figsize=(12, 7))

# Root mean squared error


ax = plt.subplot(1, 2, 1)
results.sort_values('mae', ascending = True).plot.bar(y = 'mae', color = 'violet', ax = ax)
plt.title('Model Mean Absolute Error') plt.ylabel('MAE')
49

# Median absolute percentage error ax


= plt.subplot(1, 2, 2)
results.sort_values('rmse', ascending = True).plot.bar(y = 'rmse', color = 'pink', ax = ax)
plt.title('Model Root Mean Squared Error') plt.ylabel('RMSE')

10.2 SCREEN SHOTS

IMPORT LIBRARY PACKAGES:


50

VIEW DATASET INFORMATION DETAILS:

DATASET DESCRIPTION:
51

CHECK NULL VALUES PRESENT IN DATASET:

PLOT FOR THE ALL ATTRIBUTES:


52

DATA VISUALIZATION:

AGE OF STUDENTS:
53

STUDENTS FROM URBAN & RURAL AREAS:

EDA - EXPLORATORY DATA ANALYSIS:


54

FAMILY EDUCATION ATTRIBUTE ( FEDU + MEDU )

FRIENDS ATTRIBUTE ANALYSIS:


55

MAE - MEAN ABSOLUTE ERROR & RMSE - ROOT MEAN SQUARE ERROR:

FINAL RESULT ON MAE AND RMSE:


56

10.3 JOURNAL
57
58

A TEXT CLASSIFICATION OF YOUR DATASET


THAT EXTRACTS THE KEY TOPICS

DR.M. GEETHA,
ASSOCIATE PROFESSOR,
DEPARTMENT OF COMPUTER APPLICATIONS (MCA), (AUTONOMOUS),

K.S.R. COLLEGE OF ENGINEERING, TIRUCHENGODE - 637 215

SOWNDARRAJAN N,

DEPARTMENT OF COMPUTER APPLICATIONS (MCA), (AUTONOMOUS),


K.S.R. COLLEGE OF ENGINEERING, TIRUCHENGODE - 637 215

Keywords— Identifying keywords, suggesting


documents, topic modelling, grouping, and
Abstract— Using relevant keywords to manually categorization of texts.
identify the subject of a huge collection
generally viewed as an extremely time- I. Introduction
consuming job when editing text documents. Unprecedented amounts of information, which
With the goal of employing these keywords to can be found in texts, databases, or multimedia
obtain a small number of potentially relevant resources, surround humanity. Users' ability to
publications for each brief discussion fragment access this information depends on the presence of
that may be recommended to participants, the the proper search engines, but even in those cases
problem of keyword extraction from discussions where they are, they frequently decide not to do a
is addressed. A suggested method begins by search because it would interfere with their current
removing terms from the dialogue using topic task or because they are unaware that the
modelling approaches. The keywords are
information, they are looking for is available. In
selected based on their similarity. Then, a
this work, we employ the instantaneous retrieval
technique for generating numerous questions
approach, which corrects this issue by suggesting
with different themes from this keyword
books that are timely and pertinent to readers'
collection is demonstrated in order to raise the
probability that at least one suggestion that's present activity available on demand. When these
connected will be returned while searching with
these terms throughout English Wikipedia. The
technique's keyword extraction tasks are quite
precise and match the cluster's subject matter.
The papers are organized using keywords that
are taken from the documents.
59

activities are primarily conversational, users' keywords increases the chance that a minimum of one
information requests might be described as implicit of the proposed papers contains the keywords, while
queries that emerge from the conversation in the topic-based grouping reduces the risk of ASR mistakes
background and are recovered by actual time listening. in the queries can give you the details you need or can
Making use of speech recognition software (VRS) point you in the direction of another useful document
during a meeting. Such implicit searches are used to by clicking on its hyperlinks. For instance, depending
locate and suggest content from the web or a local the Wikipedia articles based on frequency of words
storage facility, which users may opt to explore further "Light," "Lighting," and "Light My Fire" would be
if they are interested in. returned. Users would, however, favor a set that
contained words like "Lighter," "Wool," and
"Chocolate."
This study discusses how to create implicit
inquiries for the just-in-time retrieval system that will
be utilized in conference rooms. Our just-intime To guarantee relevance and diversity,
retrieval system must build implicit inquiries from three stages can be used: extracting the
conversational input, which comprises a lot more keywords, generating one or more implicit
words than a query. This is compared to the potentially queries, and reordering the outcomes. This
done explicit spoken queries with corporate digital article focuses on the first two strategies
engines of search. Consider the following example, investigation. A new study employing the third
where four people are asked to name various items, one [1] demonstrates that reranking the
such as "chocolate," "pistol," or "lighter," that will outcomes of a single implicit test cannot
allow them to live in the mountains. This field of study increase the users' happiness with the
is known as "automatic term recognition" in computing recommended texts inquiry. In prior to choosing
linguistics, and it is known as "automatic indexing" or the peak-ranking keywords, keywords are
"automatic keyword extraction" compared to the ranked using frequency of words, TFIDF
potentially done explicit spoken queries with corporate weights, or approaches for extracting implicit
digital engines of search. questions from text are used [2, 3]. This work
introduces a unique method for collecting
keywords from ASR output that increases
In order to give a limited sample of suggestions protection of users' prospective information
based on the most likely hypotheses, our goal is to demands and minimizes the usage of
maintain a variety of assumptions regarding users' unnecessary phrases. After being extracted, a
information needs. This is due to the possibility of a group of keywords is clustered to produce a
wide range of topics, which might be made more number of topically-separated searches that can
challenging by probable ASR mistakes or speech be used independently and have greater
stutters (like "whisk" in this case). Therefore, our precision than a single, more complex topically-
objective is to obtain a broad and relevant keyword mixed query. Eventually, the findings are
collection, organize them into topic-specific queries, compiled into a ranked set before being made
rank them according to importance, and finally provide available to users as suggestions.
a sample of the results to the user. The diversity of
60

II. Comparable works that the inclusion of POS and sentence score data
facilitates extraction of keywords.

The studies on the use of keyword extraction


software and categorization methods for the
paragraphs below provide a quick overview of data A unique graph-based framework called
mining that follow. Topical PageRank was presented by Zhiyuan
Liu et al. [8] to measure word relevance in
In their research, Latent Dirichlet Allocation (LDA)
is a generating probability model for small data sets, relation to diverse subjects. Topical PageRank
such as text databases, that was introduced by Jordan mixes topic information into random walks for
et al. The probabilistic Latent Semantics Analysis key phrase extraction. The top-scoring words
model and the unigrams model are combined to are then extracted as key phrases after we
analyze the results of document modelling, text further examine their ranking scores while
categorization, and collaborative filtering. taking into account the thematic distribution of
[4] Ishizuka et al. uses a single document rather the content.
than a corpus for its new keyword extraction approach.
Frequent terms are initially retrieved before extracting
a collection connecting each phrase to the common For tasks involving document
phrases in terms of bondoccurrences. The 2-measure is
summarization, a collection of submodular
used to evaluate a combination allocation's bias level.
functions was created by Jeff Bilmes and
Cong Wong and co reported on a WordNet and colleagues [11]. Each of these problems
PageRank-based keyword extraction method. A combines two terms: one that favors variation
technique known as PageRank is used to assess the and one that promotes the succinct as a corpus
vertex relevance of a graph. The result shows how
representative. Taking into account that the
effective and practical the approach is.
relevant submodular optimization issue
Shireen Ye et al. in [2] suggested the Document efficiently solved, and fast.
Concept Lattice (DCL) summary that organizes the
surrounding phrases into a sequence of local themes
tied to a set of common methods. Every phrase in a
Ani Nenkova and others [5] provided a range
document can be represented by a set of ideas in base
of techniques for determining key information
nodes throughout the construction of the DCL, and
for computerized text synthesis. The text is
typically, thought collections extracted from those base
initially represented in a state of inactivity using
nodes will create subsequent nodes.
topic identification algorithms using the input's
subjects. Last but not least, a summary is
created by either employing a greedy method to
In their investigation of a graph-based method for choose either by individually selecting the
evaluating a word's importance according to how it phrases that will be included one at a time or by
connects to other terms or phrases. The results globally optimizing the selection to find the
demonstrate that the straightforward unsupervised most effective sentence combinations.
TFIDF technique performs largely satisfactorily, and
61

As demonstrated by documentation synthesis 1. Identifying the necessary keywords from a


programmers, according to David Harwath et al, conversational fragment's transcript utilized to
routinely assessed based on the inherent quality of the locate the suggested documents.
summaries they provide. The results indicate that an 2. The second stage is clustering the keyword set.
automatic subject identification system, which seems to 3. Use k-Nearest Neighbor to order the keywords.
be related to the success of this a summarizing system,
can provide as a low-cost replacement for a human Differential Keyword Extraction Making a
review during the early phases of the system's topical representation of a conversation
implementation. fragment is the aim of topic modelling
approaches. From then, content terms are
selected as keywords based on topical
similarity, and themes covering a wide range of
Three text classification algorithms—Naive
subjects are rewarded. Diversified keyword
Bayes (NB), Support Vector Machines (SVM), and extraction has the advantage of covering as
Decision Tree—were introduced by Wongkot Sriurai much of the main topics of the conversation
[13]. The word features are divided into a number of fragment as possible. The suggested approach
residual themes using the Latent Dirichlet Allocation also covers other subjects in order to
method. The greatest outcomes are produced by the
method that use SVMs to learn the model for
classification and blends the representation of features
with a particular model.

Menaka [12] provided a description of the Text


Classification technique using keyword extraction.
In order to extract keywords from documents, WordNet
and TF-IDF are employed. Finally, using machine
learning algorithms, documents are categorized based
on the retrieved keywords. Decision trees, the k-
Nearest Neighbor, and Naive Bayes. The Decision Tree
outperforms other machine learning techniques for text
categorization, according to a performance evaluation
technique outperforms the other two algorithms in
terms of prediction accuracy.

I. Proposed Work

The recommended strategy has two stages.


62

determine the subject distribution of each word and


note using a large number of training texts [10]. The
Mallet toolbox's LDA implementation is used for
this study since the overfitting problem with PLSA
does not apply to it.

 z N1 1i N p z | w

i  were,

z= topic weight N =

total no. of words

p z | w i = probabilities of word wi for the


topic z
Multiple Keyword Extraction Challenge: The goal
of the keyword extraction technique is to cover as
many topics as possible. If a conversation fragment
covers a set of topics and each word from the
Fig. 1 A technique for categorizing documents fragment can evoke a subset of the topics in, the goal
based on keywords will choose fewer key terms is to find a subset of distinct words, with, that
from each topic. maximizes the number of topics covered. The topic's
contribution is determined by adding up all of the
probabilities of the words in each group of words,
Fig. 1 depicts schematically the three stages of the regardless of size, in order to reach the goal. Create a
suggested method for extracting various keywords. reward function for each set and topic after that to
replicate the contribution of the set to the subject.
1. As shown in Fig. 1, a topic model is utilized to Last but not least, pick a collection that maximizes
depict the occurrence of the conceptual subject the combined reward values across all themes.
for every word observed.
2. In each conversations segment represented by, A diverse reward function is defined as: What
these topic categories are utilized to give is a diverse reward function? Saying this will
weights for each abstract subject. introduce the contribution to the keyword set's
3. The next step is to choose the keyword list that fragment topic:
comprises the most crucial subjects.
Modelling Conversational Topics: Latent Dirichlet
Allocation (LDA), a topic approach, it is possible to
63

utilization in addition to swiftly locate related papers.


rS,z  p z | wi  (2)
w S
According to the method described in this study, stop
words are first eliminated from the text before nouns
The next step involves choosing the keyword and verbs are found using a dictionary. Finding the
set by maximizing the cumulative reward document's keywords is made simpler by the use of
function across all subjects, which is written as these verbs and nouns.
follows:

By employing keyword clustering to group


documents, clustering algorithm's feature dimensions
(3) R(S) 
 z.r
can be reduced. This approach clusters texts by
(1) z Z
grouping keywords that co-occur in verbs and nouns
and have equivalent probability distributions
throughout the target words.
were,

R(S) is monotonous non-decreasing The


parameter for a submodular Algorithm 1: Extraction of Multiple Keywords

 function ranges from 0 to 1.


B. Classification of Text
In the 1 event where the incentive
function is linear solely evaluates how closely Text categorization one of the primary applications
words relate to the primary subjects of z. Yet of machine learning. The assignment of a fresh,
01, when a selected term comes from a unlabeled document with text to a certain class is the
subject, immediately. task. The collection of features terms that convert into
helpful keywords during the training phase is the first
of two major issues that occur in the method of text
classification. The actual categorization of the text
during the testing phase utilizing these aspect keywords
is the second problem. Prior to classification of
A. Combinations of keywords: documents, preprocessing has been finished. During
Retrievers of information typically employ preprocessing, stop words are removed, and the words
document aggregation to increase the precision of the are stemmed. After that, for each term in the document,
findings. To organize similar pieces of data into the TF-IDF and word frequency are calculated.
categories or a collection is the goal of clustering.
Document clustering is the procedure of first figuring
out how similar each pair of documents are, and then Identifying keywords: You might think of
grouping them together if the parallels are greater than keywords as a condensed form of papers and
a certain level of similarity. Many applications have their evaluation. For several applications linked
employed document aggregation. The database system to texts mining, including document retrieval,
has made use of it to enhance accuracy and memory website retrieval, document clustering, and
64

document summarization, obtaining keywords information retrieval. The frequency of a word in the
is an essential approach. The primary goal of corpus reduces the value of tf-idf, which grows directly
obtaining keywords is to extract terms based on in proportion to a word's frequency in the text.
how significant they are in the text. The initial
action is to choose and preprocess the desired
documents. A measure of how frequently a phrase appears in a
document is called term frequency (TF).

Stop Words are Eliminated: Stop words are


common in everyday speech but aren't very
important in a retrieval strategy. Avoid using tf t,d  0.5 0.5f t,d 
stopwords in texts since they appear heavier and (4)
max.wordoccurrence
less pertinent to readers and analyzers. When
stop words are removed, the phrase space Were,
becomes less three-dimensional. The words that
appear most frequently in text documents but The frequency of occurrences of word t is
don't add to the sense of the texts are shown by f(t,d).
prepositions, articles, and pronouns. Stop words
are included in this category. Examples of stop
words are the, in, a, an, with, etc. Manuscripts Frequency of Inverse Documents-A statistical weight
are stripped of stop words because text mining called Inverse Document Frequency (IDF) is used to
algorithms do not consider them to be assess a term's importance within a group of text
keywords. documents.

Stemming: To determine the word's root or stem, D


idf t,d log
use the filtering process. (5) Stemming strips the
initial letters of words, which contain a substantial
no.ofdoctermst
amount from linguistic data unique to a particular Experimentation Findings
language. As an example, words like connection,
connects, and linked can be used to deduce the phrase
"connect". The Porter Stemmer technique, the most The Intel Pentium i3 processor and 4
popular algorithm in English, is used in the current GB of RAM were used to construct the
work. suggested model. The 500 GB hard disc has a
capacity. Windows 8.1 serves as the OS. With
Java Net Beans, the algorithm is put into
practice.
Inverse Document Frequency for Terms: A statistical
measure called phrase Frequency-Inverse Document
Frequency (tf-idf) shows how pertinent a phrase is to
each document in a collection. Tf-idf is commonly used Using multiple keyword extraction
as a weighting component in text mining and techniques utilized in this section, more relevant
65

keywords covering a larger range of topics are then categorized employing machine learning
gathered. The project uses conversations, and techniques Decision trees, k-Nearest Neighbor, and
the data is categorized using a free application. Nave Bayes Neighbor. The Decision Tree outperforms
RapidMiner is a platform for predictive other machine learning algorithms for text
analytics, business analytics, machine learning, categorization. according to a performance evaluation
text analysis, and data mining. Applications for technique outperforms the other two algorithms in
the technology include research, discipline, terms of prediction accuracy.
teaching, creation of applications, fast
developing prototypes, and business. It offers References
data methods loading and transformation, [1] M. Habibi and A. Popescu-Belis, “Enforcing topic
preprocessing, visualization, modelling, diversity in a document recommender for
conversations,” in Proc. 25th Int. Conf.
evaluation, and deployment for machine
Comput. Linguist. (Coling), 2014, pp. 588–599.
learning and data mining. This learning-based J.
tool was developed using the Java programming [2] S. Ye, T.-S. Chua, M.-Y. Kan, and L. Qiu,
language. The experimentation in the proposed “Document concept lattice for text understanding
work makes use of twenty presentations. and summarization,” Inf. Process. Manage., vol. 43,
no. 6, pp. 1643–
Collection of discussions is done by hand. The
1662, 2007.
effective the use of TF-IDF and WordNet to [3] D. Harwath and T. J. Hazen, “Topic identification
derive keywords from the segments. The based extrinsic evaluation of summarization
procedure is created using Java keyword techniques applied to conversational speech,” in
research. The retrieved keywords are then saved Proc. Int. Conf. Acoust., Speech, Signal Process.
(ICASSP), 2012, pp. 5073–5076. [4] Y. Matsuo and
for classification and clustering. M. Ishizuka, “Keyword extraction from a single
document using word
co-occurrence statistical information,” Int. J.
Artif. Intell. Tools, vol. 13, no. 1, pp. 157–
Keyword clustering performs less well than 169, 2004.
categorization. A component of supervised [5] A. Nenkova and K. McKeown, “A survey
learning is classification. The training duration of text summarization techniques,” in
Mining Text Data, C. C. Aggarwal and C.
and prediction accuracy are utilized to compare Zhai, Eds. New York, NY, USA: Springer,
the precision of predictions of the various 2012, ch. 3, pp. 43– 76.
training models and assess their performance. [6] T. J. Hazen, “Latent topic modeling for
audio corpus summarization,” in Proc. 12th
Annu.
Conf. Int. Speech Commun. Assoc., 2011,
Conclusion pp. 913–916.
One of the primary applications of machine [7] J. Wang, J. Liu, and C. Wang, “Keyword
learning is text classification. The proposed technique extraction based on pagerank,” in Proc.
uses text analysis algorithms to pull keywords out of Adv.
journal articles. Calculating the semantic distances Knowl. Disc. Data Mining (PAKDD), 2007, pp.
between terms makes use of the WordNet vocabulary. 857–864.
[8] Z. Liu, W. Huang, Y. Zheng, and M. Sun,
Italicized terms that were found to be comparable the “Automatic keyphrase extraction via topic
most. Based on the obtained keywords, documents are
66

decomposition,” in Proc. Conf. Empir.


Meth.
Nat. Lang. Process. (EMNLP’10), 2010,
pp. 366–376.
[9] F. Liu, D. Pennell, F. Liu, and Y. Liu,
“Unsupervised approaches for automatic
keyword extraction using meeting
transcripts,” in Proc. Annu. Conf. North
Amer. Chap. ACL (HLT-NAACL), 2009,
pp. 620–628. [10] D. M. Blei, A. Y. Ng, and
M. I. Jordan, “Latent Dirichlet
allocation,”J. Mach. Learn. Res., 2003, vol.
3, pp. 993–1022.
[11] H. Lin and J. Bilmes, “A class of
submodular functions for document
summarization,” in Proc. 49th Annu.
Meeting Assoc. Comput.
Linguist. (ACL), Portland, OR, USA, 2011, pp.
510–520.
[12] N.Radha and S.Menaka, “Text classification
using keyword extraction technique”, in
Proceedings of the Int. Conf. adv. Res.
Computer Science and Software
Engineering(IJARCSSE), 2013, pp. 734-
740. [13] Wongkot Sriurai, “Improving text
classification by using a topical model”, in Adv.
67

10.3 RESUME

E-mail: [email protected]
Ph-No: 9597316660

CARRER OBJECTIVE
I seek challenging opportunities where I can fully use my skills for the success of the
organization.

ACADEMIC DETAILS

COURSE INSTITUTION BOARD/ YEAR OF PERCENTAGE


UNIVERSITY PASSING
K.S.R College of
MCA Engineering Anna 2023 74.5(till 3rd
(Autonomous) University semester)
Salem
BCA Sowdeswari Periyar 2020 68.8
College University
Sri Jothi Higher
HSC Secondary State Board 2017 78.5
School,
Tharamangalam.
Government High
SSLC School, State Board 2015 90.5
Periyakadampatty.

SKILLS

 Java
 Python
 Basic of C
 MS office

PERSONAL SKILLS

 Logical Thinking
 Ability to perform in a Team Task
 Hard Worker
 Developing Communication skills
68

ACHIEVEMENTS
 Certificate of Participated in the one-day National Level Seminar on Novel IoT
Insights and its Artificial Intelligence.
 Certificate of Participated in Design Thinking-Leveraging the Power of your
Mind.
 Certificate of Participated in Workshop at ASTHRA 2022 an international level
technical symposium.

PROJECT

 Educational Data Mining to support programming learning using problemsolving


Data

EXTRA-CURRICULAR ACTIVITIES

 Playing cricket
 Hearing music & songs

HOBBIES

 Reading Books & Newspaper 


Searching about new Technologies

PERSONAL INFORMATION

Date of Birth : 12.02.2000


Gender : Male
Marital status : Single
Blood group : O+ve
Linguistic status : Tamil & English

COMMUNICATION DETAILS

Address: 8/119, s/o Natesan, Nallankadu,


Puthurkadampatty, Semmankoodal(post),
Omalur(TK), Salem(DT) - 636304.
Place:
N. SOWNDARRAJAN
Date:
70

CHAPTER 11

REFERENCES
[1] A. K. Singh and I. Gupta, ``Online information leaker identification scheme for secure data
sharing,'' Multimedia Tools Appl., vol. 79, no. 41, pp. 31165_31182, Nov. 2020.

[2] E. Zaghloul, K. Zhou, and J. Ren, ``P-MOD: Secure privilege-based multilevel organizational
data-sharing in cloud computing,'' IEEE Trans. Big Data, vol. 6, no. 4, pp.
804_815, Dec. 2020.

[3] I. Gupta and A. K. Singh, ``GUIM-SMD: Guilty user identification model using summation
matrix-based distribution,'' IET Inf. Secur., vol. 14, no. 6, pp. 773_782, Nov.
2020.

[4] W. Shen, J. Qin, J. Yu, R. Hao, and J. Hu, ``Enabling identity-based integrity auditing and
data sharing with sensitive information hiding for secure cloud storage,'' IEEE Trans. Inf.
Forensics Security, vol. 14, no. 2, pp. 331_346, Feb. 2019.

[5] I. Gupta and A. K. Singh, ``An integrated approach for data leaker detection in cloud
environment,'' J. Inf. Sci. Eng., vol. 36, no. 5, pp. 993_1005, Sep. 2020.

[6] R. Li, C. Shen, H. He, X. Gu, Z. Xu, and C.-Z. Xu, ``A lightweight secure data sharing
scheme for mobile cloud computing,'' IEEE Trans. Cloud Comput., vol. 6, no. 2, pp.
344_357, Apr. 2018.

[7] I. Gupta, N. Singh, and A. K. Singh, ``Layer-based privacy and security architecture for cloud
data sharing,'' J. Commun. Softw. Syst., vol. 15, no. 2, pp. 173_185, Apr. 2019.

[8] J. Li, S. Wang, Y. Li, H. Wang, H. Wang, H. Wang, J. Chen, and Z. You, ``An efficient
attribute-based encryption scheme with policy update and _le update in cloud computing,''
IEEE Trans. Ind. Informat., vol. 15, no. 12, pp. 6500_6509, Dec. 2019.

[9] C. Suisse. (2017). 2018 Data Center Market Drivers: Enablers Boosting Enterprise Cloud
Growth. Accessed: May 19, 2019. [Online]. Available:
71

https://cloudscene.com/news/2017/12/2018-data-center-predictions/
[10] I. Gupta and A. K. Singh, ``A framework for malicious agent detection in cloud
computing environment,'' Int. J. Adv. Sci. Technol., vol. 135, pp. 49_62, Feb. 2020.

[11] Y. Li, Y. Yu, G. Min, W. Susilo, J. Ni, and K.-R. Choo, ``Fuzzy identity-based data
integrity auditing for reliable cloud storage systems,'' IEEE Trans. Dependable Secure
Comput., vol. 16, no. 1, pp. 72_83, Jan./Feb. 2019.

[12] I. Gupta and A. K. Singh, ``A probabilistic approach for guilty agent detection using
bigraph after distribution of sample data,'' Proc. Comput. Sci., vol. 125, pp. 662_668, Jan.
2018.

[13] L. Zhang, Y. Cui, and Y. Mu, ``Improving security and privacy attribute based data
sharing in cloud computing,'' IEEE Syst. J., vol. 14, no. 1, pp. 387_397, Mar. 2020.

[14] I. Gupta and A. K. Singh, ``Dynamic threshold based information leaker identification
scheme,'' Inf. Process. Lett., vol. 147, pp. 69_73, Jul. 2019

[15] I. Gupta and A. K. Singh, ``SELI: Statistical evaluation based leaker identification
stochastic scheme for secure data sharing,'' IET Commun., vol. 14, no. 20, pp. 3607_3618,
Dec. 2020..
72

K.S.R. COLLEGE OF ENGINEERING (AUTONOMOUS)


Approved by AICTE, Accredited by NAAC with ‘A++’ grade
K.S.R. Kalvi Nagar, Tiruchengode-637-215
Namakkal District, Tamil Nadu
Ph: 04288-274213 Fax: 04288-274757

E-mail: [email protected]

POs & PSOs MAPPING

ACADEMIC YEAR: 2022-2023


Name of the Student : SOWNDARRAJAN N

Register Number : 73152162050

Department : MASTER OF COMPUTER APPLICATIONS

Batch : 2021- 2023


Title of the Project Project Outcomes (Min.3 Points)
POs PSOs
Mapped Mapped

Signature of the Supervisor Programme Coordinator Head of the Department

You might also like