0% found this document useful (0 votes)
13 views57 pages

Volume 8 Issue 9

Uploaded by

mahersingh222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views57 pages

Volume 8 Issue 9

Uploaded by

mahersingh222
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

International Journal of Computer Applications Technology and Research

Volume 8–Issue 09, 340-347, 2019, ISSN:-2319–8656

Design of Microstrip Reflective Array Antenna


Meng Li
School of Communication Engineering
Chengdu University of Information Technology
Chengdu, China

Abstract: With the development of communication technology, people have higher and higher requirements for
communication quality. Traditional array antennas require complex feed networks and the necessary phase shifters.
Conventional reflector antennas are bulky and difficult to manufacture. It is therefore necessary to analyze the
traditional characteristics of reflective array antennas to accommodate future adaptability. The microstrip antenna
has the advantages of small size, simple structure and small outer shape. Therefore, the working principle and design
process of the reflective microstrip array antenna are introduced in detail. A dual-loop antenna operating at

f  4.5GHz was designed, which simplifies the shape of the antenna and achieves a beam pointing of 30°.

Compared with similar literature, the new unit antenna has a simple structure, can realize beam orientation without a
phase shifter, can work in the low frequency range of 5G, and has high engineering value.
Keywords: Beam orientation; Simple structure; Phase shifter; Microstrip antenna; 5G communication

1. INTRODUCTION

With the rapid development of modern the shape of the reflec Therefore, this paper

microwave communication, satellite communication designs a micro-band unit with a simple double-ring

and 5G technology, parabolic antennas play an


increasingly important role[1-3]. However, due to the structure.tion array unit is complicated, which

modern society's requirements for the flexible increases the difficulty of production. The structure is

operation of communication systems, the 


simple, can compensate the phase delay of 0-360 ,
disadvantages of the traditional parabolic antennas are
cumbersome and bulky[4]. The planar array antenna 
and can realize the beam orientation function of 30 at
requires a complicated power distribution feed network,
a phase shifter, etc., which can easily increase the f  4.5GHz . It also can be used in 5G mobile
transmission loss of the antenna and reduce its
transmission efficiency[5]. The microstrip reflective communication systems and other wireless

array antenna is operated. The microstrip reflective communication systems in the frequency band, and has

array combines the advantages of a reflective antenna high engineering practical value.

and a microstrip antenna, and is highly valued for its


light weight, small size, low price, and ease of
manufacture. Different shapes of reflective elements 2. WORKING PRINCPLE
are discussed in the literature of Dahri, M. Hashim. It
can be seen from the discussion of various design and The microstrip reflection array is mainly
architectural features that the reflective arrays of composed of two parts, one part is the feed power
different structures have different phase shifting source and the other part is a reflective medium plate
ranges[6]. However, it can be seen from the text that printed with a large number of microstrip arrays.

www.ijsea.com 340
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 340-347, 2019, ISSN:-2319–8656

indicates that the phase compensation period is 2

[8-10].

2.1 Spatial phase delay unit between array


elements
The spatial phase delay in a planar array reflective

antenna is primarily due to the fact that the array


peripheral elements are unequal from the center and

thus due to differences in the electric field transmission


paths.
Figure.1 Reflective array antenna working diagram

Working mechanism of microstrip planar


reflection antenna: assuming that the planar reflection
arrays are all in the far field region of the feed, it can be
considered that the electromagnetic waves irradiated to
each of the reflection units are plane waves[7]. For a
spherical wave, the phase is proportional to the
distance between the center of the feed phase and each
reflective unit. When the electromagnetic wave
radiated by the feed is irradiated from the center of the Figure.2 Spatial wave path difference between the units of

feed phase to each radiating element on the reflective the reflective array antenna

array, since the transmission distance from the feed to


Therefore, the spatial phase difference that needs
each unit is different, there must be a wave path
to be compensated is:
difference between the respective units, so that each
unit is The received incident field has different spatial
phase delays, and the size parameters of each unit are (2)
reasonably designed according to the phase center of
the feed horn and the specified beam pointing, so that it
Therefore, in the microstrip array design,

can properly compensate the incident field[8]. From
the ray theory, the phase of the ith unit in Figure 1 that must be compensated first to achieve the same incident
needs to be compensated is: field phase of each unit of the array.

i =2  N+ k0 ( Ri  ri .r0 ) (N=0,1,2...) (1)


2.2 Analytical method for phase shift
characteristics of microstrip reflective array

k0 Ri elements
is the free space wave number、 is the
2.2.1 Unit antana model
phase center of the feed to the position of the i th
An isolated unit model, which does not consider
ri
patch、 represents the position vector from the the mutual coupling effects of surrounding elements,
directly uses plane waves to excite individual isolated
r0
center of the array to the i th patch、 represents the elements, and obtains the phase delay generated by
electromagnetic waves in the unit according to the
unit vector along the outgoing main beam、 2N
phase contrast between the reflected waves and the
incident waves. F. Venneri et al. extracted a square

www.ijsea.com 341
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 340-347, 2019, ISSN:-2319–8656

variable size unit antenna using an isolated unit model. (2) Master-slave boundary method
This model is fast calculated by computer, but the
drawback is that this analysis method only applies to The wireless periodic array is simulated by two

large cell spacing so that mutual coupling can be pairs of master-slave boundaries with a Floquet port.

ignored. The model excitation port is different from the


ordinary wave port by giving two mutually
2.2.2 Infinite period model
perpendicular electric field excitations to the radiating
In the wireless periodic cell model, based on the element on the upper surface of the port, and then
Floquet theory to simulate the model of the infinite porting A layer of PML absorber layer is placed
array environment, the influence of the cell spacing on between the master and slave boundaries. This aspect
the unit reflection field can be calculated. Therefore, makes up for the simulation flaws that the waveguide
in the unit design, only a single unit model needs to be simulator can only be used for the normal incidence of
calculated to complete the wireless large The plane waves. The master-slave boundary method can
calculation of the array. illuminate the reflecting unit from any direction, but
ensure that the fields on the two master-slave
boundaries have the same amplitude and direction.

Figure.3 Infinite period model


Figure.5 Master-slave boundary method model
Similar to the simulation model of a general
microstrip antenna, the model mainly consists of three 2.3 Several typical phase compensation methods
parts, the excitation port, the surrounding periodic During the analysis, the dielectric material,
boundary and the patch unit. The difference is that the dielectric thickness, and cell spacing remain
radiation boundary condition in the general model is unchanged.
changed to the periodic boundary condition. 2.3.1 Loaded transmission antenna
(1) Waveguide Approach-WGA
Each unit of the patch unit in the array has the
same shape and size, but the length of the microstrip
line connected to it is different, and the required phase
shift is adjusted by changing the length of the
microstrip line of each patch in the array.

Figure.4 Waveguide simulator model

www.ijsea.com 342
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 340-347, 2019, ISSN:-2319–8656

Figure.7 Antenna model and its phase shift range

The phase shifting curve of the square patch unit


is close to the "S" shape. It can be seen from the

figure that the phase shift range is 327°.

2.3.3 Slotted unit

Each patch unit in the array has the same shape


and size, and is slotted on the ground plate on the back

side. The size of the slot is determined by the amount


Figure.6 Antenna model and its phase shift range
of phase shift required.
It is observed from the phase shift curve in the

figure that the rectangular patch originating from the


load transfer type unit occupies a part of the front

surface, so that the phase shift range of the load

transfer type unit is very limited (  = 332 deg).

2.3.2 Variable size

The shape of the array consists of reflective


elements of the same shape but different sizes, and the

appropriate amount of phase shift is provided by


adjusting the cell patch size.

Figure.8 Antenna model and its phase shift range

As seen in Figure 8, the range of antenna

movement is   363

3. NEW REFLECTVE ARRAY


ANTENNA
3.1 reflection unit structure

A new unit of multi-layer structure is proposed in

www.ijsea.com 343
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 340-347, 2019, ISSN:-2319–8656

the literature [11], and the phase shift range exceeds 3.2 Influence of unit structure
450°. The literature [12]. Proposed a windmill type parameters on phase shift characteristics
unit with a phase shift range of 700°. However, the
two units are complicated in structure and difficult to (1)Effect of thickness t on phase shifting
manufacture. Therefore, this paper designs a new type performance.
of unit with double loop structure, the structure is as
follows:

Figure.10 Effect of different parameters d on phase shifting


performance

As can be seen from Fig. 10, as the thickness


of the substrate increases, the resonance point of
the phase shift curve gradually approaches, and the
phase shift range is reduced, but still satisfies 360°.
Finally take t=4mm.
(2) Effect of parameter d on phase shifting
performance

Figure.11 Effect of different parameters d on phase shifting


Figure.9 Antenna structure performance

The unit consists of two annular patches, As can be seen from Fig. 11, as the parameter d
wherein the inner ring is a square ring, the outer increases, the outer ring width of the unit antenna
ring is a ring, and there is a gap between the upper increases, the curve becomes steep, and the resonance
and lower sides. Rogers RT/duroid 5880 material point distance is zoomed in. Finally, d = 1.9 mm is
with thickness t=2mm, dielectric constant is 2.2. taken.
(Related units are marked in the figure). (3) Effect of parameter g on phase shift

www.ijsea.com 344
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 340-347, 2019, ISSN:-2319–8656

performance

Figure.12 Effect of different parameters g on phase shifting


performance

As can be seen from Fig. 12, as the g gradually


increases, that is, the interval between the inner and

outer rings gradually becomes larger, the phase shift


range does not change much. Finally take the height.

g=3.3. In summary, the optimized unit parameters are:

Parameters Table 1

Variable a t hs h l d g

length 30 4 0.035 15 20 1.9 3.3

The optimized curve is shown below: phase shift performance is 0~360°, and the linearity of

the phase shift curve is also good. Therefore, the


reflecting unit is designed as a reflective array antenna.

When the feed is far enough away from the plane


array, the feed can be regarded as a plane wave,

assuming that the beam points to theta at 30° and the


phase delay of the adjacent array unit is:

(3)

Figure.13 Optimized curve c


f is the working frequency band, is the
It can be seen that the phase shift range is
speed of light in vacuum, and a is the spacing
  395 , greater than 360 °, to meet the design between adjacent units.

requirements.
The array adopts 4*4 array mode, the cell
spacing is a=12mm, and the dielectric constant is 2.2.
3.3 Microstrip reflection array design
The beam is directed to theta=30 degrees, so the phase
In this paper, the phase shift performance of the of the cells in each row and column needs to be
new reflection unit is analyzed and analyzed. The compensated:

www.ijsea.com 345
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 340-347, 2019, ISSN:-2319–8656

Parameters Table 2 According to Table 2, the corresponding dimensions


of each unit are as follows:

Column
1 2 3 4

Row Phase

1 19 21.75 22.25 23

2 21.75 22.25 23 24.9

3 22.25 23 24.9 21.35

4 23 24.8 21.25 28.1

Make a reflection array based on the dimensional data Figure.14 Microstrip reflective array antenna

in Table 3. Simulation and optimization of reflective array


antennas using ansoft HFSS software:

Figure.14 Gain curve of the reflective array

Parameters Table 3

Column 1 2 3 4

Row Phase

1 -33.24 -114.24 -195.24 -276.24

2 -114.24 -195.24 -276.24 -357.24


www.ijsea.com 346
3 -195.24 -276.24 -357.24 -438.24

4 -276.24 -357.24 -438.24 -519.24


International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 340-347, 2019, ISSN:-2319–8656

It can be seen from the figure that the pattern of Letters (2016):1-1.
the main beam is in the direction of 30°, which is
[6] Dahri, M. Hashim , et al. "A Review of Wideband
consistent with the original design, thus verifying that
Reflectarray Antennas for 5G Communication
the double-ring unit can achieve beam orientation.
Systems." IEEE AccessPP.99(2017):1-1.

4. CONCLUSION
This paper proposes a microstrip reflective array [7] Qin, Pei Yuan , Y. J. Guo , and A. R. Weily .

antenna that can be used in the 5G FR1 band, which "Broadband Reflectarray Antenna Using
compensates for the phase shift of the antenna by Sub-wavelength Elements Based on Double Square
changing the structure of the antenna unit. The beam Meander-Line Rings." IEEE Transactions on Antennas
pointing angle is set, and the size of the reflective
and Propagation 64.1(2015):1-1.
array unit is calculated. Finally, the main beam of the
antenna is accurately oriented to a preset 30° to [8] Chaharmir, M. R. , and J. Shaker . "Design of a
achieve beam directivity. In the same literature, the broadband, dual-band, large reflectarray using multi
unit antenna has a simple structure and is easy to
open loop elements." Antennas & Propagation Society
process and design. It can be used in 5G FR1 mobile
International SymposiumIEEE, 2010.
communication systems and other wireless
communication systems, and has high engineering
[9] Venneri, F. , S. Costanzo , and M. G. Di .
practical value.
"Bandwidth Behavior of Closely Spaced

Aperture-Coupled Reflectarrays." International


5. REFERENCES:
Journal of Antennas and Propagation2012(2012):1-11.
[1] Ta, Son Xuat , H. Choo , and I. Park . "Broadband
Printed-Dipole Antenna and Its Arrays for 5G [10] Li, Qin Yi , Y. C. Jiao , and G. Zhao . "A Novel
Applications." IEEE Antennas and Wireless Microstrip Rectangular-Patch/Ring- Combination
Propagation Letters PP.99(2017):1-1. Reflectarray Element and Its Application." IEEE

Antennas & Wireless Propagation


[2] Li, Xichun, et al. "The Future of Mobile Wireless
Letters 8.4(2009):1119-1122.
Communication Networks." International Conference
on Communication Software & Networks 2009. [11] José A. Encinar. "Design of two-layer printed
reflectarray using patches of variable size." IEEE
[3] Sharma, Pankaj . "Evolution of Mobile Wireless
Transactions on Antennas and
Communication Networks-1G to 5G as well as Future
Propagation 49.10(2001):1403-1410.
Prospective of Next Generation Communication

Network." International Journal of Computer Science [12] Encinar, J. A. , and J. A. Zornoza . "Broadband
& Mobile Computing 2.8(2013). design of three-layer printed reflectarrays." IEEE
Transactions on Antennas and
[4] Huang, J. "Analysis of a microstrip reflectarray
Propagation 51.7(2003):1662-1664.
antenna for microspacecraft application." Tda

Progress Report 120(1995).

[5] Chang, Zhuang , et al. "A Reconfigurable


Graphene Reflectarray for Generation of Vortex THz

Waves." IEEE Antennas and Wireless Propagation

www.ijsea.com 347
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 348-352, 2019, ISSN:-2319–8656

The “Promotion” and “Call for Service” Features in the


Android-Based Motorcycle Repair Shop Marketplace
Ketut Wahyu Kartika Nugraha I Made Sukarsa Ni Putu Sutramiani
Department of Information Department of Information Department of Information
Technology Technology Technology
Faculty of Engineering Faculty of Engineering Faculty of Engineering
Udayana University Udayana University Udayana University
Badung, Bali, Indonesia Badung, Bali, Indonesia Badung, Bali, Indonesia

Abstract: The existence of the motorcycle repair shop business continues to grow, along with the developments of motorcycle riders
in Indonesia. However, the majority of riders do not know the existence of the repair shop, especially in the remote location or in the
area where they have never visited before. This problem can make that business do not last long. The Motorcycle Repair Shop
Information System Application is useful for answering problems related to motorcycle repair shops. "Call for Service" and
"Promotion" are two main features of the application which implement E-CRM. The "Call for Service" feature is used to make
emergency calls to the nearest repair shop if there is an unexpected situation on the road. The "Promotion" feature is used as a medium
to attract as many customers as possible and to increase customer loyalty by providing attractive promotions to the application users.
The implementation process uses computers with React Native software, SQLyog, XAMPP, Visual Studio Code and Android
smartphones. The Black Box Test in the application reveals that the users can use the “Call for Service” and “Promotion” features
from it. The results of data development analysis in the application shows that it only requires a storage space of 73,746 MegaBytes
within a year, if there are 25 new data every day.
.
Keywords: E-CRM; mobile application; emergency call; promotion; customer loyalty.

implementing the “Call for Service” feature in the application


1. INTRODUCTION to provide the location of the user's position [3].
The development of motorcycle riders throughout Indonesia
A research conducted by Trinh Le Tan explains the success
has increased by an average of 7.5 million vehicles per year,
factor of implementing E-CRM in e-commerce companies.
calculated from 2010 to 2017 [1]. The developing use of
The study was used as a reference for implementing E-CRM
motorcycle in Indonesia has opened up opportunities to open
on the promotional features contained in the application [4].
repair shop businesses in both small and medium-sized
businesses. Motorcycle riders are often faced with difficult
situations, for example, sudden flat tire, the engine is not 3. RESEARCH METHODS
starting, sudden breakdown and so on. The riders are usually There are four steps in conducting the research. The
not aware of the existence of the nearest small repair shop first one is analyzing the needs from both of the repair
business. The lack of promotion media for that business also shop and the customer. The analysis step is carried out to
makes it possible that their business will not last long. determine the design of the application, therefore it can
The solution created is in the form of an Android-based answer the needs of both parties. The second step is designing
Motorcycle Repair Shop Information System application the system workflow. The design of it is done in order to
aimed at motorcycle riders, and to the owners of the repair know if the system can perform according to the procedures
shop. The application feature "Call for Service" is intended to that have been specified. The third step is to create a system,
overcome the problems experienced by the riders in for both an Android application and a web service which
emergency situations, and the "Promotion" feature will help aimed at the admin in managing data. The fourth step is
the repair shop owners to attract as many customers as testing the system. The application that have been made will
possible. Both of these features are the implementation of E- be tested to find out the errors contained in the system, and if
CRM in the application in order to maintain good relations there are many errors or malfunctions in the system, a
between the repair shops and the users of the Android redesign of the workflow will be done to fix the system errors.
application.
3.1 General Overview of the System
2. LITERATURE REVIEW The research applications for Android-based Motorcycle
A research conducted by Amrapali Dabhade, K.V. Kale and Repair Shop Information Systems have a general overview
Yogesh Gedam discussed an application that can determine that can be seen in Figure1.
the closest direction to a hospital. The study was used as a
reference in the “Call for Service” feature on the Motorcycle
Repair Shop application to find the shortest route to a
motorcycle rider [2].
A research conducted by Mwangala Mwiya, Jackson Phiri and
Gift Lyoko performed a similar study of using GIS
(Geographic Information System) technology to report
criminal acts to the Zambian police. The research was used in

www.ijcat.com 348
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 348-352, 2019, ISSN:-2319–8656

analysis. From the point of view of geographic data


processing, GIS is not a new invention. The geographic data
processing has been carried out a long time ago by various
fields of science, the only difference is that from the use of
digital data [8].

4.3 Google Maps API


Google Maps provides an API, it is a provider of digital map
services that are popular nowadays. The Google Maps API
can be implemented on a web or on an Android / iOS
application and provides a map service that can display real
images of the earth from satellites, provides a navigation
system for travel routes and to find registered places such as
business places, recreation areas and so on [9]. The map and
navigation system on Google Maps has begun to be developed
in the form of augmented reality. The use of augmented
reality is intended, therefore the users can improve their
driving safety because they can still see the road with a
Figure. 1 General Overview of the System
smartphone camera while using maps to navigate routes [10].
The Motorbike Repair Shop Information System is connected
to the database whose data is managed by the admin. These 4.4 Customer Loyalty
data as if motorcycle repair data, application user data, Customer Loyalty or Loyalitas Pelanggan is the desire of
motorcycle repair shop location data, transaction data and so customers to continue their relationship with a particular
on. The customer of the application can use it to register as a company for a long time, it is because the loyal customers are
user, log in to the application, search for the nearest repair those who buy goods / services of the company from time to
shop, view data of all the repair shop, call a repair shop time. Loyalty can be interpreted as a customer's desire; a
technician using the “Call for Service” feature by using the willingness to be a regular customer for a long time; buy and
help of Geographic Information System (GIS), view use goods from the selected company and recommending
promotions on the applications and so on. The repair shop can them to friends and colleagues. It is an evidence of the
use this application to register their business into the consumers who are always becoming customers, who have
application, login and see the emergency call notifications the strength and positive attitude towards the company. Each
sent from users, giving promotions and others. The user and of the customers has a different basis of the loyalty and it
the repair shop are connected with the application by using the depends on their perspective views [11].
help of geographic information systems (GIS) mapping.
4.5 E-CRM
E-CRM is a CRM (Customer Relationship Management)
4. CONCEPTS AND THEORIES which is implemented electronically by using a web browser,
This section contains concepts and theories that support in internet, and other electronic media such as e-mail, call
conducting the research. They are including Android, GIS centers, and personalization. It is a technique for the
(Geographic Information System), Google Maps API, companies which is done by online to strengthen the
Customer Loyalty and E-CRM. It will be discussed as relationship between the company and its customers, where it
follows. aims to increase customer satisfaction and gain loyalty from
consumers. Also, it has a definition of using digital
4.1 Android communication technology to maximize customer sales and
Android is a Linux-based operating system used for cellular encourage the use of online services [12].
phones (mobile) such as smartphones and tablet computers
(PDAs). It provides an open platform for developers to create 5. RESULT AND DISCUSSION
their own applications that are used by various mobile devices The results and discussion of the Motorcycle Repair Shop
[5]. Its appearance on March 9th, 2009 introduces an Android Information System application contains the results of testing
version 1.1 and up to the last version 9.0 Pie that has been the system directly, the results of Black Box testing and the
produced in 2018. Android has been used in everyday life, results of the analysis of data development. These three results
and moves into all areas of life. It can facilitate transaction will be discussed as follows.
activities, for example, in the culinary field, a transaction in a
restaurant can now be done only from an Android
Smartphone [6]. Game Explore Bali is an application that is 5.1 System Testing
engaged in education to educate children about culture in Bali The customer can make emergency calls to nearby repair
[7]. shops, and the repair shop can also receive emergency calls
made by the application user. Testing this system is done
directly by using the Motorcycle Repair Shop Information
4.2 GIS (Geographic Information System) System application. The call from the customer to the nearest
GIS (Geographic Information System) or in Indonesian repair shop, is displayed in Figure 2.
Language called as Sistem Informasi Georafis is an
information system that is designed to work by using data that
has spatial information (spatial reference). It works by
capturing, checking, integrating, manipulating, analyzing, and
displaying data that spatially refer to the condition of the
earth. The main function of GIS is to conduct spatial data

www.ijcat.com 349
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 348-352, 2019, ISSN:-2319–8656

customer's location in the Motorcycle Repair Shop


Information System application.

5.2 Black Box Testing Analysis


The black box testing or referred as functional testing, is a
testing technique regarding to the function of a system based
on a particular test case. The people who perform black box
testing do not have direct access to the application source
code, but they only focus on the output produced as a
response to the input chosen by the examiner and the
execution conditions of the system [14]. The table of black
box testing can be seen in Table 1.

Table 1. Black Box Testing Analysis


The
Testing Testing
Expected Result
(a) (b) Activity Result
Realization
Figure. 2 “Call For Service” Feature Adding The added New
repair shop promotion promotion
Figure 2 shows the step of selecting a customer's location promotions data data has been [x]
before making an emergency call from the customer's successfully successfully Accepted
application. This location selection is intended in order to be appears in added and
more accurate towards the customer location. Figure 2(a) is the repair appears in
the step of displaying a map in order to select the location of shop’s the repair [ ]
the customers when making emergency calls. In Figure 2(b), application shop’s Rejected
the customer is asked to choose one type of the damage and promotion application
can include notes for the repair shop technician. The display menu
of “Call for Service” from the repair shop application point of Changing The The repair
view is displayed in Figure 3. repair shop promotion shop
[x]
promotion data promotion
Accepted
data successfully data
changed in successfully
[ ]
the repair changed
Rejected
shop’s
application
Removing The The deleted
repair shop promotion promotion
promotion data that data is
[x]
want to be disappear
Accepted
deleted, from the
successfully promotion
[ ]
deleted in the menu in the
Rejected
application repair shop’s
promotion application
menu
(a) (b) Looking at a The repair There is an
list of repair shops that indicator in
Figure. 3 “Call for Service” Feature shops that have green which [x]
have promotions means Accepted
Figure 3(a) is a display of incoming emergency calls from the promotions are marked "Promotion"
repair shop’s application. The repair shop can receive the call from with a green at a repair [ ]
by pressing the "TERIMA ORDERAN" button, or ignore the customer’s indicator shop that has Rejected
call if they do not want to receive it. Figure 3(b) is a applications which means a promotion
navigation display of directions to the customer’s location "Promotion"
who make emergency calls. A research conducted by Yuli
Spotting An added list A list of
Fauziah, Heru Cahya Rustamaji and Rihadina Ramadhan
promotions of promotion promotion
created an application that can predict the arrival of Trans [x]
from the provided by data
Jogja buses by broadcasting locations to passengers, therefore Accepted
Promotions repair shop provided by
the estimated arrival time can be predicted [13]. A research
menu to their the repair
conducted by Made Yudha Putra Mahendra, I Nyoman Piarsa [ ]
through the application shop appears
and Dwi Putra Githa produced a public complaint application Rejected
customer's appears
by using the Geographic Information System to record the
application
location of a complaint, and the admin could read all the
Booking a The The booking
community complaints and find out the location of it [14].
service from customers service was [x]
Both of the studies are used as references to predict the
certain successfully successfully Accepted
mechanic's arrival time to the customer and to find out the
promotions book a made, but

www.ijcat.com 350
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 348-352, 2019, ISSN:-2319–8656

service with the [ ] price


certain promotion Rejected charged.
promotions calendar Looking at The repair The repair
system has an shop can see shop can see
not emergency an a history list
functioned call emergency of
properly transaction call emergency
Making an The The history from transaction call
emergency customers customer the repair history that transactions
[x]
call to the can choose successfully shop’s has been that has been
Accepted
nearest their chooses their application. received, and received, but
technician location, fill location, show the cannot see
[x] [ ]
through the out a includes services that the services
Accepted Rejected
customer's complaint their they have that they
application about their complaints performed, have
[ ]
vehicles and and makes along with performed,
Rejected
find the an the total along with
nearest repair emergency price the total
shop from call. charged. price
their charged.
locations.
Receiving The The The black box test results in Table 1 indicates that the repair
emergency technicians technicians shop can create a new promotion data that will be provided to
calls from can notice successfully the customers. They can change the promotion data that
[x]
customers emergency see the already exists in their promotion data. Also, they can delete it
Accepted
through calls from emergency in the menu from their application. The customer can see a list
repair shop’s the call from the of the repair shop that provides promotion from their
[ ]
application customers customers application, both from the repair shop list menu or when they
Rejected
and can and book a service in the booking menu. They can see various
receive them. successfully promotion lists that appear in the Promotions menu from their
receive it. application. In addition, they can directly order services based
Reviewing The repair The repair on the certain promotions on the Promotion menu. The
the list of shop can see shop can see customer can make an emergency call to the nearest repair
emergency all of the all of the [x] shop technician and track them. The transaction history of an
calls from received received Accepted emergency call also can be seen by the customer. The repair
the repair emergency emergency shop can receive an emergency call from the customer and
shop’s calls, along calls, but [ ] navigate the direction to their location through the digital
application with the cannot see Rejected maps. They also can see the status and the history of the
status of the the status of received emergency call.
call. that call.
Navigating The The 5.3 Data Growth Analysis
the technicians technicians This section will tell an explanation of the estimated system
[x]
customer's can navigate can navigate data storage space requirements in the database. That
Accepted
location the direction the direction estimations are used to predict the database's ability to store
from the of customer of customer's data. The analysis is done by calculating the type of storage
[ ]
repair shop’s locations location in a space requirements based on the data of each table which is
Rejected
application through digital map. required on the system. The tables in the Motorcycle Repair
digital maps. Shop Information System database are classified into 2
Tracking the The customer The groups, such as the Transaction Table and the Master Table.
location of can monitor customer The analysis of data growth from both of the groups can be
the the presence cannot seen in Table 2.
[x]
technician of the monitor the Table 2. Data Growth Analysis
Accepted
from the technician presence of Master Transaction
customer's who receives the Table Table
[ ]
application emergency technician.
Rejected The Number of Table 4 8
calls that has
been made in 1 Row Data Storage
2 8,849
real-time. Requirement (Kilo Bytes)
Looking at The customer The The Amount of Data Per
25 25
an can see an customer Day
emergency emergency cannot see an [x] The 1 Day 55 221,225
call call emergency Accepted Estimated 30 Days 1.652,25 6.636,75
transaction transaction call Storage Space
history from history that transaction [ ] Requirement 365 Days 20.102,39 73.746,133
the has been history that Rejected (Kilo Bytes)
customer's made, along has been
application. with the total made.

www.ijcat.com 351
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 348-352, 2019, ISSN:-2319–8656

Table 2 shows an analysis of data growth from the Master IT-enabled absorptive capacity perspective,” Inf.
Table and Transaction Table groups, with the assumption Manag., vol. 55, no. 5, pp. 576–587, 2018.
there are 25 data per day. The results reveal that the Master
Table requires a storage space of 55 kilobytes for a day; [6] I. K. K. Sanjaya, P. W. Buana, and I. M. Sukarsa,
1,652.25 kilobytes for 30 days; and 20,102.39 kilobytes for “Designing Mobile Transactional Based Restaurant
365 days. In other hand, the Transaction Table requires Management,” Int. J. Comput. Eng. Inf. Technol., vol.
storage space of 221,225 kilobytes for a day; 6,636.75 11, no. 6, pp. 130–136, 2019.
kilobytes for 30 days; and 73,746,133 kilobytes for 365 days.
[7] D. P. A. Sanjaya, I. K. A. Purnawan, and N. K. D.
6. CONCLUSION Rusjayanthi, “An Introduction to Balinese Cultural
The Motorcycle Repair Shop Information System Application Traditions through the Android-Based Game Explore
is an Android-based marketplace application that aims to Bali Application (In Indonesian: Pengenalan Tradisi
improve the economic level of repair shop business, and help Budaya Bali melalui Aplikasi Game Explore Bali
the riders in everywhere and at any time by implementing E- Berbasis Android),” Lontar Komput. J. Ilm. Teknol. Inf.,
CRM on the “Call for Service” feature through the vol. 7, no. 3, pp. 162–173, 2016.
application. The Black Box Testing in the application shows
that the user can use the “Call For Service” feature, and [8] E. A. Sholarin and J. L. Awange, “Geographical
reveals that the application has successfully implemented E- information system (GIS),” in Environmental Science
CRM in that feature in order to make an emergency call. In and Engineering (Subseries: Environmental Science),
testing data development analysis, it shows that the Motor 2015.
Repair Shop Information System application only requires a
storage space of 73,746,133 kiloBytes (73,746 MegaBytes) [9] A. Rahmi, I. N. Piarsa, and P. W. Buana, “FinDoctor –
for 365 days, if it is assumed that there are 25 new data per Interactive Android Clinic Geographical Information
day. In the future, the application can still be developed both System Using Firebase and Google Maps API,” Int. J.
in terms of display and new features, such as a “live chat” New Technol. Res., vol. 3, no. 7, pp. 8–12, 2017.
feature with the mechanics when the customers use the “Call
for Service” feature in order to make it easier to communicate [10] I. N. Piarsa, P. W. Buana, and I. G. A. Mahasadhu,
with both parties. “Android Navigation Application with Location-Based
Augmented Reality,” Int. J. Comput. Sci. Issues, vol. 13,
no. 4, 2016.
REFERENCES
[1] POLRI, "The Number of Motorcycle Developments [11] M. Išoraitė, “Customer Loyalty Theoretical Aspects,”
Based on Its Type, from 1987-2008 (In Indonesian: Ecoforum, vol. 5, no. 2, pp. 292–299, 2016.
“Perkembangan Jumlah Kendaraan Bermotor Menurut
Jenis tahun 1987-2008),” 2009. [Online]. Available: [12] A. B. Ramadhan, “The Role Of E-Crm (Electronic
http://www.bps.go.id/tab_sub/view.php?tabel=1&daftar= Customer Relationship Management) in Improving
1&id_subyek=17&notab=12. [Accessed: 06-May-2019]. Service Quality (Study at Harris Hotel & Conventions
Malang) (In Indonesian: Peran E-CRM (Electronic
[2] A. Dabhade, K. V Kale, and Y. Gedam, “Network Customer Relationship Managemen) dalam
Analysis for Finding Shortest Path in Hospital Meningkatkan Kualitas Pelayanan ( Studi pada Harris
Information System,” Int. J. Adv. Res. Comput. Sci. Hotel & Conventions Malang )),” J. Adm. Bisnis, vol. 40,
Softw. Eng., vol. 5, no. 7, pp. 618–623, 2015. no. 1, pp. 194–198, 2016.

[3] M. Mwiya, J. Phiri, and G. Lyoko, “Public Crime [13] Y. Fauziah, H. C. Rustamaji, and R. P. Ramadhan, “The
Reporting and Monitoring System Model Using GSM Implementation of Mobile Crowdsourcing for Estimating
and GIS Technologies : A Case of Zambia Police Bus Arrival Times Based on Community Information (In
Service,” Int. J. Comput. Sci. Mob. Comput., vol. 4, no. Indonesian: Penerapan Mobile Crowdsourching Untuk
11, pp. 207–226, 2015. Estimasi Waktu Kedatangan Bis Berdasarkan Informasi
Masyarakat),” Lontar Komput. J. Ilm. Teknol. Inf., vol.
[4] T. Le Tan, “Successful Factors of Implementation 7, no. 3, p. 150, 2017.
Electronic Customer Relationship Management (e-CRM)
on E-commerce Company,” Am. J. Softw. Eng. Appl., [14] M. Y. P. Mahendra, I. N. Piarsa, and D. Putra Githa,
vol. 6, no. 5, p. 121, 2017. “Geographic Information System of Public Complaint
Testing Based On Mobile Web (Public Complaint),”
[5] T. Cui, Y. Wu, and Y. Tong, “Exploring ideation and Lontar Komput. J. Ilm. Teknol. Inf., vol. 9, no. 2, p. 95,
implementation openness in open innovation projects: 2018.

www.ijcat.com 352
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 353-357, 2019, ISSN:-2319–8656

Optimization of Lift Gas Allocation using Evolutionary


Algorithms

Sofía López Urhan Koç Emma Bakker


University of Barcelona Istanbul Polytechnic Leiden University
Spain Turkey Netherlands

Javad Rahmani
Islamic Azad University
Iran
Abstract: In this paper, the particle swarm optimization (PSO) algorithm is proposed to solve the lift gas optimization problem in the
crude oil production industry. Two evolutionary algorithms, genetic algorithm (GA) and PSO, are applied to optimize the gas
distribution for oil lifting problem for a 6-well and a 56-well site. The performance plots of the gas intakes are estimated through the
artificial neural network (ANN) method in MATLAB. Comparing the simulation results using the evolutionary optimization
algorithms and the classical methods, proved the better performance and faster convergence of the evolutionary methods over the
classical approaches. Moreover, the convergence rate of PSO is 13 times faster than GA's for this problem.

Keywords: particle swarm optimization; crude oil lifting; lift gas allocation; optimization; artificial neural network; genetic algorithm.

1. INTRODUCTION problem in their approach was to maximize the production


rate. They also utilized GLP diagrams in their method. In
There exist a wide variety of natural mechanisms to drive another study, [15] developed a distributed algorithm to
crude oil from the underground reservoirs to the surface, optimize the energy allocation in a building environment [15].
including the gas expansion and water pressure mechanisms. In [16], the rate of lift gas injection is determined based on the
When the natural energies to produce crude oil from a well is net present value (NPV). From their study, it is realized that
not sufficient, the artificial lift procedures are used to the maximum profit from the production does not necessarily
accomplish the oil production process. In general, the artificial occur at maximum production. Authors also proved that the
lift processes are divided into two main categories; gas-based oil price is an important parameter in the optimization
lift process and pump-based lift process [1-8]. Gas-based lift process, and an appropriate optimization scenario should be
technology is known as an efficient and economical procedure picked considering the oil price rate. However, the authors did
in the oil production industry. In a gas-based lift process, the not provide a well-designed model for their strategy. [17]
optimal rate for the gas injection is determined such that it can applied the control theory principles to optimize the lift gas
compensate for the hydro-static pressure drop and frictional distribution; their approach was a cascaded control strategy.
pressure drop in the well [9]. The optimum injection rate is [18] developed an algorithm based on ant colony algorithm
important, mainly because of the operating constraints related (known as continuous ant colony optimization, or CACO) to
to the available gas intake. solve the gas allocation problem.
One of the very first studies on gas allocation optimization In this paper, the optimum amount of lift gas is distributed
was conducted by Redden et al. in 1974 [10]. Authors in [10] over a set of wells based on an evolutionary optimization
have optimized the gas distribution among 30 wells in algorithm. It is the first time that the particle swarm
Venezuela. Their approach was based on the good laboratory optimization (PSO) algorithm is used for finding the optimal
practices (GLP) diagrams, and the optimization criterion was gas injection rate for oil lift process. Worth mentioning that
the higher profit rate. Their proposed strategy did not consider PSO algorithm is known to be more efficient and faster in
any optimization constraints; i.e., they assumed the unlimited solving such optimization problems, compared to the similar
amount of gas is available. A similar study is conducted by evolutionary algorithms such as genetic algorithm (GA).
Mayhill in 1974 [11]. In 1981, Kanu and his colleagues Moreover, in this study, the artificial neural network (ANN)
introduced a parameter called the economic slope, which was method is utilized to estimate the performance plots of the
a measure of the economic efficiency in a gas-based lift gas-based lift process.
process. In their proposed approach, the optimal gas allocation
was analyzed with and without constraints; e.g., with limited The rest of the paper is organized as follows. The next section
and unlimited gas intake [12]. In a further study, Nishikiori et explains the two evolutionary algorithms; genetic algorithm
al. developed a strategy based on the economic slope (GA) and particle swarm optimization (PSO). Section 3
parameters, in which the optimum amount of gas injection describes the PSO algorithm challenges. The proposed
was determined through a pseudo-Newtonian method [13]. In strategy is shown in section 4. Section 5 includes the
[14], authors optimized the controller tuning process using the simulation results. The work finishes with the conclusions in
particle swarm optimization. The objective of the optimization section 6.

www.ijcat.com 353
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 353-357, 2019, ISSN:-2319–8656

2. EVOLUTIONARY ALGORITHMS
In this section, the genetic optimization algorithm (GA) and
the particle swarm optimization (PSO) algorithm are
explained in detail.

2.1 Genetic algorithm


Genetic algorithm is one of the most important meta-heuristic
algorithms, first introduced by Holland in 1975 [19]. Genetic
algorithm is a type of evolutionary algorithm, which is
commonly used in artificial intelligence (AI) and computing.
The genetic algorithm applies a set of solutions to the
optimization problem in each generation. The selection
process chooses the individuals with the best fitness; these
individuals mutate and reproduce new genes [20-26].
Therefore, the best optimum solutions are attained through
mimicking the natural process genes mutation, selection, and
reproduction. In the genetic algorithm, the final goal of
selections and mutations is to maximize the fitness or
minimize the costs of each individual. The genes adapt
themselves to the environmental conditions such that they
survive or mutate with genes with higher fitness. The
crossover operator is used to produce new offsprings from
every two parents.

2.2 Particle swarm optimization algorithm 3. PARTICLE SWARM OPTIMIZATION


Extensive studies have investigated the social behavior of ALGORITHM CHALLENGES
various types of creatures; such as birds flock, school of
whales, fish, sharks, etc. The particle swarm optimization The particle swarm optimization algorithm has several
(PSO) algorithm is a meta-heuristic computational method drawbacks and disadvantages. PSO can easily fall into the
that mimics the social behavior of animal swarms. PSO local optimum points in high-dimensional optimization
optimizes problem by improving the candidate solution problems. Although PSO is faster compared to similar
iteratively. The algorithm was first introduced by Kennedy evolutionary algorithms, its convergence rate does not
and Eberhart in 1995 [27]. Swarm intelligence is the enhance with a higher number of iterations. The prominent
collective behavior of self-organized systems. The algorithms reason is that in this algorithm, particles converge to the point
in artificial intelligence (AI) follow a hierarchy directly or with the personal best and global best solution. To address
indirectly. In PSO algorithm, two main parameters are being this issue, the inertia weight w is used to modify the algorithm
updated in each iteration; velocity term and position term. The [28]. Another main drawback in this algorithm is that the
particle's velocity and position are updated through the quality of solutions is very much dependent to the weighting
following equations, respectively. coefficients and algorithm parameters [29]. Therefore, we
should try to tune the PSO parameters in the best way.

4. PROPOSED STRATEGY
In order to define the optimization problem, we first need to
estimate the performance diagrams of the wells with different
levels of gas injections. The artificial neural network (ANN)
algorithm is utilized in this step to attain the (good laboratory
practices) GLP-based performance diagrams. The training
where vi(t) and xi(t) denote the velocity and position of model is then used as the fitness function in the optimization
particles at time t. y and parameters represent the personal process. Once the convergence criteria are met, the algorithm
best solution of the particle and the global best solution, stops. The PSO algorithm is simulated in MATLAB
respectively. r1 and r2 are the random vectors with uniform environment. The advantages of coding in MATLAB include:
distribution in the [0,1] interval. w, c1, and c2 are the inertia
coefficient, personal learning coefficient, and collective
learning coefficient, respectively.
Beside the velocity and position updates, the personal best and
global best parameters should also be updated in a standard
PSO algorithm.

www.ijcat.com 354
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 353-357, 2019, ISSN:-2319–8656

Table 1: Simulation results on a 6-well problem using the


propose method and GA

5. SIMULATION RESULTS
In this section, the results of gas allocation optimization using
PSO algorithm are presented and discussed. Two different
scenarios; a low-dimension problem with six wells, and a
high-dimension problem with 56 wells are considered in our
simulations. The constraints on the amount of available lift
gas are considered (limited amount of lift gas is available).
The optimization is implemented on the datasets from
Buitrago et al. research. As mentioned, the ANN approach is
employed to estimate the performance diagrams of the lift gas.
The objective in the constrained optimization problem is to
maximize oil production. The upper limit for the gas
consumption is only considered as a constraint, and the gas
consumption is not a term in the objective function. The
objective function and the constraints equation is as (5).

Table 2: Simulation results on a 56-well problem using the


propose method and GA

The simulation results for the six-well problem and 56-well


problem, using the proposed approach and GA, are shown in
Tables 1 and 2, respectively.
From Table 1, the optimum oil production is 3425 barrels in
Moreover, the estimation of performance plots using the the constrained optimization problem, in both PSO and GA
neural network approach are illustrated in Figure 1. approaches. In a 6-well optimization problem, the results from
the two evolutionary algorithms GA and PSO is almost the
same, since it is a low-dimension problem. Obviously, in a
higher dimension optimization problem with more
computational complications, the performance of the
evolutionary optimization methods will be recognizably
different. Comparing the results of simulations in a 56-well
problem proved that the proposed evolutionary algorithms
performed more than 3% (more than 700 barrels) better than
the classic approaches. Therefore, if the higher the dimension
of the problem, the significantly better performance will be
attained using the evolutionary optimization algorithms
compared to the classical methods.
Although the results from GA and PSO approaches are the
same, we recommend PSO for the gas allocation optimization
problem. To prove the superiority of PSO over GA, we have
shown the number of iterations needed to solve the same
Figure 1: The performance plot estimates through the ANN problem using the two algorithms (Figure 2). Thus, from the
algorithm. iteration graph, PSO converges a lot faster (13 times faster)
than GA and requires less number of iterations for solving the
same optimization problem. So, the operational costs for
solving the problem using GA is significantly more than the
cost associated with PSO. The parameter update processes in

www.ijcat.com 355
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 353-357, 2019, ISSN:-2319–8656

PSO enhances the convergence pace in the algorithm. The [6] Rahimikelarijani, Behnam, et al. "Optimal Ship
main drawback of GA in this regard is that it does not update Channel Closure Scheduling for a Bridge
its parameters, and it does not include any tunable parameter Construction." IIE Annual Conference. Proceedings.
in its process. Institute of Industrial and Systems Engineers (IISE),
2017.
[7] F. Rahmani, F. Razaghian, and A. Kashaninia, "Novel
Approach to Design of a Class-EJ Power Amplifier
Using High Power Technology," World Academy of
Science, Engineering and Technology, International
Journal of Electrical, Computer, Energetic, Electronic
and Communication Engineering, vol. 9, pp. 541-546,
2015.
[8] Rostaghi-Chalaki, Mojtaba, A. Shayegani-Akmal, and
H. Mohseni. "A study on the relation between leakage
current and specific creepage distance." 18th
International Symposium on High Voltage
Engineering (ISH 2013). 2013.
[9] M. Golan and C. H. Whitson, Well Performance,
Figure 2: The number of iterations in PSO and GA for solving Norwegian University of Science and Technology
the 56-well problem (NTNU), Trondheim, Norway, (1991) by Prentice-
Hall. Inc.
[10] J. Redden, T. A. Sherman, and J. Blann, “Optimizing
6. CONCLUSIONS Gas-Lift Systems,” in Proceedings of Fall Meeting of
the Society of Petroleum Engineers of AIME, 1974.
The gas distribution optimization problem is studied in this
paper. The particle swarm optimization (PSO) approach is [11] T. D. Mayhill, “Simplified Method for Gas-Lift Well
used for the first time for this problem. The performance plots Problem identification and Diagnosis,” in Fall Meeting
are attained through an artificial neural network (ANN) of the Society of Petroleum Engineers of AIME, 1974.
learning. The proposed strategy is implemented on a high- [12] E. Kanu, J. Mach, and K. Brown, “Economic
dimensional (56-well) and a low-dimensional (6-well) Approach to Oil Production and Gas Allocation in
problem. The better performance of the evolutionary Continuous Gas Lift (includes associated papers 10858
optimization method (GA and PSO) over the classical and 10865),” J. Pet. Technol., vol. 33, no. 10, pp.
approaches is more recognizable when the problem is of 1,887–1,892, Oct. 1981.
higher dimension (like the 56-well problem). PSO and GA
showed similar performances; however, PSO performed much [13] N. Nishikiori, R. A. Redner, D. R. Doty, and Z.
faster (13 times faster) and required less number of iterations Schmidt, “An Improved Method for Gas Lift
than GA. Allocation Optimization,” in Proceedings of SPE
Annual Technical Conference and Exhibition, 1989.
[14] R. Eini, "Flexible Beam Robust Loop Shaping
Controller Design Using Particle Swarm
7. REFERENCES Optimization," Journal of Advances in Computer
[1] F. Rahmani, F. Razaghian, and A. Kashaninia, "High Research, vol. 5, pp. 55-67, 2014.
Power Two-Stage Class-AB/J Power Amplifier with [15] R. Eini, and S. Abdelwahed. "Distributed Model
High Gain and Efficiency," 2014. Predictive Control Based on Goal Coordination for
[2] M. Ketabdar, "Numerical and Empirical Studies on the Multi-Zone Building Temperature." In 2019 IEEE
Hydraulic Conditions of 90 degree converged Bend Green Technologies Conference (GreenTech),
with Intake," International Journal of Science and Lafayette, LA. 2019.
Engineering Applications, vol. 5, pp. 441-444, 2016. [16] B. T. Hyman, Z. Alisha, S. Gordon, "Secure Controls
for Smart Cities; Applications in Intelligent
[3] A. Hamedi, M. Ketabdar, M. Fesharaki, and A.
Transportation Systems and Smart Buildings,"
Mansoori, "Nappe Flow Regime Energy Loss in
International Journal of Science and Engineering
Stepped Chutes Equipped with Reverse Inclined Steps:
Applications, vol. 8, pp. 167-171, 2019. doi:
Experimental Development," Florida Civil
10.7753/IJSEA0806.1004
Engineering Journal, vol. 2, pp. 28-37, 2016.
[17] Heng, Li Jun, and Abesh Rahman. "Designing a robust
[4] R. Eini and A. R. Noei, "Identification of Singular controller for a missile autopilot based on Loop
Systems under Strong Equivalency," International shaping approach." arXiv preprint
Journal of Control Science and Engineering, vol. 3, pp. arXiv:1905.00958 (2019).
73-80, 2013.
[18] Patel, Dev, Li Jun Heng, Abesh Rahman, and Deepika
[5] Rostaghi-Chalaki, Mojtaba, A. Shayegani-Akmal, and Bharti Singh. "Servo Actuating System Control Using
H. Mohseni. "Harmonic analysis of leakage current of Optimal Fuzzy Approach Based on Particle Swarm
silicon rubber insulators in clean-fog and salt-fog." Optimization." arXiv preprint
18th International Symposium on High Voltage arXiv:1809.04125 (2018).
Engineering. 2013.

www.ijcat.com 356
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 353-357, 2019, ISSN:-2319–8656

[19] J. H. Holland, Adaptation in natural and artificial


systems: an introductory analysis with applications to
biology, control, and artificial intelligence. University
of Michigan Press, 1975.
[20] Bakker, V. Deljou, and J. Rahmani, "Optimal
Placement of Capacitor Bank in Reorganized
Distribution Networks Using Genetic Algorithm,"
International Journal of Computer Applications
Technology and Research (IJCATR), vol. 8, pp. 2319-
8656, 2019.
[21] F. Rahmani, "Electric Vehicle Charger based on
DC/DC Converter Topology," International Journal of
Engineering Science, vol. 18879, 2018.
[22] F. Rahmani and M. Barzegaran, "Dynamic wireless
power charging of electric vehicles using optimal
placement of transmitters," in 2016 IEEE Conference
on Electromagnetic Field Computation (CEFC), 2016,
pp. 1-1.
[23] M. Ketabdar and A. Hamedi, "Intake Angle
Optimization in 90-degree Converged Bends in the
Presence of Floating Wooden Debris: Experimental
Development," Florida Civ. Eng. J, vol. 2, pp. 22-
27.2016, 2016.
[24] M. Ketabdar, A. K. Moghaddam, S. A. Ahmadian, P.
Hoseini, and M. Pishdadakhgari, "Experimental
Survey of Energy Dissipation in Nappe Flow Regime
in Stepped Spillway Equipped with Inclined Steps and
Sill," International Journal of Research and
Engineering, vol. 4, pp. 161-165, 2017
[25] A. Hamedi and M. Ketabdar, "Energy Loss
Estimation and Flow Simulation in the skimming flow
Regime of Stepped Spillways with Inclined Steps and
End Sill: A Numerical Model," International Journal
of Science and Engineering Applications, vol. 5, pp.
399-407, 2016.
[26] Rahimikelarijani, Behnam, Mohammad Saidi-
Mehrabad, and Farnaz Barzinpour. "A mathematical
model for multiple-load AGVs in Tandem layout."
Journal of Optimization in Industrial Engineering
(2019).
[27] J. Kennedy and R. Eberhart, “Particle swarm
optimization,” in Proceedings of ICNN’95 -
International Conference on Neural Networks, 1995,
vol. 4, pp. 1942–1948.
[28] N. Sfeir, H. Sharifi, "Internet of Things Solutions in
Smart Cities," doi: 10.13140/RG.2.2.26015.51367
August 2019.
[29] H. Sharifi, "Singular Identification of a Constrained
Rigid Robot," International Research Journal of
Engineering and Technology (IRJET), vol. 5, pp. 941-
946, 2018.

www.ijcat.com 357
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 358-362, 2019, ISSN:-2319–8656

Campus Placement Analyzer: Using Supervised Machine


Learning Algorithms
Shubham Khandale Sachin Bhoite
Student, M.Sc. (Big Data Analytics) Assistant Professor
School of Computer Science, Faculty of Science School of Computer Science, Faculty of Science
MIT-WPU, Pune, Maharashtra, India MIT-WPU, Pune, Maharashtra, India

Abstract -- The main aim of every academia enthusiast is placement in a reputed MNC’s and even the reputation and every year
admission of Institute depends upon placement that it provides to their students. So, any system that will predict the placements of the
students will be a positive impact on an institute and increase strength and decreases some workload of any institute’s training and
placement office (TPO). With the help of Machine Learning techniques, the knowledge can be extracted from past placed students and
placement of upcoming students can be predicted. Data used for training is taken from the same institute for which the placement
prediction is done. Suitable data pre-processing methods are applied along with the features selections. Some Domain expertise is used
for pre-processing as well as for outliers that grab in the dataset. We have used various Machine Learning Algorithms like Logistic,
SVM, KNN, Decision Tree, Random Forest and advance techniques like Bagging, Boosting and Voting Classifier and achieved 78%
in XGBoost and 78% in AdaBoost Classifier.

Keywords: Pre-processing, Feature Selection, Domain expertise, Outliers, Bagging, Boosting, SVM, KNN, Logistics

1. INTRODUCTION
Nowadays Placement plays an important role in this world
accuracy of 71.66% with tested real-life data indicates that the
full of unemployment. Even the ranking and rating of
institutes depend upon the amount of average package and system is reliable for carrying out its major objectives, which
amount of placement they are providing. is to help teachers and placement cell[2].
So basically main objective of this model is to predict whether
the student might get placement or not. Different kinds of Ajay Kumar Pal, Saurabh Pal (2013) they are predicting the
classifiers were applied i.e. Logistic Regression, SVM, placement of student after doing MCA by the three selected
Decision Tree, Random Forest, KNN, AdaBoost, Gradient classification algorithms based on Weka. The best algorithm
Boosting and XGBoost. For this all over academics of based on the placement data is Naïve Bayes Classification
students are taken under consideration. As placements activity with an accuracy of 86.15% and the total time taken to build
take place in last year of academics so last year semesters are the model is at 0 seconds. Naïve Bayes classifier has the
not taken under consideration lowest average error at 0.28 compared to others.[3]

2. RELATED WORK Syed A0068med, Aditya Zade, Shubham Gore, Prashant


Various researches and students have published related work Gaikwad, Mangesh Kolhal (2017). Their objective is to
in national and international research papers, thesis to analyze the previous year's student's historical data and
understand the objective, types of algorithm they have used predict placement chance of the current students and the
and various techniques for pre-processing, Feature. percentage placement chance of the institution. They have
used the Decision tree C4.5 Algorithm. Decision tree C4.5
Pothuganti Manvitha, Neelam Swaroopa (2019) used Random algorithms are applied to the Company’s previous year data &
Forest and Decision Tree. The accuracy obtained after current requirement to generate the model and this model can
analysis for Decision tree is 84% and for the Random Forest be used to predict the students’ eligibility in various
is 86%. Hence, from the above-said analysis and prediction, companies. According to company eligibility criteria, they
it’s better if the Random Forest algorithm is used to predict will send the notification to those candidates who are eligible
the placement results [1]. for that campus interview and check the eligibility of
candidate on the basis of percentage & technology [4].
Senthil Kumar Thangavel, Divya Bharathi P, Abijith
Sankar(2017) used Decision Tree, Logistic Regression, Apoorva Rao r , Deeksha K C , Vishal Prajwal R , Vrushak K,
Metabagging Classifier, Naïve Bayes and obtain highest Nandini M S (2018). They have used techniques like
84.42% accuracy in Decision Tree. The objectives, which is to clustering along with that they have used classification rule
predict the placement status the students in Btech are most Naïve Bayes algorithm that will classify students in five
likely to have at the end of their final year placements. The

www.ijcat.com 358
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 358-362, 2019, ISSN:-2319–8656

different status i.e. Dream company, Core Company, Mass we have merged 12th and diploma marks and made
recruiters, Not eligible and Not interested[5] a single column for both.
 Some of the tuples where from M.tech background
so we have dropped them and even in
3. DATASET DESCRIPTION AND “current_aggregate” column we have dropped the
SYSTEM FLOW NA values because the whole row was having NA.
 Replaced all NA values in columns
“Current_Back_Papers”,
This approach was followed in following Figure 3. “Current_Pending_Back_Papers”, all semester wise
“Sem_Back_Papers”, “Sem_Pending_Back_Papers”
with 0 because it was null only if that student have
no backlogs
Data Gathering  Using LabelEncoder from Preprocessing API in
sklearn encoded the labels of columns
“'Degree_Specializations”, “Campus”,” Gender”,
“year_down”, “educational_gap”

Pre-processing
3.2 Feature Selection
As per machine learning Feature Selection algorithms like
“Ridge”, “Lasso”, “RFE”, “plot importance”, “F1 score” and
“feature importance” we have got various outputs
Feature selection
 “Feature importance” with DT

Training different
Model

Model Selection

Prediction

Figure 3. Flow chart


Figure 3.2.1 Feature importance with DT

3.1 Data gathering and Pre-processing


The Data was collected from Training and placement
department of MIT which consist of all the students of
Bachler of Engineering (B.E) from 3 different colleges of
their campus. The Data consists of 2338 records with 31
different attribute.
 Dataset contains academic information of students.
As some students have completed their 12th and
some of them are from diploma background who
have directly taken admission to the second year so,

www.ijcat.com 359
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 358-362, 2019, ISSN:-2319–8656

 “Feature importance” with Random Forest  “RFE”

Num Features: 5 Features support : [False False False False


False False False False True False False True False False
False True False True False False False False True False False
False False False False False] Features Ranking [25 6 4 3 24 8
13 22 1 10 11 1 23 17 2 1 19 1 5 18 7 21 1 26 16 20 15 12 14
9] selected
Features:['Sem1_Pending_Back_Papers',’Sem2_Pending_Bac
k_Papers','Sem4_Aggregate_Marks','Sem4_Pending_Back_Pa
pers', 'Sem6_Back_Papers']
Selected features index: [8, 11, 15, 17, 22]

 “Ridge”

Figure 3.2.2 Feature importance with Random Forest

 “F1 score”

Feature Names F1 score

Sem4_Aggregate_Marks 312.063809

Current_Aggregate_Marks 286.086537

Sem3_Aggregate Marks 255.771833

Sem2_Aggregate_Marks 164.183078

12th_/_Diploma_Aggre_marks 142.208129

Sem1_Aggregate_Marks 139.183936
Figure 3.2.3 Feature Selection Using Ridge
Sem6_Aggregat_ Marks 136.333959

Sem5_Aggregate_Marks 131.988165

10th_Aggregate_Marks 128.526784

Sem6_Back_Papers 128.526784

live_atkt 47.908927

Sem5_Back_Papers 45.382049

Sem4_Back_Papers 43.547352

www.ijcat.com 360
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 358-362, 2019, ISSN:-2319–8656

 “Lasso”

Figure 4.2 Campus wise number of students who got placed

Figure 3.2.4 Feature Selection Using Lasso

But as per the domain knowledge we have selected


all the features which are importance for our model

4. EXPLORATORY DATA ANALYSIS

Figure 4.3 Gender Wise Student Placement

5. BAGGING AND BOOSTING

Figure 4.1 Total number of student placed


Bagging is nothing but bootstrap aggregating, it is an
ensemble method to improve the accuracy and stability of the
models. Random samples are taken with replacement and with
every new sample that is generated is trained and the
ensemble can make a prediction for the new instance by
simply aggregating the prediction of all predictors

Boosting is nothing but the ensemble method that can


combine different weak learner into a strong learner. Its main
aim is to train predictors sequentially. Most popular are
AdaBoost and Gradient Boosting.

www.ijcat.com 361
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 358-362, 2019, ISSN:-2319–8656

Base classifier- AdaBoost Classifier


7. REFERENCES
Decision Tree
[1] Pothuganti Manvitha, Neelam Swaroopa “Campus
Placement Prediction Using Supervised Machine Learning
Bagging Classifier Techniques” International Journal of Applied Engineering
Research ISSN 0973-4562 Volume 14, Sept 2019

Figure 5.1 Layering of Classifiers [2] Senthil Kumar Thangavel, Divya Bharathi P, Abijith
Sankar “Student Placement Analyzer: A Recommendation
System Using Machine Learning” 2017 International
We have used Base Classifier as Decision Tree, over that we Conference on Advanced Computing and Communication
have used AdaBoost Classifier and over that we have used Systems (ICACCS -2017), Coimbatore, INDIA, Jan. 06 – 07,
Baagging Classifier because we want to tune the accuracy of 2017
the model
[3] Ajay Kumar Pal, Saurabh Pal “Classification Model of
Prediction for Placement of Students” I.J.Modern Education
and Computer Science, 2013, 11, 49-56 Published Online, 11
6. RESULT AND CONCLUSION November 2013

[4] Syed A0068med, Aditya Zade, Shubham Gore, Prashant


Algorithms Accuracy Gaikwad, Mangesh Kolhal “Smart System for Placement
Prediction using Data Mining” International Journal for
Logistic Regression 58% Research in Applied Science & Engineering Technology
(IJRASET) ISSN: 2321-9653, Dec 2017
Support Vector Machine 69%
[5] Apoorva Rao r , Deeksha K C , Vishal Prajwal R ,
KNN 63.22 % Vrushak K, Nandini M S “Student placement analyzer: a
recommendation system using machine learning” ijariie-
Decision Tree 69% issn(o)-2395-4396, Jan 2018

Random Forest 75.25%

AdaBoost(DT) 77%

Gradient Boosting 77%

Voting Classifier Soft 69.11%

Voting Classifier Hard 68.43%

XGBoost 78%

In this model, we have considered various academics records


along with all semester’s aggregate, live backlog, dead
backlog, education gap, year down. This model will help the
teachers to find whether the student will get placement or not
prior in 3rd year only so that they can pay special attention to
those students who are predicted as not getting placement.
Even the institute can take major steps to improve the
qualities of those students before their final placement.
Various algorithms were used but the final model is selected
on AdaBoost classifier along with the Bagging and Decision
Tree as Base Classifier as its accuracy is very high.

The existing dataset was only for 3 colleges further even we


can add more college’s dataset to it for prediction. In future,
we are going to implement Deep learning algorithms which
may give better accuracy then Machine Learning models

www.ijcat.com 362
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 363-366, 2019, ISSN:-2319–8656

Customer Churn Analysis and Prediction


Aditya Kulkarni [1] Amruta Patil [2] Madhushree Patil [3]
M.sc (Big Data Analytics) Msc (Big Data Analytics) Msc (Big Data Analytics)
MIT WPU MIT WPU MIT WPU
Pune , India Pune , India Pune , India

Sachin Bhoite [4]


Assistant Professor , Computer Science
MIT WPU
Pune , India

Abstract: When talking about any companies growth within market customers play an essential role in it , having the correct insights about
customer behaviour and their requirements is the current need in this customer driven market . Preserving the interests of customers by
providing new services & products helps in maintaining business relations . Customer churn is great problem faced by companies nowadays
due to lagging in understanding their behaviour & finding solutions for it . In this project we have found causes of the churn for a telecom
industry by taking into consideration their past records & then recommending them new services to retain the customers & also avoid churns in
future . We used pie charts to check churning percentage later analysed whether there are ant outliers [using box plot] then dropped some
features which were of less importance then converted all categorical data into numerical by using [Label Encoding for multiple category data
& map function for two category data] plotted the ROC curve to get to know about true positive & false negative rate getting line at 0.8 then
spitted the data using train test split .We used algorithms decision tree , Random Forest for feature selection wherein we got feature
importance , then used logistic regression & found feature with highest weight assigned leading to cause of churn . Now in order to retain
customers we can recommend them new services.

Keywords : Customer churn analysis telecom , Customer churn prediction & prevention , naïve bayes , logistic regression , decision tree ,
random forest

“A comparison of machine learning techniques for customer churn


1.INTRODUCTION prediction Praveen Asthana has used decision tree , svm , naïve bayes
, ANN & compared which model gives best accuracy and would help
The telecom industry is growing day by day hence user as well as in prediction of customer churn to achieve better performance[1].
operators are investing into this industry ,such a customer driven
industry faces a huge financial issue if customer tend to leave their SCOTT A. NESLIN, SUNIL GUPTA, WAGNER KAMAKURA,
services . By using machine learning we can analyse , predict the JUNXIANG LU, and CHARLOTTE H. MASON* “Defection
way customer respond to these services , researches have proven that Detection: Measuring and
by using past data it could be accomplished [2] .

In this Customer Churn prediction & retention we are analysing the Understanding the Predictive Accuracy of Customer Churn Models”
past behaviour of customers and accordingly finding the real cause of [2]here they have worked on measuring and increasing accuracy for
the churn , then predicting whether churn will happen in future by churn prediction used logistic & tree approach .
customers . By taking into account details like Monthly charges ,
services they have subscribed for , tenures , contract they will We went through one more paper “Customer churn prediction in
contribute into he end results i.e prediction. telecom using machine learning in big data platform” Abdelrahim
Kasem Ahmad* , Assef Jafar and Kadan Aljoumaa [3] they have
Our aim is to use machine learning concepts to not only predict & used decision tree , random forest , XGBoosting , they used this
retain customers but also to avoid further churns which would be algorithm for classification in predictive churn of customers getting
beneficial to industry . better accuracy.

S-Y. Hung, D. C. Yen, and H.-Y. Wang. "Applying data mining to


2.RELATED WORKS telecom churn management." , here they have used predictive model
in a bank with personalized action to retain customer & have also
We went through various articles & research papers , and then found used recomender system[4] .
that many researchers have worked on customer churn as it is a major
problem faced by industries nowadays we found the following K. Coussement, and D. Van den Poel "Improving customer attrition
papers more promising prediction by integrating emotions from client/company interaction
emails and evaluating multiple classifiers.", they have used logistic
regression , svm & random forest classification algorithms to filter
out the churners from non-churner[5].

www.ijcat.com 363
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 363-366, 2019, ISSN:-2319–8656

L Miguel APM. "Measuring the impact of data mining on churn 3.2 DATA PRE-PROCESSING
management” , they have proposed a analysis framework which
prefigure impact of data mining for churn management[6] . Data pre-processing is important task in machine learning. It
converts raw data into clean data. Following are technique, we have
Adnan Amin , Babar shah , Awais Adnan "Customer churn applied on data: -
prediction in telecommunication industry using data certainty"[7],  Missing Values – Here we had missing values in Totalcharges
The dataset is grouped into different zones based on the distance feature which we then eliminated and adjusted them with mean
factor which are then divided into two categories as data with high values . These are the missing row values within data if not
certainty, and data with low certainty, for predicting customers handled would later lead to errors for converting data type as it
exhibiting Churn and Non-churn behaviour. takes string value for empty spaces .
 Label Encoder – For categorical variables this is perfect method
3. PROCESS FLOW to convert them into numeric values , best used when having
multiple categories . We had various categorical values
converted them into numeric for further use in algorithms .
The data we got was mostly balanced & categorical data then we  Drop Columns – As we took insights from the data we came to
began with Data Cleaning, Pre-processing, removing unwanted know some of the features were of less importance so we
columns, feature selection, label encoding. dropped them to reduce number of features .

3.3 FEATURE SELECTION

As we had number of features and most of them were of great


importance so we used feature section to get to know which of them
Dataset are contributing towards the accuracy of the model .
DATASET
We used Decision tree , Random forest for feature selection so using
decision tree we got accuracy[80] and by using arandom forest we go
t [80%] so random forest gave us four features
PRE-PROCESSING
Data Pre-processing Index(['tenure', 'Contract', 'MonthlyCharges', 'TotalCharges'], dtype='
object')
[(0.2251735641431145, 'Contract'), (0.1687558104226648, 'tenure'),
(0.12539865168020692, 'OnlineSecurity'), (0.1128092761196452,
Feature Selection 'TechSupport'), (0.10731999001345587, 'TotalCharges'),
FEATURE SELECTION
(0.08573112448285626, 'MonthlyCharges'),
Here we can see contarct is having more importance resulting factor
for churn .

Train-Test Split
TRAIN-TEST SPLIT
FEATURE SELECTION
TechSupport
Model Applied
MODELS OnlineSecurity
tenure
Contract
Model Tuning
MODEL TUNING 0 0.05 0.1 0.15 0.2 0.25 0.3

And for decision tree we got accuracy of [77%]


Final Output Used heat Map for correlation checking :
OUTPUT

3.1 DATASET

We took this telecom dataset from online website source took all the
insights regarding the data .
Attributes of the dataset : :
Customerid, gender, SeniorCitizen, Partner, Dependents,tenure,
PhoneService, MultipleLines, InternetService, OnlineSecurity,
OnlineBackup, DeviceProtection, TechSupport, StreamingTV,
Contract, PaperlessBilling, PaymentMethod, MonthlyCharges,
TotalCharges, Churn.

www.ijcat.com 364
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 363-366, 2019, ISSN:-2319–8656

4. EXPLORATORY DATA ANALYSIS


In this phase we will look towards those features which we
didn’t consider in feature selection but are contributing factor
for prediction .

Confusion Matrix :
[[3816 295]
[ 924 590]] .

5. RESULT AND DISCUSSION

Now after all the cleaning up & pre-processing of the data now we
separate our data for further applying algorithms on it. By using :
1. Train-Test Split
2. Modeling
3. Tuning Model

5.1 Train-Test Split:


To create the model we train our dataset while testing data set is
used to test the performance. So, in our data, we have split into 80%
for training data and 20% for testing data because it makes the
classification model better whilst more test data makes the error
estimate more accurate.

5.2 Modelling :
Following are model, we applied to check which model
gives better accuracy:
Here we can see that customer who took fibre optics for
month-to-month contract whether it be male/female • Support Vector Classifier (SVC):
resulted in churn. This algorithm is used for classification problem. The main
objective of SVC is to fit to the data you provide, returning a
“best fit” hyperplane that divides, or categorizes, your data.
From there, when obtaining the hyperplane, you'll then feed
some options to your category to examine what the "predicted"
class is.

• Decision Tree:
Decision tree is non-parametric supervised learning. It is used
for both classification and regression problem. It is flowchart-
like structure in which each internal node represents a “test” on
an attribute, each branch represents the outcome of the test, and
each leaf node represents a class label. The path between root
and leaf represent classification rules. It creates a
comprehensive analysis along with each branch and identifies
Also visualised all the features within the dataset & came to decision nodes that need further analysis.
know the distributions .
• Random Forest:
Got roc curve ; Random Forest is a meta estimator that uses the number of
decision tree to fit the various sub samples drawn from the
original dataset. we also can draw data with replacement as per
the requirements.

• K-Nearest Neighbours (KNN):


www.ijcat.com 365
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 363-366, 2019, ISSN:-2319–8656

K-Nearest Neighbours (KNN) is supervised learning algorithm 6. CONCLUSION


which is used to solve regression and classification problem
both. Where ‘K’ is number of nearest neighbours. It is simple to
implement, easy to understand and it is lazy algorithm. Lazy Here we had past records of customers who had churned and using
algorithm means it does not need any training data points for that data we predicted whether new customer would tend to churn or
model generation . All training data used in the testing phase. not , this will help the companies to get to know the behaviour of
customer & how to maintain their interests into the services of
company . Further the company can also use recommender system to
• Naïve Bayes:
retain customers and also avoid the further churns . We used various
A Naive Bayes Classifier is a supervised machine learning algorithms wherein Logistic regression gave us high accuracy close
algorithm which uses the Bayes’ Theorem, that features are to this accuracy were Random Forest , SVM .
statistically independent. It finds many uses in the probability
theory and statistics. By simple machine learning problem,
The dataset did not consisted of records which would tell us whether
where we need to learn our model from a given set of attributes
customer has switched the services , that will help in recommending
(in training examples) and then form a hypothesis or a relation
new services further . Now we are going to build a recommender
to a response variable.
system to avoid churns & retain the old customers .

 Logistic Regression :
Logistic regression is a classification algorithm used to assign
observations to a discrete set of classes . Logistic Regression is a 7. REFERENCES
Machine Learning algorithm which is used for the classification
problems, it is a predictive analysis algorithm and based on the [1] Praveen Ashtana “A comparison of machine learning techniques
concept of probability. for customer churn prediction” International Journal of Pure and
Applied Mathematics Volume 119 No. 10 2018, 1149-1169 ISSN:
Models used & their accuracy : : 1311-8080

Model Accuracy [2] SCOTT A. NESLIN, SUNIL GUPTA, WAGNER


KAMAKURA, JUNXIANG LU, and CHARLOTTE H. MASON*
“Defection Detection: Measuring and Understanding the Predictive
Logistic Regression 80.38% Accuracy of Customer Churn Models” Journal of Marketing
Research 204 Vol. XLIII (May 2006), 204–211 , ISSN: 0022-2437.

Decision Tree 77.81% [3] Abdelrahim Kasem Ahmad* , Assef Jafar and Kadan Aljoumaa
“Customer churn prediction in telecom using machine learning in big
data platform” - Journal of Big Data volume 6,
Random Forest Tree 80.02%
Article number: 28 (2019) , published on 20th March 2019 .

Naïve Bayes 74.91% [4] S-Y. Hung, D. C. Yen, and H.-Y. Wang. "Applying data mining
to telecom churn management." Expert Systems with Applications,
SVM 80.1% vol. 31, no. 3, pp. 515–524, 2006.

K – Nearest Neighbour 76.61% [5] K. Coussement, and D. Van den Poel. "Improving customer
attrition prediction by integrating emotions from client/company
interaction emails and evaluating multiple classifiers." Expert
XGBoost 80% Systems with Applications, vol. 36, no. 3, pp. 6127–6134, 2009

[6] L. Miguel APM. "Measuring the impact of data mining on churn


Figure 5: Accuracy for Different Models management." Internet Research, vol. 11, no. 5, pp. 375–387,2001

[7] Amin , Babar shah , Awais Adnan "Customer churn prediction


in telecommunication industry using data certainty" Journal of
business research Volume 94, January 2019, Pages 290-301.
5.3 MODEL TUNING:

Here we tune the model to increase model performance without


overfitting the model.

• XGBoost :
XGBoost stands for extreme Gradient Boosting. XGBoost is
an implementation of gradient boosted decision trees designed
for speed and performance[3].

We used XGBoost to check the error function and reduce it


[Accuracy :80%] , by using cross validation checked for the reducing
RMSE also it handles the missing values , initially our RMSE was [
[0]validation_0-error: 0.208955] later it came [ [10] validation_0-
error: 0.200426 ]

www.ijcat.com 366
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 367-370, 2019, ISSN:-2319–8656

Air Quality Prediction using Machine Learning


Algorithms

Pooja Bhalgat Sejal Pitale Sachin Bhoite


Student Student Assistant Professor
M.Sc(Big Data Analytics) M.Sc(Big Data Analytics) Computer Science
MIT-WPU, Pune, India MIT-WPU, Pune, India MIT-WPU, Pune, India

-----------------------------------------------------------------------------------------------------------------------------

Abstract: :Examining and protecting air quality has become one of the most essential activities for the government in many
industrial and urban areas today. The meteorological and traffic factors, burning of fossil fuels, and industrial parameters play
significant roles in air pollution.With this increasing air pollution,Weare in need of implementing models which will record
information about concentrations of air pollutants(so2,no2,etc).The deposition of this harmful gases in the air is affecting the quality of
people’s lives, especially in urban areas. Lately, many researchers began to use Big Data Analytics approach as there are
environmental sensing networks and sensor data available.In this paper, machine learning techniques are used to predict the
concentration of so2 in the environment. Sulphur dioxide irritates the skin and mucous membranes of the eyes, nose, throat, and
lungs.Models in time series are employed to predict the so2 readings in nearing years or months.
Keywords: Machine Learning, Time Series, Prediction, Air Quality, SO2

They say that this model is capable of successfully


1. INTRODUCTION predicting the air quality index of a total county or any state or
In the developing countries like India, the rapid increase in
any bounded region provided with the historical data of
population and economic upswing in cities have lead to
pollutant concentration.[1]
environmental problems such as air pollution, water pollution,
noise pollution and many more. Air pollution has direct
This paper presents an integrated model using Artificial
impact on humans health .There has been increased public
Neural Networks and Kriging to predict the level of air
awareness about the same in our country.Global warming,
pollutants at various locations in Mumbai and Navi Mumbai
acid rains, increase in the number of asthma patients are some
using past data available from meteorological department and
of the long-term consequences of air pollution. Precised air
Pollution Control Board. The proposed model is implemented
quality forecasting can reduce the effect of maximal pollution
and tested using MATLAB for ANN an R for Kriging and the
on the humans and biosphere as well. Hence, enhancing air
results are presented.[2]
quality forecasting is one of the prime targets for the society.
Sulphur Dioxide is a gas. It is one of the major pollutants
present in air.It is colorless and has a nasty, sharp smell.It This system has used the Linear regression and Multilayer
combines easily with other chemicals to form harmful Perceptron (ANN)Protocol for prediction of the pollution of
substances like sulphuric acid, sulfurous acid etc. next day. The system helps to predict next date pollution
Sulfur dioxide affects human health when it is breathed in. It details based on basic parameters and analyzing pollution
irritates the nose, throat, and airways to cause coughing, details and forecast future pollution. Time Series Analysis
wheezing, shortness of breath, or a tight feeling around the was also used for recognition of future data points and air
chest. The concentration of sulphur dioxide in the atmosphere pollution prediction.[3]
can influence the habitat suitability for plant communities,
as well as animal life. This proposed system does two important tasks (i). Detects
The proposed system is capable of predicting concentration of the levels of PM2.5 based on given atmospheric values. (ii)
Sulphur Dioxide for forthcoming months / years. Predicts the level of PM2.5 for a particular date. Logistic
regression is used to detect whether a data sample is either
polluted or not polluted. Autoregression is employed to
2. RELATED WORK predict future values of PM2.5 based on the previous PM2.5
In this research paper the students have forecasted the air readings. The primary goal is to predict air pollution level in
quality of India by using machine learning algorithms to City with the ground data set.[4]
predict the air quality index(AQI) of a given area. Air quality
Index is a standard measure to determine the quality of air.
Concentrartion of Gases such asso2, no2,co2, rspm, spm. etc. The major objective of this paper was to provide a snapshot of
are recorded by the agencies . These students have developed the vast research work and useful review on the current state-
a model to predict the air quality index based on historical
of-the-art on applicable big data approaches and machine
data of previous years and predicting over a particular
upcoming year as a Gradient decent boosted multivariable learning techniques for air quality evaluation and
regression problem. They improved the efficiency of the predication.Air quality map were illustrated and visualized
model by applying cost Estimation for predictive Problem. using data from Shenzhen, China. Artificial neural network

www.ijcat.com 367
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 367-370, 2019, ISSN:-2319–8656

(ANN), Genetic Algorithm ANN Model, Random forest, So, to summarize we have deleted the following features from
decision tree, Deep belief network are the algorithms which our dataset :
were used and various pros and cons of the model were
state,pm2_5,agency, stn_code,
presented.[5]
sampling_date and location_monitoring_station
We have simplified the type attribute to contain only one of
. the three categories: industrial, residential, other.For SO2 and
3. DATASET NO2, we replaced nan values by mean.For date, we have
dropped nan values as there were only 3 null values.
3.1 Dataset/Source: Kaggle So after pre-processing our dataset contains 60,380 rows and
Structured/Unstructured data:Structured Data in CSV 7 columns.
format.
4. EXPLORATORY DATA ANALYSIS:
Dataset Description:  The below graph shows concentration of so2 over
The dataset consists of around 450000 records of all the the years.It was highest in the years of 1997 and
states of India.We worked only on Dataset of 2001 and lowest in the years 1988 and 2003
Maharashtra.So we had 60383 records. This dataset .However,it is stable for the latest years.
consist of 13 attributes listed below.

1)stn_code
2)sampling_date
3) state
4) location
5) agency
6)type
7)so2
8)no2
9)rspm
10) spm  This graph shows that the amount of so2 is
highest in the industrial areas.
11)location_monitoring_station
12)pm2_5
13)date

Station code is a code given to each station that recorded the


data,sampling date is the date when the data is recorded.state
and location represents state and cities whose data is recorded
and agency is the name of agency that recorded the data.Type
states the type of area where the data was recorded such as
industrial,residential,etc.so2,no2,rspm and spm is the amount
of sulphur dioxide, nitrogen dioxide, respirable suspended
particulate matter and suspended particulate matter measured
respectively.date is a cleaner version of sampling_date.
PM2.5 refers to atmospheric particulate matter (PM) that have
a diameter of less than 2.5 micrometers, which is about 3%
the diameter of a human hair.But majority of values in this
column are null.

Splitting for Testing :Data Splitting was done as 80% for  From this graph we can conclude that Nagpur
training and 20% for testing. has the deadliest amount of so2 as compared to
other cities whereas Akole , Amravati are
Preprocessing and Feature Selection: sparsely polluted followed by Jalna and
We only studied and applied algorithms on the data of Kolhapur.
Maharashtra State .Hence, no. of rows was reduced to 60,383
and state column automatically is of no more use.
All the values in pm2_5 were null values ,so we dropped the
column.The agency’s name have nothing to do with how
much polluted the state is. Similarly, stn_code is also not
useful.
The date is a cleaner representation of sampling_date attribute
and so we will eliminate the redundancy by removing the
latter. location_monitoring_station attribute is again
unnecessary as it contains the location of the monitoring
station which we do not need to consider for the analysis.

www.ijcat.com 368
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 367-370, 2019, ISSN:-2319–8656

I: Integrated. The use of differencing of raw observations (e.g.


subtracting an observation from an observation at the previous
time step) in order to make the time series stationary.
MA: Moving Average. A model that uses the dependency
between an observation and a residual error from a moving
average model applied to lagged observations.

Each of these components are explicitly specified in the


model as a parameter. A standard notation is used of
ARIMA(p,d,q) where the parameters are substituted with
integer values to quickly indicate the specific ARIMA model
being used.

The parameters of the ARIMA model are defined as follows:


p: The number of lag observations included in the model, also
called the lag order.
d: The number of times that the raw observations are
differenced, also called the degree of differencing.
q: The size of the moving average window, also called the
order of moving average.[7]
5. RESULT AND DISCUSSION:
We are able to identify the future data points using Time
Series Analysis. 6. CONCLUSION
Models used for the same are : Based on the bar plots plotted we come to the conclusion
that some cities are highly polluted and need urgent
1)AR model:(autoregressive model) attention. Also for cities like Pune ,Mumbai where
concentration of so2 is increasing, we can take measures
Test MSE: 166.358
Autoregression is a time series model that uses observations from now to not face problems later.We used AR model
from previous time steps as input to a regression equation to and ARIMA model for predicting values of so2. Features
predict the value at the next time step. such as location_monitoring_station or station code were
It is a very simple idea that can result in accurate forecasts on of no use as they have nothing to do with so2 predictions.
a range of time series problems.
So2 safe levels are as follows:
yhat = b0 + b1*X1 0.20 ppm (parts per million) averaged over a one hour period.
0.08 ppm averaged over a 24 hour period. 0.02 ppm averaged
Where yhat is the prediction, b0 and b1 are coefficients found over a one year period.
by optimizing the model on training data, and X is an input
value. In order to predict air quality, pm2_5 is also an important
This technique can be used on time series where input attribute. The values of this must be recorded in future as
variables are taken as observations at previous time steps,
called lag variables. this particulates are responsible for various health effects
For example, we can predict the value for the next time step including cardiovascular effects such as cardiac arrhythmias
(t+1) given the observations at the last two time steps (t-1 and and heart attacks, and respiratoryeffects such as asthma
t-2). As a regression model, this would look as follows: attacks and bronchitis.

X(t+1) = b0 + b1*X(t-1) + b2*X(t-2) This model is not able to show expected output as the data is
Because the regression model uses data from the same input not in sequence as per date column.The same is the problem
variable at previous time steps, it is referred to as an for cities.If we predict for the entire state, it wont be helpful
autoregression (regression of self).[6] So we will be now calculating AQI and use classification
models further.

This model further, also makes us aware of the challenges in


2)ARIMA MODEL: future and research needs such as pm2.5,AQI,etc.
An ARIMA model is a class of statistical models for
analyzing and forecasting time series data. 7. REFERENCES
ARIMA is a generalization of the simpler AutoRegressive [1] Mrs. A. GnanaSoundariMtech, (Phd) ,Mrs. J.
Moving Average and adds the notion of integration. GnanaJeslin M.E, (Phd), Akshaya A.C. “Indian Air
Quality Prediction And Analysis Using Machine
AR: Autoregression. A model that uses the dependent Learning”. International Journal of Applied
relationship between an observation and some number of Engineering Research ISSN 0973-4562 Volume 14,
lagged observations. Number 11, 2019 (Special Issue)
[2] Suhasini V. Kottur , Dr. S. S. Mantha. “An
Integrated Model Using Artificial Neural Network

www.ijcat.com 369
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 367-370, 2019, ISSN:-2319–8656

(Ann) And Kriging For Forecasting Air


Pollutants Using Meteorological Data”.
International Journal of Advanced Research in
Computer and Communication Engineering ISSN
(Online) : 2278-1021 ISSN (Print) : 2319-5940 Vol.
4, Issue 1, January 2015
[3] RuchiRaturi, Dr. J.R. Prasad .“Recognition Of
Future Air Quality Index Using Artificial Neural
Network”.International Research Journal of
Engineering and Technology (IRJET) .e-ISSN:
2395-0056 p-ISSN: 2395-0072 Volume: 05 Issue:
03 Mar-2018
[4] Aditya C R, Chandana R Deshmukh, Nayana D K,
Praveen Gandhi Vidyavastu .” Detection and
Prediction of Air Pollution using Machine Learning
Models”. International Journal of Engineering
Trends and Technology (IJETT) – volume 59 Issue
4 – May 2018
[5] Gaganjot Kaur Kang, Jerry ZeyuGao, Sen Chiao,
Shengqiang Lu, and Gang Xie.” Air Quality
Prediction: Big Data and Machine Learning
Approaches”. International Journal of
Environmental Science and Development, Vol. 9,
No. 1, January 2018
[6] https://machinelearningmastery.com/autoregression-
models-time-series-forecasting-python/
[7] https://machinelearningmastery.com/arima-for-
time-series-forecasting-with-python/

www.ijcat.com 370
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 371-374, 2019, ISSN:-2319–8656

Individual Household Electric Power Consumption


Forecasting using Machine Learning Algorithms

Aaditi Parate Sachin Bhoite


Student, M.Sc. (Big Data Analytics) Assistant Professor, Computer Science
MIT WPU Pune, India MIT-WPU, Pune, India

--------------------------------------------------------------------****************------------------------------------------------------------------

Abstract: Electric energy consumption is the actual energy demand made on existing electricity supply. However, the
mismanagement of its utilisation can lead to a fall in the supply of electricity. It is therefore imperative that everybody should be
concerned about the efficient use of energy in order to reduce consumption [1]. The purposes of this research are to find a model
to forecast the electricity consumption in a household and to find the most suitable forecasting period whether it should be in
daily, weekly, monthly, or quarterly. The time series data in our study is the individual household electric power consumption
[4].To explore and understand the dataset I used line plots for series data and histograms for the data distribution. The data
analysis has been performed with the ARIMA (Autoregressive Integrated Moving Average) model.

Keywords: Energy consumption prediction, ARIMA, AR, MA, Python.

---------------------------------------------------------------------------------------------------------------------------------------------

1. INTRODUCTION 2. RELATED WORK:

Electricity load forecasting has gained substantial How to Load and Explore Household Electricity Usage
importance nowadays in the modern electrical power Data In this tutorial, you will discover a household power
management systems with elements of smart greed consumption dataset for multi-step time series forecasting
technology. A reliable forecast of electrical power and how to better understand the raw data using
consumption represents a starting point in policy exploratory analysis.[5]
development and improvement of energy production and
distribution. At the level of individual households, the How to Develop an Autoregression Forecast Model for
ability to accurately predict consumption of electricity Household Electricity Consumption In this tutorial, you
power significantly reduces prices by appropriate systems will discover how to develop and evaluate an
for energy storage. Therefore, the energy efficient power autoregression model for multi-step forecasting household
networks of the future will require entirely new ways of power consumption.[6]
forecasting demand on the scale of individual households
[2]. The analysis of a time series used forecasting
Time Series Analysis of Household Electric Consumption
techniques to identify models from the past data. With the
with ARIMA and ARMA Models: In this research, we are
assumption that the information will resemble itself in the
interest in time series analysis with the most popular
future, we can thus forecast future events from the
method, that is, the Box and Jenkins method. The result
occurred data. There are several techniques of forecasting
model of this method is quite accurate compared to other
and these techniques provide forecasting models of
methods and can be applied to all types of data
different accuracy. The accuracy of the prediction is based
movement. There were two forecasting techniques that
on the minimum error of the forecast. The appropriate
were used in this study; Autoregressive Integrated
prediction methods are considered from several factors
Moving Average (ARIMA) and Autoregressive Moving
such as prediction interval, prediction period, characteristic
Average (ARMA).[1]
of time series, and size of time series [4].

In this research, we are interest in time series analysis 3. DATASET DESCRIPTION:


with the popular forecasting technique that I used in this
study; ARIMA (Autoregressive Moving Average) I Source:
applied this method for detecting patterns and trends of https://archive.ics.uci.edu/ml/datasets/individual+househo
the electric power consumption in the household with real ld+electric+power+consumption
time series period in daily, weekly, monthly, and
quarterly. I used Python program for constructing the
model.

www.ijcat.com 371
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 371-374, 2019, ISSN:-2319–8656

The Household Power Consumption dataset is a


5. EXPLORATORY DATA ANALYSIS:
multivariate time series dataset that describes the 1) Global active power distribution plots:
electricity consumption for a single household over four

years.

This archive contains 2075259 measurements gathered in

a house located in Sceaux (7km of Paris, France)

between December 2006 and November 2010 (47

months).

It is a multivariate series comprised of seven variables


(besides the date and time); they are:

 global_active_power: The total active power


consumed by the household (kilowatts).
 global_reactive_power: The total reactive
Normal probability plot also shows the data is far from
power consumed by the household (kilowatts).
normally distributed.
 voltage: Average voltage (volts).
 global_intensity: Average current intensity
(amps). 2) 1st graph represents the mean global active power by
 sub_metering_1: Active energy for kitchen Year.
(watt-hours of active energy). 2nd graph represents the mean global active power by
 sub_metering_2: Active energy for laundry Quarter.
(watt-hours of active energy). 3rd graph represents the mean active power by the
 sub_metering_3: Active energy for climate Month and the 4th graph represents the mean global
control systems (watt-hours of active energy). active power by Day.

4. PRE-PROCESSING:
The dataset contains some missing values in the
measurements (nearly 1,25% of the rows). All calendar
timestamps are present in the dataset but for some
timestamps, the measurement values are missing: a
missing value is represented by the absence of value
between two consecutive semi-colon attribute separators.
For instance, the dataset shows missing values on April
28, 2007.We cannot ignore the missing values in this
dataset therefore we cannot delete the missing values. I
copied the observation from the same time the day before
and implemented this in a function
named fill_missing() that will take the NumPy array of
the data and copy values from exactly 24 hours ago Then
we saved cleaned-up version of the dataset to a new file
household_power_consumption.csv‘[3].
The above plots confirmed our previous discoveries. By year,
it was steady. By quarter, the lowest average power
consumption was in the 3rd quarter. By month, the lowest
average power consumption was in July and August. By day,
the lowest average power consumption was around 8th of the
month

3) Global active power by year. This time we removed


year 2006

www.ijcat.com 372
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 371-374, 2019, ISSN:-2319–8656

The pattern is similar every year from 2007 to 2010.

4) Line plot of Active power for years:


[4].

5. RESULT AND DISCUSSION:


I developed an autoregression model for univariate
series of daily power consumption. I used the
Statsmodels library that provides multiple ways of
developing an AR model, such as using the AR,
ARMA, ARIMA, and SARIMAX classes.
I use the ARIMA implementation as it allows for
easy expandability into differencing and moving
average.
First, the history data comprised of weeks of prior
observations is converted into a univariate time
series of daily power consumption. I specified an AR
(7) model, which in ARIMA notation is
ARIMA(7,0,0).
Running the example first prints the performance of
the AR (7) model on the test dataset.
We can see that the model achieves the overall
RMSE of about 381 kilowatts.
5) Line plots for Active Power for all months in a year: This model has skill when compared to naive
forecast models, such as a model that forecasts the
week ahead using observations from the same time
one year ago that achieved an overall RMSE of about
465 kilowatts.
A line plot of the forecast is also created, showing
the RMSE in kilowatts for each of the seven lead
times of the forecast. We can see an interesting
pattern. We might expect that earlier lead times are
easier to forecast than later lead times, as the error at
each successive lead time compounds.
Instead, we see that Friday (lead time +6) is the
easiest to forecast and Saturday (lead time +7) is the
most challenging to forecast. We can also see that
the remaining lead times all have a similar error in
the mid- to high-300 kilowatt range. [3]
6) Histogram plots for Each Variable in the Power
Consumption Dataset

www.ijcat.com 373
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 371-374, 2019, ISSN:-2319–8656

an-autoregression-forecast-model-for-household-
electricity-consumption/#

[4] Pasapitch Chujai*, Nittaya Kerdprasop, and Kittisak


Kerdprasop“Time Series Analysis of Household Electric
Consumption with ARIMA and ARMA
Models”Proceedings of the International
MultiConference of Engineers and Computer Scientists
2013 Vol I, IMECS 2013, March 13 - 15, 2013, Hong
Kong

[5] https://machinelearningmastery.com/how-to-load-
and-explore-household-electricity-usage-data/
[6] https://machinelearningmastery.com/how-to-develop-
an-autoregression-forecast-model-for-household-
electricity-consumption/#:

6. CONCLUSION:

Many researchers wrote about the ARIMA model, AR


model, MA model and also worked with these models on
the consumption of electricity. I find the ARIMA model
easy as compared to the other models and also the
ARIMA model gives a better accuracy then the other
models.

Therefore in this research I used the ARIMA model on


the individual household electricity consumption dataset.
Then, chose the suitable forecasting method and
identified the most suitable forecasting period by
considering the smallest values of RMSE. In this data
set the consumption of electricity is more in the month
of December and regular during the other time period of
the year.The results showed that the ARIMA model
represent the most suitable forecasting periods in
monthly and quarterly, daily and weekly.

The ARIMA model result: arima: [465.902] 428.0,


448.9, 395.8, 522.3, 450.4, 380.5, 597.9Here the RMSE
is 465.902

7. REFERENCES:

[1] c Kamunda A Study on Efficient Energy Use for


Household Appliances in Malaw
[2] Naser Farag Abed1 and Milan M.Milosavljevic 1,2 1
Singidunum University Belgrade, 11000, Serbia 2 School
of Electrical Engineering, Belgrade University, Belgrade,
11000, Serbia“Single Home Electricity Power
Consumption Forecast Using Neural Networks
Model”IJISET - International Journal of Innovative
Science, Engineering & Technology, Vol. 3 Issue 1,
January 2016. ISSN 2348 – 7968
[3]https://machinelearningmastery.com/how-to-develop-

www.ijcat.com 374
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 375-378, 2019, ISSN:-2319–8656

Restaurants Rating Prediction using Machine Learning


Algorithms

Atharva Kulkarni[1] Divya Bhandari[2] Sachin Bhoite[3]

Student, M.Sc (BDA) Student, M.Sc (BDA) Assistant Professor

MIT-WPU MIT-WPU Department of Computer Science,


MIT-WPU
Pune, Maharashtra, India Pune, Maharashtra, India
Pune, Maharashtra, India

----------------------------------------------------------------------------------------------------------------------------- --------------------------------
Abstract: Restaurant Rating has become the most commonly used parameter for judging a restaurant for any individual. A lot of research has
been done on different restaurants and the quality of food it serves. Rating of a restaurant depends on factors like reviews, area situated, average
cost for two people, votes, cuisines and the type of restaurant.
The main goal of this is to get insights on restaurants which people like visit and to identify the rating of the restaurant. With this article we
study different predictive models like Support Vector Machine (SVM),Random forest and Linear Regression, XGBoost, Decision Tree and have
achieved a score of 83% with ADA Boost.

Key Words: Pre-processing, EDA, SVM Regressor, Linear Regression, XGBoost Regressor, Boosting.
-------------------------------------------------------------------------------------------------------------------------------------------------------------

1. INTRODUCTION
[4] Rrubaa Panchendrarajan, Nazick Ahamed, Prakhash Sivakumar,
Zomato is the most reputed company in the field of food reviews.
Brunthavan Murugaiah, Surangika Ranathunga and Akila Pemasiri
Founded in 2008, this company started in India and now is in 24
wrote a paper on ‘Eatery, a multi-aspect restaurant rating system’
different countries. Its is so big that the people now use it as a verb.
that identifies rating values for different aspects of a restaurant by
“Did you know about this restaurant? Zomato it”. The rating is the
means of aspect-level sentiment analysis. This research introduced a
most important feature of any restaurant as it is the first parameter
new taxonomy to the restaurant domain that captures the hierarchical
that people look into while searching for a place to eat. It portrays the
relationships among entities and aspects.
quality, hygiene and the environment of the place. Higher ratings
lead to higher profit margins. Notations of the ratings usually are
[5] Neha Joshi wrote a paper in 2012 on A Study on Customer
stars or numbers scaling between 1 and 5.
Preference and Satisfaction towards Restaurant in Dehradun City
Zomato has changed the way people browse through restaurants. It which aims to contribute to the limited research in this area and
has helped customers find good places with respect to their dining provide
budget. insight into the consumer decision making process specifically for
the India foodservice industry. She did hypothesis testing using chi-
Different machine learning algorithms like SVM, Linear regression, square test.
Decision Tree, Random Forest can be used to predict the ratings of
the restaurants. [6] Bidisha Das Baksi, Harrsha P, Medha, Mohinishree Asthana, Dr.
Anitha C wrote a paper that studies various attributes of existing
2. RELATED WORK restaurants and analyses them to predict an appropriate location for
Various researches and students have published related work in higher success rate of the new restaurant. The study of existing
national and international research papers, thesis to understand the restaurants in a particular location and the growth rate of that
objective, types of algorithm they have used and various techniques location is important prior to selection of the optimal location. The
for pre-processing and feature selection. aim is to the create a web application that determines the location
suitable to establish a new restaurant unit, using machine learning
[1] Shina, Sharma S. and Singha A. have used Random forest and and data mining techniques.
decision tree to classifying restaurants into several classes based on
their service parameters. Their results say that the Decision Tree
Classifier is more effective with 63.5% of accuracy than Random
Forest whose accuracy is merely 56%.
3. DATA SET DESCRIPTION
This is a kaggle dataset.
(https://www.kaggle.com/himanshupoddar/zomato-bangalore-
[2] Chirath Kumarasiri’s and Cassim Faroo’s focuses on a Part-of- restaurants).
Speech (POS) Tagger based NLP technique for aspect identification
from reviews. Then a Naïve Bayes (NB) Classifier is used to classify It Represents information of Restaurants in the City of Bangalore.
identified aspects into meaningful categories.
It contains 17Columns and 51,000 Rows
[3] I. K. C. U. Perera and H.A. Caldera have used data mining
techniques like Opinion mining and Sentiment analysis to automate
the analysis and extraction of opinions in restaurant reviews.

375
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 375-378, 2019, ISSN:-2319–8656

 develop parsimonious models; and

3.1 PreProcessing  determine optimal factor settings.


The Dataset contained 17 Attributes.
 Records with null values were dropped from ratings
columns and were replaced in the other columns with a
1) Restaurant Rate Distribution
numerical value.
 Values in the ‘Rating’ column were changed. The ‘/5’ string
was deleted. For eg. If the rating of a restaurant was 3.5/5, it
was changed to 3.5.
 Using LabelEncoding from sklearn library, encoding was
done on columns like
book_table,online_order,rest_type,listed_in(city).

3.2 Feature Selection


We did not use any feature selection algorithms but eliminated some
columns due to available domain knowledge and thorough study of
the system.
Dropped columns mentioned below:
 URL
 Address We can see that the number of restaurants with the rating between 3.5
and 4 are the highest. We will look into its dependencies further.
 Dish_liked
 Phone
2) Approximate Cost of two people
 Menu
 Review_list
 Location
 Cuisine
Some of these columns may look like they are important but all of the
same information could be found in other columns with lesser
complexity.
The Columns being used are as follows:
 Name
 Online_order
 Book_table
 Votes
 Rest_type
This is a graph for the ‘Approximate cost of 2 people’ for dining
 Approx. cost of two people in a restaurant. Restaurants with this cost below 1000 Rupees are
 Listed_in(type) more.

 Listed_in(city) This box plot helps us look into the outliers. We can also see that
online ordering service also affects the rating. Restaurants with
online ordering service have a rating from 3.5 to 4.

4. EXPLORATORY DATA ANALYSIS


3) Online ordering with respect to Rating(Finding
A lot of effort went into the EDA as it gives us a detailed Outliers)
knowledge of our data.
Exploratory Data Analysis (EDA) is an approach/philosophy for
data analysis that employs a variety of techniques (mostly graphical)
to
 maximize insight into a data set;
 uncover underlying structure;
 extract important variables;
 detect outliers and anomalies;
 test underlying assumptions;

376
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 375-378, 2019, ISSN:-2319–8656

This graph just showcases the best restaurants in Bangalore along


with their rating.

6) Cost and Rate Distribution according to online


ordering and booking table

4) Booking table with respect to rating(Finding


Outliers)

A very important scatterplot shows the correspondence between the


cost, online ordering, bookings and rating of the restaurant.
This box plot also helps us look into the outliers. This box plot is
regarding how table booking availability is seen in restaurants with
rating over 4.
4.1. Key Findings

5) Top Rated Restaurants


Votes approx_cost(for Rating
two people)
online_order

No 367.992471 716.025190 3.658071


Yes 343.228663 544.365434 3.722440

Votes approx_cost(for Rating


two people)
Book_table

No 204.580566 482.404625 3.620801


Yes 1171.342957 1276.491117 4.143464

377
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 375-378, 2019, ISSN:-2319–8656

5. RESULTS
[5] Neha Joshi. A Study on Customer Preference and Satisfaction
towards Restaurant in Dehradun City.
Algorithms Accuracy Global Journal of Management and Business Research(2012)
Link:
Linear Regression 30% https://pdfs.semanticscholar.org/fef5/88622c39ef76dd773fcad8bb5d
233420a270.pdf
KNN 44%
[6] Bidisha Das Baksi, Harrsha P, Medha, Mohinishree Asthana, Dr.
Support Vector Machine 43% Anitha C.(2018) Restaurant Market Analysis.
International Research Journal of Engineering and Technology
Decision Tree 69% (IRJET)
Link: https://www.irjet.net/archives/V5/i5/IRJET-V5I5489.pdf
Random Forest 81%

ADA Boost(DT) 83%

XGBoost 72.26%

Gradient Boosting 52%

In this model, we have considered various restaurants records with


features like the name, average cost, locality, whether it accepts
online order, can we book a table, type of restaurant.
This model will help business owners predict their rating on the
parameters considered in our model and improve the customer
experience.
Different algorithms were used but in the end the final model is
selected on Ada Boost Regressor which gives the highest accuracy
compared to others.

6. CONCLUSIONS
This paper studies a number of features about existing restaurants of
different areas in a city and analyses them to predict rating of the
restaurant. This makes it an important aspect to be considered, before
making a dining decision. Such analysis is essential part of planning
before establishing a venture like that of a restaurant.
Lot of researches have been made on factors which affect sales and
market in restaurant industry. Various dine-scape factors have been
analysed to improve customer satisfaction levels.
If the data for other citirs is also collected, such predictions could be
made for accurate.

7. REFERENCES
[1] Chirath Kumarasiri, Cassim Faroo,”User Centric Mobile Based
Decision-Making System Using Natural Language Processing (NLP)
and Aspect Based Opinion Mining (ABOM) Techniques for
Restaurant Selection”. Springer 2018. DOI: 10.1007/978-3-030-
01174-1_4

[2] Shina, Sharma, S. & Singha ,A. (2018). A study of tree based
machine learning Machine Learning Techniques for Restaurant
review. 2018 4th International Conference on Computing
Communication and Automation (ICCCA)
DOI:/10.1109/CCAA.2018.8777649

[3] I. K. C. U. Perera and H. A. Caldera, "Aspect based opinion


mining on restaurant reviews," 2017 2nd IEEE International
Conference on Computational Intelligence and Applications
(ICCIA), Beijing, 2017, pp. 542-546. doi:
10.1109/CIAPP.2017.8167276

[4] Rrubaa Panchendrarajan, Nazick Ahamed, Prakhash Sivakumar,


Brunthavan Murugaiah, Surangika Ranathunga and Akila Pemasiri.
Eatery – A Multi-Aspect Restaurant Rating System. Conference: the
28th ACM Conference
378
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 379-384, 2019, ISSN:-2319–8656

Engineering College Admission Preferences Based on


Student Performance

Dhruvesh Kalathiya Rashmi Padalkar Rushabh Shah


Student, M.Sc.(BDA) Student, M.Sc.(BDA) Student, M.Sc.(BDA)
MIT-WPU MIT-WPU MIT-WPU
Pune, India Pune, India Pune, India

Sachin Bhoite
Assistant Professor
Department of Computer
Science
Faculty of Science
MIT-WPU
Pune, India

Abstract: As we know that after the 12th board results, the main problem of a student is to find an appropriate college for their
further education. It is a tough decision to make for many students as to which college they should apply to. We have built a
system that compares the student’s data with the past admission data and suggests colleges in a sequence of their preference. We
have used Decision Tree, Support Vector Classifier, Extra Tree Classifier, Naïve Bayes, KNN and Random Forest as our
statistical model to predict the probability of getting admission to a college. It was observed that the performance of Random
Forest was achieved highest among all.

Keywords: Decision Tree, Random Forest, KNN, Random Forest, Extra Tree Classifier, SVC, Probabilities

Bibodi).The first one is the University selection model, and


1. INTRODUCTION the second one is a student selection model. They came across
Education plays a vital role in today’s era. While we talk
some issues like noisy data, unformatted text but after
about career – a person’s degree, course, university and the
cleaning the data, they proceeded to ‘model selection’ with
knowledge that he possesses – is the key factor on which the
some important features. “University Selection Model” – A
firm hires a fresher. As soon as a student completes his/her
Classification problem with apriori probability output. They
Higher Secondary Schooling, the first goal of any student is to
found out just two universities giving a higher probability of
get into an appropriate College so that he can get a better
output. “Student Selection Model” – Classification using
education and guidance for his future. For that, students seek
supervised learning like Linear and kernel, Decision Tree and
help from many sources like online sites or career experts to
Random Forest. Random Forest provided better accuracy than
get the best options for their future. A good career counselor
other algorithms i.e. 90% accuracy.[1]
charges a huge amount for providing such solutions. Online
sources are also not as reliable as the data from a particular There is one more researcher – Himanshu Sonawane –
source is not always accurate. Students also perform their who has researched on ‘Student Admission Predictor’. It is a
analysis before applying to any institution, but this method is system built to help students who are studying abroad. This
slow and certainly not consistent for getting actual results and system helps students find the best foreign
possibly includes human error. Since the number of universities/colleges based on their performance in GRE,
applications in different universities for each year is way too IELTS, 12th, Graduation Marks, Student Statement of
high, there is a need to build up a system that is more accurate purpose, Letter of Recommendation, etc. Based on this
or precise to provide proper suggestions to students. information, it recommends the best-suited university/college.
Our aim is to use machine learning concepts to predict the They have used three algorithms: KNN (76% Accuracy),
probability of a student to get admission into those preferred Decision Tree (80% Accuracy), Logistic Regression (68%
colleges and suggest a list of colleges in a sequence of the Accuracy). In the case of a decision tree, accuracy was nearly
probability of getting admission to that specific college. The the same for both pieces of training as well as testing
following are the steps that include the work we have done in datasets.[2]
sequence of implementation.
From another research paper, we got to know what affects
the likelihood of enrolling (Ahmad Slim – Predicting Student
2. RELATED WORKS Enrolment Based on Student and College Characteristics).
One of the researchers has done work on predicting the They have used machine learning to analyze the enrolment.
university and students applying to explicit universities (Jay

www.ijcat.com 379
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 379-384, 2019, ISSN:-2319–8656

This work intends to provide decision-makers in the predicts the probability of binary classification. The feature
enrolment management administration, a better understanding vector encoding of a student's file indicates whether the
of the factors that are highly correlated to the enrolment applicant was rejected or admitted. The system was used to
process. They have used real data of the applicants who were predict the probability of admissions committee accepting that
admitted to the University of New Mexico (UNM).In their applicant or not but, in our model, we are trying to make it
dataset, they have different features like gender, GPA, easy for the applicants to understand whether they should
parent’s income, student’s income. They had data issues like apply to that college or not.[7]
missing value and categorical variables. They have divided
classification into two parts – classification at the individual 3. DATA EXTRACTION AND
level and classification at a cohort level. For classification at TRANSFORMATION
the individual level, the model was used to check the We have achieved our goals step-by-step to make the data
probability of enrolment and whether the applicant is enrolled steady, fitting it into our models and finding out suitable
or not. Logistic Regression (LR) provided an accuracy of 89% algorithms of machine learning for our System.
and Support Vector Machine (SVM) provided an accuracy of
91% which was used in the classification at an individual This step contains mainly – Data Extraction, Data
level. The total enrolment in 2016 was actually 3402 but the Cleaning, Pre-processing, removing unwanted columns,
prediction was 3478 by using past year records (2015) using feature selection, label encoding. These steps are shown in
time series for classification at the cohort level. [3] Figure 1.

These researchers – Heena, Mayur, and Prashant from


Mumbai – have used data mining and ML techniques to
analyze the current scenario of admission by predicting the
enrolment behavior of students. They have used the Apriori Raw Data/Dataset
technique to analyze the behavior of students who are seeking
admission to a particular college. They have also used the
Naïve Bayes algorithm which will help students to choose the
course and help them in the admission procedure. In their Data Pre-processing
project, they were conducting a test for students who were
seeking admissions and then based on their performance, they
were suggesting students a course branch using Naïve Bayes
Algorithm.[4] Feature Selection
One more researcher has made a project for helping
students in suggesting them best-suited colleges in the USA
based on his/her profile. He has collected the data from online Train-Test Split
sources which was reported by students. He has used 5-6
algorithms for his project. Naïve Bayes was one of them
which gave the highest accuracy among all of them. He has
predicted students’ chances(probabilities) of getting Model Applied
admission in 5 different universities in the USA.[5]
Other researchers were predicting the student admission
(Students' Admission Prediction using GRBST with
Distributed Data Mining - Dinesh Kumar B Vaghela). They Model Tuning
have used the Global Rule Binary Search Tree (GRBST).
While searching, they identified some problems like
maintaining a single database for all the colleges were
difficult. This paper has two phases i.e. training phase and Final Output
testing phase. In the training phase, the J48 algorithm was
used for all local sites. In the testing phase, Users can interact
Figure 1: Architecture
with the system with the help of the application layer. They
have used consolidation techniques in two ways i.e. using
If…Then… rules format and Decision Table. They have also 3.1 Dataset
used binary search tree construction. After applying this Knowing about this use case we need past admission data
technique, they have found the time complexity of generating of multiple colleges to work on. We have extracted data from
the Binary Search Tree from the Decision table is very less three different colleges which includes information about a
and also this BST has efficient time complexity to predict the student’s academic scores and the reservation category he
result. They conclude that data mining techniques can be falls in. Data has been mined from college registries. We have
useful in deriving patterns to improve the education system. extracted 2054 records that include 13 attributes.
[6] Attributes of the dataset are:
First Name, Last Name, Email ID, Gender, Address, Date
GRADE system developed to help graduate admission of Birth, Category, S.S.C. Percentage, H.S.C Percentage,
committee at the University of Texas at Austin Department of Diploma Percentage, Branch, Education Gap, and Nationality.
Computer Science (UTCS) by Austin Waters and Risto
Miikkulainen Department of Computer Science 1 University 1. Data Preprocessing
Station C0500, University of Texas, Austin, TX 78712. This Data preprocessing is an important task in machine
system first reads the applicant's files from the database and learning. It converts raw data into clean data. Following are
encodes as a high-dimensional feature vector and then a techniques, we have applied on data: -
logistic regression classifier is trained on that data. It then

www.ijcat.com 380
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 379-384, 2019, ISSN:-2319–8656

● Missing Values – Missing Value are those values that 4. EXPLORATORY DATA
failed to load information or the data itself was
corrupted. There are different techniques to handle ANALYSIS
missing values. One of which we have applied is deleting As we saw in feature selection, some features which
rows because some of the rows were blank and they may seemed not so important were contributing to our model. So,
mislead the classification. to understand those features, we need to do exploratory
● Label Encoder – This is one of the most frequently used analysis on this data.
techniques for the categorical variable. Label encoder We did exploratory analysis on a few features by
converts labels into a numeric format so that the machine grouping and plotting it on graphs.
can recognize it. In our data, there are many attributes
which are categorical variable like gender, category, EDA on Gender Column:
branch. By grouping the gender and plotting the admissions in
● Change in data type – Some attributes didn’t include different colleges as per their gender, we identified some
proper input. For example, the Nationality attribute relations between the student’s admission and his or her
included values like Indian, India, IND which all meant gender. As shown in Figure 4 – For different gender, most
the same country. For that purpose, we needed to change students lied in different bins for different colleges. Even for
such values into a single format. ‘Object’ data type different colleges, we are getting different bell curves.
values in some attributes had to be changed into ‘float’ Looking at this we can confirm that the gender column is
data type. Some records included CGPA for S.S.C scores contributing to our model.
so we converted those records into a percentage. We For Extra Tree Classifier, Gender contributes to model –
made all these changes so that it doesn’t affect our 1.3092%.
accuracy.
● Drop Columns – As per domain knowledge, we EDA on Category Column:
removed some columns which were not needed in our By grouping the category and calculating the percentage
model. of students who got admissions with respect to their
categories is shown in Figure 3 – For different categories, we
2. Feature Selection calculated the percentage of students that lie in each category.
As we proceed further, before fitting our model we must This percentage of students was matching to reservation
make sure that all the features that we have selected contribute criteria as per Indian laws. This shows that the Category
to the model properly and weights assigned to it are good column is contributing to our model.
enough so that our model gives satisfactory accuracy. For For Extra Tree Classifier, Category contributes to model –
9.6582%.
that, we have used 4 feature selection techniques: Lasso,
Ridge, F1 Score, Extra Tree Classifier.
Lasso, Ridge and F1 Score were removing the features
that I needed the most and Extra Tree Classifier was giving
me an acceptable importance score. Which is shown below.

Extra Tree Classifier:


Extra Tree Classifier is used to fit a randomized decision
tree and uses averaging to improve the predictive accuracy
and control over-fitting. We have used this to know the
important features of our data.

Figure 3: Admission with respect to Category

Figure 2: Feature Selection using Extra Tree Classifier

As we can see, my Feature Selection model is giving


more importance to S.S.C. The percentage is not appropriate.
So, in this case, our domain knowledge is also helpful to
make decisions for this type of situation.

www.ijcat.com 381
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 379-384, 2019, ISSN:-2319–8656

Figure 4: Admission with respect to Gender

5. RESULT AND DISCUSSION


After removing all the noise from the data and after
selecting appropriate features for our model, the next step is to
find out the best model which gives us more accuracy for train
and test both. But before that, we must split our data into 2
parts as we don’t have any testing dataset right now.
So, we have divided this modeling section into 3 parts:
1. Train-Test Split
2. Modeling
3. Tuning Model

5.1 Train-Test Split


The training data set is used to create the model while
testing the data set is used to qualify the performance.
Training data’s output is available to model while test data is
unseen data. So, in our data, we have split data into 70% for
training data and 30% for testing data because it makes the
classification model better. while the test data makes the error
estimate more accurate.

5.2 Modeling
Following are models, we have applied to check which
model gives better accuracy:

● Support Vector Classifier (SVC):


This algorithm is used for the classification
problem. The main objective of SVC is to fit the data you
provide, returning a “best fit” hyperplane that divides or
categorizes your data. From there, when obtaining the
hyperplane, you'll then feed some options to your
category to examine what the "predicted" class is.

● Decision Tree:
A decision tree is non-parametric supervised
learning. It is used for both classification and regression
problems. It is a flowchart-like structure in which each
internal node represents a “test” on an attribute, each
branch represents the outcome of the test, and each leaf
node represents a class label. The path between root and
leaf represents classification rules. It creates a
comprehensive analysis along with each branch and
identifies decision nodes that need further analysis.

● Random Forest:
Random Forest is a meta estimator that uses the
number of decision trees to fit the various subsamples

www.ijcat.com 382
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 379-384, 2019, ISSN:-2319–8656

drawn from the original dataset. we can also draw the Following are the models, we applied to check which model
data with replacement as per the requirements. gives better accuracy:

● K-Nearest Neighbors (KNN): ● XGBoost:


K-Nearest Neighbors (KNN) is a supervised XGBoost stands for eXtreme Gradient Boosting.
learning algorithm that is used to solve regression as well XGBoost is an implementation of gradient boosted
as classification problems. Where ‘K’ is the number of decision trees designed for speed and performance [9].
nearest neighbors around the query. It is simple to Using this we have achieved 64% accuracy.
implement, easy to understand and it is a lazy algorithm. ● AdaBoost:
The lazy algorithm means it does not need any training AdaBoost is one of the first boosting algorithms to be
data points for a model generation [8]. All the training adapted in solving practices. AdaBoost helps you
data is used in the testing phase. This makes the training combine multiple “weak classifiers” into one “strong
faster and the testing phase slower and costlier. By costly classifier”. AdaBoost is best used to boost the
testing phase we mean it requires more time and more performance of all trees on binary classification issues.
memory. Using this we have achieved 61% accuracy.

● Naïve Bayes: We have used XGBoost and AdaBoost for just


A Naive Bayes Classifier is a supervised machine- improving our accuracy. Accuracy of XGBoost is higher and
learning algorithm that uses Bayes’ Theorem, in which it improves our accuracy by 6%.
the features are statistically independent. It specifies But as our problem statement suggests, we do not need
multiple uses of probability theories and statistics. By accuracy as we are just calculating the probabilities for getting
simple machine learning problem, where we need to the admission in all the colleges and referring top probabilities
teach our model from a given set of attributes (in training to that student.
examples) and then form a hypothesis or a relation to a
response variable. Then we tend to use this to predict a
response, given attributes of a replacement instance. 6. CONCLUSIONS
The objective of this project is achieved in this process
● Extra Tree Classifier: flow which will be used by students to identify the appropriate
The main objective of Extra Tree Classifier is colleges based on his/her performance. The main aspects of
randomizing tree building further in the context of students which are taken under are their 10th, 12th percentages
numerical input features, where the choice of the optimal and diploma percentage too if applicable. Besides of that
cut-point is responsible for a large proportion of the gender, category, Education gap and branch in which student
variance of the induced tree. wants to get admission are also contributing to admission. The
final model for our project is Random Forest as it is giving a
We used all these models to fit our data and checked the satisfactory output.
accuracy which is shown in Figure 5. As we looked at our data, and we observed that this is
just the data of students who took admission. There is no data
of neither rejected students nor we have students’ choice of
college. We can ask students for their choice for college to get
Model Accuracy better data for accuracy. We were also about to consider the
address of the student, as we know that different seats are
Support Vector Classifier 47.79% reserved for those students who belong to different states. But
in our data, most of the address columns are not filled
(SVC) properly so we removed that column. So, for that, we can
keep dropdown buttons on Online Application Form for cities,
Decision Tree 56.13% states, and countries so that we get proper data for this. We
also can ask for entrance exam results which can help us
Random Forest Tree 58.87% predict more accurately.

K-Nearest Neighbors (KNN) 52.35%


7. REFERENCES
[1] Bibodi, J., Vadodaria, A., Rawat, A. and Patel, J. (n.d.).
Naïve Bayes 42.29%
“Admission Prediction System Using Machine
Learning”. California State University, Sacramento.
Extra Tree Classifier 58.33%
[2] Himanshu Sonawane, Mr. Pierpaolo Dondio. “Student
Admission Predictor”. School of Computing, National
College of Ireland. unpublished.
Figure 5: Accuracy for Different Models
[3] A. Slim, D. Hush, T. Ojah, T. Babbitt. [EDM-2018]
“Predicting Student Enrollment Based on Student and
5.3 Model Tuning: College Characteristics”. University of New Mexico,
As we can see our accuracy is not going beyond 60%. for Albuquerque, USA.
that reason, we have tuned our model. Tuning is the method
[4] Heena Sabnani, Mayur More, Prashant Kudale.“
for increasing a model's performance without overfitting the
Prediction of Student Enrolment Using Data Mining
data or making the variance too high. Hyperparameters
Techniques”. Dept. of Computer Engineering, Terna
disagree from other model parameters therein they're not
Engineering College, Maharashtra, India.
learned by the model automatically through training ways.

www.ijcat.com 383
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 379-384, 2019, ISSN:-2319–8656

[5] Bhavya Ghai. “Analysis & Prediction of American [7] Austin Waters, Risto Miikkulainen. “GRADE: Machine
Graduate Admissions Process”. Department of Computer Learning Support for Graduate Admissions”. University
Science, Stony Brook University, Stony Brook, New of Texas, Austin, Texas.
York. [8] https://www.datacamp.com/community/tutorials/k-
[6] Dineshkumar B Vaghela, Priyanka Sharma. “Students' nearest-neighbor-classification-scikit-learn
Admission Prediction using GRBST with Distributed
Data Mining”. Gujarat Technological University, [9] https://www.meetup.com/Big-Data-Analytics-and-
Chandkheda. Machine-Learning/events/257926117/

www.ijcat.com 384
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 385-388, 2019, ISSN:-2319–8656

Wine Quality Prediction using Machine Learning


Algorithms
Devika Pawar[1] Aakanksha Mahajan[2] Sachin Bhoithe[3]
M.Sc. (Big Data Analytics) M.Sc. (Big Data Analytics) Faculty of Science
MIT-WPU MIT-WPU MIT-WPU
Pune, India Pune, India Pune, India

Abstract: Wine classification is a difficult task since taste is the least understood of the human senses. A good wine quality
prediction can be very useful in the certification phase, since currently the sensory analysis is performed by human tasters, being
clearly a subjective approach. An automatic predictive system can be integrated into a decision support system, helping the
speed and quality of the performance. Furthermore, a feature selection process can help to analyze the impact of the analytical
tests. If it is concluded that several input variables are highly relevant to predict the wine quality, since in the production process
some variables can be controlled, this information can be used to improve the wine quality. Classification models used here are
1) Random Forest 2) Stochastic Gradient Descent 3) SVC 4)Logistic Regression .

Keywords: Machine Learning, Classification,Random Forest, SVM,Prediction.


I. INTRODUCTION that the significant difference between the two is small.
Then this paper uses the Cronbach Alpha coefficient
The aim of this project is to predict the quality of wine on method to analyze the credibility of the two groups of
a scale of 0–10 given a set of features as inputs. The data.[1]
dataset used is Wine Quality Data set from UCI Machine
Learning Repository. Input variables are fixed acidity, Paulo Cortez ,Juliana Teixeira,António CerdeiraFernando
volatile acidity, citric acid, residual sugar, chlorides, free AlmeidaTelmo MatosJosé Reis wrote a paper on wine
sulphur dioxide, total sulphur dioxide, density, pH, Quality assesment using Data Mining techniques.In this
sulphates, alcohol. And the output variable is quality paper,they proposed a data mining approach to predict
(score between 0 and 10).We are dealing only with red wine preferences that is based on easily available
wine. We have quality being one of these values: [3, 4, 5, analytical tests at the certification step. A large dataset was
6, 7, 8]. The higher the value the better the quality. In this considered with white vinho verde samples from the
project we will treat each class of the wine separately and Minho region of Portugal. Wine quality is modeled under a
their aim is to be able and find decision boundaries that regression approach, which preserves the order of the
work well for new unseen data. These are the classifiers. grades. 95% accuracy was obtained using these data
mining techniques.[2]
In this paper we are explaining the steps we followed to
build our models for predicting the quality of red wine in a The study of this paper was done at International Journal
simple non-technical way. We are dealing only with red of Intelligent Systems and Applications in Engineering and
wine. We would follow similar process for white wine or this paper was published on 3rd September 2016. The
we could even mix them together and include a binary main objective of this research paper was to predict wine
attribute red/white, but our domain knowledge about wines quality based on physicochemical data. In this study, two
suggests that we shouldn’t. Classification is used to large separate data sets which were taken from UC Irvine
classify the wine as good or bad. Before examining the Machine Learning Repository were used. The instances
data it is often referred to as supervised learning because were successfully classified as red wine and white wine
the classes are determined. with the accuracy of 99.5229% by using Random Forests
Algorithm.[3]
II. RELATED WORK
III. PROPOSED WORK

Various researches and students have published related A. Data Set:


work in national and international research papers, thesis
to understand the objective, types of algorithm they have
used and various techniques for pre-processing. Dataset/Source: Kaggle
https://www.kaggle.com/uciml/red-wine-qua
College of Intelligent Science and Engineering, China has
lity-cortez-et-al-2009
written a paper on Evaluation and Analysis Model of Wine
Quality Based on Mathematical Model.They have used
various mathematical test to predict the quality of Structured/Unstructured data: Structured
wine.The Mann-Whitney U test is used to analyze the wine Data in CSV format.
evaluation results of the two wine tasters, and it is found

www.ijcat.com 385
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 385-388, 2019, ISSN:-2319–8656

Dataset Description: The two datasets are related Other than that the selection is being done randomly with
to red wine of the Portuguese "Vinho Verde" wine. For uniform distribution.
more details, consult: [Web Link] or the reference [Cortez Various classification and regression algorithms are used
et al., 2009]. Due to privacy and logistic issues, only to fit the model. The algorithms used in this paper are as
physicochemical (inputs) and sensory (the output) follows:
variables are available (e.g. there is no data about grape
types, wine brand, wine selling price, etc.). For classification:
These datasets can be viewed as classification or Random Forest Decision Trees classifier
regression tasks. The classes are ordered and not balanced
(e.g. there are many more normal wines than excellent or Support Vector Machine classifier
poor ones). Outlier detection algorithms could be used to
Stochastic gradient descent
detect the few excellent or poor wines. Also, we are not
sure if all input variables are relevant. So it could be Logistic Regression classifier
interesting to test feature selection methods.
Preprocessing: Label Encoding is used to convert
1)fixed acidity the labels into numeric form so as to convert it into the
2) volatile acidity machine-readable form. It is an important pre-processing
3) citric acid step for the structured dataset in supervised learning. We
4) residual sugar have used label encoding to label the quality of data as
5) chlorides good or bad. Assigning 1 to good and 0 to bad.
6)free sulfur dioxide
7)total sulfur dioxide
8)density
9)pH Feature Selection:
10) sulphates
As we can clearly see, volatile acidity and residual sugar
11) alcohol
are both not very impact full of the quality of wine. Hence
Output variable (based on sensory data):
we can eliminate these features. Though we are selecting
12)quality (score between 0 and 10)
these features, they will change according to the domain
experts.
IV. DATA PROCESSING METHODS

For making automated decisions on model selection


we need to quantify the performance of our model and
give it a score. For that reason, for the classifiers, we are
using F1 score which combines two metrics: Precision
which expresses how accurate the model was on predicting
a certain class and Recall which expresses the inverse of
the regret of missing out instances which are misclassified.
Since we have multiple classes we have multiple F1
scores. We will be using the unweighted mean of the F1
scores for our final scoring. This is a business decision
because we want our models to get optimized to classify
instances that belong to the minority side, such as wine
quality of 3 or 8 equally well with the rest of the qualities
that are represented in a larger number. For the regression
task we are scoring based on the coefficient of
determination, which is basically a measurement of
whether the predictions and the actual values are highly
correlated. The larger this coefficient the better. For
regressors we can also get F1 score if we first round our
prediction.

Splitting for Testing : We are keeping 20% of our


dataset to treat it as unseen data and be able and test the
performance of our models. We are splitting our dataset in
a way such that all of the wine qualities are represented
proportionally equally in both training and testing dataset.

www.ijcat.com 386
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 385-388, 2019, ISSN:-2319–8656

Result and Discussion: Algorithms used for classification


are:
Exploratory Data Analysis:
1) Logistic Regression
 The below bar plot shows the count of data 2) Stochastic gradient descent
which is good or bad. We can see 80% of the 3) Support Vector Classifier
data is classified with good wine quality and 4) Random Forest
20% with bad quality of wine.

 Logistic Regression gave us an accuracy of 86%

Performance matrix of Logistic Regression:

Precision Recall F1-Score Support

0 0.88 0.98 0.93 273

1 0.71 0.26 0.37 47

 Stochastic gradient descent was able to give an


average accuracy of 81%.
Performance matrix of SGD:
 This bar plot shows a directly proportional
relation between citric acid and quality.As the Precision Recall F1-Score Support
quality of wine increases the amount of citric
acid also increases which shows that citric acid
is the important feature on which quality of 0 0.89 0.93 0.91 273
wine depends.
1 0.42 0.30 0.35 47

 Support Vector Classifier has given an accuracy


of 85%.

Performance matrix of SVC:

Precision Recall F1-Score Support

 Free sulphur dioxide is greatly contributing to 0 0.89 0.93 0.91 273


the quality of wine, this bar plot gives us a
more clear picture. 1 0.71 0.26 0.37 47

www.ijcat.com 387
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 385-388, 2019, ISSN:-2319–8656

 Random Forest gave us an accuracy of 87.33%

Precision Recall F1-Score Support

0 0.90 0.97 0.93 273

1 0.68 0.40 0.51 47

CONCLUSION
Based on the bar plots plotted we come to an conclusion
that not all input features are essential and affect the data,
for example from the bar plot against quality and residual
sugar we see that as the quality increases residual sugar is
moderate and does not have change drastically. So this
feature is not so essential as compared to others like
alcohol and citric acid, so we can drop this feature while
feature selection.

For classifying the wine quality, we have implemented


multiple algorithms, namely

1) Logistic Regression

2) Stochastic gradient descent

3) Support Vector Classifier

4) Random Forest

We were able to achieve maximum accuracy using


random forest of 88%. Stochastic gradient descent
giving an accuracy of 81% .SVC has an accuracy of
85% and logistic regression of 86%.

References:
[1] Yunhui Zeng1 , Yingxia Liu1 , Lubin Wu1 , Hanjiang
Dong1. “Evaluation and Analysis Model of Wine Quality
Based on Mathematical Model ISSN 2330-2038 E-ISSN
2330-2046,Jinan University, Zhuhai,China.

[2] Paulo Cortez1, Juliana Teixeira1, Ant´onio


Cerdeira2.“Using Data Mining for Wine Quality
Assessment”.

[3] Yesim Er*1 , Ayten Atasoy1. “The Classification of


White Wine and Red Wine According to Their
Physicochemical Qualities”,ISSN
2147-67992147-6799,3rd September 2016

www.ijcat.com 388
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 389-393, 2019, ISSN:-2319–8656

Expert System for Student Placement


Prediction

Krishna Gandhi Aadesh Dalvi


Msc. Data Science and Big Data Msc. Data Science and Big Data
Analytics Analytics
MIT- WPU MIT- WPU
Kothrud,Pune, Maharashtra, Kothrud, Pune, Maharashtra,
India. India.

Aniket Walse Sachin Bhoite


Msc. Data Science and Big Data Msc. Data Science and Big Data
Analytics Analytics
MIT- WPU MIT- WPU
Kothrud, Pune, Maharashtra, Kothrud, Pune, Maharashtra,
India India

Abstract: - Data mining is a process of extracting and identifying previously unknown and potentially useful
information or pattern from large amount of data using different methods and techniques. Data mining in a
domain of education is known as Educational data mining (EDM). This paper discusses about an expertise
system which can used as student placement prediction system. A statistical model is applied on a reputed
college’s past data after data pre-processing and feature selection. This model can be used to predict the
percentage of chances of a student getting selected in campus placement. It will help students evaluating
themselves and identifying which skills are essential.

Keywords:- Data Mining, Educational Data Mining, Expertise system, Statistical model, Data pre-processing,
Feature selection.
Application of EDM is an evolving trend in the
1. INTRODUCTION worldwide [1]. This will help college faculty to
This model is about concerning those show the precise roadmap to students when it
students who wants to get a better placement for comes to placement and choose their career path.
better future. Sometimes it happens that student
gets sidetracked from studies in initial Semesters It will guide colleges and institutions to maintain
and later they realize the importance of their reputation by making most of the placement.
marks/CGPA. Basically, this will help them to It drives students to ask questions regarding what
enhance their performance and will make them can nurture them. It can give an overview to
believe that they can achieve their dream job.

www.ijcat.com 389
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 389-393, 2019, ISSN:-2319–8656

Junior college students while selecting their suitable for predicting student performance. MLP
stream. gives 87% prediction which is comparatively
higher than other algorithms [4].
1) What subjects to target?
K. Sripath Roy 1, K. Roopkanth, V. Uday Teja, V.
2) What skills to improvise on? Bhavana, J. Priyanka. The data is trained and
3) Probability of getting placed after tested with all three algorithms and out of all SVM
choosing their specialization? gave more accuracy with 90.3% and then the XG
Boost with 88.33% accuracy [5].
Data visualization will help College students to
get a clearer view regarding which stream they Ajay Kumar Pal, met with his goal and proved that
should choose. This can be done using different the top algorithm is Naïve Bayes Classification
libraries of Python like Matplotlib, where student with an accuracy of 86.15% with an error average
and faculties can visualize overview of each of 0.28 with others. He also conveyed that naïve
stream. Bayes has the potential to classify conventional
methods [6].
This paper describes the model to predict the
percentage of skills required by engineering Sudheep Elayidom, Dr. Suman Mary Idikkula and
students pursuing Bachelors and Master’s Degree Joseph Alexander, studying past data and and
with respect to company’s skillset necessity. following the trend, and based on that the
According to the rules generated, percentage of judgment for future will be given.
selection of students will vary. These rules are
generated with the help of Domain Expert. 2. DATASET DESCRIPTION
The data used in this model is supplied by a well-
Percentage of Selection = (Criteria’s known Engineering College situated in Pune,
Satisfied/Number of Criteria’s) * 100 Maharashtra. Data generated is collected from the
details given by graduates, post graduates,
1.1 Literature Review diploma holders in engineering of various streams
The researchers have studied several related
during the year 2019. It includes students 10th, 12th
national & international research papers, thesis to
or Diploma and semester-wise aggregation for
understand aims, technique used, various expert
Bachelors and Master’s. Dataset contains 2330
systems, datasets, data preprocessing approaches,
tuples and 81 attributes holding multiple
features selection methods, etc.
streamwise data of the students.
Siddu P. Algur, Prashant Bhat and Nitin Kulkarni
used two algorithms- Random Tree and J48 to 2.1 Data Pre-Processing
construct a classification models using Decision Data has redundant, incomplete, inconsistent and
Tree concept. The Random Tree classification inaccurate entries. We discovered that there were
model is more effective as compared to J48 many different attributes which seems to be
classification model [2]. superfluous and which won’t affect our results. By
consulting our Domain Expert, we decided to
Machine learning algorithms are applied in weka remove those attributes as well as tuples using
environment and R studio by K. Sreenivasa Rao, tools like excel.
N. Swapna and P. Praveen Kumar. Results is Entries with human errors seem to be illusory. So
tabulated and analyzed, It shows random tree as per discussion with Expert we decide to apply
algorithm gives 100% accuracy in prediction on mean to the data using Python.
their dataset and also in R environment Recursive
Partitioning & Regression Tree performs better 2.2 Feature Selection
and gives 90% accuracy. We also accept that Attributes impacting the placements of students
performance depends on nature of dataset [3]. were taken into consideration with the help of
Expert advice.
V.Ramesh, P. Parkavi and P. Yasodha also proved
that Multilayer Perception algorithm is most

www.ijcat.com 390
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 389-393, 2019, ISSN:-2319–8656

Factors like 10th, 12th or Diploma and Degree 2. Form rule based on stream with the help of
Aggregation are too affecting the placement expert
predictions for students as well as their non- 3. Enter the data of new students for the
educational attributes like work-experience, prediction of the placement.
projects and external certification were also taken 4. Calculate Number of criteria satisfied by
into consideration. that student.

3. PROPOSED EXPERT SYST- Flag=0


EM If rule 1 satisfied
After analyzing a lot we came to an idea of creat- Flag=flag+1
ing our own system, we will help us to understand Else
the genuine accuracies for placement. Flag=flag
If rule 2 satisfied
Flag=Flag+1
Else
Flag=flag
5. Perform this operation for all the criteria
created.
6. Calculate our prediction by
Prediction percentage= (Flag/Number of
rules) *100

4. EXPLORATORY ANAYLISIS
A pie chart is a circular statistical illustration,
which is divided into different parts to
demonstrate numerical proportion.

1) A student fills the form and gets stored in CSV.


2) CSV contains all details of all the streams data
of all students
3) Then system filters all streams and writes one
by one detail in each new CSV.
4) Then domain expert analyzes it on each stream
and sets criteria for all the attributes.
5)The system checks whether students get
satisfied or not according to the specific
criteria set.
6) Then the criteria will decide which students are Figure.3.1 Pie chart of Class wise placement for
going to be placed in a campus placement and Petroleum
which are not.

3.1 Pseudo Code Of The System


Pseudo is basically the demo code which we have
decided to implement on our model. It is type of
rule setting methodology where student have to
fulfill all criteria’s for getting a reputed place-
ments.
1. Load all student data.

www.ijcat.com 391
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 389-393, 2019, ISSN:-2319–8656

Figure.3.5 Pie chart of Class wise placement for


Figure.3.2 Pie chart of Class wise placement for IT Engineering
Chemical Engineering

Figure.3.3 Pie chart of Class wise placement for Figure.3.6 Pie chart of Class wise placement for
Civil Engineering Mechanical Engineering
The pie charts shown above give the information
about students placed in campus interview based
on their grade as per each stream. This gives an
overview to students about stream and importance
of aggregates according to the stream.

5. CONCLUSION
This paper examines application of Educational
Data Mining (EDM). This paper elaborates the
model to create awareness for students to create a
better careeristic pathway for their future. Students
with the help of their professors and placement
team can make use of this model to get better
Figure.3.4 Pie chart of Class wise placement for placement opportunities and enhance their
E&TC Engineering skillsets.
In Future this model can be compared with
existing Machine Learning Algorithms like Linear
Regression, Logistic Regression and Decision tree
which will help us to understand the accuracy of

www.ijcat.com 392
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 389-393, 2019, ISSN:-2319–8656

percentage of Machine Learning and Statistics. Trends in Engineering, Vol. 1, No. 1, May
We will come to know what accurate percentage 2009.
to rely on with the comparison of statistics and
Machine Learning.

6. REFERENCES
[1] Dr. Mohd Maqsood Ali,” ROLE OF DATA
MINING IN EDUCATION SECTOR”,
IJCSMC, Vol. 2, Issue. 4, April 2013, pg.374
– 383
[2] Siddu P. Algur, Prashant Bhat and Nitin
Kulkarni,“EDUCATIONAL DATA MININ-
G: CLASSIFICATION TECHNIQUES FOR
RECRUITMENT ANALYSIS” I.J. Modern
Education and Computer Science, 2016, 2,
59-65
[3] K. Sreenivasa Rao, N. Swapna, P. Praveen
Kumar “EDUCATIONAL DATA MINING
FOR STUDENT PLACEMENT
PREDICTION USING MACHINE
LEARNING ALGORITHMS”, International
Journal of Engineering & Technology, 7
(1.2) (2018) 43-46

[4] V. Ramesh, P. Parkavi and P. Yasodha,”


PERFORMANCE ANALYSIS OF DATA
MINING TECHNIQUES FOR
PLACEMENT CHANCE PREDICTION”,
International Journal of Scientific &
Engineering Research Volume 2, Issue 8,
August-2011.

[5] K. Sripath Roy 1, K. Roopkanth, V. Uday


Teja, V. Bhavana, J. Priyanka,” STUDENT
CAREER PREDICTION USING
ADVANCED MACHINE LEARNING
TECHNIQUES”, International Journal of
Engineering & Technology, 7 (2.20) (2018)
26-29.

[6] Ajay Kumar Pal, “Classification Model of


Prediction for Placement of Students”,
Published Online November 2013 in MECS
(http://www.mecs-press.org/),DOI: 10.5815
/ijmecs.2013.11.07.

[7] Sudheep Elayidom, Dr. Suman Mary Idikkula


and Joseph Alexander, “Applying Data
Mining using Statistical Technique for Career
Selection”, International journal of Research

www.ijcat.com 393
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 394-396, 2019, ISSN:-2319–8656

Applications of Machine Learning for Prediction of


Liver Disease
Khan Idris Sachin Bhoite
Student (MSc Big Data Analytics) Assistant Professor
MIT-WPU, Pune MIT-WPU, Pune
Maharashtra,India Maharashtra,India

Abstract: Patients in India for liver disease are continuously increasing because of excessive consumption of alcohol, inhale of
harmful gases, intake of contaminated food, pickles and drugs. It is expected that by 2025 India may become the World Capital
for Liver Diseases. The widespread occurrence of liver infection in India is contributed due to deskbound lifestyle, increased
alcohol consumption and smoking. There are about 100 types of liver infections. Therefore, building a model that will help
doctors to predict whether a patient is likely to have liver diseases, at an early stage will be a great advantage. Diagnosis of liver
disease at a preliminary stage is important for better treatment. We also compare different algorithms for the better accuracy.

Keywords: Indian Liver Patients, Machine Learning, Logistic regression, Support Vector Machine, Random Forest, AdaBoost,
Bagging.

1. INTRODUCTION: used FOUR classification algorithms Logistic Regression,


Support Vector Machines (SVM), K Nearest Neighbor
As there is growth in Liver Patients in India and it is (KNN) and artificial neural networks (ANN) have been
estimated that till the year 2025 India may be the World considered for comparing their performance based on the
Capital for Liver Diseases. We should have solution for this liver patient data. Authors obtained 73.23% accuracy on
kind of problems and for this it is very important for doctors Logistic Regression, 72.05% on k-NN, 75.04 accuracy on
to identify the liver disease at an early stage. To identify Support vector machine [2]. We have also gone through a
liver disease, at an early stage we are building a machine paper Liver Patient Classification using Intelligence
learning model which will predict whether patient should be Techniques by Jankisharan Pahareeya, Rajan Vohra,
diagnosed or not at an early stage. We will be using different Jagdish Makhijani, Sanjay Patsariya. In this paper Authors
algorithms as well as ensemble methods. As, Liver disease have used six intelligence techniques on the ILPD (Indian
can be diagnosed by analyzing the levels of enzymes in the Liver Patient) Data Set. Throughout the study ten-fold
blood. The objective of this model is to increase the survival cross validation is performed [3]. “Machine Learning
rate of the Liver Patients by using the previous data about Techniques on Liver Disease”, in this paper authors have
the levels of enzymes in their body. We have record of 583 shown different types of techniques for disease prediction.
patients from which 416 were the records of liver patient and Here algorithms Logistic Regression, SVM, Decision tree,
167 records of non liver patient. Random Forest and ensemble techniques are used [4].
“Liver Classification Using Modified Rotation Forest”, in
2. RELATED WORK: this paper authors have gone through various classification
algorithms to increase the accuracy and have done feature
Ramana made a critical study on liver diseases diagnosis by selection. Accuracy in this paper was 73.33% [5].
evaluating some selected classification algorithms such as
naïve Bayes classifier, C4.5, back propagation neural 3. PROCESS IMPLEMENTATION:
network, K-NN and support vector. The authors obtained
51.59% accuracy on Naïve Bayes classifier, 55.94% on The work flow process is firstly, we have to preprocess the
C4.5 algorithm, 66.66% on back propagation neural data, then some data visualization part then we trained the
network, 62.6% on KNN and 62.6% accuracy on support model with different algorithms and selecting the algorithm
vector machine. The poor performance in the training and with best output
testing of the liver disorder dataset as resulted from an
insufficient in the dataset [1]. We, have also gone through a .
research paper Diagnosis of Liver Disease Using Machine
Learning Techniques by Joel Jacob1, Joseph Chakkalakal
Mathew2, Johns Mathew3, Elizabeth Issac4. They have

www.ijcat.com 394
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 394-396, 2019, ISSN:-2319–8656

3.1 Dataset the predictor variables written as 0 +1 +X1 + ...PXp Where


0 is called the intercept [2]. As, it is seen that accuracy
The Indian Liver Patient Dataset consists of 10 different achieved by Logistic Regression was 73.23%, Now we are
attributes such as Age, Sex, Total Bilirubin, Direct applying Adaboost to the Logistic Regression.
Bilirubin, Alkaline Phosphatase, Alamine Phosphatase,
Total Proteins Albumin, Albumin and Globulin Ratio, b. Support Vector machines: Support vector
Dataset (result) of 583 patients. (416 records are of liver machines so called as SVM is a supervised learning
patients and 167 non liver patients). The patients were algorithm which can be used for classification and
described as either 1 or 2 on the basis of liver disease. The regression problems as support vector classification (SVC)
feature Sex is converted to numeric value (0 and 1) in the and support vector regression (SVR). It is used for smaller
data pre-processing step. dataset as it takes too long to process.

3.2 Data preprocessing


Data pre-processing is an essential step of solving every
machine learning problem. It is said that 80% of the time of
a Data Scientist is spend in data preprocessing. Most
commonly used preprocessing techniques are very few like
missing value imputation, encoding, scaling, etc. Every
dataset is different and poses unique challenges. All
features, except Gender are real valued integers. Therefore,
in this column males are labeled as ‘1’ and females are
labeled as ‘0’.

The last column, Disease, is the label with ’1’


representing presence of disease and ’2’ representing
absence of disease. This column is then relabeled as ‘1’ for
liver patients and ‘0’ for the non liver patients. Total
number of records is 583, with 416 liver patient records and Fig.1 Support Vector Machine
167 non-liver patient records. In the data visiualization of
this dataset, it is observed that some values are Null for the c. Random Forest: Random Forest is a meta
Albumin and Globulin Ratio column. The columns which estimator that uses the number of decision tree to fit the
contain null values are replaced with median values of the various sub samples drawn from the original dataset, drawn
column. data with replacement as per the requirements. Decision
tree is non-parametric supervised learning. It is used for
3.3 Classification Techniques both classification and regression problem.

It is flowchart-like structure in which each internal node


a. Logistic Regression: Logistic regression is a
represents a “test” on an attribute, each branch represents
type of a supervised machine learning algorithm. It makes a
the outcome of the test, and each leaf node represents a
prediction that has binary outcome from the past data.
class label. The path between root and leaf represent
Logistic regression usually returns result in very short time,
classification rules. It creates a comprehensive analysis
hence it is preferred being used as a benchmarking model
along with each branch and identifies decision nodes that
[4].
need further analysis.
Logistic regression is one of the simpler classification
models. Because of its parametric nature it can to some d.AdaBoost: AdaBoost is one of the first boosting
extent be interpreted by looking at the parameters making it algorithms to be adapted in solving practices. Adaboost
useful when experimenters want to look at relationships helps you combine multiple “weak classifiers” into a single
between variables. A parametric model can be described “strong classifier. The weak learners in AdaBoost are
entirely by a vector of parameters = (0, 1... p). An example decision trees with a single split, called decision stumps.
of a parametric model would be a straight-line y = mx + c AdaBoost works by putting more weight on difficult to
where the parameters are c and m. With known parameters classify instances and less on those already handled well.
the entire model can be recreated. Logistic regression is a AdaBoost algorithms can be used for both classification and
parametric model where the parameters are coefficients to regression problem[6].

www.ijcat.com 395
International Journal of Computer Applications Technology and Research
Volume 8–Issue 09, 394-396, 2019, ISSN:-2319–8656

e.Bagging: Bootstrap aggregating, also called bagging, 6. REFERENCES:


is a machine learning ensemble meta-algorithm designed to
improve the stability and accuracy of machine learning [1] Bendi Venkata Ramana1, Prof. M.Surendra Prasad
algorithms used in statistical classification and regression. Babu2 , Prof. N. B. Venkateswarlu3. “A Critical Study of
It also reduces variance and helps to avoid overfitting. Selected Classification Algorithms for Liver Disease
Diagnosis” International Journal of Database Management
4. RESULTS AND EVALUATION: Systems (IJDMS), Vol.3, No.2, May 2011.

The main objective was to predict whether a patient should [2] Joel Jacob, Joseph Chakkalakal Mathew, Johns
be diagnosed or not at an early stage with algorithms such Mathew, Elizabeth Issac “Diagnosis of Liver Disease Using
as SVM, Logistic Regression and random Forest. These Machine Learning Techniques “by International Research
algorithms were also used in previous studies. Now we Journal of Engineering and Technology (IRJET) 1,2,3
have improves accuracy of these algorithms by using Dept. of Computer Science and Engineering, MACE,
Bagging and AdaBoost. Kerala, India 4Assistant Professor, Dept. of Computer
Science and Engineering, MACE, Kerala, India Volume:
05 Issue: 04 | Apr-2018.

[3] Jankisharan Pahareeya1, Rajan Vohra2, Jagdish


Makhijani3 Sanjay Patsariya4 “Liver Patient Classification
using Intelligence Techniques”, International Journal of
Advanced Research in Computer Science and Software
Engineering Volume 4, Issue 2, February 2014 .

[4] V.V. Ramalingam1, A.Pandian2, R. Ragavendran3


“Machine Learning Techniques on Liver Disease - A
Survey” 1,2,3Department of Computer Science and
Fig. 2 Accuracies of Algorithms Engineering, SRMIST, Kattankulathur. International
Journal of Engineering & Technology, 7 (4.19) (2018) 485-
495.

As, we can see in the above figure increasing accuracies of [5] Bendi Venkata Ramana1 , Prof. M.Surendra Prasad
the algorithms. We got accuracy 73.5% for Logistic Babu2 1 Associate Professor, “Liver Classification Using
Regression, then by applying Adaboost classifier the Modified Rotation Forest “Dept.of IT, AITAM, Tekkali,
accuracy has been increased to 74.35%. A.P. India. 2 Dept. of CS&SE, Andhra University,
Visakhapatnam-530 003, A.P, India.
For Support Vector Machine we got 70.94%, and for
Random Forest Classification 66.67% here we have got a [6] https://towardsdatascience.com/understanding-
considerable increase in accuracy by using Bagging that is adaboost-2f94f22d5bfe
the accuracy of 72.64.

5. CONCLUSION:
We have applied the machine Learning algorithms on the Indian
Liver Patient dataset to predict the patients by the enzymes
content in their at an early stage. We have used different
machine learning classification algorithm as Logistic
Regression, SVC, Random Forest and further we have applied
bagging to Random Forest and AdaBoost to Logistic
Regression. Logistic Regression is fast in processing and gave
accuracy of 73.5%. Thus for increasing its accuracy we have
used AdaBoost and got accuracy of 74.36%.

www.ijcat.com 396

You might also like