Flood Forecasting

FFS: Flood Forecasting System based on Integrated
Big and Crowd source Data by using Deep Learning

Techniques
ABSTRACT:
Flood is one of the maximum disruptive natural hazards, answerable for
lack of lives and harm to properties. A range of towns are problem to monsoons
impacts and for this reason face the catastrophe almost every year. Early
notification of flood incident should advantage the government and public to
plan each brief and long phrases preventive measures, to put together
evacuation and rescue mission, and to alleviate the flood victims. Geographical
places of affected regions and respective severities, for instances, are a number
of the key determinants in maximum flood administration. Thus far, an
powerful method of looking ahead to flood earlier stays lacking. Existing
equipment had been commonly primarily based totally on manually enter and
organized information. The techniques had been tedious and thus prohibitive for
real-time and early forecasts. Furthermore, those equipment did now no longer
completely make the most extra complete records to be had in modern massive
information platforms. Therefore, this paper proposes a unique flood forecasting
device primarily based totally on fusing meteorological, hydrological,
geospatial, and crowd source massive information in an adaptive gadget
mastering framework. Data intelligence became pushed via way of means of
modern mastering strategies. Subjective and goal reviews indicated that the
evolved device became capable of forecast flood incidents, going on in unique
regions and time frames. It became additionally later discovered via way of
means of benchmarking experiments that the device configured with an MLP
ANN gave the simplest prediction, with accurate percentage, Kappa, MAE and
RMSE of 97.93, 0.89, 0.01 and 0.10, respectively.
KEYWORD: Flood Forecasting System, Big Data, Machine Learning,
Crowdsource, Deep Learning, Data intelligence
1.INTRODUCTION:
Natural flood is one amongst the foremost continual disasters . not like
stagnant water discharge, sometimes intimate in poorly planned cities, major
flood incidents forever cause appreciable damages to properties and, a lot of
usually than not, loss of lives. many Asian countries, significantly Thailand, are
subject to each southwest and northeast monsoons and consequently facing
seasonal deluge nearly once a year and in most components of the countries .
Among notable causes, fast and enduring serious rain is that the most pertinent
one in Thailand . Furthermore, overflow from main rivers on shore sides to
encompassing basins will greatly unfold the damages . though being set more
off from a river, a section with inappropriate land uses are unable to with
efficiency discharge accumulated precipitation, and thence are inevitably
susceptible to even more frequent floods. in spite of causes, however, a flood is
mostly fast and so nearly formidable for the overall public and relevant
organization to be adequately ready for the incident. this can be primarily
thanks to the shortage of an efficient suggests that of anticipating the disaster
well earlier . Despite the recent in depth development of computerised flood
prognostication systems, they remained based mostly totally on gift
precipitation, monitored by rain stations or rain gauges. These facilities are
commonly closely-held by a meteorology department or similar organizations .
Besides, they're scantly set during a few spaces thanks to pricey installation and
maintenance. Hence, it's troublesome to see precipitation or predict flood
accurately, particularly in areas with no such facility . To remedy this issue,
precipitation in these areas were usually calculable either by interor
extrapolation from those with rain stations gift . thanks to a restricted variety of
those stations and people readings in one area might not be a decent
representative to others. Therefore, estimated precipitation was insufficiently
accurate to make a realistic forecast . Conventional meteorological readings,
e.g., precipitation, temperature, and humidity, etc., took really long time to
measure, process, record, and transfer to relevant organizations . Analyses based
on past precipitation were known to be associated with several shortcomings.
For instance, they contribute to inaccurate and often outdated flood prediction.
Limited sample size , inadequate computing capability, and inefficient
prediction methods were all undermining the real potential of this scheme.
Nonetheless, with the recent advances in distributed computing and especially
modern machine learning (ML), resembling human intelligence computerized
flood forecasting, based on thematic factors has widely been investigated . In
addition, as the number of both open and proprietary data providers escalates,
Big Data has now become a central source of information in such pursuits. Thus
far, according to recent surveys, most flood forecasting systems relied primarily
on either monitored precipitation data or those obtained from a single source.
Beside the mentioned limitations, existing systems remained lacking in other
various aspects. For example, in a case where monitoring facilities or
communication network of ones became malfunctioned, there would be no
precipitation data available for imperative analyses. To the best of our
knowledge, there was also no tool (software) that can accommodate area-
specific forecasting well in advance. Furthermore, existing tools were highly
dependent on demanding data preparation and compilation from various
sources, including Big Data. As a consequent, automated and spontaneous
notification of flood incidents to the public and authorities, or realistic
anticipation of ones has remained a grand challenge. In addition, there have
been recent developments in flood forecasting systems based on ML. These
systems embedded both attributes and crowdsourcing data into their ML
frameworks. However, most existing systems operate by analyzing these data
offline on premise before presenting their prediction results on various
platforms. A typical practice was proposed in, where an ML was trained with
real-time rainfalls, streamflow , and other data. It was unclear, nonetheless, how
a prediction result was verified against an actual event, which was obtained
from crowdsourcing. Therefore, this paper proposes a novel flood forecasting
system based on fusing meteorological, hydrological, geospatial, as well as
crowdsourcing data, and integrating them into an ML framework. These data
were compiled from various big data platforms, by using online application
programming interfaces (API). The forecasting mechanism was driven by a
machine learning strategy. To determine the most suitable one for the task,
several state-of-the-art MLs, i.e., decision tree, random forest, naïve Bayes,
artificial neural networks, support vector machine, and fuzzy logic, were
compared. It is worth emphasizing that the novelty of this paper was not only to
use different data in an ML, but also to enhance and verify its predictions based
on crowdsourcing ones. It will be later shown in the experiments that the
developed system was able to elevate known limitations and to enhance the
effectiveness and efficiency of computerized flood forecasting. This paper is
organized as follow. The next surveys data, theories and practices relating to
flood forecasting systems. Subsequently, describes the proposed scheme and the
corresponding experiments. Then, the results of visual assessments and
numerical evaluations on studied areas are reported and discussed. Concluding
remarks are given and prospective developments are suggested in the last
section .Floods, storms and hurricanes are devastating for human life and
agricultural cropland. Near-real-time (NRT) discharge estimation is crucial to
avoid the damages from flood disasters. The key input for the discharge
estimation is precipitation. Directly using the ground stations to measure
precipitation is not efficient, especially during a severe rainstorm, because
precipitation varies even in the same region. This uncertainty might result in
much less robust flood discharge estimation and forecasting models. The use of
satellite precipitation products (SPPs) provides a larger area of coverage of
rainstorms and a higher frequency of precipitation data compared to using the
ground stations. In this paper, based on SPPs, a new NRT flood forecasting
approach is proposed to reduce the time of the emergency response to flood
disasters to minimize disaster damage. The proposed method allows us to
forecast floods using a discharge hydrograph and to use the results to map flood
extent by introducing SPPs into the rainfall–runoff model. In this study, we first
evaluated the capacity of SPPs to estimate flood discharge and their accuracy in
flood extent mapping. Two high temporal resolution SPPs were compared,
integrated multi-satellite retrievals for global precipitation measurement
(IMERG) and tropical rainfall measurement mission multi-satellite precipitation
analysis (TMPA). The two products are evaluated over the Ottawa watershed in
Canada during the period from 10 April 2017 to 10 May 2017. With TMPA, the
results showed that the difference between the observed and modeled discharges
was significant with a Nash–Sutcliffe efficiency (NSE) of −0.9241 and an
adapted NSE (ANSE) of −1.0048 under high flow conditions. The TMPA-based
model did not reproduce the shape of the observed hydrographs. However, with
IMERG, the difference between the observed and modeled discharges was
improved with an NSE equal to 0.80387 and an ANSE of 0.82874. Also, the
IMERG-based model could reproduce the shape of the observed hydrographs,
mainly under high flow conditions. Since IMERG products provide better
accuracy, they were used for flood extent mapping in this study. Flood mapping
results showed that the error was mostly within one pixel compared with the
observed flood benchmark data of the Ottawa River acquired by RadarSat-2
during the flood event. The newly developed flood forecasting approach based
on SPPs offers a solution for flood disaster management for poorly or totally
ungauged watersheds regarding precipitation measurement. These findings
could be referred to by others for NRT flood forecasting research and
applications.
1.1. PROBLEM DEFINITION:
To elucidate the merits of the proposed scheme, the experiments were
carried out on two provinces, located in the South of Thailand. They were Surat
Thani and Nakhon Si Thammarat. Their geographical illustrations are depicte
The reason for considering these provinces in this study was the fact that, unlike
other parts of the country, both areas are under influences of both southeast and
northwest monsoons. As a consequence, they both are prone to heavy floods,
almost every year. Nonetheless, evaluations and assessments later made in this
paper would show that no restriction on data nor fundamental processes was
imposed regarding these specific provinces. Therefore, the proposed scheme
could be generalized and applied equally well to other areas. In this study, data
used in flood forecasting could be categorized into four main groups. They were
1) geospatial, 2) meteorological and hydrological, obtained from GLOFAS, 3)
hourly rainfalls prediction from TMD Big Data platform, and 4) crowdsource
(or volunteer) data. They were stored in geodatabase and then processed by one
of modern ML strategies. Data interchanges were done via four interface
technologies, i.e., Web Feature Service (WFS), Web Map Service (WMS),
TMD API, and Google MAP API. The resultant models were then employed in
subsequent forecasting systems. Since all data involved in this study were
acquired from various sources, they were hence of different spatial and temporal
resolutions. To normalize them into a common coordinate frame, a pre-
processing step was required. In terms of spatial resolution, interpolation was
made based on their geographic coordinates. It should be noted that this
normalization was not intended nor did it able to increase their intrinsic
information, but only to align corresponding positions for consistent sampling
in ML. In this study, bilinear interpolation (BI) was used. On the other hand,
interpolating these data temporally was not so trivial, especially when they were
irregularly stored at diverse scales. The proposed system updated its forecasts
on a daily basis. Therefore, data sampled at 24-hour intervals, e.g., rainfall, and
precipitation, etc., would not require resampling. However, for historical
records that do not as frequently change, e.g., 100-year return period, land use
and land cover, etc., the most recent data available would be used. This is
equivalent to applying nearest neighbor (NN) interpolation scheme. depicts the
conceptual diagram describing the proposed scheme. The key elements were
data acquisition and interchange between the system and respective sources and
their intelligence via MLs. Thematic data acquisitions were divided into four
groups, i.e., meteorological and hydrological data, hourly precipitation data,
area specific geospatial data, and crowd source data. Process in each group can
be elaborated as follow. Firstly, meteorological and hydrological data were
acquired from the Global Flood server, called GLOFAS . Meteorological data
consisted of accumulated precipitation, and the probability of precipitation at
different levels, predicted daily. The prediction is based on ECMWF (European
Center for Medium-Range Weather Forecasts) model. Hydrological data
consisted of flood hazard 100-year return period, and 5-year return period
exceedance. These were acquired from GLOFAS via Web Map Service (WMS),
with a program written in PHP language. Once downloaded, the data were
stored in raster formats for subsequent analyses. The second source of data was
rainfall forecasting, which was acquired from big data repository, managed by
TMD . The acquisition via API was made by a web application developed with
PHP and Leaflet JavaScript. With this platform, rainfalls and their accumulation
could be predicted 48 hours in advance. Likewise, the current and predicted data
were stored in our MySQL database for subsequent analyses. In this study,
MySQL was preferred to other spatial databases, thanks to its compatibility with
involved crowd sources’ interfaces, e.g., that of thaiflood.org, and integrability
with a wide range of web-based administrative and flood management
functionalities. In spite of not being specifically designed for big and geospatial
data, its most recent update does support both platforms via Hadoop, Apache
Sqoop and NoSQL, etc.
1.2 SCOPE OF THE PROJECT:
Its consists of five main areas, which are, A) meteorological and hydrological
data, B) geospatial data, C) application programming interface, D) crowd source
data, E) machine learning techniques for flood forecasting, and F) existing
frameworks. In terms of mathematical models, it was found that current
frameworks made their flood forecast based on three models, i.e.,
meteorological, hydrological, and ML ones. The first two models were used
with many algorithms, e.g., HECHMS [57], HEC-RAS [59], Mike11 [60],
Mike21 [60], MikeFlood [60], and ECMWF [61], etc., while those using ML
model were implemented with various methods (as discussed in the section I.
E). Thus far, these systems were only experimental and, to our knowledge, not
yet publicly distributed. To elevate these limitations, this paper thus analyzed
and designed a flood forecasting system that improved over the current ones.
The aspects considered herein were supports of responsive web technology,
automation of key processes, and availability and usability of the system. To
this end, the proposed system was developed by using both meteorological and
hydrological models in forecasting accumulated precipitation from data
obtained from TMD big data and GLOFAS, and ML models in forecasting
flood situations in given areas. The analyses were made based on
meteorological, hydrological, geospatial and crowdsourcing data.Moreover,
geographical processing was primarily not geometric operations but data driven
MLs. Therefore, a dedicated geospatial database was not explicitly needed.
Having said that, geographical attributes were indeed associated with thematic
data via a well-defined relational structure. Accordingly, geographical data
manipulation and processing was effectively handled by GeoJSON (RFC 7946),
catered to web developers. Both MySQL and GeoJSON were open source and
fully supported by most web frameworks. Thirdly, area-specific geospatial data
consisted of height above sea level, slope, land use and land cover, 10-year
repeating flood, and flow direction. These data were compiled from Thai
authorities, i.e., the Land Development Department, and Informatics and Space
Technology Development Agency (GISTDA). They were acquired by using
Web Feature Service and stored in our database for further analyses. The fourth
group was crowd sourced data gathered from the public. They were further
divided into two parts. Data in the first part were used in training the forecasting
system. They were essentially real-time data accounting actual incidents. These
real-time data consisted of area specific rainfall intensity levels, continuing
rainfall durations, and drainage ability. Although many social media platforms
provide these data, they were not cost effective, as fee would normally be
charged for on-demand access to associated coordinates. This study thus
developed, in-house software to gather these crowd source data. To this end, an
online reporting system was developed so that the participants may inform of
their current situation. Once the system was launched, additional reports may be
fed back to adjust ML model, making the forecast more realistic. The second
part was verification data, characterizing floods happening in given areas. In
addition to data reported by voluntary users, it also consisted of flood levels,
fetched from thaiflood.org. In fact, the latter system was also developed in our
previous work, with an intention to gather flood related information from the
public for state’s uses, e.g., devising flood management, planning and executing
rescue mission, and reliving flood victims, etc. For examples, geospatial data
(i.e., elevation, slope, flow direction, land use and land cover, and repeating
flood) are while those of meteorological and hydrological data (i.e., forecast of
accumulated precipitation) are shown in. illustrates six examples of probability
of precipitation at 3 different levels. Finally, depicts the maps of flood hazard
100-year return period and 5-year return period exceedance. All data employed
in this study are summarized . The factors, their abbreviations, and ratings are
referred by ML methodsin . The amount of data acquired from GLOFOS and
TMD platforms by the proposed system were in both raster and vector types,
and also of greater than 10TB in size. It is worth noted here that, although the
amount actually arriving at, storing in, and passing through the system at a
specific time was not categorically immense, they were of significantly high
variety (structured and unstructured), and high velocity. These data, such as
rainfalls and precipitations, were continuously varying during the system
operations. As a consequence, conventional data handling was inadequate.
Without big data operations, the forecasting could not be instantly responding to
such heterogenous changes, especially when the proposed flood forecast system
is scaled up to support wider areas of interest than those demonstrated herein.
Furthermore, the system, in fact, utilized TMD big rainfalls data, processed by
the Thai Meteorological Department, by using numerical weather forecast
(NWP) and those by GLOFAS, by using ECMWF hydrological model. These
data were also of not only high volume, variety, and velocity, but also
inconclusive. These characteristics thus called for big and not conventional data
handling. It would be later shown in the subsequent experiments that the
rainfalls big data forecasted by ECMWF model yielded the most accurate
results .
2. BACK GROUND
Introduction
With the evolution of the Internet, the ways how businesses, economies, stock
markets, and even the governments function and operate have also evolved, big
time. It has also changed the way people live. With all of this happening, there
has been an observable rise in all the information floating around these days;
it’s more than ever before. This outburst of data is relatively new. Before the
past couple of years, most of the data was stored on paper, film, or any other
analog media; only one-quarter of all the world’s stored information was digital.
But with the exponential increase in data, the idea of storing it manually just
does not hold appeal anymore. You will learn more about applications and
examples of big data in this big data analytics tutorial.
What is Big Data?
The conventional way in which we can define big data is, It is a set of extremely

large data so complex and unorganized that it defies the common and easy data
management methods that were designed and used up until this rise in data.
Big data sets can’t be processed in traditional database management systems

and tools. They don’t fit into a regular database network.
History of Big Data
The first trace of big data was evident way back in 1663. It was during
the bubonic plague that John Graunt dealt with overwhelming amounts of
information during his study of the disease. He was the first person ever to
make use of statistical data analysis. The field of statistics expanded later to data
collection and analysis in the early 1800s.The US Census Bureau estimated that
it would take eight years to handle and process the data collected during the
census program in 1880, which was the first overwhelming collection of raw
data. The Hollerith Tabulating Machine was invented to reduce the calculation
work in the subsequent 1890 census.
After that, data evolved at an unprecedented rate throughout the 20th century.
There were machines that stored information magnetically. Scanning patterns in
messages and computers were also prominent during that time. In 1965, the first
data center was built with the aim to store millions of fingerprint sets and tax
returns.
Big Data Examples:
Here are a few big data examples:
Customer Acquisition and Retention
Everyone knows that customers are the most important asset of any
business. However, even with a solid customer base, it is foolish to disregard
competition. A business should be aware of what customers are looking for.
This is where big data comes in.Applying big data allows businesses to identify
and monitor customer-related trends and patterns. This contributes toward
gaining loyalty. More data collection allows for more patterns and trends to be
identified.With a proper customer data analytics mechanism in order, critical
behavioral insights can be derived to act on and retain the customer base. This is
the most basic step to retain customers.Big data analytics is strongly behind
customer retention at Coca-Cola. In 2015, Coca-Cola strengthened its data
strategy by building a digital-led loyalty program.
Advertising Solutions and Marketing Insights

Big data analytics has the ability to match customer expectations, improve a
company’s product line, optimize marketing campaigns, etc.The marketing and
advertising technology sector has now fully embraced big data in a big way.
Through big data, it is possible to make a more sophisticated analysis involving
monitoring online activities and point-of-sale transactions, and ensuring real-
time detection of changes in customer trends.Collecting and analyzing customer
data will help gain insights into customer behavior. This is done with a similar
approach that is used by marketers and advertisers and results in more
achievable, focused, and targeted campaigns.A more targeted and personalized
campaign will ensure more cost-cutting and efficiency as high-potential clients
can be targeted with the right products.
A good example of a brand that uses big data for targeted advertisements is
Netflix. It uses big data analytics for targeted advertising. The data gives
insights into what interests the subscribers the most.
Risk Management
A risk management plan is a critical investment for any business regardless of

the sector as these are unprecedented times with a highly risky business
environment. Being able to predict a potential risk and addressing it before it
occurs is crucial for businesses to remain profitable.
Big data analytics has contributed immensely toward the development of risk
management solutions. Tools allow businesses to quantify and model regular
risks. The rising availability and diversity of statistics have made it possible for
big data analytics to enhance the quality of risk management models, thus
achieving better risk mitigation strategies and decisions.UOB in Singapore uses
big data for risk management. The risk management system allows the bank to
reduce the calculation time of the value at risk.
Innovations and Product Development
Big data has become a smart way for creating additional revenue streams
through innovations and product improvement. Organizations are first correct as
much data as possible before moving on to designing new product lines and
redesigning existing ones.The design processes have to encompass the
requirements and needs of customers. Various channels are available to help
study these customer needs. Big data analytics helps a business to identify the
best ways to capitalize on those needs.Amazon Fresh and Whole Foods are the
perfect examples of how big data can help improve innovation and product
development. Data-driven logistics provides companies with the required
knowledge and information to help achieve greater value.
Supply Chain Management
Big data offers improved clarity, accuracy, and insights to supplier networks.
Through big data analytics, it is possible to achieve contextual intelligence
across supply chains. Suppliers are now able to avoid the constraints and
challenges that they faced earlier.Suppliers incurred huge losses and were prone
to making errors when they were using traditional enterprise and supply chain
management systems. However, approaches based on big data made it possible
for suppliers to achieve success with higher levels of contextual
intelligence.PepsiCo depends on enormous amounts of data for the efficient
supply chain management. The company tries to ensure that it replenishes the
retailers’ shelves with appropriate numbers and types of products. Data is used
to reconcile and forecast the production and shipment needs.
Types of Big Data
Data falls into three main categories:
Structured Data
Any data that can be stored, accessed, and processed in a fixed format is known
as structured data. Businesses can get the most out of this type of data by
performing analysis. Advanced technologies help generate data-driven insights
to make better decisions from structured data.
Unstructured Data
Data that has an unknown structure or form is unstructured data. Processing and
analyzing this type of data for data-driven insights can be a difficult and
challenging task as they are under different categories and putting them together
in a box will not be of any value. A combination of simple text files, images,
videos, etc., is an example of unstructured data.
Semi-structured data
Semi-structured data, as you may have already guessed, has both structured and
unstructured data. Semi-structured data may seem structured in form, but it is
not exactly well-defined with table definition in relational DBMS. Web
applications have unstructured data such as transaction history files, log files,
etc.
How are we contributing to the creation of Big Data?
Every time one opens an application on his/her phone, visits a web page, signs
up online on a platform, or even types into a search engine, a piece of data is
gathered.So, whenever we turn to our search engines for answers a lot of data is
created and gathered.
But as users, we are usually more focused on the outcomes of what we are
performing on the web. We don’t dwell on what happens behind the scenes. For
example, we might have opened up our browser and looked up for ‘big data,’
then visited this link to read this blog. That alone has contributed to the vast
amount of big data. Now imagine the number of people spending time on the
Internet visiting different web pages, uploading pictures, and whatnot.
Characteristics of Big Data
There are some terms associated with big data that actually help make
things even clearer about big data. These are essentially called the
characteristics of big data and are termed as volume, velocity, and variety,
giving rise to the popular name 3Vs of big data, which I am sure we must have
heard before. But, if it feels new to you, do not worry. We are going to discuss
them in detail here. As people are understanding more and more about the ever-
evolving technological term, big data, it shouldn’t come as a shock if more
characteristics are added to the list of the 3Vs. These are called veracity and
value.
Characteristics of Details
Big Data
Volume Organizations have to constantly scale their storage
solutions since big data requires a large amount of space to
be stored.
Velocity Since big data is being generated every second,
organisations need to respond in real time to deal with it.
Variety Big data comes in a variety of forms. It could be structured
or unstructured, or even in different formats such as text
format, videos, images, and more.
Veracity Big data, as large as it is, can contain wrong data too.
Uncertainty of data is something organisations have to
consider while dealing with big data.
Value Just collecting big data and storing it is of no consequence
unless the data is analyzed and a useful output is produced.
Challenges of Big Data
It must be pretty clear by now that while talking about big data one can’t ignore
the fact that there are some obvious big data challenges associated with it. So
moving forward in this blog, let’s address some of those challenges.
 Quick Data Growth
Data growing at such a quick rate is making it a challenge to find insights from
it. There is more and more data generated every second from which the data that
is actually relevant and useful has to be picked up for further analysis.
 Storage
Such a large amount of data is difficult to store and manage by organizations

without appropriate tools and technologies.
 Syncing Across Data Sources

This implies that when organizations import data from different sources the data
from one source might not be up to date as compared to the data from another
source.
 Security
Large amounts of data in organizations can easily become a target for advanced
persistent threats, so here lies another challenge for organizations to keep their
data secure by proper authentication, data encryption, etc.
 Unreliable Data
We can’t deny the fact that big data can’t be 100 percent accurate. It might
contain redundant or incomplete data, along with contradictions.
 Miscellaneous Challenges
These are some other challenges that come forward while dealing with big data,
like the integration of data, skill and talent availability, solution expenses, and
processing a large amount of data in time and with accuracy so that the data is
available for data consumers whenever they need it.
Technologies and Tools to Help Manage Big Data
Before we go further into getting to know technologies that can help manage
big data, we should first get familiar with a very popular programming
paradigm called MapReduce.
What it does is, allows performing computations on huge data sets on multiple
systems in a parallel fashion.
MapReduce mainly consists of two parts: the Map and the Reduce. It’s kind of
obvious! Anyway, let’s see what these two parts are used for:
 Map: It sorts and filters and then categorizes the data so that it’s easy to
analyze it.
 Reduce: It merges all data together and provides the summary.
Big Data Frameworks
 Apache Hadoop is a framework that allows parallel data processing and

distributed data storage.
 Apache Spark is a general-purpose distributed data processing
framework.
 Apache Kafka is a stream processing platform.
 Apache Cassandra is a distributed NoSQL database management system.
These are some of the many technologies that are used to handle and manage
big data. Hadoop is the most widely used among them. If you wish to learn
more about Big Data and Hadoop, along with a structured training program,
visit HERE.
Applications of Big Data
There are many real-life Big Data applications in various industries. Let’s find
out some of them in brief.
 Fraud Detection
Big data helps in risk analysis, management, fraud detection, and abnormal
trading analysis.
 Advertising and Marketing
Big data helps advertising agencies understand the patterns of user behavior and
then gather information about consumers’ motivations.
 Agriculture
Big data can be used to sensor data to increase crop efficiency. This can be done
by planting test crops to record and store the data about how crops react to
various environmental changes and then using that data for planning crop
plantation, accordingly.
Job Opportunities in Big Data
Knowledge about big data is one of the most important skills required for some
of the hottest job profiles which are in high demand right now and the demand
in these profiles won’t be dropping down any time sooner, because, honestly,
the accumulation of data is only going to increase over time, increasing the
number of talents required in this field, thus opening up multiple doors of
opportunities for us. Some of the hot job profiles are given below:
 Data analysts analyze and interpret data, visualize it, and build reports to
help make better business decisions.
 Data scientists mine data by assessing data sources and using algorithms
and machine learning techniques.
 Data architects design database systems and tools.
 Database managers control database system performance, perform
troubleshooting and upgrade hardware and software.
 Big data engineers design, maintain and support big data solutions.
Once we learn about big data and understand its use, we will come to know that
there are many analytics problems we can solve, which were not possible earlier
due to technological limitations. Organizations are now relying more and more
on this cost-effective and robust method for easy data processing and storage.
Identify the problem
Before adopting a Big Data solution, an organisation needs to know the

problem it wants to solve. For example, it might want to understand the
relationship between customers’ buying patterns and their online influence (on
social networks, etc). Equally, it also needs to understand how best to represent
the results (in a table, graphical format, textually, etc).
Data lifecycles
Organisations already have to determine the length of time for which

information should be retained (most notably for reasons of legal and regulatory
compliance). But as they manage increasing volumes of data, IT departments
have a second lifecycle to consider that of the systems in which the data is held.
For example, social media data may be combined with internal customer
data to uncover new insights. This new data could be discarded once it has
served its immediate use, or it could be retained for a period. If an organisation
decides to retain the data, it needs to consider not only the regulatory
implications, but also the technical question of how and where to store that data:
 Should it remain in an unstructured, Big Data solution (e.g. a

NoSQL database) or be moved to a data warehouse?
 Should it remain online or be made accessible from an archive on
an as-needed basis?
 Should it be retained in full, aggregated with other data or
anonymised for more general use? Adoption Approaches Many
proposed approaches to the management of Big Data could result
in organisations creating new information ‘silos’.
There are no definitive answers to these questions. They will vary depending on
the contents of the data, an organisation’s policies and any legal or regulatory
restrictions. However, these considerations underline the need for robust data
lifecycles when adopting Big Data solutions. In effect, data is created (arrives),
is maintained (exists), and then is deleted (disappears) at some point in the
future. That data needs to be managed in a way that ensures it is kept for the
right length of time and no longer. But Big Data’s potential to find new insights
and generate value from old data prompts another question — shouldn’t
organisations be keeping it forever, as long as they can do so securely, cost-
effectively and legally?
Choosing the right tools
When selecting tools for Big Data analysis, organisations face a number of
considerations:
 Where will the data be processed? Using locally-hosted software on a

dedicated appliance, or in the cloud? Early Big Data usage is likely to
focus on business analytics. Enterprises will need to turn capacity on and
off at particular times, which will result in them favouring the cloud over
private solutions.
 From where does the data originate and how will it be transported? It’s
often easier to move the application than it is to move the data (e.g., for
large data sets already hosted with a particular cloud provider). If the data
is updated rapidly then the application needs to be close to that data in
order to maximise the speed of response.
 How clean is the data? The hotch-potch nature of Big Data means that it
needs a lot of tidying up — which costs time and money. It can be more
effective to use a data marketplace. Although the quality of the data
provided via such marketplaces can vary, there will generally be a
mechanism for user feedback/ratings which can give a good indication of
the quality of the different services on offer.
 What is the organisation’s culture? Does it have teams with the necessary
skills to analyse the data? Big Data analysis requires people who can
think laterally, understand complicated mathematical formulae
(algorithms) and focus on business value. Data science teams need a
combination of technical expertise, curiosity, creative flair and the ability
to communicate their insights effectively.
 What does the organisation want to do with the data? As well as honing
the choice of tools, having an idea about the desired outcomes of the
analysis can also help a business to identify relevant patterns or find clues
in the data
Identifying and acting on Big Data priorities
The chart above has been designed to help readers of this book identify the
key Big Data priorities for their businesses. At the heart of business is the
drive for increased profit, represented here in the centre of the target.
Working outwards, businesses can either increase profit by increasing
revenue or reducing costs. Both of these methods can be achieved through
either direct or indirect actions, but by combining the two we move outwards
towards the appropriate segment. The outer circle shows the various actions
a business can take. The left hemisphere contains revenue-increasing actions,
while the right side contains cost-reducing actions. The diagram also splits
horizontally to show direct actions (in the top half) and indirect actions
(bottom half). From this it is easy to see the possible actions a business can
take to increase revenues or reduce costs, either directly or indirectly. These
are also listed below with examples:
Direct actions to increase revenues:
 Develop new products or services (e.g. to address new opportunities)

 Increase existing customer revenue (e.g. by raising prices)
 Acquire new customers (e.g. by running a sales campaign).
Direct actions to reduce costs:
 Improving productivity (e.g. automating processes)

 Displacing costs (e.g. outsourcing non-core functions)
 Reducing capital requirements (e.g. moving to an operational
expenditure model).
Indirect actions to increase revenues:
 Increasing brand awareness (e.g. by running a marketing campaign)

 Increasing customer loyalty (e.g. by improving account management)
 Increasing customer satisfaction (e.g. by improving customer service).
Indirect actions to reduce costs:
 Reducing fulfilment errors (e.g. putting in place additional checks

in the dispatch process)
 Decreasing time to market (e.g. by shortening the product
development lifecycle)
 Reducing customer support dependencies (e.g. by introducing a
self-service element to the support process).
By shading the outer grey segments of the diagram on the previous page
— green (for high value), yellow (for medium value), blue (for low
value) or leaving them blank (for no value) — it would be straightforward
for organisations to identify their business priorities. They can then use
the table on the pages overleaf to identify the actions that need to be
taken. Ensuring success for a Big Data project In common with any
business change initiative, a Big Data project needs to be business-led
and (ideally) executive-sponsored — it will never work as an isolated IT
project. Even more importantly, Big Data is a complex area that requires
a wide range of skills — it spans the whole organisation and the entire
executive team needs to work together (not just the CIO). In addition,
there is a dearth of data scientists, and it may be necessary to fill gaps
with cross training. The next two chapters of this book look at the
changing role of the executive team and the rise of the data scientist
2.2 SUB DOMAIN
1. Supervised Learning
How it works: This algorithm consist of a target / outcome variable (or

dependent variable) which is to be predicted from a given set of predictors
(independent variables). Using these set of variables, we generate a function
that map inputs to desired outputs. The training process continues until the
model achieves a desired level of accuracy on the training data. Examples of
Supervised Learning: Regression, Decision Tree, Random Forest, KNN,
Logistic Regression etc.
2. Unsupervised Learning
How it works: In this algorithm, we do not have any target or outcome variable
to predict / estimate. It is used for clustering population in different groups,
which is widely used for segmenting customers in different groups for specific
intervention. Examples of Unsupervised Learning: Apriori algorithm, K-means.
3. Reinforcement Learning:
How it works: Using this algorithm, the machine is trained to make specific
decisions. It works this way: the machine is exposed to an environment where it
trains itself continually using trial and error. This machine learns from past
experience and tries to capture the best possible knowledge to make accurate
business decisions. Example of Reinforcement Learning: Markov Decision
Process
List of Common Machine Learning Algorithms
Here is the list of commonly used machine learning algorithms. These

algorithms can be applied to almost any data problem:
1. Linear Regression
2. Logistic Regression
3. Decision Tree
4. SVM
5. Naive Bayes
6. kNN
7. K-Means
8. Random Forest
9. Dimensionality Reduction Algorithms
10.Gradient Boosting algorithms
1. GBM
2. XGBoost
3. LightGBM
4. CatBoost
1. Linear Regression
It is used to estimate real values (cost of houses, number of calls, total sales etc.)
based on continuous variable(s). Here, we establish relationship between
independent and dependent variables by fitting a best line. This best fit line is
known as regression line and represented by a linear equation Y= a *X + b.
The best way to understand linear regression is to relive this experience of

childhood. Let us say, you ask a child in fifth grade to arrange people in his
class by increasing order of weight, without asking them their weights! What do
you think the child will do? He / she would likely look (visually analyze) at the
height and build of people and arrange them using a combination of these
visible parameters. This is linear regression in real life! The child has actually
figured out that height and build would be correlated to the weight by a
relationship, which looks like the equation above.
In this equation:
 Y – Dependent Variable
 a – Slope
 X – Independent variable
 b – Intercept
These coefficients a and b are derived based on minimizing the sum of squared
difference of distance between data points and regression line.
Look at the below example. Here we have identified the best fit line having
linear equation y=0.2811x+13.9. Now using this equation, we can find the
weight, knowing the height of a person.
Linear Regression is mainly of two types: Simple Linear Regression and

Multiple Linear Regression. Simple Linear Regression is characterized by one
independent variable. And, Multiple Linear Regression(as the name suggests) is
characterized by multiple (more than 1) independent variables. While finding
the best fit line, you can fit a polynomial or curvilinear regression. And these
are known as polynomial or curvilinear regression.
Here’s a coding window to try out your hand and build your own linear
regression model in Python:
2. Logistic Regression
Don’t get confused by its name! It is a classification not a regression algorithm.

It is used to estimate discrete values ( Binary values like 0/1, yes/no, true/false )
based on given set of independent variable(s). In simple words, it predicts the
probability of occurrence of an event by fitting data to a logit function. Hence, it
is also known as logit regression. Since, it predicts the probability, its output
values lies between 0 and 1 (as expected).
Again, let us try and understand this through a simple example.
Let’s say your friend gives you a puzzle to solve. There are only 2 outcome
scenarios – either you solve it or you don’t. Now imagine, that you are being
given wide range of puzzles / quizzes in an attempt to understand which
subjects you are good at. The outcome to this study would be something like
this – if you are given a trignometry based tenth grade problem, you are 70%
likely to solve it. On the other hand, if it is grade fifth history question, the
probability of getting an answer is only 30%. This is what Logistic Regression
provides you.
Coming to the math, the log odds of the outcome is modeled as a linear
combination of the predictor variables.
odds= p/ (1-p) = probability of event occurrence / probability of not event

occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk
Above, p is the probability of presence of the characteristic of interest. It

chooses parameters that maximize the likelihood of observing the sample values
rather than that minimize the sum of squared errors (like in ordinary regression).
Now, you may ask, why take a log? For the sake of simplicity, let’s just say that
this is one of the best mathematical way to replicate a step function. I can go in
more details, but that will beat the purpose of this article.
Build your own logistic regression model in Python here and check the
accuracy:
There are many different steps that could be tried in order to improve the model:
 including interaction terms

 removing features
 regularization techniques
 using a non-linear model
3. Decision Tree
This is one of my favorite algorithm and I use it quite frequently. It is a type of

supervised learning algorithm that is mostly used for classification problems.
Surprisingly, it works for both categorical and continuous dependent variables.
In this algorithm, we split the population into two or more homogeneous sets.
This is done based on most significant attributes/ independent variables to make
as distinct groups as possible. For more details, you can read: Decision Tree
Simplified.
you can see that population is classified into four different groups based on
multiple attributes to identify ‘if they will play or not’. To split the population
into different heterogeneous groups, it uses various techniques like Gini,
Information Gain, Chi-square, entropy.
The best way to understand how decision tree works, is to play Jezzball – a
classic game from Microsoft (image below). Essentially, you have a room with
moving walls and you need to create walls such that maximum area gets cleared
off with out the balls.
So, every time you split the room with a wall, you are trying to create 2
different populations with in the same room. Decision trees work in very similar
fashion by dividing a population in as different groups as possible.More:
Simplified Version of Decision Tree Algorithms
Let’s get our hands dirty and code our own decision tree in Python!
SVM (Support Vector Machine)
It is a classification method. In this algorithm, we plot each data item as a point

in n-dimensional space (where n is number of features you have) with the value
of each feature being the value of a particular coordinate.
For example, if we only had two features like Height and Hair length of an
individual, we’d first plot these two variables in two dimensional space where
each point has two co-ordinates (these co-ordinates are known as Support
Vectors)Now, we will find some line that splits the data between the two
differently classified groups of data. This will be the line such that the distances
from the closest point in each of the two groups will be farthest away.
In the example shown above, the line which splits the data into two differently
classified groups is the black line, since the two closest points are the farthest
apart from the line. This line is our classifier. Then, depending on where the
testing data lands on either side of the line, that’s what class we can classify the
new data as.
More: Simplified Version of Support Vector Machine
Think of this algorithm as playing JezzBall in n-dimensional space. The

tweaks in the game are:
 You can draw lines/planes at any angles (rather than just horizontal or
vertical as in the classic game)
 The objective of the game is to segregate balls of different colors in
different rooms.
 And the balls are not moving.
Try your hand and design an SVM model in Python through this coding
window:
Naive Bayes
It is a classification technique based on Bayes’ theorem with an assumption of

independence between predictors. In simple terms, a Naive Bayes classifier
assumes that the presence of a particular feature in a class is unrelated to the
presence of any other feature. For example, a fruit may be considered to be an
apple if it is red, round, and about 3 inches in diameter. Even if these features
depend on each other or upon the existence of the other features, a naive Bayes
classifier would consider all of these properties to independently contribute to
the probability that this fruit is an apple.
Naive Bayesian model is easy to build and particularly useful for very large data
sets. Along with simplicity, Naive Bayes is known to outperform even highly
sophisticated classification methods.Bayes theorem provides a way of
calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the
equation below:
Here,
 P(c|x) is the posterior probability of class (target) given predictor

(attribute).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.
Example: Let’s understand it using an example. Below I have a training data

set of weather and corresponding target variable ‘Play’. Now, we need to
classify whether players will play or not based on weather condition. Let’s
follow the below steps to perform it.
Step 1: Convert the data set to frequency table
Step 2: Create Likelihood table by finding the probabilities like Overcast

probability = 0.29 and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability
for each class. The class with the highest posterior probability is the outcome of
prediction.
Problem: Players will pay if weather is sunny, is this statement is correct?
We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny |

Yes) * P(Yes) / P (Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)=
9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class
based on various attributes. This algorithm is mostly used in text classification
and with problems having multiple classes.
Code a Naive Bayes classification model in Python:
kNN (k- Nearest Neighbors)
It can be used for both classification and regression problems. However, it is

more widely used in classification problems in the industry. K nearest neighbors
is a simple algorithm that stores all available cases and classifies new cases by a
majority vote of its k neighbors. The case being assigned to the class is most
common amongst its K nearest neighbors measured by a distance function.
These distance functions can be Euclidean, Manhattan, Minkowski and

Hamming distance. First three functions are used for continuous function and
fourth one (Hamming) for categorical variables. If K = 1, then the case is simply
assigned to the class of its nearest neighbor. At times, choosing K turns out to
be a challenge while performing kNN modeling.
More: Introduction to k-nearest neighbors : Simplified.
KNN can easily be mapped to our real lives. If you want to learn about a person,
of whom you have no information, you might like to find out about his close
friends and the circles he moves in and gain access to his/her information!
Things to consider before selecting kNN:
 KNN is computationally expensive

 Variables should be normalized else higher range variables can bias it
 Works on pre-processing stage more before going for kNN like an outlier,
noise removal
Python Code
K-Means
It is a type of unsupervised algorithm which solves the clustering problem. Its

procedure follows a simple and easy way to classify a given data set through a
certain number of clusters (assume k clusters). Data points inside a cluster are
homogeneous and heterogeneous to peer groups.
Remember figuring out shapes from ink blots? k means is somewhat similar this
activity. You look at the shape and spread to decipher how many different
clusters / population are present!
How K-means forms cluster:
1. K-means picks k number of points for each cluster known as centroids.

2. Each data point forms a cluster with the closest centroids i.e. k clusters.
3. Finds the centroid of each cluster based on existing cluster members.
Here we have new centroids.
4. As we have new centroids, repeat step 2 and 3. Find the closest distance
for each data point from new centroids and get associated with new k-
clusters. Repeat this process until convergence occurs i.e. centroids does
not change.
How to determine value of K:
In K-means, we have clusters and each cluster has its own centroid. Sum of
square of difference between centroid and the data points within a cluster
constitutes within sum of square value for that cluster. Also, when the sum of
square values for all the clusters are added, it becomes total within sum of
square value for the cluster solution.
We know that as the number of cluster increases, this value keeps on decreasing
but if you plot the result you may see that the sum of squared distance decreases
sharply up to some value of k, and then much more slowly after that. Here, we
can find the optimum number of cluster.
Random Forest
Random Forest is a trademark term for an ensemble of decision trees. In

Random Forest, we’ve collection of decision trees (so known as “Forest”). To
classify a new object based on attributes, each tree gives a classification and we
say the tree “votes” for that class. The forest chooses the classification having
the most votes (over all the trees in the forest).
Each tree is planted & grown as follows:
1. If the number of cases in the training set is N, then sample of N cases is

taken at random but with replacement. This sample will be the training
set for growing the tree.
2. If there are M input variables, a number m<<M is specified such that at
each node, m variables are selected at random out of the M and the best
split on these m is used to split the node. The value of m is held constant
during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.
For more details on this algorithm, comparing with decision tree and tuning
model parameters, I would suggest you to read these articles:
1. Introduction to Random forest – Simplified

2. Comparing a CART model to Random Forest (Part 1)
3. Comparing a Random Forest to a CART model (Part 2)
4. Tuning the parameters of your Random Forest model
Dimensionality Reduction Algorithms
In the last 4-5 years, there has been an exponential increase in data capturing at
every possible stages. Corporates/ Government Agencies/ Research
organisations are not only coming with new sources but also they are capturing
data in great detail.
For example: E-commerce companies are capturing more details about

customer like their demographics, web crawling history, what they like or
dislike, purchase history, feedback and many others to give them personalized
attention more than your nearest grocery shopkeeper.
As a data scientist, the data we are offered also consist of many features, this
sounds good for building good robust model but there is a challenge. How’d
you identify highly significant variable(s) out 1000 or 2000? In such cases,
dimensionality reduction algorithm helps us along with various other algorithms
like Decision Tree, Random Forest, PCA, Factor Analysis, Identify based on
correlation matrix, missing value ratio and others.
To know more about this algorithms, you can read “Beginners Guide To Learn
Dimension Reduction Techniques“.
Gradient Boosting Algorithms
10.1. GBM
GBM is a boosting algorithm used when we deal with plenty of data to make a
prediction with high prediction power. Boosting is actually an ensemble of
learning algorithms which combines the prediction of several base estimators in
order to improve robustness over a single estimator. It combines multiple weak
or average predictors to a build strong predictor. These boosting algorithms
always work well in data science competitions like Kaggle, AV Hackathon,
CrowdAnalytix.
More: Know about Boosting algorithms in detail
. XGBoost
Another classic gradient boosting algorithm that’s known to be the decisive

choice between winning and losing in some Kaggle competitions.
The XGBoost has an immensely high predictive power which makes it the best
choice for accuracy in events as it possesses both linear model and the tree
learning algorithm, making the algorithm almost 10x faster than existing
gradient booster techniques.
The support includes various objective functions, including regression,

classification and ranking.
One of the most interesting things about the XGBoost is that it is also called a
regularized boosting technique. This helps to reduce overfit modelling and has a
massive support for a range of languages such as Scala, Java, R, Python, Julia
and C++.
Supports distributed and widespread training on many machines that encompass

GCE, AWS, Azure and Yarn clusters. XGBoost can also be integrated with
Spark, Flink and other cloud dataflow systems with a built in cross validation at
each iteration of the boosting process.
LightGBM
LightGBM is a gradient boosting framework that uses tree based learning
algorithms. It is designed to be distributed and efficient with the following
advantages
 Faster training speed and higher efficiency

 Lower memory usage
 Better accuracy
 Parallel and GPU learning supported
 Capable of handling large-scale data
The framework is a fast and high-performance gradient boosting one based on

decision tree algorithms, used for ranking, classification and many other
machine learning tasks. It was developed under the Distributed Machine
Learning Toolkit Project of Microsoft.
Since the LightGBM is based on decision tree algorithms, it splits the tree leaf
wise with the best fit whereas other boosting algorithms split the tree depth wise
or level wise rather than leaf-wise. So when growing on the same leaf in Light
GBM, the leaf-wise algorithm can reduce more loss than the level-wise
algorithm and hence results in much better accuracy which can rarely be
achieved by any of the existing boosting algorithms.
Also, it is surprisingly very fast, hence the word ‘Light’.
3. LITERATURE SURVEY
“Characteristics of the 2011 Chao Phraya River flood in central Thailand,” for
A massive flood, the maximum ever recorded in Thailand, struck the Chao
Phraya River in 2011. The total rainfall during the 2011 rainy season was 1,439
mm, which was 143% of the average rainy season rainfall during the period
1982–2002. Although the gigantic Bhumipol and Sirikit dams stored
approximately 10 billion m³ by early October, the total flood volume was
estimated to be 15 billion m³. This flood caused tremendous damage, including
813 dead nationwide, seven industrial estates, and 804 companies with
inundation damage, and total losses estimated at 1.36 trillion baht
(approximately 3.5 trillion yen). The Chao Phraya River watershed has
experienced many floods in the past, and floods on the same scale as the 2011
flood are expected to occur in the future. Therefore, to prepare of the next flood
disaster, it is essential to understand the characteristics of the 2011 Chao Phraya
River Flood. This paper proposes countermeasures for preventing major flood
damage in the future[3]. “The 2011 Great Flood in Thailand: Climate Diagnostics
and Implications from Climate Change” for Severe flooding occurred in Thailand
during the 2011 summer season, which resulted in more than 800 deaths and
affected 13.6 million people. The unprecedented nature of this flood in the Chao
Phraya River basin (CPRB) was examined and compared with historical flood
years. Climate diagnostics were conducted to understand the meteorological
conditions and climate forcing that led to the magnitude and duration of this
flood. Neither the monsoon rainfall nor the tropical cyclone frequency
anomalies alone was sufficient to cause the 2011 flooding event. Instead, a
series of abnormal conditions collectively contributed to the intensity of the
2011 flood: anomalously high rainfall in the premonsoon season, especially
during March; record-high soil moisture content throughout the year; elevated
sea level height in the Gulf of Thailand, which constrained drainage; and other
water management factors. In the context of climate change, the substantially
increased premonsoon rainfall in CPRB after 1980 and the continual sea level
rise in the river outlet have both played a role. The rainfall increase is associated
with a strengthening of the premonsoon northeasterly winds that come from
East Asia. Attribution analysis using phase 5 of the Coupled Model
Intercomparison Project historical experiments pointed to anthropogenic
greenhouse gases as the main external climate forcing leading to the rainfall
increase. Together, these findings suggest increasing odds for potential flooding
of similar intensity to that of the 2011 flood[4]. “Adaptive hydrological flow
field modeling based on water body extraction and surface information” for
Hydrological flow characteristic is one of the prime indicators for assessing
flood. It plays a major part in determining drainage capability of the affected
basin and also in the subsequent simulation and rainfall-runoff prediction. Thus
far, flow directions were typically derived from terrain data which for flat
landscapes are obscured by other man-made structures, hence undermining the
practical potential. In the absence (or diminutive) of terrain slopes, water
passages have a more pronounced effect on flow directions than elevations. This
paper, therefore, presents detailed analyses and implementation of hydrological
flow modeling from satellite and topographic images. Herein, gradual
assignment based on support vector machine was applied to modified
normalized difference water index and a digital surface model, in order to
ensure reliable water labeling while suppressing modality-inherited artifacts and
noise. Gradient vector flow was subsequently employed to reconstruct the flow
field. Experiments comparing the proposed scheme with conventional water
boundary delineation and flow reconstruction were presented. Respective
assessments revealed its advantage over the generic stream burning.
Specifically, it could extract water body from studied areas with 98.70%
precision, 99.83% recall, 98.76% accuracy, and 99.26% F-measure. The
correlations between resultant flows and those obtained from the stream burning
were as high as 0.80±0.04 (p≤0.01 in all resolutions)[5]. “A Brief review of

flood forecasting techniques and their applications” for Flood forecasting (FF)
is one the most challenging and difficult problems in hydrology. However, it is
also one of the most important problems in hydrology due to its critical
contribution in reducing economic and life losses. In many regions of the world,
flood forecasting is one among the few feasible options to manage floods.
Reliability of forecasts has increased in the recent years due to the integration of
meteorological and hydrological modelling capabilities, improvements in data
collection through satellite observations, and advancements in knowledge and
algorithms for analysis and communication of uncertainties. The present paper
reviews different aspects of flood forecasting, including the models being used,
emerging techniques of collecting inputs and displaying results, uncertainties,
and warnings. In the end, future directions for research and development are
identified [6]. “Near-Real-Time Flood Forecasting Based on Satellite
Precipitation Products” for Floods, storms and hurricanes are devastating for
human life and agricultural cropland. Near-real-time (NRT) discharge
estimation is crucial to avoid the damages from flood disasters. The key input
for the discharge estimation is precipitation. Directly using the ground stations
to measure precipitation is not efficient, especially during a severe rainstorm,
because precipitation varies even in the same region. This uncertainty might
result in much less robust flood discharge estimation and forecasting models.
The use of satellite precipitation products (SPPs) provides a larger area of
coverage of rainstorms and a higher frequency of precipitation data compared to
using the ground stations. In this paper, based on SPPs, a new NRT flood
forecasting approach is proposed to reduce the time of the emergency response
to flood disasters to minimize disaster damage. The proposed method allows us
to forecast floods using a discharge hydrograph and to use the results to map
flood extent by introducing SPPs into the rainfall–runoff model. In this study,
we first evaluated the capacity of SPPs to estimate flood discharge and their
accuracy in flood extent mapping. Two high temporal resolution SPPs were
compared, integrated multi-satellite retrievals for global precipitation
measurement (IMERG) and tropical rainfall measurement mission multi-
satellite precipitation analysis (TMPA). The two products are evaluated over the
Ottawa watershed in Canada during the period from 10 April 2017 to 10 May
2017. With TMPA, the results showed that the difference between the observed
and modeled discharges was significant with a Nash–Sutcliffe efficiency (NSE)
of −0.9241 and an adapted NSE (ANSE) of −1.0048 under high flow
conditions. The TMPA-based model did not reproduce the shape of the
observed hydrographs. However, with IMERG, the difference between the
observed and modeled discharges was improved with an NSE equal to 0.80387
and an ANSE of 0.82874. Also, the IMERG-based model could reproduce the
shape of the observed hydrographs, mainly under high flow conditions. Since
IMERG products provide better accuracy, they were used for flood extent
mapping in this study. Flood mapping results showed that the error was mostly
within one pixel compared with the observed flood benchmark data of the
Ottawa River acquired by RadarSat-2 during the flood event. The newly
developed flood forecasting approach based on SPPs offers a solution for flood
disaster management for poorly or totally ungauged watersheds regarding
precipitation measurement. These findings could be referred to by others for
NRT flood forecasting research and applications[7]. “Rain gauge network
design for flood forecasting using multi-criteria decision analysis and clustering
techniques in lower Mahanadi river basin, India” for Flood is one of the most
common hydrologic extremes which are frequently experienced in Mahanadi
basin, India. During flood times it becomes difficult to collect information from
all rain gauges. Therefore, it is important to find out key rain gauge (RG)
networks capable of forecasting the flood with desired accuracy. In this paper a
procedure for the design of key rain gauge network particularly important for
the flood forecasting is discussed and demonstrated through a case study.New
hydrological insights for the region This study establishes different possible key
RG networks using Hall’s method, analytical hierarchical process (AHP), self
organization map (SOM) and hierarchical clustering (HC) using the
characteristics of each rain gauge occupied Thiessen polygon area. Efficiency of
the key networks is tested by artificial neural network (ANN), Fuzzy and NAM
rainfall-runoff models. Furthermore, flood forecasting has been carried out
using the three most effective RG networks which uses only 7 RGs instead of
14 gauges established in the Kantamal sub-catchment, Mahanadi basin. The
Fuzzy logic applied on the key RG network derived using AHP has shown the
best result for flood forecasting with efficiency of 82.74% for 1-day lead period.
This study demonstrates the design procedure of key RG network for effective
flood forecasting particularly when there is difficulty in gathering the
information from all RGs[8]. “Evaluation and comparison of satellite-based
rainfall products in Burkina Faso, West Africa” for The performance of seven
operational high-resolution satellite based rainfall products – Africa Rainfall
Estimate Climatology (ARC 2.0), Climate Hazards Group InfraRed
Precipitation with Stations (CHIRPS), Precipitation Estimation from Remotely
Sensed Information using Artificial Neural Networks (PERSIANN), African
Rainfall Estimation (RFE 2.0), Tropical Applications of Meteorology using
SATellite (TAMSAT), African Rainfall Climatology and Timeseries
(TARCAT), and Tropical Rainfall Measuring Mission (TRMM) daily and
monthly estimates – was investigated for Burkina Faso. These were compared
to ground data for 2001–2014 on a pointto-pixel basis at daily to annual time
steps. Continuous statistics was used to assess their performance in estimating
and reproducing rainfall amounts, and categorical statistics to evaluate rain
detection capabilities. The north–south gradient of rainfall was captured by all
products, which generally detected heavy rainfall events, but showed low
correlation for rainfall amounts. At daily scale they performed poorly. As the
time step increased, the performance improved. All (except TARCAT) provided
excellent scores for Bias and Nash–Sutcliffe Efficiency coefficients, and over
estimated rainfall amounts at the annual scale. RFE performed the best, whereas
TARCAT was the weakest. Choice of product depends on the specific
application: ARC, RFE, and TARCAT for drought monitoring, and
PERSIANN, CHIRPS, and TRMM daily for flood monitoring in Burkina
Faso[9]. “Spatial interpolation of precipitation from multiple rain gauge
networks and weather radar data for operational applications in Alpine
catchments” for Increasing meteorological data availability and quality implies
an adaptation of the interpolation methods for data combination. In this paper,
we propose a new method to efficiently combine weather radar data with data
from two heated rain gauge networks of different quality. The two networks
being non-collocated (no common location between the two networks), pseudo
cross-variograms are used to compute the linear model of coregionalization for
kriging computation. This allows considering the two networks independently
in a co-kriging approach. The methodology is applied to the Upper Rhône River
basin, an Alpine catchment in Switzerland with a complex topography and an
area of about 5300 km². The analysis explores the newly proposed Regression
co-kriging approach, in which two independent rain gauge networks are
considered as primary and secondary kriging variables. Regression co-kriging is
compared to four other methods, including the commonly applied Inverse
distance weighting method used as baseline scenario. Incorporation of
additional networks located within and around the target region in the
interpolation computation is also explored. The results firstly demonstrate the
added value of the radar information as compared to using only ground stations.
As compared to Regression kriging using only the network of highest quality,
the Regression co-kriging method using both networks slightly increases the
performance. A key outcome of the study is that Regression co-kriging
performs better than Inverse distance weighting even for the data availability
scenario when the radar network was providing lower quality radar data over
the studied basin. The results and discussion underline that combining
meteorological information from different rain gauge networks with different
equipments remains challenging for operational purposes. Future research in
this field should in particular focus on additional pre-processing of the radar
data to account for example for areas of low visibility of the weather radars due
to the topography[10]. “Considering Rain Gauge Uncertainty Using Kriging for
Uncertain Data,” for In urban hydrological models, rainfall is the main input
and one of the main sources of uncertainty. To reach sufficient spatial coverage
and resolution, the integration of several rainfall data sources, including rain
gauges and weather radars, is often necessary. The uncertainty associated with
rain gauge measurements is dependent on rainfall intensity and on the
characteristics of the devices. Common spatial interpolation methods do not
account for rain gauge uncertainty variability. Kriging for Uncertain Data
(KUD) allows the handling of the uncertainty of each rain gauge independently,
modelling space- and time-variant errors. The applications of KUD to rain
gauge interpolation and radar-gauge rainfall merging are studied and compared.
First, the methodology is studied with synthetic experiments, to evaluate its
performance varying rain gauge density, accuracy and rainfall field
characteristics. Subsequently, the method is applied to a case study in the
Dommel catchment, the Netherlands, where high-quality automatic gauges are
complemented by lower-quality tipping-bucket gauges and radar composites.
The case study and the synthetic experiments show that considering
measurement uncertainty in rain gauge interpolation usually improves rainfall
estimations, given a sufficient rain gauge density. Considering measurement
uncertainty in radar-gauge merging consistently improved the estimates in the
tested cases, thanks to the additional spatial information of radar rainfall data
but should still be used cautiously for convective events and low-density rain
gauge networks[11]. “Interpolation of daily raingauge data for hydrological
modelling in data sparse regions using pattern information from satellite data,”
for In order to cope with a severe reduction of the raingauge network in the
Great Ruaha River basin over the past 30 years, an interpolation scheme using
spatial patterns from satellite images as covariate has been evaluated. The
regression-based interpolation attempts to combine the advantages of accurate
rainfall amounts from raingauge records with the unique spatial pattern
information obtained from satellite-based rainfall estimates. A spatial pattern
analysis reveals that the simple interpolation of the sparse current raingauge
network compares very poorly to the pattern originating from the much denser
historic network. In contrast, the rainfall datasets that include patterns from
satellite data show good correlation with the historic pattern. The evaluation
based on hydrological modelling showed similar and good performance for all
rainfall products, including raingauge records, whereas the purely satellite-
based product performed poorly [12].
4. SYSTEM ANALYSIS
4.1 EXISTING SYSTEM
Existing frameworks were implemented as application software on
different platforms. In a first instance, a mobile application on ASP.NET
framework written in C# language was developed. This software forecasted a
flood event by using the HEC-HMS algorithm, based on factors stored in
MySQL database. However, it was limited to only an Android operating system.
Another more in-depth analysis of flood situations was developed on a web
platform . This system statistically analyzed floods based on the historical data
of water levels, and seven points of water inlets and outlets. The resultant
forecast was presented as point-wise flood levels, rendered on a two-
dimensional map. Its shortcoming was that the analysis did not consider any
geospatial data. In addition, forecasts could only be made at specific outlets, but
not at those arbitrarily queried by users. Although it was developed as a web
application, it did not support responsive web technology, and as such was
unable to equally well satisfy user experience on all other devices apart from
personal computers (PC). In terms of mathematical models, it was found that
current frameworks made their flood forecast based on three models, i.e.,
meteorological, hydrological, and ML ones. The first two models were used
with many algorithms, e.g., HECHMS , HEC-RAS , Mike11, Mike21, Mike
Flood , and ECMWF, etc., while those using ML model were implemented with
various methods (as discussed in the section I. E). Thus far, these systems were
only experimental and, to our knowledge, not yet publicly distributed. To
elevate these limitations, this paper thus analyzed and designed a flood
forecasting system that improved over the current ones. The aspects considered
herein were supports of responsive web technology, automation of key
processes, and availability and usability of the system. To this end, the proposed
system was developed by using both meteorological and hydrological models in
forecasting accumulated precipitation from data obtained from TMD big data
and GLOFAS, and ML models in forecasting flood situations in given areas.
The analyses were made based on meteorological, hydrological, geospatial and
crowdsourcing data.
4.2 PROPOSED SYSTEM
To determine a suitable ML strategy for flood prediction, this section compared
the forecasting performances from different MLs. Those considered in this
study were DT (J48), RF, Naïve Bayes, ANN (both MLP and RBF
architectures), SVM and Fuzzy Logic. On evaluating each ML, K-fold cross-
validation was employed. The input data were divided into four groups, i.e.,
thematic spatial layers, meteorological and hydrological data obtained from
GLOFAS, hourly predicted precipitation obtained from TMD big data, and
crowdsourced (or volunteered) data. Altogether, they constituted to 15 variables
(as shown in Table 1). For a given area, prediction outcomes were divided into
4 classes. They were 1) no flood is anticipated, 2) flood level were below 20
cm, 3) flood level were between 20 to 49 cm, and 4) flood level were 50 cm or
above. The system provided flood forecasting on daily basis. Its lead time was
hence 24 hours. The predicted results were subsequently validated against those
retrieved from trusted agencies, onsite expeditions, and crowdsourced reports
via the web application (developed by the authors and thaiflood.org). In this
paper, the studied area covers Surat Thani and Nakhon Sri Thammarat
provinces. Unlike a physical based approach, data driven ML does not focus on
insights into functional models, but intrinsic relationships between flood
relevant factors and corresponding outcomes, learned from the past events. The
design of ML models adopted in this framework and their characteristics are
described as follow: Firstly, MLP is a configuration of ANN with multiple
layers and is suitable for complicate learning tasks. An MLP network employs
back propagation scheme, which consists of 2 reciprocal procedures, i.e.,
forward and backward passes. The forward-pass traverses data presented at the
input layers through the ones hidden in the network, while the backward pass
iteratively adjusts their connecting weights such that the errors between actual
and expected responses are minimized. In our implementation, MLP was
defined as per equations (1) and (2).
𝑥𝑗 = ∑ 𝑥𝑖𝑤𝑖𝑗 + 𝑏𝑗𝑤𝑗 𝑛 𝑖=1 (1)
𝑦𝑖 = 1 1+𝑒 −𝑥𝑖 (2)
For a given dataset, n is the total number of input nodes, xi is a sampled data
point present at the i th node, wij is the weight assigned to a link connecting the
i th and the j th nodes, bj and wj are, respectively, bias and weight linking to the
j th node, and yi is the response at the i th node. In the proposed framework,
MLP consisted of three main parts, i.e., input data, the network, and classified
results, as shown in Figure 8. The input data consisted of 16 attributes, listed in
Table 2. The network parameters were empirically determined by preliminary
trials. They were assigned to the ANN as follow: learning rate (0.1), momentum
(0.1), number of hidden layers (7), and training epochs (500). The predicted
classes which corresponded to four levels of flooding were defined as none-
existing, low, moderate, and heavy. SVM was first introduced in a binary
classifier problem and later extended to multi-class ones . It neither poses any
assumption on samples distribution nor the geometry of their separation. This is
because, in its general form, SVM creates decision boundaries on a hyper-plane
based on a series of kernels. Let the training data consists of n vectors {(xi, yi), i
= 1, 2, ... N}. A class value or target yi  (–1, 1) is associated to each vector,
where N is the number of training samples.
4.3 SYSTEM SPECIFICATION

HARDWARE REQUIREMENTS:
• System : I3 PROCESSOR
• Hard Disk : 300 GB.
• Floppy Drive : 1.44 Mb.
• Monitor : 15 VGA Colour.
• Mouse : Logitech.
• Ram : 4 gb.
SOFTWARE REQUIREMENTS:
• Operating system : - Windows XP/Ubuntu

• Coding Language : java for Maper and Reducer
• Front End :Php,Javascript (Intelligent Graph)
• Back End : Hadoop Cluster
• Tool :Virtual Box Oracle tool
5. SYSTEM DESIGN
5.1 system Architecture
5.2 USE CASE DIAGRAM
A use-case diagram is a graph of actors, a set of use cases enclosed by a

system boundary, participation associations between the actors and the use-
cases, and generalization among the use cases. In general, the use-case defines
the outside (actors) and inside (use-case) of the system’s typical behavior. A
use-case is shown as an ellipse containing the name of the use-case and is
initiated by actors. An Actor is anything that interacts with a use-case. This is
symbolized by a stick figure with the name of the actor below the figure.
Figure: Use case diagram
CLASS DIAGRAM
In software engineering, a class diagram in the Unified Modeling

Language (UML) is a type of static structure diagram that describes the
structure of a system by showing the system's classes, their attributes,
operations (or methods), and the relationships among objects.
Figure: Class diagram
SEQUENCE DIAGRAM
The sequence diagrams are an easy and intuitive way of describing the
system’s behavior, which focuses on the interaction between the system and the
environment. This notational diagram shows the interaction arranged in a time
sequence. The sequence diagram has two dimensions: the vertical dimension
represents the time; the horizontal dimension represents different objects. The
vertical line also called the object’s lifeline represents the object’s existence
during the interaction.
ACTIVITY DIAGRAM
Activity diagrams are graphical representations of workflows of

stepwise activities and actions with support for choice, iteration and
concurrency. In the Unified Modeling Language, Activity diagrams are
intended to model both computational and organizational processes.
Figure: Activity diagram
6.CONCLUSION:
This paper proposed a unique allotted flood forecasting device, primarily based
totally on integrating meteorological, hydrological, geospatial, and crowdsource
facts. Big facts made to be had via way of distinguished groups had been
received by using diverse move platform APIs. Forecasting became done
primarily based totally on those facts found out via way of present day ML
strategies. They had been choice tree, RF, Naïve Bayes, MLP and RBF ANN,
SVM, and fuzzy logics. Evaluation consequences on studied regions indicated
that the device may want to forecasted flood occasions notably accurately.
Three first-class acting MLs had been MLP ANN, SVM, and RF, respectively.
It became elucidated empirically that the advanced device might be used to alert
the general public and government alike of now no longer handiest a cutting-
edge flood however additionally destiny ones. This device additionally superior
person revel in thru responsive graphical interfaces, interoperable on one of a
kind computing gadgets inclusive of mobiles. This benefit correctly
recommended extra contribution of crowdsource facts from the general public,
enriching facts aggregation and as a result growing device accuracy and
reliability. As such, the advanced device is adaptive, in a feel that because the
device have become extra “experienced” (i.e., via learnings), the forecasting
receives extra realistic In prospects, the device may be effortlessly hired in
current floods control schemes, e.g., the ones led via way of authorities groups
or non-earnings organizations. Moreover, way to allotted architecture, the
device can attain wider public, and consequently serves as an powerful manner
of speaking with them (and specifically the flood victims), concerning cutting-
edge repute and improvement of the disaster. Future upgrades of the device
encompass preliminary flood illustration and its quantity being tailored to the
cutting-edge region of the device, in order that they may be immediately made
privy to via way of its person. Moreover, flooded region pined via way of an
icon can be augmented with color-coded regions, in order that the conditions
(e.g., stages and extents) of affected regions can be higher comprehended.Some
facts taken into consideration on this study, including GLOFAS, had been now
no longer of intrinsically excessive spatial decision, However, they had
beenaccurate. The accrued precipitations and their chance at one of a kind
stages, for instance, corresponded to the real event. Their API device became
additionally reliable. These traits had been desired via way of the proposed
device. Their decision short comings had been remedied via way of
incorporating different extra targeted layers in addition to crowdsource elements
into the ML framework. Possible upgrades for this problem consist of involving
the Internet of Things (IoT) in measuring real meteorological facts with desired
coverage.
7.REFERENCE:
[1] Asian Disaster Reduction Center, “Natural Disaster Data Book,” Japan,
Asian Disaster Reduction Center )ADRC(, 2012.
[2] Asian Disaster Reduction Center, “Natural Disaster Data Book,” Japan,
Asian Disaster Reduction Center )ADRC(, 2015
[3] D. Komori, S. Nakamura, M. Kiguchi, A. Nishijima, D. Yamazaki, S.
Suzuki and T. Oki, “Characteristics of the 2011 Chao Phraya River flood in
central Thailand,” Hydrological Research Letters, Vol. 6, pp. 41-46, 2012.
[4] P., Promchote, S. Y. Simon Wang and P. G. Johnson, “The 2011 great flood
in Thailand: Climate diagnostics and Implications from climate change,”
Journal of Climate, Vol. 29, no. 1, pp. 367-379, Jan. 2016.
[5] S. Puttinaovarat, P. Horkaew, K. Khaimook and W. Polnigongit, “Adaptive
hydrological flow field modeling based on water body extraction and surface
information,” Journal of Applied Remote Sensing, Vol. 9, no. 1, pp. 095041,
Jan. 2015.
[6] S. K. Jain, P. Mani, S. K. Jain, P. Prakash, V. P. Singh, D. Tullos and A. P.
Dimri, “A Brief review of flood forecasting techniques and their applications,”
International journal of river basin management, Vol. 16, no. 3, pp. 329-344,
Jan. 2018
[7] N. Belabid, F. Zhao, L. Brocca, Y. Huang and Y. Tan, “Near-real time flood
forecasting based on satellite precipitation products,” Remote Sensing, vol. 11,
no. 3, pp. 252, Jan. 2019.
[8] A. K. Kar, A. K. Lohani, N. K. Goel and G. P. Roy, “Rain gauge network
design for flood forecasting using multi-criteria decision analysis and clustering
techniques in lower Mahanadi river basin, India,” Journal of Hydrology:
Regional Studies, vol. 4, pp. 313-332, Sep. 2015.
[9] M. Dembélé and S. J. Zwart, “Evaluation and comparison of satellite based
rainfall products in Burkina Faso, West Africa,” International Journal of
Remote Sensing, Vol. 37, no. 17, pp. 3995-4014, Jul. 2016.
[10] A. Foehn, J. G. Hernández, B. Schaefli and G. De Cesare, “Spatial
interpolation of precipitation from multiple rain gauge networks and weather
radar data for operational applications in Alpine catchments,” Journal of
hydrology, Vol. 563, pp. 1092-1110, Jul. 2018.
[11] F. Cecinati, A. Moreno-Ródenas, M. Rico-Ramirez, M. C. Veldhuis and J.
Langeveld, “Considering Rain Gauge Uncertainty Using Kriging for Uncertain
Data,” Atmosphere, Vol. 9, no. 11, pp. 446, Nov. 2018.
[12] S. Stisen and M. Tumbo, “Interpolation of daily raingauge data for
hydrological modelling in data sparse regions using pattern information from
satellite data,” Hydrological Sciences Journal, Vol. 60, no. 11, pp. 1911-1926,
Sep. 2015.
[13] Q. Hu, Z. Li, L. Wang, Y. Huang, Y. Wang and L. Li, “Rainfall Spatial
Estimations: A Review from Spatial Interpolation to MultiSource Data
Merging,” Water, Vol. 11, no. 3, pp. 579, May. 2019.
[14] C. Berndt and U. Haberlandt, “Spatial interpolation of climate variables in
Northern Germany—Influence of temporal resolution and network density,”
Journal of Hydrology: Regional Studies, Vol. 15, pp. 184-202, Feb. 2018.
[15] S. Zhang, J. Zhang, Y. Liu and Y. Liu, “A mathematical spatial
interpolation method for the estimation of convective rainfall distribution over
small watersheds,” Environmental Engineering Research, Vol. 21, no. 3, pp.
226-232, Sep. 2016.
[16] A. Mosavi, P. Ozturk and K. W. Chau, “Flood Prediction Using Machine
Learning Models: Literature Review,” Water, Vol. 10, no. 11, pp. 20734441,
Nov. 2018.
[17] H. L. Cloke and F. Pappenberger, “Ensemble flood forecasting: A review,”
Journal of hydrology, Vol. 375, no. 3, pp. 613-626, Sep. 2009.
[18] T. E. Adams, and T. C. Pagano, “Flood Forecasting: A Global
Perspective,” Academic Press, 2016.
[19] Q. Liang, Y. Xing, X. Ming, X. Xia, H. Chen, X. Tong and G. Wang, “An
open-Source Modelling and Data System for near real-time flood,” In proc. 37th
IAHR World Congress, Newcastle University, 2017, pp. 1-10.
[20] L .C. Chang, F .J. Chang, S .N. Yang, I. Kao, Y .Y. Ku, C .L. Kuo and
M .Z. Bin Mat, “Building an Intelligent Hydroinformatics Integration Platform
for Regional Flood Inundation Warning Systems,” Water, Vol. 11, no. 1, pp.
20734441, Jan. 2019.
[21] S. Puttinaovarat, P. Horkaew and K. Khaimook, “Conguring ANN for
Inundation Areas Identication based on Relevant Thematic Layers,” ECTI
Transactions on Computer and Information Technology (ECTI-CIT), Vol. 8, no.
1, pp. 56-66, May. 2014.
[22] G. Tayfur, V. Singh, T. Moramarco and S. Barbetta, “Flood hydrograph
prediction using machine learning methods,” Water, Vol. 10, no. 8, pp. 968, Jul.
2018.
[23] C. Choi, J. Kim, J. Kim, D. Kim, Y. Bae and H. S. Kim, “Development of
heavy rain damage prediction model using machine learning based on big data,”
Advances in Meteorology, Vol.2018, pp. 1-11, May. 2018.
[24] J. A. Pollard, T. Spencer and S. Jude, “Big Data Approaches for coastal
flood risk assessment and emergency response,” Wiley Interdisciplinary
Reviews: Climate Change, Vol. 9, no. 5, pp. e543, Jul. 2018.
[25] N. Tkachenko, R. Procter and S. Jarvis, “Predicting the impact of urban
flooding using open data,” Royal Society open science, Vol. 3, no. 5, pp.
160013, May. 2016.
[26] K. Sene, “Flash floods: forecasting and warning,” Springer Science &
Business Media, 2012.
[27] World Meteorological Organization, “Manual on flood forecasting and
warning,” World Meteorological Organization (WMO), 2011.
[28] X. Chen, L. Zhang, C. J. Gippel, L. Shan, S. Chen and W. Yang,
“Uncertainty of flood forecasting based on radar rainfall data assimilation,”
Advances in Meteorology, Vol. 2016, pp. 1-12, Aug. 2016.
[29] L. Alfieri, M. Berenguer, V. Knechtl, K. Liechti, D. Sempere-Torres and
M. Zappa, “Flash flood forecasting based on rainfall thresholds,” Handbook of
Hydrometeorological Ensemble Forecasting, pp. 1223-1260, 2019.
[30] J. Ye, Y. Shao, and Z. Li, “Flood forecasting based on TIGGE
precipitation ensemble forecast,” Advances in Meteorology, Vol. 2016, pp. 1-9,
Nov. 2016.
[31] M. Santos and M. Fragoso, “Precipitation Thresholds for Triggering
Floods in the Corgo Basin, Portugal,” Water, Vol. 8, no. 9, pp. 376, Aug. 2016.
[32] F. Dottori, M. Kalas, P. Salamon, A. Bianchi, L. Alfieri and L. Feyen, “An
operational procedure for rapid flood risk assessment in Europe,” Natural
Hazards and Earth System Sciences, Vol. 17, no. 7, pp. 1111-1126, Jul. 2017.
[33] C. Li, X. Cheng, N. Li, X. Du, Q. Yu and G. Kan, “A framework for flood
risk analysis and benefit assessment of flood control measures in urban areas,”
International journal of environmental research and public health, Vol. 13, no.
8, pp. 787, Aug. 2016.
[34] D. Cane, S. Ghigo, D. Rabuffetti and M. Milelli, “Real-time flood
forecasting coupling different postprocessing techniques of precipitation
forecast ensembles with a distributed hydrological model, The case study of
may 2008 flood in western Piemonte, Italy,” Natural Hazards and Earth System
Sciences, Vol. 13, no. 2, pp. 211-220, Feb. 2013.
[35] Thailand Meteorological Department. TMD Big Data. Retrieved from
http://www.rnd.tmd.go.th/bigdata.php, 2019.
[36] Globalfloods. GLOFAS. Retrieved from http://www.globalfloods.eu
/accounts/login/?next=/glofas-forecasting/, 2019.
[37] D. D. Konadu and C. Fosu, “Digital elevation models and GIS for
watershed modelling and flood prediction–a case study of Accra Ghana.,” In
proc. Appropriate Technologies for Environmental Protection in the Developing
World Springer, Dordrecht, 2009, pp. 325- 332.
[38] J. García-Pintado, D. C. Mason, S. L. Dance, H. L. Cloke, J. C. Neal, J.
Freer, and P. D. Bates, “Satellite-supported flood forecasting in river networks:
A real case study,” Journal of Hydrology, Vol. 523, pp. 706-724, Mar. 2015.
[39] M. Shafapour Tehrany, F. Shabani, M. Neamah Jebur, H. Hong, W. Chen
and X. Xie, “GIS-based spatial prediction of flood prone areas using standalone
frequency ratio, logistic regression, weight of evidence and their ensemble
techniques. Geomatics,” Natural Hazards and Risk, Vol. 8, no. 2, pp. 1538-
1561, Aug. 2017.
[40] S. Ramly, W. Tahir and S. N. H. S. Yahya, “Enhanced Flood Forecasting
Based on Land-Use Change Model and Radar-Based Quantitative Precipitation
Estimation,” In proc. ISFRAM 2014, Springer, Singapore, 2015, pp. 305-317.
[41] R. E. Emerton, E. M. Stephens, F. Pappenberger, T. C. Pagano, A. H.
Weerts, A. W. Wood and C. A. Baugh, “Continental and global scale flood
forecasting systems,” Wiley Interdisciplinary Reviews: Water, Vol. 3, no. 3, pp.
391-418, Feb. 2016.
[42] Google, Google Map API. Retrieved from
https://developers.google .com/maps /documentation/javascript/tutorial, 2019.
[43] A. Pejic, S. Pletl and B. Pejic, “An expert system for tourists using Google
Maps API,” In proc. 2009 7th International Symposium on Intelligent Systems
and Informatics, IEEE, 2009, pp. 317-322.
[44] Y. Wang, G. Huynh and C. Williamson, “Integration of Google
Maps/Earth with microscale meteorology models and data visualization,”
Computers & Geosciences, Vol. 61, pp. 23-31, Dec. 2013.
[45] N. Mount, G. Harvey, P., Aplin and G. Priestnall, “Representing,
modeling, and visualizing the natural environment,” CRC Press, 2008.
[46] S. Nedkov and S. Zlatanova, “Google maps for crowdsourced emergency
routing,” In proc. XXII ISPRS Congress, Commission IV, Melbourne,
International Society for Photogrammetry and Remote Sensing (ISPRS), 2012,
pp. IAPRS XXXIX-B4.
[47] J .P. De Albuquerque, M. Eckle, B., Herfort and A .Zipf, “Crowdsourcing
geographic information for disaster management and improving urban resilience
:an overview of recent developmentsand lessons learned,” European Handbook
of Crowdsourced Geographic Information, 2016, pp. 309.
[48] J. Fohringer, D. Dransch, H. Kreibich and K.Schröter, “Social media as an
information source for rapid flood inundation mapping,” Natural Hazards and
Earth System Sciences, Vol. 15, no. 12, pp. 2725- 2738, Jul. 2015.
[49] G. Schimak, D. Havlik and J .Pielorz, “Crowdsourcing in crisis and
disaster management–challenges and considerations,” In proc. International
Symposium on Environmental Software Systems, Springer International
Publishing, 2015 .pp .56-70.
[50] T. Simon, A. Goldberg, and B .Adini, “Socializing in emergencies— A
review of the use of social media in emergency situations,” International Journal
of Information Management, Vol. 35, no. 5, pp. 609- 619, Oct. 2015.
[51] F .E .A. Horita, L .C. Degrossi, L .F .G. de Assis, A. Zipf and J .P. De
Albuquerque, “The use of volunteered geographic information )VGI (and
crowdsourcing in disaster management :a systematic literature review,” In proc.
the Nineteenth Americas Conference on Information Systems, 2013, pp .1-10.
[52] M. Zook, M. Graham, T. Shelton and S. Gorman, “Volunteered geographic
information and crowdsourcing disaster relief :a case study of the Haitian
earthquake,” World Medical & Health Policy, Vol. 2, no. 2, pp. 7-33, 2010.
[53] L. See, P. Mooney, G. Foody, L. Bastin, A. Comber, J. Estima and
H .Y .Liu, “Crowdsourcing, citizen science or volunteered geographic
information? The current state of crowdsourced geographic information,”
ISPRS International Journal of Geo-Information, Vol. 5, no. 5, pp. 55, May.
2016.
[54] P.-S. Yu, T.-C. Yang, S.-Y. Chen, C.-M. Kuo and H.-W. Tseng,
“Comparison of random forests and support vector machine for realtime radar-
derived rainfall forecasting,” J. Hydrol. Vol. 552, pp. 92– 104, Sep. 2017.
[55] M. Dehghani, B. Saghafian, F. Nasiri Saleh, A. Farokhnia and R. Noori,
“Uncertainty analysis of streamflow drought forecast using artificial neural
networks and Monte-Carlo simulation,” Int. J. Climatol, Vol. 34, pp. 1169–
1180, Aug. 2014.
[56] A.K. Lohani, N. Goel and K. Bhatia, “Improving real time flood
forecasting using fuzzy inference system,” J. Hydrol, Vol. 509, pp. 25– 41, Feb.
2014.
[57] M. Azam, H. San Kim and S. J. Maeng, "Development of flood alert
application in Mushim stream watershed Korea," International journal of
disaster risk reduction, Vol. 21, pp. 11-26, 2017.
[58] J. Xiaoming, H. Xiaoyan, D. Liuqian, L. Jiren, L. Hui, C. Fuxin and R.
Minglei, "Real-Time Flood Forecasting and Regulation System of Poyanghu
Lake Basin in China," EPiC Series in Engineering, Vol. 3, pp. 2368-2374, 2018.
[59] N. A. Sulaiman, N. F. Ab Aziz, N. M. Tarmizi, A. M. Samad and W. Z. W.
Jaafar, W. Z. W. "Integration of geographic information system (GIS) and
hydraulic modelling to simulate floodplain inundation level for bandar
segamat," In 2014 IEEE 5th Control and System Graduate Research
Colloquium, pp. 114-119, IEEE, 2014.
[60] S. Ghosh, S. Karmakar, A. Saha, M. P. Mohanty, S. Ali, S. K. Raju and P.
L. N. Murty, "Development of India's first integrated expert urban flood
forecasting system for Chennai," CURRENT SCIENCE, 117(5), 741-745, 2019.
[61] Q. Ran, W. Fu, Y. Liu, T. Li, K. Shi and B. Sivakumar, "Evaluation of
Quantitative Precipitation Predictions by ECMWF, CMA, and UKMO for
Flood Forecasting: Application to Two Basins in China," Natural Hazards
Review, Vol. 19, No. 2, pp. 05018003, 2018.
[62] E. Mayoraz, and E. Alpaydin, "Support vector machines for multiclass
classification," In International Work-Conference on Artificial Neural
Networks, pp. 833-842, Springer, Berlin, Heidelberg, 1999.
[63] E. R. Davies, "Computer vision: principles, algorithms, applications,
learning," Academic Press, 2017

Flood Forecasting

Uploaded by

Copyright:

Available Formats

Flood Forecasting

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Flood Forecasting

Uploaded by

Copyright:

Available Formats

FFS: Flood Forecasting System based on Integrated

Big and Crowd source Data by using Deep Learning

What is Big Data?

The conventional way in which we can define big data is, It is a set of extremely

Big data sets can’t be processed in traditional database management systems

History of Big Data

Big Data Examples:

Here are a few big data examples:

Customer Acquisition and Retention

Advertising Solutions and Marketing Insights

A risk management plan is a critical investment for any business regardless of

Innovations and Product Development

Types of Big Data

Data falls into three main categories:

How are we contributing to the creation of Big Data?

Characteristics of Big Data

Challenges of Big Data

 Quick Data Growth

Such a large amount of data is difficult to store and manage by organizations

 Syncing Across Data Sources

Technologies and Tools to Help Manage Big Data

Big Data Frameworks

 Apache Hadoop is a framework that allows parallel data processing and

Applications of Big Data

Job Opportunities in Big Data

Before adopting a Big Data solution, an organisation needs to know the

Organisations already have to determine the length of time for which

 Should it remain in an unstructured, Big Data solution (e.g. a

Choosing the right tools

 Where will the data be processed? Using locally-hosted software on a

Direct actions to increase revenues:

 Develop new products or services (e.g. to address new opportunities)

Direct actions to reduce costs:

 Improving productivity (e.g. automating processes)

Indirect actions to increase revenues:

 Increasing brand awareness (e.g. by running a marketing campaign)

 Reducing fulfilment errors (e.g. putting in place additional checks

2.2 SUB DOMAIN

How it works: This algorithm consist of a target / outcome variable (or

List of Common Machine Learning Algorithms

Here is the list of commonly used machine learning algorithms. These

1. Linear Regression

The best way to understand linear regression is to relive this experience of

Linear Regression is mainly of two types: Simple Linear Regression and

Don’t get confused by its name! It is a classification not a regression algorithm.

Again, let us try and understand this through a simple example.

odds= p/ (1-p) = probability of event occurrence / probability of not event

Above, p is the probability of presence of the characteristic of interest. It

 including interaction terms

This is one of my favorite algorithm and I use it quite frequently. It is a type of

SVM (Support Vector Machine)

It is a classification method. In this algorithm, we plot each data item as a point

More: Simplified Version of Support Vector Machine

Think of this algorithm as playing JezzBall in n-dimensional space. The

It is a classification technique based on Bayes’ theorem with an assumption of

 P(c|x) is the posterior probability of class (target) given predictor

Example: Let’s understand it using an example. Below I have a training data

Step 1: Convert the data set to frequency table