0% found this document useful (0 votes)
25 views42 pages

Water Quality Analysis Report

The Water Quality Analysis project report outlines the development of a machine learning-based web application designed to analyze and classify water quality using various physicochemical parameters. The application, built on Streamlit, allows users to upload datasets, visualize data, and apply classification algorithms to assess water safety. The project aims to enhance water resource management and pollution control through real-time monitoring and user-friendly data interpretation tools.

Uploaded by

Dani Jojo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views42 pages

Water Quality Analysis Report

The Water Quality Analysis project report outlines the development of a machine learning-based web application designed to analyze and classify water quality using various physicochemical parameters. The application, built on Streamlit, allows users to upload datasets, visualize data, and apply classification algorithms to assess water safety. The project aims to enhance water resource management and pollution control through real-time monitoring and user-friendly data interpretation tools.

Uploaded by

Dani Jojo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Water Quality Analysis

A PROJECT REPORT

Submitted By

SURESH M: 950421104053

ATHISAYA MICHEAL PARALOGA RAJ :950421104011

In partial fulfillment for the award of the degree of

BACHELOR OF TECHNOLOGY

In

INFORMATION TECHNOLOGY

DR.G.U POPE COLLEGE OF ENGINEERING


SAWYERPURAM-628251
ANNAUNIVERSITY:CHENNAI600025
MAY2025

1
BONAFIDECERTIFICATE

Certified that this project report report “Water Quality


Analysis”is the bonafide work of “M SURESH (950421104053),
ATHISAYA MICHEAL PARALOGA RAJ (950421104011),” who carried out
the project work under my supervision. Certified further that to the best of
my knowledge the work reported herein does not form part of any other
thesis or dissertation on the basis of which a degree or award was conferred
on an earlier occasion on this or any other candidate.

SIGNATURE SIGNATURE
DR.T.JASPERLINE
MR.E. STEPHENJOSEPH
SUPERVISOR HEADOFTHEDEPARTMENT
Professor Professor & Head
DeptofComputerScience Deptof ComputerScienceAnd Engineering,
And Engineering,
Dr. G UPopeCollege Of Engineering,
Tuticorin-628251. Dr. G UPopeCollege Of Engineering,
Tuticorin-628251.

Submitted for Semester Mini-Project viva-voice examination held on

INTERNALEXAMINER EXTERNALEXAMINE

2
1. INTRODUCTION

1.1 ABOUT ORGANIZATION

The Water Quality Regression project aims to develop a machine learning-powered web
application using Streamlit to analyze and classify water quality based on various
physicochemical parameters. This project provides an interactive platform where users can
visualize water quality data through graphs and statistical insights, enabling better understanding
and decision-making. By implementing classification algorithms, the system predicts water
quality categories, helping to determine whether the water is safe for consumption or requires
treatment. Additionally, the application offers statistical analysis, including mean, median,
standard deviation, and correlation insights, to enhance data interpretation.

1.2 PROBLEM DEFINITION

The Water Quality Regression project is a machine learning-based system designed to


analyze and classify water quality based on various physicochemical parameters. The project
utilizes Streamlit to create an interactive web application where users can upload datasets,
visualize data through various plots, and apply classification models to determine water quality
levels. The system leverages statistical analysis and machine learning techniques to provide
accurate and insightful assessments, aiding researchers, environmentalists, and decision-makers
in evaluating water safety.

1.3 PROJECT OBJECTIVE

The Water quality is a critical factor affecting public health and the environment. This
project aims to develop a web-based Water Quality Classification system that enables users to
analyze water quality data efficiently. The application provides data visualization tools such as
histograms, box plots, correlation heatmaps, and scatter plots to help users understand trends and
patterns in water quality parameters. Additionally, machine learning models, including
classification algorithms, predict water quality categories based on input data, assisting in
determining whether the water is potable or contaminated. The project integrates statistical
calculations to enhance data insights, making it a valuable tool for water resource management
and pollution control. By providing an easy-to-use interface through Streamlit, the project
ensures accessibility and promoting data-driven decision-making in water quality assessment.
3
1.4 PROJECT OVERVIEW

The system comprises six modules, each designed to enhance functionality and user experience:

1. Data Upload and Preprocessing: Users upload CSV/Excel files, which are validated
and cleaned (removing NaN values, imputing missing data). The module generates
quality reports detailing dataset shape, missing values, and column types, ensuring robust
data preparation.

2. Data Visualization: This module creates interactive plots, including correlation


heatmaps to reveal parameter relationships, distribution plots for individual metrics, and
time-series charts for temporal trends, using Plotly and Seaborn for dynamic exploration.

3. Model Training and Evaluation: Supports Linear Regression, Ridge, Lasso, Random
Forest, and SVR models with hyperparameter tuning (e.g., number of trees). It evaluates
performance using R², RMSE, and MAE, visualizing feature importance and actual vs.
predicted values.

4. Geospatial Mapping: Utilizes Folium to display water quality data on interactive maps,
requiring latitude and longitude inputs, enabling spatial analysis of environmental trends.

5. Real-Time IoT Integration: Fetches live data from IoT sensor APIs, displaying metrics
like pH and TDS in the sidebar for real-time monitoring.

6. Export Functionality: Allows downloading trained models (pickle), predictions (CSV),


and visualizations (HTML), facilitating further analysis and sharing.

4
2. SYSTEM SPECIFICATION

2.1 HARDWARE SPECIFICATION

To run the Water Quality Classification system efficiently, the following hardware specifications
are recommended:

 Processor: Intel Core i5 or higher

 RAM: 8 GB (16 GB recommended for large datasets)

 Storage: At least 2 GB of free space for data storage and model execution

2.2 SOFTWARE SPECIFICATION

The application is compatible with multiple operating systems, including Windows, macOS, and
Linux. The core software components include:

 Backend: Python Streamlit

 Frontend: Streamlit for interactive UI

 Libraries: Scikit-learn, Pandas, NumPy, Matplotlib, Seaborn

2.3 SOFTWARE DESCRIPTION

Streamlit

Streamlit is a lightweight and user-friendly Python framework for building interactive web
applications. It is used as the frontend for the Water Quality Classification system, allowing
users to upload data files, visualize results, and interpret water quality classifications in real
time.

 Real-time Visualizations: The application dynamically updates graphs and charts based
on user inputs, making it easy to understand trends in water quality data.

 CSV Upload & Processing: Users can upload water quality datasets, which are
automatically processed by the backend to generate classification results.

Scikit-learn

Scikit-learn is used to implement machine learning algorithms such as RandomForest, Decision


Tree, SVM, and Logistic Regression for water quality classification.

5
 Feature Engineering: The system extracts meaningful features from water quality
parameters to improve model accuracy.

 Model Training & Prediction: The trained models classify water quality into different
categories (e.g., safe, moderate, polluted) based on input parameters.

Matplotlib & Seaborn

These libraries are used for data visualization, allowing users to analyze trends and relationships
between water quality parameters. Graphs such as scatter plots, heatmaps, and histograms help
in understanding the impact of each parameter on water quality.

Pandas & NumPy

These libraries are used for data preprocessing, handling missing values, and performing
numerical computations required for model training.

6
3. SYSTEM STUDY

3.1 EXISTING SYSTEM WITH LIMITATIONS

In the current scenario, water quality assessment is often performed manually using
laboratory testing, which is time-consuming, expensive, and requires skilled professionals.
Traditional methods involve collecting water samples, analyzing them for chemical, physical,
and biological parameters, and interpreting the results based on predefined standards.

These processes lack real-time monitoring capabilities, leading to delays in detecting


contamination. Furthermore, existing water quality classification systems often do not leverage
advanced machine learning techniques, resulting in lower prediction accuracy. The lack of an
interactive and user-friendly interface also makes it difficult for non-experts to understand water
quality reports.

The existing systems for water quality analysis predominantly rely on manual laboratory
testing, standalone software tools, and fragmented data processing workflows, which present
significant challenges in efficiency, scalability, and accessibility. Laboratory-based methods
involve collecting water samples and conducting chemical tests for parameters like pH, TDS,
and turbidity, a process that is labor-intensive, time-consuming, and prone to human error.

Data analysis is often performed using tools like Microsoft Excel, R, or MATLAB, which
require users to manually preprocess datasets, apply statistical models, and generate
visualizations. These tools lack integration, forcing users to switch between multiple platforms
for data cleaning, modeling, and mapping, leading to inefficiencies and data inconsistencies.

Geospatial analysis, when required, is typically handled by separate GIS software like
ArcGIS or QGIS, which demands specialized expertise and additional licensing costs. Real-time
monitoring is rarely supported, as most systems do not integrate with IoT devices, limiting their
ability to provide live insights into water quality trends.

3.2 ADVANTAGES OF PROPOSED METHODOLOGY

7
The proposed Water Quality Classification system overcomes the limitations of the
existing methods by utilizing machine learning algorithms for accurate and automated
classification of water quality. The system is designed to analyze various water quality
parameters, such as pH, turbidity, dissolved oxygen (DO), biological oxygen demand (BOD),
total dissolved solids (TDS), and conductivity, to classify water into different quality categories.

It provides real-time data visualization through an intuitive Streamlit-based web


application, making it accessible to a wider audience. The machine learning models are trained
on extensive datasets, improving the precision of classification. Additionally, the system allows
users to upload CSV files containing water quality data for bulk processing, enabling large-scale
analysis.

The proposed Water Quality Regression System addresses the shortcomings of existing
systems by offering a unified, automated, and scalable web-based platform built on Streamlit
and Python. Unlike manual laboratory methods, the system automates data ingestion and
preprocessing, allowing users to upload CSV or Excel files and instantly clean datasets by
removing NaN values, imputing missing data, and scaling features.

It integrates multiple regression models—Linear Regression, Ridge, Lasso, Random


Forest, and SVR—within a single interface, eliminating the need for separate tools like R or
MATLAB. The system’s interactive visualizations, powered by Plotly and Seaborn, include
correlation heatmaps, distribution plots, and time-series charts, enabling users to explore data
trends intuitively without requiring advanced technical skills.

Geospatial mapping, facilitated by Folium, is seamlessly integrated, allowing users to


visualize water quality metrics on interactive maps without relying on external GIS software.
Real-time IoT integration via APIs provides live monitoring of parameters like pH and TDS, a
feature absent in most existing systems, ensuring timely insights for environmental monitoring.

4. SYSTEM DESIGN

8
4.1 SYSTEM FLOW DIAGRAM

The System flow Diagram illustrates the system’s workflow across multiple levels:

 System Flow Diagram : Shows interactions between external entities (User, IoT Device)
and the system. Users provide datasets and API URLs, receiving reports, models,
visualizations, and maps. IoT Devices supply real-time data, displayed in the UI.

 System Flow Diagram for Breaks down processes:

o Data Upload: User uploads CSV/Excel files to the system.

o Preprocessing: Cleans and validates data, storing it in memory.

o Model Training: Trains regression models on processed data.

o Visualization: Generates plots (heatmaps, distributions, time-series).

o Geospatial Mapping: Renders maps using latitude/longitude.

o IoT Integration: Fetches and displays live data.

o Export: Saves outputs (models, predictions, visualizations).

4.2 INPUT DESIGN

The input design focuses on how water quality data is collected from users. The inputs are
structured to ensure accuracy, reliability, and ease of use. Water Quality Data Inputs Users can
upload a CSV file containing key water quality parameters. The input fields include:

 pH Level – Measures the acidity or alkalinity of the water.


 Turbidity – Indicates water clarity.
 Total Dissolved Solids (TDS) – Represents dissolved substances in water.
 Conductivity – Measures the ability of water to conduct electricity.
 Dissolved Oxygen (DO) – Determines oxygen levels for aquatic life.
 Biological Oxygen Demand (BOD) – Indicates organic pollution levels.
 Chemical Oxygen Demand (COD) – Measures the amount of organic and inorganic
compounds in water.

Each input undergoes validation checks to ensure:

Numeric values are within the expected range.

9
No missing or null values.

CSV format integrity (correct column names and data structure).

Data Upload Mechanism

Upload Button – Users can upload a CSV file for batch processing.

Supported Formats – The system accepts .csv files only.

Real-time Data Entry – Users can manually input values through a form if needed.

This input design ensures that the system collects high-quality, structured data, which is crucial
for accurate classification and analysis.

4.3 OUTPUT DESIGN

The output design defines how the results of water quality classification are presented to the
user. The system generates outputs after processing the uploaded water quality data and running
it through the machine learning model.

Water Quality Classification Output

Once the model analyzes the input parameters, the system presents the classification results in an
easy-to-understand format:

Predicted Water Quality Class – The classification label (e.g., Safe, Heavily Polluted).
Confidence Score – A numerical probability indicating the model’s certainty in its classification.
Visual Indicators – A color-coded indicator (Green, Yellow, Red) for an intuitive understanding
of the water quality.

Additional Insights and Recommendations

To help users interpret the classification results, the system may also provide Parameter Analysis
– A breakdown of how each input (e.g., pH, TDS, DO) influences the classification. Health &
Environmental Impact – Information on the effects of water quality levels on human health and
ecosystems. Suggestions for Improvement – Possible corrective actions, such as filtration
methods, chemical treatments, or regulatory guidelines for maintaining safe water quality.

Data Visualization

The outputs are presented using interactive visualizations, such as:

10
Graphs & Charts – Displaying trends in water quality over time.
Comparative Analysis – Allowing users to compare different water samples. The outputs are
formatted using Streamlit's interactive components for a clean, responsive, and user-friendly
interface.

4.4 DATASET DESIGN

While the initial prototype of the Water Quality Classification system may not require a
persistent database, a well-structured database design will be useful for future enhancements,
enabling data storage, historical analysis, and improved model performance.

The database can be designed to store the following key entities:

 Water Quality Data: Stores attributes related to water samples, including pH level,
turbidity, total dissolved solids (TDS), conductivity, dissolved oxygen (DO), biological
oxygen demand (BOD), chemical oxygen demand (COD), and timestamps.

 Classification Results: Stores predictions from the Water Quality Classification Model,
including the predicted water quality category, confidence score, and associated input
data.

 User Data (Optional for Authentication): If user authentication is implemented, this table
will store user details such as username, email, and uploaded water quality records.

Relationships

 The User Data table (if included) has a one-to-many relationship with the Water Quality
Data table, meaning a user can submit multiple water quality records.

 The Water Quality Data table is linked to the Classification Results table, allowing easy
retrieval of past predictions based on specific input conditions.

 Entities and Attributes:

o Dataset: dataset_id (PK), filename (varchar), columns (text), rows (int), target
(varchar), features (text).

o Model: model_id (PK), name (varchar, e.g., "RandomForest"), parameters (text,


JSON), metrics (text, JSON: R², RMSE, MAE), dataset_id (FK).

11
o Visualization: viz_id (PK), type (varchar, e.g., "heatmap"), parameters (text),
output_file (varchar), dataset_id (FK).

o Geospatial Data: geo_id (PK), latitude (float), longitude (float), quality_params


(text, JSON), dataset_id (FK).

 Relationships:

o Dataset to Model: One-to-many (one dataset trains multiple models).

o Dataset to Visualization: One-to-many (one dataset generates multiple plots).

o Dataset to Geospatial Data: One-to-one (one dataset links to geospatial


coordinates).

 Importance:

o Ensures structured data organization.

o Reduces redundancy via normalization.

o Clarifies relationships for developers.

5. SYSTEM TESTING

System testing and implementation are crucial phases in the development of the Water
Quality Regression system, ensuring that the model provides accurate predictions and the

12
application functions reliably. This section outlines the testing strategies used to validate the
system, including unit testing, integration testing, validation testing, and system-wide testing.
Additionally, the implementation process is described, focusing on the transition from
development to deployment.

5.1 TESTING METHODOLOGIES

The testing strategy employs a comprehensive approach to validate the system’s functionality,
performance, and security across various levels. Testing involves individual components of the
system to verify that each part functions correctly.

Water Quality Prediction Model:

 The regression model was tested using various water quality parameters such as pH,
Dissolved Oxygen (DO), Chemical Oxygen Demand (COD), Biochemical Oxygen
Demand (BOD), Total Dissolved Solids (TDS), and Turbidity.

 Different test cases were designed to verify that the model returns reasonable predictions
for water quality index (WQI) values.

 Edge cases, such as extreme values or missing data, were tested to ensure error handling
mechanisms function correctly.

Data Preprocessing Module:

 Tested for proper handling of missing values, scaling of numerical features, and feature
selection.

 Verified that input data is standardized before being passed to the regression model.

Frontend Input Validation:

 Ensured that user inputs are validated (e.g., pH should be between 0 and 14, TDS should
be non-negative).

 Implemented error messages and restrictions to prevent invalid inputs from being
submitted.

5.2 UNIT TESTING

Unit testing targets individual functions and methods within the system to confirm they perform
as expected under various conditions. Key functions tested include:

13
 load_and_preprocess_data: Validates file loading (CSV/Excel), handling of missing
values (NaN removal, imputation), and data type consistency. Test cases include valid
files, corrupted files, empty files, and files with excessive missing data.

 prepare_data: Tests feature scaling (StandardScaler), train-test splitting (80-20 ratio), and
target/feature selection. Edge cases include datasets with single columns, non-numeric
data, or mismatched feature-target pairs.

 calculate_metrics: Verifies computation of regression metrics (R², RMSE, MAE) against


known outputs. Tests include perfect predictions, random predictions, and edge cases
with zero variance.

 get_model: Ensures correct initialization of regression models (Linear Regression,


Random Forest, etc.) with specified hyperparameters. Tests cover invalid parameters and
model compatibility.

 generate_plot: Validates Plotly/Seaborn visualization outputs (heatmaps, time-series) for


correct data mapping and rendering. Edge cases include empty datasets or invalid plot
types.

 fetch_iot_data: Tests API calls for IoT data retrieval, parsing, and error handling for
invalid URLs, timeouts, or malformed JSON responses.

5.3 INTEGRATION TESTING

Integration testing ensures that combined modules work cohesively, focusing on data flow and
interaction between components. Key workflows tested include:

 Upload to Preprocessing to Visualization: Verifies that a CSV file is uploaded, cleaned


(NaN removed, scaled), and visualized (e.g., correlation heatmap). Tests include large
datasets (10,000 rows), files with missing columns, and non-numeric data.

 Preprocessing to Model Training to Metrics: Confirms that cleaned data is split, used
to train models (e.g., Random Forest), and evaluated (R², RMSE). Edge cases include
small datasets, highly correlated features, or imbalanced splits.

 Geospatial Data to Mapping: Ensures latitude/longitude data is extracted, validated,


and rendered as Folium maps. Tests cover missing coordinates, invalid ranges (e.g.,
latitude > 90), and large geospatial datasets.

14
 IoT Integration to Display: Validates that IoT API data is fetched, parsed, and displayed
in the Streamlit sidebar. Tests include intermittent connectivity, invalid API keys, and
high-frequency data streams.

Integration tests are conducted using PyTest with mocked dependencies (e.g., IoT APIs) and real
data samples. Test scenarios simulate sequential module execution, checking for data
consistency and error propagation. Issues like memory leaks or caching errors are identified
using profiling tools (e.g., memory_profiler).

5.4 FUNCTIONAL TESTING

Functional testing validates the system’s features against user requirements, ensuring all
functionalities are intuitive and accurate. Key areas tested include:

 Data Upload: Confirms users can upload CSV/Excel files, with validation for file types
and error messages for invalid formats (e.g., .txt files).

 Data Preprocessing: Verifies automatic NaN handling, imputation options, and data
quality reports (dataset shape, missing value stats).

 Model Training: Tests model selection (Linear Regression, SVR), hyperparameter


tuning (e.g., Random Forest trees), and display of metrics (R², RMSE, MAE).

 Visualizations: Ensures interactive plots (correlation heatmaps, time-series, distribution


plots) render correctly, with zoom/pan functionality and export options (HTML).

 Geospatial Mapping: Validates Folium map rendering, marker placement, and


interactivity (zoom, click) for geospatial datasets.

5.5 ACCEPTANCE TESTING

Acceptance testing validates the system against stakeholder requirements, ensuring it meets the
needs of environmental scientists, researchers, and policymakers. This phase involves:

 End-to-End Functionality: Tests complete workflows, such as uploading a water quality


dataset, preprocessing, training a Random Forest model, visualizing correlations,
mapping locations, and exporting results. Scenarios include real-world datasets with pH,
TDS, and geospatial data.
15
 UI Usability: Evaluates the Streamlit interface for intuitiveness, using feedback from
beta testers (e.g., researchers). Tests cover layout clarity, navigation ease, and error
message clarity (e.g., “Invalid file format”).

 Prediction Accuracy: Compares model predictions (e.g., pH values) against ground


truth data, ensuring R² > 0.8 for typical datasets. Cross-validation is used to verify
generalizability.

 IoT Display: Confirms real-time metrics (e.g., live pH readings) are accurate and update
seamlessly, tested with simulated IoT feeds.

 Stakeholder Feedback: Incorporates input from environmental domain experts to ensure


the system supports practical use cases, such as monitoring river water quality or
detecting pollution trends.

Acceptance testing is conducted in a staging environment, with user acceptance testing (UAT)
sessions involving faculty and peers. Defects are logged and prioritized for resolution before
deployment. A final validation report is prepared, documenting compliance with project
objectives.

6. SYSTEM IMPLEMENTATION AND MAINTENANCE

6.1 IMPLEMENTATION PROCEDURES

The Water Quality Regression system was implemented using Flask for the backend and
Streamlit for the frontend, providing an interactive web interface for users to input water quality
parameters and receive predictions.

Implementation Steps:

Backend Development (Flask & Machine Learning Models)

16
 The machine learning regression models were trained using Scikit-learn and saved as .pkl
files for efficient deployment.

 The Flask API was designed to load the trained models and process user inputs in real
time.

Frontend Development (Streamlit)

 A Streamlit web interface was built to allow users to enter water quality parameters such
as pH, TDS, BOD, COD, Turbidity, and DO.
 The frontend was designed to be user-friendly, providing real-time visualizations and
predictions.

Integration and Deployment

 The backend and frontend were integrated to enable seamless communication between
user inputs and model predictions.
 The system was initially tested on a local server using Replit for development and
debugging.
 For production deployment, the system can be hosted on cloud platforms such as AWS,
Hugging Face Spaces, or Heroku, ensuring accessibility and scalability.

Key Features Implemented:

 Real-time Predictions: Users receive immediate WQI (Water Quality Index) predictions
based on input values.
 Interactive Visualizations: Graphs and plots help users understand the trends and
relationships between water quality parameters.
 Scalability: The system architecture allows for easy integration of new features and
additional machine learning models.

6.2 SYSTEM MAINTENANCE

System maintenance ensures the long-term functionality, accuracy, and security of the Water
Quality Regression application. Maintenance activities include updating machine learning
models, monitoring system performance, and incorporating user feedback.

Maintenance Tasks:

Model Updates & Data Improvement

17
 The machine learning models will be periodically retrained with new water quality
datasets to improve prediction accuracy.

 Data preprocessing techniques will be refined to ensure the model adapts to changes in
environmental conditions.

Bug Fixes & Performance Optimization

 Regular debugging will be performed to fix any errors or inefficiencies in the Flask
backend and Streamlit interface.

 Performance tuning, such as optimizing prediction response time and reducing server
load, will be implemented.

Server & Deployment Monitoring

 Uptime monitoring ensures the system remains accessible without frequent downtime.
 The backend will be tested regularly to prevent server crashes or excessive computational
delays.

User Experience & Feature Enhancements

 Based on user feedback, improvements will be made to the UI/UX to ensure ease of use.
 New features, such as historical data tracking and alert notifications, can be integrated
over time.

API and Third-Party Service Updates

 If real-time water quality datasets or external APIs (such as government environmental


monitoring services) are integrated, they will be monitored for changes to ensure
continued compatibility.

 Updates to Streamlit and Flask versions will be applied to maintain security and
functionality.

18
7. CONCLUSION

The Water Quality Regression project successfully demonstrates how machine learning
can be leveraged to assess and predict water quality based on various physicochemical
parameters. By implementing regression models, the system provides accurate predictions of the
Water Quality Index (WQI), helping users evaluate the safety and usability of water sources. The
project integrates Flask for backend processing and Streamlit for an interactive web interface,
ensuring ease of use and accessibility. Users can input water quality parameters such as pH,
TDS, BOD, COD, Turbidity, and DO, visualize data through interactive plots, and obtain real-
time insights about water quality. The Water Quality Regression System represents a
transformative advancement in environmental monitoring, successfully addressing the
limitations of traditional water quality analysis methods through automation, integration, and
accessibility.

19
By leveraging Streamlit and Python, the system unifies data preprocessing, regression
modeling, interactive visualization, geospatial mapping, and real-time IoT integration into a
single, user-friendly platform, eliminating the inefficiencies of fragmented workflows. Its ability
to handle large datasets, train multiple regression models (e.g., Random Forest, SVR), and
generate actionable insights through dynamic charts and maps empowers environmental
scientists, researchers, and policymakers to make informed decisions for water resource
management.

8. SCOPE FOR FUTURE ENHANCEMENTS

The Water Quality Regression project provides a strong foundation for predicting water quality
using machine learning. However, several enhancements can be made to improve its accuracy,
usability, scalability, and real-world applicability. Below are some key areas for future
improvements:

1. Integration with IoT Sensors for Real-Time Monitoring

 Deploy IoT-based water quality sensors to collect real-time data from lakes, rivers, and
reservoirs.
 Automate data collection and feed it directly into the machine learning model for
continuous monitoring.
 Enable remote access to live water quality updates through cloud storage.

20
2. Advanced Machine Learning and Deep Learning Models

 Implement Deep Learning techniques (such as LSTMs or CNNs) to improve prediction


accuracy.
 Utilize ensemble learning methods to enhance regression performance.
 Train models on larger and more diverse datasets for better generalization across
different water bodies.

3. Geographic Mapping and Visualization

 Integrate GIS and geospatial mapping to display water quality across different locations.
 Use heat maps to visualize water contamination trends over time.
 Allow users to search for water quality reports based on specific locations.

4. Automated Water Quality Classification

 Instead of only predicting WQI values, introduce classification models to categorize


water as Safe, Moderate, or Contaminated.
 Provide recommendations for water treatment based on contamination levels.

5. Mobile Application for Remote Accessibility

 Develop a mobile app where users can check water quality predictions on their
smartphones.
 Enable push notifications and alerts for high pollution levels.
 Allow users to report water contamination incidents for public awareness.

21
BIBLOGRAPHY

1. Tiwari, T. N., & Mishra, M. A. (1985). A new method for determining water quality
index for rivers. International Journal of Environmental Studies, 26(3), 237-245.
2. Kumar, M., & Puri, A. (2012). A review of permissible limits of drinking water quality
in India. Journal of Environmental Science & Engineering, 54(1), 94-100.
3. Sharma, S., & Bhardwaj, N. (2021). Machine learning-based water quality prediction
models: A review. Environmental Monitoring and Assessment, 193(12), 784.
4. Garg, S., & Gupta, R. (2020). Real-time water quality monitoring and prediction using
IoT and ML. Proceedings of the IEEE Conference on Smart Environments and
Innovative Applications, 125-132.
5. Chaudhary, R., Singh, D. K., & Yadav, R. (2019). A comparative study of regression
models for water quality prediction in Indian rivers. International Journal of Data Science
and Analytics, 7(3), 192-205.
6. Government of India, National Water Mission (NWM). (2023). National Framework for
Water Quality Management in India.

22
REFERENCE:

https://www.bis.gov.in/

https://cpcb.nic.in/

https://jalshakti.gov.in/

https://www.who.int/water_sanitation_health

APPENDIX

A.SYSTEM FLOW DIAGRAM

System Flow Diagram

23
Break Down Process

B.SCREENSHOTS

Home page

File Upload

24
Heatmap Visualization

Model Selection

25
Training Model

Model Performance

26
Export Trained Model

Actual vs Predicted Values

27
Mapping

Backend

28
D. PLAGIARISM REPORT

29
C.SOURCE CODE

import pandas as pd

from sklearn.model_selection import train_test_split


30
from sklearn.preprocessing import StandardScaler

def load_data(file_path):

df = pd.read_csv(file_path)

return df

def preprocess_data(df):

df = df.dropna() # Remove missing values

X = df.drop(columns=['Water Quality Index'])

y = df['Water Quality Index']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

return X_train, X_test, y_train, y_test, scaler


'Cherry_(including_sour)___healthy',

import joblib

from sklearn.ensemble import RandomForestRegressor

31
from preprocess import load_data, preprocess_data

df = load_data("water_quality.csv")

X_train, X_test, y_train, y_test, scaler = preprocess_data(df)

model = RandomForestRegressor(n_estimators=100, random_state=42)

model.fit(X_train, y_train)

joblib.dump(model, "models/water_quality_model.pkl")

joblib.dump(scaler, "models/scaler.pkl")

print("Model trained and saved successfully!")

import streamlit as st

import pandas as pd

import numpy as np

import joblib

import matplotlib.pyplot as plt

import seaborn as sns

model = joblib.load("models/water_quality_model.pkl")

scaler = joblib.load("models/scaler.pkl")

df = pd.read_csv("water_quality.csv")

st.title("Water Quality Prediction Web App")

32
st.sidebar.header("Enter Water Quality Parameters")

ph = st.sidebar.slider("pH", 0.0, 14.0, 7.0)

dissolved_oxygen = st.sidebar.slider("Dissolved Oxygen (mg/L)", 0.0, 20.0, 7.0)

biological_oxygen_demand = st.sidebar.slider("BOD (mg/L)", 0.0, 10.0, 3.0)

nitrate = st.sidebar.slider("Nitrate (mg/L)", 0.0, 50.0, 10.0)

total_coliform = st.sidebar.slider("Total Coliform (MPN/100mL)", 0, 1000, 200)

def predict_quality(ph, dissolved_oxygen, biological_oxygen_demand, nitrate,


total_coliform):

input_data = np.array([[ph, dissolved_oxygen, biological_oxygen_demand,


nitrate, total_coliform]])

input_data_scaled = scaler.transform(input_data)

prediction = model.predict(input_data_scaled)[0]

return round(prediction, 2)

if st.sidebar.button("Predict"):

prediction = predict_quality(ph, dissolved_oxygen, biological_oxygen_demand,


nitrate, total_coliform)

st.subheader(f"Predicted Water Quality Index: {prediction}")

if prediction > 80:

st.success("Water Quality: Excellent ")

elif prediction > 60:

st.info("Water Quality: Good ")


33
elif prediction > 40:

st.warning("Water Quality: Poor ")

else:

st.error("Water Quality: Unsafe ")

if st.checkbox("Show Dataset"):

st.write(df.head())

st.subheader("Water Quality Data Correlation")

fig, ax = plt.subplots(figsize=(8, 6))

sns.heatmap(df.corr(), annot=True, cmap="coolwarm", ax=ax)

st.pyplot(fig)

from sklearn.linear_model import LinearRegression, Ridge, Lasso

from sklearn.ensemble import RandomForestRegressor

from sklearn.svm import SVR

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

import numpy as np

import pandas as pd

def get_model(model_name, **kwargs):

"""Return the specified regression model with optional parameters."""

34
if model_name == 'Random Forest':

rf_params = {k: v for k, v in kwargs.items() if k in ['n_estimators',


'max_depth']}

return RandomForestRegressor(random_state=42, **rf_params)

elif model_name == 'Ridge Regression':

ridge_params = {k: v for k, v in kwargs.items() if k in ['alpha']}

return Ridge(**ridge_params)

elif model_name == 'Lasso Regression':

lasso_params = {k: v for k, v in kwargs.items() if k in ['alpha']}

return Lasso(**lasso_params)

elif model_name == 'SVR':

svr_params = {k: v for k, v in kwargs.items() if k in ['C', 'kernel']}

return SVR(**svr_params)

else:

return LinearRegression()

def train_model(model, X_train, y_train):

"""Train the selected model with error handling."""

try:

X_train_np = X_train.to_numpy() if isinstance(X_train, pd.DataFrame) else


X_train

y_train_np = y_train.to_numpy() if isinstance(y_train, pd.Series) else y_train

35
model.fit(X_train_np, y_train_np)

return model

except Exception as e:

raise Exception(f"Error during model training: {str(e)}")

def get_feature_importance(model, feature_names):

"""Get feature importance if available for the model."""

try:

if hasattr(model, 'coef_'):

return dict(zip(feature_names, abs(model.coef_)))

elif hasattr(model, 'feature_importances_'):

return dict(zip(feature_names, model.feature_importances_))

return None

except Exception as e:

print(f"Warning: Could not calculate feature importance: {str(e)}")

return None

def prepare_data(df, target_column, feature_columns):

"""Preprocess data by splitting into training and testing sets."""

try:

# Select feature columns and target column

X = df[feature_columns]

36
y = df[target_column]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

return X_train_scaled, X_test_scaled, y_train, y_test, scaler

except Exception as e:

raise Exception(f"Error during data preparation: {str(e)}")

import pandas as pd

import numpy as np

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

def load_and_preprocess_data(file):

"""Load and preprocess the uploaded data file."""

try:

if file.name.endswith('.csv'):

df = pd.read_csv(file)

37
elif file.name.endswith(('.xls', '.xlsx')):

df = pd.read_excel(file)

else:

raise ValueError("Unsupported file format. Please upload a CSV or Excel


file.")

if df.empty:

raise ValueError("The uploaded file is empty.")

df = df.dropna(axis=1, how='all')

df = df.dropna(how='all')

return df

except Exception as e:

raise Exception(f"Error loading data: {str(e)}")

def prepare_data(df, target_column, feature_columns):

"""Prepare data for modeling by handling missing values and scaling."""

try:

if target_column not in df.columns:

raise ValueError(f"Target column '{target_column}' not found in dataset")

for col in feature_columns:

if col not in df.columns:

raise ValueError(f"Feature column '{col}' not found in dataset")

X = df[feature_columns].copy()

38
y = df[target_column].copy()

target_missing = y.isna().sum()

if target_missing > 0:

valid_indices = ~y.isna()

X = X[valid_indices]

y = y[valid_indices]

print(f"Removed {target_missing} rows with missing target values")

X = X.replace([np.inf, -np.inf], np.nan)

numeric_cols = X.select_dtypes(include=[np.number]).columns

if len(numeric_cols) == 0:

raise ValueError("No numeric features found in the dataset")

for col in numeric_cols:

if X[col].isna().any():

mean_val = X[col].mean()

X[col] = X[col].fillna(mean_val)

print(f"Imputed missing values in '{col}' with mean: {mean_val:.2f}")

non_numeric_cols = X.select_dtypes(exclude=[np.number]).columns

39
for col in non_numeric_cols:

if X[col].isna().any():

mode_val = X[col].mode()[0]

X[col] = X[col].fillna(mode_val)

print(f"Imputed missing values in '{col}' with mode: {mode_val}")

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

X_scaled = pd.DataFrame(X_scaled, columns=X.columns)

X_train, X_test, y_train, y_test = train_test_split(

X_scaled, y, test_size=0.2, random_state=42

return X_train, X_test, y_train, y_test, scaler

except Exception as e:

raise Exception(f"Error preparing data: {str(e)}")

def calculate_metrics(y_true, y_pred):

"""Calculate regression metrics."""

from sklearn.metrics import r2_score, mean_squared_error,


mean_absolute_error

r2 = r2_score(y_true, y_pred)

rmse = np.sqrt(mean_squared_error(y_true, y_pred))

mae = mean_absolute_error(y_true, y_pred)

return {

'R² Score': r2,

40
'RMSE': rmse,

'MAE': mae

st.download_button(

label="Download Plot as HTML",

data=fig.to_html(),

file_name="actual_vs_predicted.html",

mime="text/html"

residuals = y_test - y_pred_test

fig = go.Figure()

fig.add_scatter(x=y_pred_test, y=residuals, mode='markers',

marker=dict(color='blue'))

fig.add_hline(y=0, line_dash="dash", line_color="red")

fig.update_layout(

title='Residual Plot',

xaxis_title='Predicted Values',

yaxis_title='Residuals'

st.plotly_chart(fig)

except Exception as e:

st.error(f"An error occurred during model training: {str(e)}")

st.info("Please ensure your data is properly formatted and


contains valid numerical values.")

except Exception as e:

41
st.error(f"An error occurred: {str(e)}")

else:

st.info("Please upload a dataset to begin the analysis.")

42

You might also like