Water Quality Analysis Report
Water Quality Analysis Report
A PROJECT REPORT
Submitted By
SURESH M: 950421104053
BACHELOR OF TECHNOLOGY
In
INFORMATION TECHNOLOGY
1
BONAFIDECERTIFICATE
SIGNATURE SIGNATURE
DR.T.JASPERLINE
MR.E. STEPHENJOSEPH
SUPERVISOR HEADOFTHEDEPARTMENT
Professor Professor & Head
DeptofComputerScience Deptof ComputerScienceAnd Engineering,
And Engineering,
Dr. G UPopeCollege Of Engineering,
Tuticorin-628251. Dr. G UPopeCollege Of Engineering,
Tuticorin-628251.
INTERNALEXAMINER EXTERNALEXAMINE
2
1. INTRODUCTION
The Water Quality Regression project aims to develop a machine learning-powered web
application using Streamlit to analyze and classify water quality based on various
physicochemical parameters. This project provides an interactive platform where users can
visualize water quality data through graphs and statistical insights, enabling better understanding
and decision-making. By implementing classification algorithms, the system predicts water
quality categories, helping to determine whether the water is safe for consumption or requires
treatment. Additionally, the application offers statistical analysis, including mean, median,
standard deviation, and correlation insights, to enhance data interpretation.
The Water quality is a critical factor affecting public health and the environment. This
project aims to develop a web-based Water Quality Classification system that enables users to
analyze water quality data efficiently. The application provides data visualization tools such as
histograms, box plots, correlation heatmaps, and scatter plots to help users understand trends and
patterns in water quality parameters. Additionally, machine learning models, including
classification algorithms, predict water quality categories based on input data, assisting in
determining whether the water is potable or contaminated. The project integrates statistical
calculations to enhance data insights, making it a valuable tool for water resource management
and pollution control. By providing an easy-to-use interface through Streamlit, the project
ensures accessibility and promoting data-driven decision-making in water quality assessment.
3
1.4 PROJECT OVERVIEW
The system comprises six modules, each designed to enhance functionality and user experience:
1. Data Upload and Preprocessing: Users upload CSV/Excel files, which are validated
and cleaned (removing NaN values, imputing missing data). The module generates
quality reports detailing dataset shape, missing values, and column types, ensuring robust
data preparation.
3. Model Training and Evaluation: Supports Linear Regression, Ridge, Lasso, Random
Forest, and SVR models with hyperparameter tuning (e.g., number of trees). It evaluates
performance using R², RMSE, and MAE, visualizing feature importance and actual vs.
predicted values.
4. Geospatial Mapping: Utilizes Folium to display water quality data on interactive maps,
requiring latitude and longitude inputs, enabling spatial analysis of environmental trends.
5. Real-Time IoT Integration: Fetches live data from IoT sensor APIs, displaying metrics
like pH and TDS in the sidebar for real-time monitoring.
4
2. SYSTEM SPECIFICATION
To run the Water Quality Classification system efficiently, the following hardware specifications
are recommended:
Storage: At least 2 GB of free space for data storage and model execution
The application is compatible with multiple operating systems, including Windows, macOS, and
Linux. The core software components include:
Streamlit
Streamlit is a lightweight and user-friendly Python framework for building interactive web
applications. It is used as the frontend for the Water Quality Classification system, allowing
users to upload data files, visualize results, and interpret water quality classifications in real
time.
Real-time Visualizations: The application dynamically updates graphs and charts based
on user inputs, making it easy to understand trends in water quality data.
CSV Upload & Processing: Users can upload water quality datasets, which are
automatically processed by the backend to generate classification results.
Scikit-learn
5
Feature Engineering: The system extracts meaningful features from water quality
parameters to improve model accuracy.
Model Training & Prediction: The trained models classify water quality into different
categories (e.g., safe, moderate, polluted) based on input parameters.
These libraries are used for data visualization, allowing users to analyze trends and relationships
between water quality parameters. Graphs such as scatter plots, heatmaps, and histograms help
in understanding the impact of each parameter on water quality.
These libraries are used for data preprocessing, handling missing values, and performing
numerical computations required for model training.
6
3. SYSTEM STUDY
In the current scenario, water quality assessment is often performed manually using
laboratory testing, which is time-consuming, expensive, and requires skilled professionals.
Traditional methods involve collecting water samples, analyzing them for chemical, physical,
and biological parameters, and interpreting the results based on predefined standards.
The existing systems for water quality analysis predominantly rely on manual laboratory
testing, standalone software tools, and fragmented data processing workflows, which present
significant challenges in efficiency, scalability, and accessibility. Laboratory-based methods
involve collecting water samples and conducting chemical tests for parameters like pH, TDS,
and turbidity, a process that is labor-intensive, time-consuming, and prone to human error.
Data analysis is often performed using tools like Microsoft Excel, R, or MATLAB, which
require users to manually preprocess datasets, apply statistical models, and generate
visualizations. These tools lack integration, forcing users to switch between multiple platforms
for data cleaning, modeling, and mapping, leading to inefficiencies and data inconsistencies.
Geospatial analysis, when required, is typically handled by separate GIS software like
ArcGIS or QGIS, which demands specialized expertise and additional licensing costs. Real-time
monitoring is rarely supported, as most systems do not integrate with IoT devices, limiting their
ability to provide live insights into water quality trends.
7
The proposed Water Quality Classification system overcomes the limitations of the
existing methods by utilizing machine learning algorithms for accurate and automated
classification of water quality. The system is designed to analyze various water quality
parameters, such as pH, turbidity, dissolved oxygen (DO), biological oxygen demand (BOD),
total dissolved solids (TDS), and conductivity, to classify water into different quality categories.
The proposed Water Quality Regression System addresses the shortcomings of existing
systems by offering a unified, automated, and scalable web-based platform built on Streamlit
and Python. Unlike manual laboratory methods, the system automates data ingestion and
preprocessing, allowing users to upload CSV or Excel files and instantly clean datasets by
removing NaN values, imputing missing data, and scaling features.
4. SYSTEM DESIGN
8
4.1 SYSTEM FLOW DIAGRAM
The System flow Diagram illustrates the system’s workflow across multiple levels:
System Flow Diagram : Shows interactions between external entities (User, IoT Device)
and the system. Users provide datasets and API URLs, receiving reports, models,
visualizations, and maps. IoT Devices supply real-time data, displayed in the UI.
The input design focuses on how water quality data is collected from users. The inputs are
structured to ensure accuracy, reliability, and ease of use. Water Quality Data Inputs Users can
upload a CSV file containing key water quality parameters. The input fields include:
9
No missing or null values.
Upload Button – Users can upload a CSV file for batch processing.
Real-time Data Entry – Users can manually input values through a form if needed.
This input design ensures that the system collects high-quality, structured data, which is crucial
for accurate classification and analysis.
The output design defines how the results of water quality classification are presented to the
user. The system generates outputs after processing the uploaded water quality data and running
it through the machine learning model.
Once the model analyzes the input parameters, the system presents the classification results in an
easy-to-understand format:
Predicted Water Quality Class – The classification label (e.g., Safe, Heavily Polluted).
Confidence Score – A numerical probability indicating the model’s certainty in its classification.
Visual Indicators – A color-coded indicator (Green, Yellow, Red) for an intuitive understanding
of the water quality.
To help users interpret the classification results, the system may also provide Parameter Analysis
– A breakdown of how each input (e.g., pH, TDS, DO) influences the classification. Health &
Environmental Impact – Information on the effects of water quality levels on human health and
ecosystems. Suggestions for Improvement – Possible corrective actions, such as filtration
methods, chemical treatments, or regulatory guidelines for maintaining safe water quality.
Data Visualization
10
Graphs & Charts – Displaying trends in water quality over time.
Comparative Analysis – Allowing users to compare different water samples. The outputs are
formatted using Streamlit's interactive components for a clean, responsive, and user-friendly
interface.
While the initial prototype of the Water Quality Classification system may not require a
persistent database, a well-structured database design will be useful for future enhancements,
enabling data storage, historical analysis, and improved model performance.
Water Quality Data: Stores attributes related to water samples, including pH level,
turbidity, total dissolved solids (TDS), conductivity, dissolved oxygen (DO), biological
oxygen demand (BOD), chemical oxygen demand (COD), and timestamps.
Classification Results: Stores predictions from the Water Quality Classification Model,
including the predicted water quality category, confidence score, and associated input
data.
User Data (Optional for Authentication): If user authentication is implemented, this table
will store user details such as username, email, and uploaded water quality records.
Relationships
The User Data table (if included) has a one-to-many relationship with the Water Quality
Data table, meaning a user can submit multiple water quality records.
The Water Quality Data table is linked to the Classification Results table, allowing easy
retrieval of past predictions based on specific input conditions.
o Dataset: dataset_id (PK), filename (varchar), columns (text), rows (int), target
(varchar), features (text).
11
o Visualization: viz_id (PK), type (varchar, e.g., "heatmap"), parameters (text),
output_file (varchar), dataset_id (FK).
Relationships:
Importance:
5. SYSTEM TESTING
System testing and implementation are crucial phases in the development of the Water
Quality Regression system, ensuring that the model provides accurate predictions and the
12
application functions reliably. This section outlines the testing strategies used to validate the
system, including unit testing, integration testing, validation testing, and system-wide testing.
Additionally, the implementation process is described, focusing on the transition from
development to deployment.
The testing strategy employs a comprehensive approach to validate the system’s functionality,
performance, and security across various levels. Testing involves individual components of the
system to verify that each part functions correctly.
The regression model was tested using various water quality parameters such as pH,
Dissolved Oxygen (DO), Chemical Oxygen Demand (COD), Biochemical Oxygen
Demand (BOD), Total Dissolved Solids (TDS), and Turbidity.
Different test cases were designed to verify that the model returns reasonable predictions
for water quality index (WQI) values.
Edge cases, such as extreme values or missing data, were tested to ensure error handling
mechanisms function correctly.
Tested for proper handling of missing values, scaling of numerical features, and feature
selection.
Verified that input data is standardized before being passed to the regression model.
Ensured that user inputs are validated (e.g., pH should be between 0 and 14, TDS should
be non-negative).
Implemented error messages and restrictions to prevent invalid inputs from being
submitted.
Unit testing targets individual functions and methods within the system to confirm they perform
as expected under various conditions. Key functions tested include:
13
load_and_preprocess_data: Validates file loading (CSV/Excel), handling of missing
values (NaN removal, imputation), and data type consistency. Test cases include valid
files, corrupted files, empty files, and files with excessive missing data.
prepare_data: Tests feature scaling (StandardScaler), train-test splitting (80-20 ratio), and
target/feature selection. Edge cases include datasets with single columns, non-numeric
data, or mismatched feature-target pairs.
fetch_iot_data: Tests API calls for IoT data retrieval, parsing, and error handling for
invalid URLs, timeouts, or malformed JSON responses.
Integration testing ensures that combined modules work cohesively, focusing on data flow and
interaction between components. Key workflows tested include:
Preprocessing to Model Training to Metrics: Confirms that cleaned data is split, used
to train models (e.g., Random Forest), and evaluated (R², RMSE). Edge cases include
small datasets, highly correlated features, or imbalanced splits.
14
IoT Integration to Display: Validates that IoT API data is fetched, parsed, and displayed
in the Streamlit sidebar. Tests include intermittent connectivity, invalid API keys, and
high-frequency data streams.
Integration tests are conducted using PyTest with mocked dependencies (e.g., IoT APIs) and real
data samples. Test scenarios simulate sequential module execution, checking for data
consistency and error propagation. Issues like memory leaks or caching errors are identified
using profiling tools (e.g., memory_profiler).
Functional testing validates the system’s features against user requirements, ensuring all
functionalities are intuitive and accurate. Key areas tested include:
Data Upload: Confirms users can upload CSV/Excel files, with validation for file types
and error messages for invalid formats (e.g., .txt files).
Data Preprocessing: Verifies automatic NaN handling, imputation options, and data
quality reports (dataset shape, missing value stats).
Acceptance testing validates the system against stakeholder requirements, ensuring it meets the
needs of environmental scientists, researchers, and policymakers. This phase involves:
IoT Display: Confirms real-time metrics (e.g., live pH readings) are accurate and update
seamlessly, tested with simulated IoT feeds.
Acceptance testing is conducted in a staging environment, with user acceptance testing (UAT)
sessions involving faculty and peers. Defects are logged and prioritized for resolution before
deployment. A final validation report is prepared, documenting compliance with project
objectives.
The Water Quality Regression system was implemented using Flask for the backend and
Streamlit for the frontend, providing an interactive web interface for users to input water quality
parameters and receive predictions.
Implementation Steps:
16
The machine learning regression models were trained using Scikit-learn and saved as .pkl
files for efficient deployment.
The Flask API was designed to load the trained models and process user inputs in real
time.
A Streamlit web interface was built to allow users to enter water quality parameters such
as pH, TDS, BOD, COD, Turbidity, and DO.
The frontend was designed to be user-friendly, providing real-time visualizations and
predictions.
The backend and frontend were integrated to enable seamless communication between
user inputs and model predictions.
The system was initially tested on a local server using Replit for development and
debugging.
For production deployment, the system can be hosted on cloud platforms such as AWS,
Hugging Face Spaces, or Heroku, ensuring accessibility and scalability.
Real-time Predictions: Users receive immediate WQI (Water Quality Index) predictions
based on input values.
Interactive Visualizations: Graphs and plots help users understand the trends and
relationships between water quality parameters.
Scalability: The system architecture allows for easy integration of new features and
additional machine learning models.
System maintenance ensures the long-term functionality, accuracy, and security of the Water
Quality Regression application. Maintenance activities include updating machine learning
models, monitoring system performance, and incorporating user feedback.
Maintenance Tasks:
17
The machine learning models will be periodically retrained with new water quality
datasets to improve prediction accuracy.
Data preprocessing techniques will be refined to ensure the model adapts to changes in
environmental conditions.
Regular debugging will be performed to fix any errors or inefficiencies in the Flask
backend and Streamlit interface.
Performance tuning, such as optimizing prediction response time and reducing server
load, will be implemented.
Uptime monitoring ensures the system remains accessible without frequent downtime.
The backend will be tested regularly to prevent server crashes or excessive computational
delays.
Based on user feedback, improvements will be made to the UI/UX to ensure ease of use.
New features, such as historical data tracking and alert notifications, can be integrated
over time.
Updates to Streamlit and Flask versions will be applied to maintain security and
functionality.
18
7. CONCLUSION
The Water Quality Regression project successfully demonstrates how machine learning
can be leveraged to assess and predict water quality based on various physicochemical
parameters. By implementing regression models, the system provides accurate predictions of the
Water Quality Index (WQI), helping users evaluate the safety and usability of water sources. The
project integrates Flask for backend processing and Streamlit for an interactive web interface,
ensuring ease of use and accessibility. Users can input water quality parameters such as pH,
TDS, BOD, COD, Turbidity, and DO, visualize data through interactive plots, and obtain real-
time insights about water quality. The Water Quality Regression System represents a
transformative advancement in environmental monitoring, successfully addressing the
limitations of traditional water quality analysis methods through automation, integration, and
accessibility.
19
By leveraging Streamlit and Python, the system unifies data preprocessing, regression
modeling, interactive visualization, geospatial mapping, and real-time IoT integration into a
single, user-friendly platform, eliminating the inefficiencies of fragmented workflows. Its ability
to handle large datasets, train multiple regression models (e.g., Random Forest, SVR), and
generate actionable insights through dynamic charts and maps empowers environmental
scientists, researchers, and policymakers to make informed decisions for water resource
management.
The Water Quality Regression project provides a strong foundation for predicting water quality
using machine learning. However, several enhancements can be made to improve its accuracy,
usability, scalability, and real-world applicability. Below are some key areas for future
improvements:
Deploy IoT-based water quality sensors to collect real-time data from lakes, rivers, and
reservoirs.
Automate data collection and feed it directly into the machine learning model for
continuous monitoring.
Enable remote access to live water quality updates through cloud storage.
20
2. Advanced Machine Learning and Deep Learning Models
Integrate GIS and geospatial mapping to display water quality across different locations.
Use heat maps to visualize water contamination trends over time.
Allow users to search for water quality reports based on specific locations.
Develop a mobile app where users can check water quality predictions on their
smartphones.
Enable push notifications and alerts for high pollution levels.
Allow users to report water contamination incidents for public awareness.
21
BIBLOGRAPHY
1. Tiwari, T. N., & Mishra, M. A. (1985). A new method for determining water quality
index for rivers. International Journal of Environmental Studies, 26(3), 237-245.
2. Kumar, M., & Puri, A. (2012). A review of permissible limits of drinking water quality
in India. Journal of Environmental Science & Engineering, 54(1), 94-100.
3. Sharma, S., & Bhardwaj, N. (2021). Machine learning-based water quality prediction
models: A review. Environmental Monitoring and Assessment, 193(12), 784.
4. Garg, S., & Gupta, R. (2020). Real-time water quality monitoring and prediction using
IoT and ML. Proceedings of the IEEE Conference on Smart Environments and
Innovative Applications, 125-132.
5. Chaudhary, R., Singh, D. K., & Yadav, R. (2019). A comparative study of regression
models for water quality prediction in Indian rivers. International Journal of Data Science
and Analytics, 7(3), 192-205.
6. Government of India, National Water Mission (NWM). (2023). National Framework for
Water Quality Management in India.
22
REFERENCE:
https://www.bis.gov.in/
https://cpcb.nic.in/
https://jalshakti.gov.in/
https://www.who.int/water_sanitation_health
APPENDIX
23
Break Down Process
B.SCREENSHOTS
Home page
File Upload
24
Heatmap Visualization
Model Selection
25
Training Model
Model Performance
26
Export Trained Model
27
Mapping
Backend
28
D. PLAGIARISM REPORT
29
C.SOURCE CODE
import pandas as pd
def load_data(file_path):
df = pd.read_csv(file_path)
return df
def preprocess_data(df):
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
import joblib
31
from preprocess import load_data, preprocess_data
df = load_data("water_quality.csv")
model.fit(X_train, y_train)
joblib.dump(model, "models/water_quality_model.pkl")
joblib.dump(scaler, "models/scaler.pkl")
import streamlit as st
import pandas as pd
import numpy as np
import joblib
model = joblib.load("models/water_quality_model.pkl")
scaler = joblib.load("models/scaler.pkl")
df = pd.read_csv("water_quality.csv")
32
st.sidebar.header("Enter Water Quality Parameters")
input_data_scaled = scaler.transform(input_data)
prediction = model.predict(input_data_scaled)[0]
return round(prediction, 2)
if st.sidebar.button("Predict"):
else:
if st.checkbox("Show Dataset"):
st.write(df.head())
st.pyplot(fig)
import numpy as np
import pandas as pd
34
if model_name == 'Random Forest':
return Ridge(**ridge_params)
return Lasso(**lasso_params)
return SVR(**svr_params)
else:
return LinearRegression()
try:
35
model.fit(X_train_np, y_train_np)
return model
except Exception as e:
try:
if hasattr(model, 'coef_'):
return None
except Exception as e:
return None
try:
X = df[feature_columns]
36
y = df[target_column]
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
except Exception as e:
import pandas as pd
import numpy as np
def load_and_preprocess_data(file):
try:
if file.name.endswith('.csv'):
df = pd.read_csv(file)
37
elif file.name.endswith(('.xls', '.xlsx')):
df = pd.read_excel(file)
else:
if df.empty:
df = df.dropna(axis=1, how='all')
df = df.dropna(how='all')
return df
except Exception as e:
try:
X = df[feature_columns].copy()
38
y = df[target_column].copy()
target_missing = y.isna().sum()
if target_missing > 0:
valid_indices = ~y.isna()
X = X[valid_indices]
y = y[valid_indices]
numeric_cols = X.select_dtypes(include=[np.number]).columns
if len(numeric_cols) == 0:
if X[col].isna().any():
mean_val = X[col].mean()
X[col] = X[col].fillna(mean_val)
non_numeric_cols = X.select_dtypes(exclude=[np.number]).columns
39
for col in non_numeric_cols:
if X[col].isna().any():
mode_val = X[col].mode()[0]
X[col] = X[col].fillna(mode_val)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
except Exception as e:
r2 = r2_score(y_true, y_pred)
return {
40
'RMSE': rmse,
'MAE': mae
st.download_button(
data=fig.to_html(),
file_name="actual_vs_predicted.html",
mime="text/html"
fig = go.Figure()
marker=dict(color='blue'))
fig.update_layout(
title='Residual Plot',
xaxis_title='Predicted Values',
yaxis_title='Residuals'
st.plotly_chart(fig)
except Exception as e:
except Exception as e:
41
st.error(f"An error occurred: {str(e)}")
else:
42