Big Data Progress Report (1)
Big Data Progress Report (1)
Introduction
Our project focuses on developing a scalable machine learning pipeline using Apache Spark to analyze
wind turbine SCADA data for early fault detection. In this report, we share what we’ve accomplished
so far, what we’re currently working on, and our plans moving forward. Along the way, we also touch
on some challenges we’ve faced and how we’ve addressed them. We also plan to make our progress and
code publicly available on GitHub[2], which will include the dataset, pre-processing scripts, and machine
learning models used for fault detection, ensuring reproducibility and collaboration.
Dataset Description
The dataset consists of 95 CSV files, containing 89 years of SCADA time series data distributed
across 36 different wind turbines from three wind farms: A, B, and C. Due to confidentiality, the data
has been anonymized.
Each dataset contains time series data for one wind turbine and is divided into training and pre-
diction data. The features include 10-minute averages, minima, maxima, and standard deviations of
sensor measurements. Additional columns provide metadata, such as timestamps, wind turbine IDs, and
operational statuses.
Each dataset includes exactly one event, which can either be an anomaly event (indicating potential
failure) or normal operation. Events are further detailed in event information files, which provide labels,
start and end timestamps, and additional root cause descriptions for anomalies. The datasets also contain
feature descriptions outlining sensor names, measurement statistics, units, and additional attributes like
whether the sensor represents an angle or a counter.
The overall dataset is balanced, with 44 datasets containing labeled anomaly events and 51 represent-
ing normal behavior. This balance makes it suitable for training models to distinguish between normal
and faulty operations.
Work Completed
We started by sourcing the dataset[1] from Kaggle and diving into it to understand its structure,
features, and scope. This involved calculating summary statistics for key variables to get a sense of
their distributions and ranges. Next, we set up an Apache Spark environment to handle large-scale
data efficiently. This included configuring the Spark cluster and ensuring connectivity for seamless data
processing.
We combined data from multiple files to create unified datasets. Event information was merged
with the main datasets using event id, enriching the data with additional labels and descriptions.
Subsequently, we addressed missing values by identifying gaps and imputing them with mean values
from their respective columns. This approach minimized data loss and preserved overall trends while
maintaining computational efficiency.
During this process, we encountered several challenges. One of the biggest challenges was managing
the computational overhead during preprocessing, especially given the high-dimensional nature of the
data. Additionally, handling missing or inconsistent sensor readings without introducing bias proved
tricky. Another ongoing effort has been identifying the most predictive features while ensuring the
models remain generalizable. These challenges have shaped our approach and have led us to implement
various solutions to enhance the model’s performance.
Finally, we conducted initial exploratory analysis to identify patterns and inconsistencies in the data.
This included visualizing the frequency and distribution of operational statuses, analyzing correlations
between key features, and summarizing event characteristics. These insights helped shape our subsequent
preprocessing and modeling strategies.
Progress Report December 22, 2024
Current Work
Our current focus is on the initial development of a machine learning model using the Wind Farm A
dataset. We have begun by splitting the data into training and test sets based on the train test column
and preparing feature vectors using VectorAssembler in Spark MLlib. A RandomForestClassifier
model is being trained on the training data, and we are analyzing the model’s predictions to identify
potential improvements. We are also evaluating feature importance to gain insights into the contribution
of different variables and refining preprocessing techniques, such as scaling and normalization, to enhance
model performance. At this stage, we are focusing solely on the Wind Farm A dataset, and we plan to
incorporate data from Wind Farms B and C at later stages.
Through our exploratory analysis and initial model development, we have uncovered several insights
from the wind turbine SCADA dataset:
Operational Status Distribution: The operational statuses of the turbines are relatively bal-
anced, with a higher frequency of normal operations compared to other states. This aligns with
the general expectation that wind turbines spend most of their time in normal operation. However,
statuses such as derated operation and downtime occur sporadically but are crucial for identifying
early signs of malfunction or inefficiency.
RandomForestClassifier Performance: Our model’s performance on the Wind Farm A dataset
provides a promising starting point, with a training accuracy of 61.8% and a slightly higher test
accuracy of 63.4%. These results indicate that the model can reasonably distinguish between
normal operations and anomalies while maintaining generalization. However, there is still room
for improvement. Moving forward, we plan to refine preprocessing steps, experiment with different
models, and fine-tune hyperparameters to improve performance.
Future Work
Looking ahead, we plan to experiment with different machine learning models, including Gradient
Boosted Trees, Support Vector Machines to benchmark their performance and determine the most ef-
fective model for our use case. We will perform hyperparameter tuning using techniques like grid search
to maximize model efficiency. Additionally, we aim to implement feature selection techniques to identify
the most predictive variables, which will help reduce computational overhead and enhance model inter-
pretability. We also intend to integrate advanced anomaly detection techniques, such as Isolation Forest
to complement traditional classification models. Furthermore, we plan to incorporate a comprehensive
model evaluation framework, including metrics like precision, recall, F1-score, and area under the ROC
curve, to ensure the robustness of our models. Finally, we will document the entire process to ensure
reproducibility and facilitate knowledge sharing within the team and with stakeholders.
Conclusion
In this report, we have outlined the progress made in developing a machine learning pipeline for
early fault detection in wind turbines using SCADA data. While our initial model shows a promising
start, it also reveals areas for improvement. These initial results underscore the model’s ability to
distinguish between normal and anomalous operations, though the performance suggests that there is
room for further enhancement in both preprocessing and modeling approaches. By adopting a more
comprehensive evaluation strategy and continuously refining our approach, we aim to achieve a more
robust model for early fault detection, ultimately contributing to more reliable and efficient operation of
wind turbines.
References
[1] Bachmann, Janio. ”Wind Turbine SCADA Data For Early Fault Detec-
tion” Kaggle Notebook, 2024. https://www.kaggle.com/datasets/azizkasimov/
wind-turbine-scada-data-for-early-fault-detection
[2] ”Wind Turbine Fault Detection.” Our GitHub Repository. https://github.com/beyzanurkeskin/
Wind-Turbine-Fault-Detection.