Big Data
Big Data
Big Data
STUDENT ID
STUDENT NAME:
2|Page
TABLE OF CONTENTS
ABSTRACT ………………………………………………………………………..3
1. INTRODUCTION ……………………………………………………………….4
5. RECOMMENDATIONS …………………………………………………………19
6. CONCLUSION …………………………………………………………………...20
REFERENCES ……………………………………………………………………………21
3|Page
ABSTRACT
Data is important to many businesses, affecting many activities and procedures. The notion of
"Big Data" was created as a result of the exponential growth in data volume over time. Big Data
comprises data frequently created in real-time and a bigger volume of data than typical datasets.
As traditional techniques of data analysis and management fail to keep up with the sheer quantity
and velocity of Big Data, the collecting and storage of such big and complex data have grown to
be major challenges in recent years. Data wrangling, which entails addressing issues with data
quality, completeness, and compatibility, is one area of emphasis within Big Data analysis. In
this research, we examine YouTube video analysis using Big Data to gather insightful
knowledge and identify significant trends. Many operations, like the analysis of sizable
transactional databases and the discovery of sizable YouTube video collections, have been
transformed by the arrival of big data. As a result, the use of Big Data technologies and
approaches within enterprises has significantly increased. Organizations may discover subtle
patterns and get priceless insights that were previously unavailable by utilizing the power of big
data. But it's important to understand that Big Data also presents enormous computing obstacles
1. INTRODUCTION
Big Data has changed the way corporations handle and analyze massive quantities of data,
transforming a number of sectors. Big Data has created new opportunities for obtaining
thanks to its capacity to acquire, store, and analyze enormous information. This article examines
how Big Data is affecting many businesses while stressing both its transformational potential and
the difficulties it poses (Samaranayake, 2018). Modern tools and procedures for data
management and analysis have been developed as a result of the exponential rise of data in
recent years. The sheer amount, velocity, and diversity of data created from many sources,
including social media, sensors, and online transactions, are too much for traditional ways to
handle. On the other side, big data solutions offer scalable infrastructure, distributed computing
frameworks, and advanced algorithms to maximize the value of data ( Gray, 2018). While Big
Data presents immense opportunities, it also poses challenges that need to be addressed:
Data privacy and security: Handling vast amounts of sensitive data requires robust security
Data quality and integration: Ensuring data accuracy, completeness, and compatibility across
Scalability and infrastructure: Managing the infrastructure to store, process, and analyze massive
Finally, big data has emerged as a transformative force across industries, empowering
organizations to extract valuable insights, optimize processes, and make data-driven decisions.
By leveraging advanced technologies, such as distributed computing, machine learning, and data
5|Page
visualization, organizations can unlock the full potential of Big Data. However, addressing
challenges related to data privacy, quality, and infrastructure remains crucial for successful
implementation. As technology continues to evolve, Big Data will continue to reshape industries,
2. DATA ACQUISITION
Data acquisition refers to the process of collecting data from various sources to prepare a big
data dataset. In its raw form, data from the real world cannot be easily understood by computer
systems. Therefore, data acquisition involves converting the physical parametric data into a
digital format that computers can comprehend (Ahlburg, Arfaoui, Arling, Augustine, Barney,
Benoit & Wieduwilt, 2020). This digital format typically utilizes integers to represent the data.
Traditionally, organizations primarily focused on internal data sources for information. However,
with the advent of big data analytics and predictive analysis, organizations have realized the
value of incorporating external data to facilitate digital transformation. This necessitates the
such external data. It is important to note that "data acquisition" is sometimes mistakenly used to
refer only to the data generated within the organization, which is a misconception since internal
The dataset used for Loan Default Analysis and Prediction was obtained from Kaggle's Lending
Club Python dataset repository. The data source provides comprehensive information necessary
for analyzing and predicting loan defaults. More details about the dataset can be accessed
3. DATA STORAGE
Currently, the dataset is stored in a local resource, which may have limitations in terms of
scalability, accessibility, and data security. However, considering a "What-if?" scenario, where
there is a need to store the data in a cloud-based system or data warehouse, several steps and
Data Assessment: Evaluate the size and structure of the dataset to estimate the storage
Consider factors such as data volume, frequency of updates, and data retention policies.
Cloud Provider Selection: Choose a reliable cloud service provider that offers scalable
measures. Consider providers such as Amazon Web Services (AWS), Microsoft Azure,
Data Transfer: Transfer the dataset from the local resource to the cloud-based system or
data warehouse. This may involve uploading the dataset to the cloud storage solution,
ensuring secure data transfer protocols, and optimizing the transfer process to minimize
downtime.
7|Page
Data Modeling and Schema Design: Design a suitable data model and schema for the
cloud-based system or data warehouse. This may involve defining tables, relationships,
and data organization structures based on the specific requirements of the analysis or
reporting tasks.
Data Integration and ETL (Extract, Transform, Load): Implement an Extract, Transform,
Load (ETL) process to migrate and integrate the data into the cloud-based system or data
Security and Access Controls: Establish appropriate security measures to protect the data
access or breaches.
Scalability and Performance Optimization: Leverage the scalability features of the cloud-
based system or data warehouse to accommodate future growth in data volume and user
demands. Optimize the data storage and retrieval processes to ensure efficient
Backup and Disaster Recovery: Implement backup and disaster recovery mechanisms to
ensure data resilience and business continuity. Regularly back up the data stored in the
8|Page
cloud-based system or data warehouse and establish procedures for restoring data in case
In summary, transitioning from local storage to a cloud-based system or data warehouse involves
assessing the dataset, selecting a suitable cloud provider, transferring the data, designing
appropriate data models and schemas, implementing data integration processes, ensuring security
measures, optimizing performance, and establishing backup and disaster recovery mechanisms.
These steps enable organizations to leverage the scalability, accessibility, and security benefits
offered by cloud-based storage solutions for their data analysis and reporting needs.
4. DATA ANALYSIS
The dataset contains 202 entries and consists of 142 columns. The columns represent various
attributes related to loan default analysis and prediction. Some of the columns have missing
values. The dataset includes information such as loan amount, interest rate, employment details,
home ownership, annual income, credit history, payment details, and loan status. It also provides
data on borrower demographics, credit scores, and financial indicators. The dataset includes a
mix of numerical and categorical data types. Further analysis and processing can be performed to
gain insights and develop models for loan default prediction based on this dataset.
Data wrangling and cleaning are crucial steps in the data preparation process. Wrangling
involves transforming and reshaping the data to make it suitable for analysis. Cleaning refers to
identifying and dealing with missing, erroneous, or inconsistent data. In the given code snippet,
df_filtered is created by dropping columns that have all missing values using dropna(axis=1,
9|Page
are dropped using drop(). Finally, dropna() with inplace=True is used to remove any remaining
rows with missing values. These steps ensure that the resulting DataFrame, df_filtered, is cleaned
Descriptive statistics provide a statistical summary of the data, offering an overview of the
dataset. It includes measures such as the mean, median, and mode, which collectively represent
the central tendencies of the data. While descriptive statistics provide a broad overview of the
data, they do not uncover deeper insights. There are different types of parameters within
descriptive statistics. Variance encompasses characteristics like quartiles and deviations, which
indicate the spread of the data. This helps understand how the data is distributed. Central
tendencies focus on the centrality of the data distribution and include measures like the mode,
39 2
709.22 28.676
fico_range_high 201 39 2 664 689 704 729 794
0.8407 0.9922
inq_last_6mths 201 96 34 0 0 1 1 5
8.8159 3.3077
open_acc 201 2 1 2 7 8 11 20
0.0248 0.1561
pub_rec 201 76 35 0 0 0 0 1
13217. 10289.
revol_bal 201 32 3 0 6842 11095 16576 74351
19.477 9.1684
total_acc 201 61 65 3 12 18 26 51
out_prncp 201 0 0 0 0 0 0 0
out_prncp_inv 201 0 0 0 0 0 0 0
12198. 7734.6 10904. 15451. 40009.
total_pymnt 201 01 43 0 6858.7 84 16 01
12067. 10904. 15352. 40009.
total_pymnt_inv 201 64 7581.4 0 6858.7 84 48 01
9856.5 6456.9 5495.3
total_rec_prncp 201 58 34 0 8 9000 12800 35000
2227.4 1976.0 1553.7 10085.
total_rec_int 201 33 93 0 858.7 4 2949.9 08
1.1581 5.2944
total_rec_late_fee 201 22 06 0 0 0 0 36.247
112.86 472.99 3874.7
recoveries 201 29 66 0 0 0 0 9
12.885 80.213 670.81
collection_recovery_fee 201 02 4 0 0 0 0 93
2795.8 4393.4 3946.2 28412.
last_pymnt_amnt 201 33 2 0 240.64 536.81 4 43
668.42 76.692
last_fico_range_high 201 79 87 499 614 679 719 839
649.65 134.33
last_fico_range_low 201 17 22 0 610 675 715 835
collections_12_mths_ex
_med 201 0 0 0 0 0 0 0
policy_code 201 1 0 1 1 1 1 1
acc_now_delinq 201 0 0 0 0 0 0 0
chargeoff_within_12_m
ths 201 0 0 0 0 0 0 0
delinq_amnt 201 0 0 0 0 0 0 0
0.1400
pub_rec_bankruptcies 201 0.0199 07 0 0 0 0 1
tax_liens 201 0 0 0 0 0 0 0
11 | P a g e
The provided data consists of a summary statistics table for various attributes related to loans.
Count: The count column represents the number of observations available for each
attribute. It appears that there are 202 observations for most attributes, except for "dti"
Mean: The mean column represents the average value for each attribute across the
$11,546.66, the average annual income is around $59,000.83, and the average installment
Standard Deviation (Std): The standard deviation column measures the dispersion or
variability of the data points around the mean. It provides information about the spread of
the data. Higher standard deviations indicate greater variability. For instance, the standard
Minimum (Min): The minimum column represents the smallest value observed for each
attribute. It gives an idea about the lower boundary of the data. For example, the
minimum loan amount is $1,000, and the minimum FICO score range is 660.
25th Percentile (25%): The 25th percentile column represents the value below which 25%
of the data falls. It provides information about the distribution of the data and is also
12 | P a g e
known as the first quartile. For instance, 25% of the loan amounts are below $7,000, and
50th Percentile (50%): The 50th percentile column represents the median value, which is
the middle value of the data. It indicates the point below which 50% of the data falls. For
example, the median loan amount is $10,000, and the median FICO score is 700.
75th Percentile (75%): The 75th percentile column represents the value below which 75%
of the data falls. It provides information about the distribution of the data and is also
known as the third quartile. For instance, 75% of the loan amounts are below $15,000,
Maximum (Max): The maximum column represents the largest value observed for each
attribute. It gives an idea about the upper boundary of the data. For example, the
maximum loan amount is $35,000, and the maximum FICO score is 790.
This summary statistics table provides a quick overview of the distribution and variation of the
loan numerical attributes. It can be helpful in understanding the range of values and identifying
coun uniqu
t e top freq
term 190 2 36 months 143
int_rate 190 28 9.91% 18
13 | P a g e
grade 190 6 B 79
sub_grade 190 28 B1 18
emp_title 190 188 American Airlines 2
emp_length 190 11 10+ years 41
home_ownership 190 3 RENT 129
verification_statu
s 190 3 Not Verified 75
issue_d 190 1 Dec-11 190
loan_status 190 2 Fully Paid 154
pymnt_plan 190 1 n 190
https://lendingclub.com/browse/loanDetail.action?
url 190 190 loan_id=1077430 1
purpose 190 12 debt_consolidation 105
title 190 129 Debt Consolidation Loan 19
zip_code 190 139 921xx 4
addr_state 190 35 CA 46
earliest_cr_line 190 132 Sep-98 4
revol_util 190 164 29.30% 3
initial_list_status 190 1 f 190
last_pymnt_d 190 46 Jan-15 54
last_credit_pull_d 190 71 May-20 25
application_type 190 1 Individual 190
hardship_flag 190 1 N 190
debt_settlement_
flag 190 2 N 188
This summary statistics table provides a quick overview of the distribution and variation of the
loan categorical attributes. It can be helpful in understanding the range of values and identifying
Predictive analysis, also known as predictive modeling, is a branch of data analytics that aims to
forecast or predict future outcomes based on historical data. It involves using statistical or
machine learning models to make predictions or classifications about unknown or future events.
In the context of the provided information, a predictive analysis was conducted using a decision
tree classifier model. Decision trees are a popular machine learning algorithm that uses a tree-
like structure to make decisions based on features or attributes of the data. The reported accuracy
of 100% indicates that the model predicted the outcomes perfectly for the given dataset.
Accuracy is a metric that measures the overall correctness of the model's predictions compared to
the actual outcomes. An accuracy of 100% suggests that the model classified all instances
correctly.
Precision and recall are performance metrics used for binary classification problems. Precision
measures the proportion of correctly predicted positive instances out of all instances predicted as
positive. Recall, also known as sensitivity, measures the proportion of correctly predicted
positive instances out of all actual positive instances. The reported precision of 100% suggests
that all instances predicted as positive were indeed positive. Similarly, the reported recall of
100% indicates that the model correctly identified all positive instances in the dataset. Attaining
100% accuracy, precision, and recall with a decision tree classifier is quite rare and may indicate
either an overfitting issue or potential data quality or sampling bias. It's important to carefully
evaluate the data, model, and evaluation process to ensure that the results are reliable and
Diagnostic analysis is the examination and evaluation of data and results to identify patterns,
relationships, and potential issues, aiming to gain insights and make informed decisions based on
the findings.
Based on the given diagnostic analysis results, here is a summary of the findings:
Dataset Summary:
The dataset contains information related to loan applications, including various categorical and
numerical variables.
Categorical Variables:
There are several categorical variables in the dataset, including term, grade, sub_grade,
Each categorical variable has different levels of uniqueness, with varying top values and
frequencies.
Loan Status:
The analysis shows that out of the 190 instances, 154 loans are labeled as "Fully Paid" and the
Model Performance:
The predictive model, trained using a decision tree classifier, achieved perfect performance on
The accuracy, precision, and recall metrics all have a value of 100, indicating that the model
accurately predicted the loan status for all instances in the dataset.
5 Recommendations
Validate the Model: Although the model has shown excellent performance on the current
dataset, it is important to validate its performance on new, unseen data. This can be done
by splitting the dataset into training and testing sets or using cross-validation techniques
Feature Importance: Determine the most important features that contribute to the accurate
prediction of loan status. This can help in understanding the factors that significantly
Explore the decision tree structure to gain insights into the criteria used by the model to
classify loans. This can help in explaining the factors that contribute to loan approval or
rejection.
Data Quality and Reliability: Ensure the quality and reliability of the data used for
training the model. Clean and preprocess the data, handle missing values, outliers, and
ensure that the dataset is representative of the real-world scenario. High-quality and
Continuous Model Monitoring: As loan data and trends change over time, it is important
to continuously monitor the model's performance. Regularly update the model with new
data and evaluate its performance metrics. If the accuracy, precision, or recall starts to
6 Conclusions
21 | P a g e
Based on the current analysis, the decision tree classifier has shown exceptional performance in
predicting loan status. However, it is important to note that the evaluation metrics alone
(accuracy, precision, recall) may not provide a complete picture of the model's performance. It is
necessary to consider other factors such as the dataset's representativeness, potential bias, and the
business context to make informed decisions. Further analysis, model validation, and ongoing
monitoring are recommended to ensure the model's reliability and effectiveness in real-world
scenarios. Additionally, domain expertise and collaboration with experts in the lending industry
can provide valuable insights and enhance the accuracy and interpretability of the model.
References
[1] Ahlburg, P., Arfaoui, S., Arling, J.H., Augustin, H., Barney, D., Benoit, M., Bisanz, T.,
Corrin, E., Cussans, D., Dannheim, D. and Dreyling-Eschweiler, J., 2020. EUDAQ—A data
15(01), p.P01038.
[2] Azeroual, O., 2020. Data Wrangling in Database Systems: Purging of Dirty Data. Data, 5(2),
p.50.
[3] Bathla, G., Rani, R. and Aggarwal, H., 2018. Comparative study of NoSQL databases for big
[4] Cartledge, C., 2018. ODU Big Data, Data Wrangling Boot Camp Software Overview, and
Design.
[5] Clark, E.L., Resasco, J., Landers, A., Lin, J., Chung, L.T., Walton, A., Hahn, C., Jaramillo,
T.F. and Bell, A.T., 2018. Standards and protocols for data acquisition and reporting for studies
[6] Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martinez-Plumed, F., Ramírez-
Quintana, M.J. and Katayama, S., 2017. Domain specific induction for data wrangling
[7] de Jesús Ramírez-Rivera, E., Díaz-Rivera, P., Ramón-Canul, L.G., Juárez-Barrientos, J.M.,
Comparison of performance
and quantitative descriptive analysis sensory profiling and its relationship to consumer liking
between the artisanal cheese producers panel and the descriptive trained panel. Journal of dairy
[8] Fleuren, L.M., Klausch, T.L., Zwager, C.L., Schoonmade, L.J., Guo, T., Roggeveen, L.F.,
Swart, E.L., Girbes, A.R., Thoral, P., Ercole, A. and Hoogendoorn, M., 2020. Machine learning
review and meta-analysis of diagnostic test accuracy. Intensive care medicine, 46(3), pp.383-
400.
[9] Alma Digit, S.R.L., A Cloud-Based System for Improving Retention Marketing Loyalty
[10] Kim, D.W., Jang, H.Y., Kim, K.W., Shin, Y. and Park, S.H., 2019. Design characteristics of
studies reporting the performance of artificial intelligence algorithms for diagnostic analysis of
[11] Loeb, S., Dynarski, S., McFarland, D., Morris, P., Reardon, S. and Reber, S., 2017.
Descriptive Analysis in Education: A Guide for Researchers. NCEE 2017-4023. National Center
[12] MacAvaney, S., Yates, A., Feldman, S., Downey, D., Cohan, A., and Goharian, N., 2021.
[13] Mazumdar, S., Seybold, D., Kritikos, K. and Verginadis, Y., A survey on data storage and
placement methodologies for Cloud-Big Data ecosystem. J Big Data. 2019; 6 (1): 15.
[14] Mazumdar, S., Seybold, D., Kritikos, K. and Verginadis, Y., 2019. A survey on data storage
and placement methodologies for cloud-big data ecosystem. Journal of Big Data, 6(1), pp.1-37.
[15] McInnes, M.D., Moher, D., Thombs, B.D., McGrath, T.A., Bossuyt, P.M., Clifford, T.,
[16] Gatsonis, C., Hooft, L. and Hunt, H.A., 2018. Preferred reporting items for a systematic
review and meta-analysis of diagnostic test accuracy studies: the PRISMA-DTA statement.
[17] Samaranayake, L. Big data is or big data are. Br Dent J 224, 916 (2018).
https://doi.org/10.1038/sj.bdj.2018.486
[18] Gray, Mikel. Context for Practice: The Power of “Big Data”. Journal of Wound, Ostomy and