SHUKLAdocument
SHUKLAdocument
1
Abstract
his internship project in Data Science focused on leveraging
T
Python programming and various machine learning techniques to
analyze and predict customer behavior within the e-commerce
domain.The primary objectives were to enhance customer
segmentation, optimize marketing strategies, and improve overall
user experience.
rganization Info-
O
Bharat Intern is a software development and IT services company.
They use their technical expertise and industry experience to help
clients anticipate what's next.
ethodologies:
M
We follow a structured methodology for our projects which starts
from designing the solution to the implementation phase. Well
planned Project reduces the time to deliver the project and any
2
dditional ad-hoc costs to our clients, hence we dedicate the
a
majority of our time understanding our clients business and
gathering requirements. This ground up approach helps us deliver
not only the solution to our clients but also add value to your
Investments
.
Key parts of the report:
nder each division we further provide specific industry solutions
U
on focused domains with cutting edge technologies.
3
INDEX
S.no CONTENTS Page no
1. Introduction......................................................................................6
1.1 Modules....................................................................................8
.
2 Analysis..........................................................................................9
3. Software requirements specifications ...........................................11
4. Technology.....................................................................................12
4.1 PYTHON..................................................................................12
4.2 ML LIBRARIES .......................................................................12
4.3 NUMPY & PANDAS.................................................................12
4.4 MATPLOT & SEABORN...........................................................12
.
5 Coding.............................................................................................13
6. Screenshot......................................................................................16
7. Conclusion......................................................................................20
8. Bibliography.......................................................................................21
4
Learning Objective/Internship Objective
5
INTRODUCTION
. : Models need to be evaluated to ensure their accuracy and generalization to new data.
1
Data scientists use metrics like accuracy, precision, recall, and F1-score to assess the
performance of their models. They also employ techniques like cross-validation to estimate how
well the model will perform on unseen data.
HyperparameterPredicting the fate of the RMS Titanic, a tragic and iconic maritime disaster
that occurred in 1912, has long captivated the imaginations of data scientists and machine
learning enthusiasts. In this analysis, we will embark on a journey to forecast whether
passengers on the ill-fated Titanic survived or perished, based on a variety of features such as
age, gender, class, and more. By delving into the realm of predictive analytics, we hope to gain
insights into the factors that influenced survival rates and develop a model that can make
accurate predictions. This endeavor not only demonstrates the power of data-driven
decision-making but also pays tribute to the memories of those who were aboard the Titanic,
revealing the lessons that history can teach us through the lens of modern data science.
Data science plays a pivotal role in predicting the fate of passengers on the Titanic. Here are
some key ways in which data science contributes to this analysis:
1. Data Collection:Data scientists collect and curate the historical dataset containing
information about Titanic passengers, including details like age, gender, class, ticket fare, cabin,
and whether they survived or not. This dataset forms the foundation for analysis.
2. Data Preprocessing: The collected data may be messy and incomplete. Data scientists
clean and preprocess the dataset, handling missing values and outliers. They might also
perform feature engineering to create new features that can improve prediction accuracy.
3. Exploratory Data Analysis (EDA):EDA is a crucial step in understanding the data. Data
scientists use various visualization and statistical techniques to explore the relationships and
patterns within the data. EDA helps in identifying which features are likely to have an impact on
survival.
4. Feature Selection:Not all features are equally important. Data scientists use techniques to
select the most relevant features that influence the prediction. For example, they might find that
a passenger's gender or class is more indicative of survival.
5. Model Building:Data scientists apply machine learning algorithms to build predictive
models. Classification algorithms, such as logistic regression, decision trees, random forests,
and neural networks, are commonly used to predict whether a passenger survived or not based
on the selected features.
6. Model EvaluationTuning:Data scientists fine-tune model parameters to optimize
performance. This process involves adjusting the parameters of the chosen algorithms to find
the best combination for accurate predictions.
7. Model Deployment:Once a satisfactory model is built, it can be deployed to make
predictions on new data. This could be used to assess the survival probabilities of passengers
who were not in the original dataset.
8. Interpretability:Data scientists may also use interpretability techniques to understand why
a model is making certain predictions. This can provide valuable insights into the factors that
influence survival.
6
. Continuous Improvement:Data science is not a one-time task. It involves iterative
9
processes of refining models and incorporating new data or insights. This continuous
improvement ensures that predictions remain accurate and relevant over time.
In the case of Titanic prediction, data science techniques allow us to extract meaningful
information from a historical dataset and make educated guesses about the survival of
passengers, shedding light on the factors that contributed to their fates. This analysis serves as
a powerful demonstration of how data science can be applied to historical events, shedding new
light on the past and providing valuable insights.
7
1.1 MODULE DESCRIPTION
Pandas:
● Description: Pandas is a powerful data manipulation library in
Python. It provides easy-to-use data structures and data
analysis tools that are essential for handling and preprocessing
datasets .
NumPy:
● Description: NumPy is a fundamental package for numerical
computing in Python. It provides support for arrays and
matrices, which are crucial for many mathematical and
statistical operations.
Scikit-Learn:
● Description: Scikit-Learn is a versatile machine learning library
in Python. It offers a wide range of algorithms and tools for
tasks like classification, regression, clustering, and model
evaluation.
Matplotlib and Seaborn:
● Description: Matplotlib and Seaborn are visualization libraries in
Python. They enable the creation of static, animated, and
interactive visualizations to help analyze and communicate
insights from the data.
8
SYSTEM ANALYSIS
ystem analysis for Titanic prediction involves a structured approach to understand, design, and
S
implement a system that can accurately predict the survival or non-survival of passengers on
the Titanic. Here's a breakdown of the system analysis process:
1. Understanding the Problem:
○ Define the problem clearly: The primary objective is to predict whether a
passenger survived or not based on various attributes.
○ Understand the context: Learn about the historical context of the Titanic, its
passengers, and the available dataset.
2. Data Collection:
○ Identify data sources: Gather the Titanic dataset, which contains passenger
information.
○ Verify data quality: Ensure that the data is accurate, complete, and relevant.
3. Requirements Analysis:
○ Identify stakeholders: Determine who will use the prediction model (e.g.,
researchers, historians, or data scientists).
○ Gather user requirements: Understand what users expect from the system, such
as prediction accuracy and interpretability.
4. System Design:
○ Define the system architecture: Decide on the software tools, databases, and
frameworks to use for data processing and modeling.
○ Choose the modeling approach: Select the machine learning algorithms to build
predictive models.
○ Design the user interface (if applicable): Create a user-friendly interface for users
to interact with the system.
5. Data Preprocessing:
○ Data cleaning: Handle missing values, outliers, and inconsistencies in the
dataset.
○ Feature engineering: Create new features or transform existing ones to improve
predictive power.
○ Data splitting: Divide the dataset into training and testing sets for model
evaluation.
6. Model Development:
○ Select machine learning algorithms: Choose algorithms suitable for binary
classification, such as logistic regression, decision trees, or random forests.
○ Model training: Train the selected models on the training data.
○ Hyperparameter tuning: Optimize model parameters to improve prediction
performance.
7. Model Evaluation:
○ Use appropriate evaluation metrics: Assess the model's performance using
metrics like accuracy, precision, recall, F1-score, and ROC AUC.
○ Cross-validation: Employ techniques like k-fold cross-validation to estimate the
model's performance on unseen data.
9
8. System Implementation:
○ Develop the system: Implement the data preprocessing, modeling, and
evaluation steps into a functional system.
○ User interface (if applicable): Create a user-friendly interface for users to input
data and receive predictions.
9. Testing and Validation:
○ Test the system: Ensure that the system works as expected and provides
accurate predictions.
○ Validate against real-world data (if available): If there are new Titanic passenger
data, validate the system's predictions against this data.
10.Deployment:
○ Deploy the system for regular use or research purposes.
○ Monitor the system: Continuously monitor its performance and address any
issues or changes in data distribution.
11.Documentation:
○ Create documentation: Document the system's architecture, algorithms, and
processes.
○ User documentation: Prepare guides for users on how to interact with the system
and interpret results.
12.Maintenance and Improvement:
○ Regularly update the system to accommodate new data, algorithms, or user
requirements.
○ Keep improving the system's accuracy and efficiency.
ystem analysis for Titanic prediction is a comprehensive process that involves understanding
S
the problem, designing the system, implementing it, and ensuring its accuracy and reliability.
The application of data science and machine learning techniques to historical data, like the
Titanic dataset, can yield valuable insights and serve as a model for similar predictive analysis
in other domains.
10
3. SOFTWARE REQUIREMENTS
SPECIFICATIONS
3.1 System configurations
he software requirement specification can produce at the culmination
T
of the analysis task.The function and performance allocated to
software as part of system engineering are refined by established a
complete information description, a detailed functional description, a
representation of system behavior, and indication of performance and
design constraints, appropriate validate criteria, and other information
pertinent to requirements.
Software Requirements
• Operating system: Windows 10
• Coding Language: Python
• Front-End: Google Collab
Hardware Requirement
• Hard Disk:512 GB HDD + 256 GB SSD
• Ram: 4GB
11
4.TECHNOLOGY USED
12
CODING
TITANIC CLASSIFICATION
IMPORT LIBRARIES
import
pandas
as
pd
import
numpy
as
np
import
matplotlib.pyplot
as
plt;
import
seaborn
as
sns
%matplotlib
inline
THE DATA
train = pd.read_csv(
'/content/train.csv'
)
MISSING DATA
sns.heatmap(train.isnull(),yticklabels=
False
,cbar=
False
,cmap=
'viridis'
)
sns.set_style(
'whitegrid'
)
sns.countplot(x=
'Survived'
,data=train,palette=
'RdBu_r'
)
sns.countplot(x=
'SibSp'
,data=train)
DATA CLEANING
plt.figure(figsize=(
12
,
7
))
sns.boxplot(x=
'Pclass'
,y=
'Age'
,data=train,palette=
'winter'
)
def
impute_age
(
cols
):
Age = cols[
0
]
13
Pclass = cols[
1
]
if
pd.isnull(Age):
if
Pclass ==
1
:
return
37
elif
Pclass ==
2:
return
29
else
:
return
24
else
:
return
Age
train.drop(
'Cabin'
,axis=
1
,
inplace=
True
)
from
sklearn.linear_model
import
LogisticRegression
EVALUATION
from
sklearn.metrics
import
classification_report,confusion_matrix
ANN
import
keras
from
keras.layers
import
Dense
14
from
keras.models
import
Sequential
from
tensorflow.keras.models
import
Sequential
from
tensorflow.keras.layers
import
Dense
test = pd.read_csv(
'/content/test.csv'
)
test_prediction = ann.predict(test)
test_prediction = [
1
i
f
y>=
0.5
else
0
for
y
in
test_prediction]
new_test.head()
15
SCREENSHOTS
Missing data
16
Data process
17
18
Test
19
CONCLUSION
In conclusion, Titanic prediction is a compelling application of data science and machine
learning techniques to a historical event that continues to captivate the public's imagination.
Through a systematic analysis of the Titanic dataset, we can gain insights into the factors that
influenced passenger survival and build predictive models that can make educated guesses
about their fates. Here are some key takeaways from Titanic prediction:
In essence, Titanic prediction is a testament to the power of data-driven analysis in shedding
light on past events and understanding the intricate interplay of factors that influenced the
survival of the Titanic's passengers. It serves as a reminder that data science can uncover
hidden insights and contribute to a deeper appreciation of history.
20
BIBLOGRAPHY
The following books are referred during the analysis and execution phase of the project
1. M. Lenzerini, “Data integration: A theoretical perspective,” in PODS, 2002, pp. 233– 246.
3. R. Hughes, Agile Data Warehousing: Delivering world-class business intelligence systems
sing Scrum and XP. IUniverse, 2008.
u
4. Y. Chen, S. Alspaugh, and R. Katz, “Interactive analytical processing in big data systems: A
ross-industry study of map reduce workloads,” Proceedings of the VLDB Endowment, vol. 5,
c
no. 12, pp. 1802–1813, 2012.
21