0% found this document useful (0 votes)
16 views21 pages

SHUKLAdocument

Uploaded by

Abhay shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views21 pages

SHUKLAdocument

Uploaded by

Abhay shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

‭Acknowledgment‬

I‭ would like to express my deepest gratitude to all those who‬


‭made my internship in Data Science an enriching and insightful‬
‭Experience.‬

‭First and foremost, I would like to thank‬‭Ms. NEHA GOYAL‬‭for‬


‭ roviding me with the opportunity to be a part of‬ ‭BNCET‬‭. Their‬
p
‭guidance, mentorship, and unwavering support throughout this‬
‭internships have been invaluable. Their expertise in the field of‬
‭Data Science has not only broadened my understanding but also‬
‭inspired me to delve deeper into the subject.‬

‭ heir willingness to share knowledge and their patience in‬


T
‭answering my numerous questions have been instrumental in my‬
‭growth during this internship. The collaborative and dynamic work‬
‭environment provided an ideal platform for learning and‬
‭Development.‬

‭ urthermore, I am grateful to my colleagues who offered their‬


F
‭support and encouragement.Their insights and different‬
‭perspectives greatly contributed to the success of our projects.‬
‭I must also acknowledge the role of the various departments‬
‭within CSE who facilitated my access to resources and provided a‬
‭conducive working environment.‬

‭ astly, I would like to thank my family and friends for their‬


L
‭unwavering support and encouragement throughout this journey.‬
‭I am truly grateful for the experience I gained during this‬
‭internship and look forward to applying the knowledge and skills I‬
‭have acquired in my future endeavors.‬
‭Sincerely‬
‭ABHAY SHUKLA‬

‭1‬
‭Abstract‬
‭ his internship project in Data Science focused on leveraging‬
T
‭Python programming and various machine learning techniques to‬
‭analyze and predict customer behavior within the e-commerce‬
‭domain.The primary objectives were to enhance customer‬
‭segmentation, optimize marketing strategies, and improve overall‬
‭user experience.‬

‭ he project involved the collection and preprocessing of a diverse‬


T
‭dataset containing customer demographics, browsing behavior,‬
‭and purchase history.Exploratory Data Analysis (EDA) techniques‬
‭were applied to gain insights into the underlying patterns and‬
‭trends. Python libraries such as Pandas, NumPy, and Matplotlib‬
‭were extensively utilized for data manipulation, visualization, and‬
‭statistical analysis.‬

‭ rganization Info-‬
O
‭Bharat Intern is a software development and IT services company.‬
‭They use their technical expertise and industry experience to help‬
‭clients anticipate what's next.‬

‭Bharat Intern goal include:‬

‭ Providing opportunities for leadership growth‬



‭● Academic achievement‬
‭● Student engagement in common interests‬

‭ ethodologies:‬
M
‭We follow a structured methodology for our projects which starts‬
‭from designing the solution to the implementation phase. Well‬
‭planned Project reduces the time to deliver the project and any‬

‭2‬
‭ dditional ad-hoc costs to our clients, hence we dedicate the‬
a
‭majority of our time understanding our clients business and‬
‭gathering requirements. This ground up approach helps us deliver‬
‭not only the solution to our clients but also add value to your‬
‭Investments‬
.‭‬
‭Key parts of the report:‬
‭ nder each division we further provide specific industry solutions‬
U
‭on focused domains with cutting edge technologies.‬

‭Benefits of the Company/Institution through our report:‬


‭ nder each division we further provide specific industry solutions‬
U
‭on focused domains with cutting edge technologies. We‬
‭emphasize on building relationships with our clients by delivering‬
‭projects on time and within budget.‬

‭3‬
‭INDEX‬
‭S.no‬ ‭CONTENTS‬ ‭Page no‬
‭1. Introduction......................................................................................6‬
‭1.1 Modules....................................................................................8‬
‭.
2 Analysis..........................................................................................9‬
‭3. Software requirements specifications ...........................................11‬
‭4. Technology.....................................................................................12‬
‭4.1 PYTHON..................................................................................12‬
‭4.2 ML LIBRARIES .......................................................................12‬
‭4.3 NUMPY & PANDAS.................................................................12‬
‭4.4 MATPLOT & SEABORN...........................................................12‬
‭.
5 Coding.............................................................................................13‬
‭6. Screenshot......................................................................................16‬
‭7. Conclusion......................................................................................20‬
‭8. Bibliography.......................................................................................21‬

‭4‬
‭Learning Objective/Internship Objective‬

‭➤ Skill Development and Application:‬


‭● Enhance proficiency in programming languages (e.g., Python,‬
‭R) and data analysis tools (e.g., SQL, Excel) through hands-on‬
‭projects and tasks.‬
‭● Apply statistical and machine learning techniques to real-world‬
‭data to derive actionable insights .‬
‭➤ Project Contribution and Completion:‬
‭● Actively participate in ongoing projects, contributing to tasks‬
‭such as data cleaning, analysis, and modeling, and ensuring‬
‭deliverables are met within set timelines.‬
‭➤ Professional Networking and Collaboration:‬
‭● Establish meaningful connections with colleagues, supervisors,‬
‭and professionals in the field of data science, fostering a‬
‭collaborative work environment.‬
‭➤ Presentation and Communication Skills:‬
‭● Develop the ability to effectively communicate complex‬
‭technical findings to non-technical stakeholders through clear‬
‭and concise reports, presentations, and visualizations.‬
‭➤ Problem Solving and Critical Thinking:‬
‭● Cultivate problem-solving skills by tackling challenges related to‬
‭data quality, model optimization, and deriving meaningful‬
‭insights from complex datasets.‬

‭5‬
‭INTRODUCTION‬

‭ . : Models need to be evaluated to ensure their accuracy and generalization to new data.‬
1
‭Data scientists use metrics like accuracy, precision, recall, and F1-score to assess the‬
‭performance of their models. They also employ techniques like cross-validation to estimate how‬
‭well the model will perform on unseen data.‬
‭Hyperparameter‬‭Predicting the fate of the RMS Titanic, a tragic and iconic maritime disaster‬
‭that occurred in 1912, has long captivated the imaginations of data scientists and machine‬
‭learning enthusiasts. In this analysis, we will embark on a journey to forecast whether‬
‭passengers on the ill-fated Titanic survived or perished, based on a variety of features such as‬
‭age, gender, class, and more. By delving into the realm of predictive analytics, we hope to gain‬
‭insights into the factors that influenced survival rates and develop a model that can make‬
‭accurate predictions. This endeavor not only demonstrates the power of data-driven‬
‭decision-making but also pays tribute to the memories of those who were aboard the Titanic,‬
‭revealing the lessons that history can teach us through the lens of modern data science.‬
‭Data science plays a pivotal role in predicting the fate of passengers on the Titanic. Here are‬
‭some key ways in which data science contributes to this analysis:‬
‭1. Data Collection:‬‭Data scientists collect and curate the historical dataset containing‬
‭information about Titanic passengers, including details like age, gender, class, ticket fare, cabin,‬
‭and whether they survived or not. This dataset forms the foundation for analysis.‬
‭2. Data Preprocessing‬‭: The collected data may be messy and incomplete. Data scientists‬
‭clean and preprocess the dataset, handling missing values and outliers. They might also‬
‭perform feature engineering to create new features that can improve prediction accuracy.‬
‭3. Exploratory Data Analysis (EDA):‬‭EDA is a crucial step in understanding the data. Data‬
‭scientists use various visualization and statistical techniques to explore the relationships and‬
‭patterns within the data. EDA helps in identifying which features are likely to have an impact on‬
‭survival.‬
‭4. Feature Selection:‬‭Not all features are equally important. Data scientists use techniques to‬
‭select the most relevant features that influence the prediction. For example, they might find that‬
‭a passenger's gender or class is more indicative of survival.‬
‭5. Model Building:‬‭Data scientists apply machine learning algorithms to build predictive‬
‭models. Classification algorithms, such as logistic regression, decision trees, random forests,‬
‭and neural networks, are commonly used to predict whether a passenger survived or not based‬
‭on the selected features.‬
‭6. Model EvaluationTuning:‬‭Data scientists fine-tune model parameters to optimize‬
‭performance. This process involves adjusting the parameters of the chosen algorithms to find‬
‭the best combination for accurate predictions.‬
‭7. Model Deployment:‬‭Once a satisfactory model is built, it can be deployed to make‬
‭predictions on new data. This could be used to assess the survival probabilities of passengers‬
‭who were not in the original dataset.‬
‭8. Interpretability:‬‭Data scientists may also use interpretability techniques to understand why‬
‭a model is making certain predictions. This can provide valuable insights into the factors that‬
‭influence survival.‬

‭6‬
‭ . Continuous Improvement:‬‭Data science is not a one-time task. It involves iterative‬
9
‭processes of refining models and incorporating new data or insights. This continuous‬
‭improvement ensures that predictions remain accurate and relevant over time.‬
‭In the case of Titanic prediction, data science techniques allow us to extract meaningful‬
‭information from a historical dataset and make educated guesses about the survival of‬
‭passengers, shedding light on the factors that contributed to their fates. This analysis serves as‬
‭a powerful demonstration of how data science can be applied to historical events, shedding new‬
‭light on the past and providing valuable insights.‬

‭7‬
‭1.1 MODULE DESCRIPTION‬
‭Pandas:‬
‭● Description: Pandas is a powerful data manipulation library in‬
‭Python. It provides easy-to-use data structures and data‬
‭analysis tools that are essential for handling and preprocessing‬
‭datasets .‬
‭NumPy:‬
‭● Description: NumPy is a fundamental package for numerical‬
‭computing in Python. It provides support for arrays and‬
‭matrices, which are crucial for many mathematical and‬
‭statistical operations.‬
‭Scikit-Learn:‬
‭● Description: Scikit-Learn is a versatile machine learning library‬
‭in Python. It offers a wide range of algorithms and tools for‬
‭tasks like classification, regression, clustering, and model‬
‭evaluation.‬
‭Matplotlib and Seaborn:‬
‭● Description: Matplotlib and Seaborn are visualization libraries in‬
‭Python. They enable the creation of static, animated, and‬
‭interactive visualizations to help analyze and communicate‬
‭insights from the data.‬

‭8‬
‭SYSTEM ANALYSIS‬
‭ ystem analysis for Titanic prediction involves a structured approach to understand, design, and‬
S
‭implement a system that can accurately predict the survival or non-survival of passengers on‬
‭the Titanic. Here's a breakdown of the system analysis process:‬
‭1.‬ ‭Understanding the Problem‬‭:‬
‭○‬ ‭Define the problem clearly: The primary objective is to predict whether a‬
‭passenger survived or not based on various attributes.‬
‭○‬ ‭Understand the context: Learn about the historical context of the Titanic, its‬
‭passengers, and the available dataset.‬
‭2.‬ ‭Data Collection‬‭:‬
‭○‬ ‭Identify data sources: Gather the Titanic dataset, which contains passenger‬
‭information.‬
‭○‬ ‭Verify data quality: Ensure that the data is accurate, complete, and relevant.‬
‭3.‬ ‭Requirements Analysis‬‭:‬
‭○‬ ‭Identify stakeholders: Determine who will use the prediction model (e.g.,‬
‭researchers, historians, or data scientists).‬
‭○‬ ‭Gather user requirements: Understand what users expect from the system, such‬
‭as prediction accuracy and interpretability.‬
‭4.‬ ‭System Design‬‭:‬
‭○‬ ‭Define the system architecture: Decide on the software tools, databases, and‬
‭frameworks to use for data processing and modeling.‬
‭○‬ ‭Choose the modeling approach: Select the machine learning algorithms to build‬
‭predictive models.‬
‭○‬ ‭Design the user interface (if applicable): Create a user-friendly interface for users‬
‭to interact with the system.‬
‭5.‬ ‭Data Preprocessing‬‭:‬
‭○‬ ‭Data cleaning: Handle missing values, outliers, and inconsistencies in the‬
‭dataset.‬
‭○‬ ‭Feature engineering: Create new features or transform existing ones to improve‬
‭predictive power.‬
‭○‬ ‭Data splitting: Divide the dataset into training and testing sets for model‬
‭evaluation.‬
‭6.‬ ‭Model Development‬‭:‬
‭○‬ ‭Select machine learning algorithms: Choose algorithms suitable for binary‬
‭classification, such as logistic regression, decision trees, or random forests.‬
‭○‬ ‭Model training: Train the selected models on the training data.‬
‭○‬ ‭Hyperparameter tuning: Optimize model parameters to improve prediction‬
‭performance.‬
‭7.‬ ‭Model Evaluation‬‭:‬
‭○‬ ‭Use appropriate evaluation metrics: Assess the model's performance using‬
‭metrics like accuracy, precision, recall, F1-score, and ROC AUC.‬
‭○‬ ‭Cross-validation: Employ techniques like k-fold cross-validation to estimate the‬
‭model's performance on unseen data.‬

‭9‬
‭8.‬ ‭System Implementation‬‭:‬
‭○‬ ‭Develop the system: Implement the data preprocessing, modeling, and‬
‭evaluation steps into a functional system.‬
‭○‬ ‭User interface (if applicable): Create a user-friendly interface for users to input‬
‭data and receive predictions.‬
‭9.‬ ‭Testing and Validation‬‭:‬
‭○‬ ‭Test the system: Ensure that the system works as expected and provides‬
‭accurate predictions.‬
‭○‬ ‭Validate against real-world data (if available): If there are new Titanic passenger‬
‭data, validate the system's predictions against this data.‬
‭10.‬‭Deployment‬‭:‬
‭○‬ ‭Deploy the system for regular use or research purposes.‬
‭○‬ ‭Monitor the system: Continuously monitor its performance and address any‬
‭issues or changes in data distribution.‬
‭11.‬‭Documentation‬‭:‬
‭○‬ ‭Create documentation: Document the system's architecture, algorithms, and‬
‭processes.‬
‭○‬ ‭User documentation: Prepare guides for users on how to interact with the system‬
‭and interpret results.‬
‭12.‬‭Maintenance and Improvement‬‭:‬
‭○‬ ‭Regularly update the system to accommodate new data, algorithms, or user‬
‭requirements.‬
‭○‬ ‭Keep improving the system's accuracy and efficiency.‬

‭ ystem analysis for Titanic prediction is a comprehensive process that involves understanding‬
S
‭the problem, designing the system, implementing it, and ensuring its accuracy and reliability.‬
‭The application of data science and machine learning techniques to historical data, like the‬
‭Titanic dataset, can yield valuable insights and serve as a model for similar predictive analysis‬
‭in other domains.‬

‭10‬
‭3. SOFTWARE REQUIREMENTS‬
‭SPECIFICATIONS‬
‭3.1 System configurations‬
‭ he software requirement specification can produce at the culmination‬
T
‭of the analysis task.The function and performance allocated to‬
‭software as part of system engineering are refined by established a‬
‭complete information description, a detailed functional description, a‬
‭representation of system behavior, and indication of performance and‬
‭design constraints, appropriate validate criteria, and other information‬
‭pertinent to requirements.‬

‭Software Requirements‬
•‭ Operating system: Windows 10‬
‭• Coding Language: Python‬
‭• Front-End: Google Collab‬

‭Hardware Requirement‬
•‭ Hard Disk:512 GB HDD + 256 GB SSD‬
‭• Ram: 4GB‬

‭11‬
‭4.TECHNOLOGY USED‬

‭4.1 Python:‬‭Python is the primary programming language used for‬


‭ uilding the recommendation system. Its extensive libraries and‬
b
‭frameworks make it a popular choice for ML projects.‬
‭4.2 Machine Learning Libraries:‬
‭●‬‭Scikit-Learn :‬‭This is a comprehensive ML library that provides a‬
‭wide range of algorithms for classification, regression,‬
‭clustering, and more. It's widely used for building‬
‭recommendation models.‬
‭●‬‭Surprise (scikit-surprise):‬‭Specifically designed for building‬
‭recommendation systems, Surprise offers a range of‬
‭collaborative filtering algorithms and tools for model selection‬
‭and evaluation.‬
‭4.3 NumPy and Pandas:‬
‭●‬‭NumPy:‬‭This library is essential for numerical computations in‬
‭Python. It provides support for arrays and matrices, which are‬
‭fundamental data structures in ML.‬
‭import Numpy as np‬

‭●‬‭Pandas:‬‭Pandas is a powerful data manipulation library that‬


‭facilitates data preprocessing and analysis .‬
‭import pandas as pd‬

‭4.4 Matplotlib and Seaborn:‬


‭●‬‭Matplotlib:‬‭This is a popular visualization library in Python. It's‬
‭used for creating static, animated, and interactive visualizations.‬
‭●‬‭Seaborn:‬‭Built on top of Matplotlib, Seaborn provides a high-level‬
‭interface for creating attractive and informative statistical‬
‭graphics.‬

‭12‬
‭CODING‬

‭TITANIC CLASSIFICATION‬

‭IMPORT LIBRARIES‬

import‬‭
‭ pandas‬‭
as‬‭
pd‬

import‬‭
‭ numpy‬‭
as‬‭
np‬

import‬‭
‭ matplotlib.pyplot‬‭
as‬‭
plt;‬

import‬‭
‭ seaborn‬‭
as‬‭
sns‬

%matplotlib‬‭
‭ inline‬

‭THE DATA‬

train = pd.read_csv(‬
‭ '/content/train.csv'‬
‭ )‬

‭MISSING DATA‬

sns.heatmap(train.isnull(),yticklabels=‬
‭ False‬
‭ ,cbar=‬
‭ False‬
‭ ,cmap=‬
‭ 'viridis'‬
‭ )‬

sns.set_style(‬
‭ 'whitegrid'‬
‭ )‬

sns.countplot(x=‬
‭ 'Survived'‬
‭ ,data=train,palette=‬
‭ 'RdBu_r'‬
‭ )‬

sns.countplot(x=‬
‭ 'SibSp'‬
‭ ,data=train)‬

‭DATA CLEANING‬

plt.figure(figsize=(‬
‭ 12‬
‭ ,‬‭
‭ 7‭
)‬)‬

sns.boxplot(x=‬
‭ 'Pclass'‬
‭ ,y=‬
‭ 'Age'‬
‭ ,data=train,palette=‬
‭ 'winter'‬
‭ )‬

def‬
‭ impute_age‬
‭ (‬
‭ cols‬
‭ ):‬

Age = cols[‬
‭ 0‭
‭]‬‬

‭13‬
Pclass = cols[‬
‭ 1‬
‭ ]‬

if‬‭
‭ pd.isnull(Age):‬

if‬‭
‭ Pclass ==‬‭
1‭
:‬‬

return‬
‭ 37‬

elif‬‭
‭ Pclass ==‬‭
2‬:‬

return‬
‭ 29‬

else‬
‭ :‬

return‬
‭ 24‬

else‬
‭ :‬

return‬‭
‭ Age‬

train.drop(‬
‭ 'Cabin'‬
‭ ,axis=‬
‭ 1‭
‭ ,
‬inplace=‬
True‬
‭ )‬

‭TRAINING AND PREDICTING‬

from‬‭
‭ sklearn.linear_model‬‭
import‬‭
LogisticRegression‬

‭EVALUATION‬

from‬‭
‭ sklearn.metrics‬‭
import‬‭
classification_report,confusion_matrix‬

‭ANN‬

import‬‭
‭ keras‬

from‬‭
‭ keras.layers‬‭
import‬‭
Dense‬

‭14‬
from‬‭
‭ keras.models‬‭
import‬‭
Sequential‬

from‬‭
‭ tensorflow.keras.models‬‭
import‬‭
Sequential‬

from‬‭
‭ tensorflow.keras.layers‬‭
import‬‭
Dense‬

test = pd.read_csv(‬
‭ '/content/test.csv'‬
‭ )‬

test_prediction = ann.predict(test)‬

test_prediction = [‬‭
‭ 1‭
i
‬f‬‭
y>=‬
0.5‬
‭ else‬
‭ 0‬
‭ for‬‭
‭ y‬‭
in‬‭
test_prediction]‬

test_pred = pd.DataFrame(test_prediction, columns= [‬


‭ 'Survived'‬
‭ ])‬

new_test = pd.concat([test, test_pred], axis=‬


‭ 1‬
‭ , join=‬
‭ 'inner'‬
‭ )‬

new_test.head()‬

‭15‬
‭SCREENSHOTS‬

Missing data‬

‭16‬
Data process‬

‭17‬
‭18‬
Test‬

‭19‬
‭CONCLUSION‬
I‭n conclusion, Titanic prediction is a compelling application of data science and machine‬
‭learning techniques to a historical event that continues to captivate the public's imagination.‬
‭Through a systematic analysis of the Titanic dataset, we can gain insights into the factors that‬
‭influenced passenger survival and build predictive models that can make educated guesses‬
‭about their fates. Here are some key takeaways from Titanic prediction:‬

‭ . Data Science's Historical Perspective:‬‭Titanic prediction showcases how data science‬


1
‭can breathe new life into historical events by leveraging available data to gain a deeper‬
‭understanding of the past.‬
‭2. Feature Importance:‬‭The analysis often highlights the importance of specific features,‬
‭such as passenger class, age, and gender, in determining survival rates. This serves as a‬
‭reminder of the socio-economic dynamics of the era.‬
‭3. Machine Learning and Predictive Modeling:‬‭Machine learning algorithms, including‬
‭logistic regression, decision trees, and random forests, play a critical role in building predictive‬
‭models. These models enable us to make predictions about passenger survival based on‬
‭historical data.‬

I‭n essence, Titanic prediction is a testament to the power of data-driven analysis in shedding‬
‭light on past events and understanding the intricate interplay of factors that influenced the‬
‭survival of the Titanic's passengers. It serves as a reminder that data science can uncover‬
‭hidden insights and contribute to a deeper appreciation of history.‬

‭20‬
‭BIBLOGRAPHY‬
‭The following books are referred during the analysis and execution phase of the project‬

‭1. M. Lenzerini, “Data integration: A theoretical perspective,” in PODS, 2002, pp. 233– 246.‬

‭2. D. Caruso, “Bringing Agility to Business Intelligence,” February 2011, Information‬


‭ anagement,http://www.information-management.com/infodirect/2009191/business intelligence‬
M
‭metadata analytics ETL data management-10019747-1.html.‬

‭3. R. Hughes, Agile Data Warehousing: Delivering world-class business intelligence systems‬
‭ sing Scrum and XP. IUniverse, 2008.‬
u

‭4. Y. Chen, S. Alspaugh, and R. Katz, “Interactive analytical processing in big data systems: A‬
‭ ross-industry study of map reduce workloads,” Proceedings of the VLDB Endowment, vol. 5,‬
c
‭no. 12, pp. 1802–1813, 2012.‬

‭WEBLINKS: 1. https://‬‭www.kaggle.com‬‭/- covering all‬‭the most important python concepts.‬

‭This tutorial is primarily for new users.‬

‭2 .https://‬‭www.kaggle.com‬‭/ - what is the data science‬‭all about? For sampleprojects.‬

‭21‬

You might also like