School of Engineering and Technology: Data Science"
School of Engineering and Technology: Data Science"
An Internship Report
On
“DATA SCIENCE”
2023-2024
SCHOOL OF ENGINEERING AND TECHNOLOGY
Department of Computer Science and Engineering
CERTIFICATE
This is to certify that the Internship work entitled “Data Science”, submitted to the CMR
University, Bangalore, in partial fulfillment of the requirements for the award of the degree of Bachelor
of Technology in Computer Science and Engineering is a record of work done by B. Mani Kowshik
bearing university register number 21BBTCS056 during the academic year 2023-24 at School of
Engineering and Technology, CMR University, Bangalore under my supervision and guidance. The
Internship report has been approved as it satisfies the academic requirement in respect of internship work
prescribed for the said degree.
(1)
Name Signature
(2)
Name Signature
DECLARATION
I further declare that the work reported in this internship has not been submitted and will
not be submitted, either in part or in full, for the award of any other degree in this university or
any other institute or University.
Date: 21BBTCS056
i
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of this project would be incomplete
without the mention of the people who made it possible, without whose constant guidance and
encouragement would have made efforts go in vain.
I consider myself privileged to express gratitude and respect towards all those who guided me through
the completion of the project. I express my thanks to my Internal Internship Guide Dr KOUSHIK, Asst.
Professor, Department of Computer Science and Engineering, IIT Bhuvaneshwar for his constant
support.
I would like to express my thanks to Dr Rubini P, Professor and Head, Department of CSE, School of
Engineering and Technology, CMR University, Bangalore, for her encouragement that motivated me
for the successful completion of internship work.
I express my heartfelt sincere gratitude to Dr N Kannan, Dean, Lake Side Campus, School of
Engineering and Technology, CMR University, for his support.
I would like to express my sincere thanks and gratitude to our internship coordinator for his support,
invaluable guidance and encouragement throughout the tenure of this internship.
ii
Sai Charan Lakum
from CMR UNIVERSITY BANGALORE has successfully completed a 6-week online training on Data Science. The
training consisted of Introduction to Data Science, Python for Data Science, Understanding the Statistics for
Data Science, Predictive Modeling and Basics of Machine Learning, and The Final Project modules.
Sai Charan scored 100% marks in the final assessment and is a top performer in the training.
We wish Sai Charan all the best for future endeavours.
The data science internship journey uncovered these pivotal findings, illuminating the power and
potential of data-driven approaches. From the critical role of data quality to the diversity of machine
learning techniques and the ethical considerations that shape the field, the experience showcased that
data science is a dynamic and impactful discipline with the ability to transform how we perceive,
analyse, and make decisions based on data.
The institute provide this opportunity in order to build different skills which are currently or will be
in the software sector. The skills such as artificial intelligence, machine learning, and deep learning
are the currently in the market which is rapidly growing and in almost everywhere the application is
used. The data science is majorly used to solve problem solving through having wide range of data
and through prediction and analysing. As we have worked together, we got different perspective on
same topic and tried our best to gain the perfect solution.
1
TABLE OF CONTENTS
3 Conclusion 10-11
4 Bibliography 12
2
LEARNING OBJECTIVES
Foundational Concepts: Understand the fundamental concepts of data science, including data
types, structures, and formats. Comprehend the importance of data quality, cleaning, and
preprocessing.
Learn to explore and visualize data to gain insights and identify patterns.
Basics of Machine Learning: Explore supervised learning techniques, such as regression and
classification, and their applications. Understand unsupervised learning methods, including
clustering and dimensionality reduction.
Gain exposure to ensemble methods and model evaluation techniques.
Ethics and Bias in Data Science: Understand ethical considerations surrounding data privacy,
security, and bias mitigation. Learn to identify and address potential biases in data and models.
Real-world Applications: Apply data science techniques to real-world problems and datasets
across various domains. Gain experience in collaborating on team projects to solve complex data-
driven challenges.
3
1. INTRODUCTION
1.1 About Internshala
Internshala is an internship and online training platform, based in Gurgaon, India.[1][2] Founded by
Sarvesh Agrawal, an IIT Madras alumnus, in 2011, the website helps students find internships with
Organisations in India.
Our Vision
Our Vision lies to bring in a technology-oriented career-driven Industrial Experience into the
aspirant’s career with Great Value.
Data science entails the utilization of advanced techniques to extract meaningful insights and
predictions from complex datasets. It doesn't seek to replace human decision-making; rather, it
amplifies human judgment by providing valuable information. Data science tools, often referred to
as "analytical services," are sophisticated computational engines designed to process and analyze
data. These tools excel in pattern recognition and predictive modeling, enhancing their predictive
capabilities. Some employ machine learning algorithms, progressively improving predictions with
new data inputs. Others utilize intricate neural network structures, mirroring the human brain's
information processing.
Data science isn't about thinking; it's about intricate calculations. These calculations, fueled by data
analysis, hold immense potential. By uncovering hidden patterns and trends, data science empowers
us to make informed decisions. It might appear as simple predictions, but these insights hold
significant sway over human lives. From healthcare diagnostics to financial forecasting, the impact
of data science extends far beyond calculations – it influences the way we understand and shape the
world around us.
Classical machine learning is AI systems learned by ingesting data and getting better at recognizing
patterns. The AI systems could predict things like the distance between points or the intensity of
values. Like all machine learning, the classical form depends on algorithms. Recall that algorithms
are mathematical expressions that output a result. Classical machine learning uses a small number of
algorithms in a relatively simple arrangement. Sometimes machine learning algorithms are binary,
which means that they output one of only two values. Typical binary results might be a 1 or a 0, a
YES or a NO, and a TRUE or a FALSE.
Machine learning has evolved into a collection of powerful applications called the deep learning
4
ecosystem. The foundation for many of applications is called a neural network. A neural network
uses electronic circuitry inspired by the way neurons communicate in the human brain. In a neural
network, a building block, called a perceptron, acts as the equivalent of a single neuron. A perceptron
has an input layer, one or more hidden layers, and an output layer. A signal enters the input layer and
the hidden layers run algorithms on the signal. Then, the result is passed to the output layer.
5
2. INTERNSHIP DISCUSSION
Data Science is an interdisciplinary field that focuses on extracting insights from vast amounts of
structured and unstructured data. It involves a combination of statistical methods, computer science,
and domain expertise to analyze and interpret data trends effectively. The first step in Data Science
is data collection, which involves gathering data from multiple sources, such as databases, APIs, or
web scraping techniques. This is followed by preprocessing, cleaning, and transforming raw data into
meaningful and structured information.
Once data is cleaned, exploratory data analysis (EDA) techniques are used to understand patterns,
detect outliers, and summarize key characteristics of the dataset. Visualization techniques such as
histograms, scatter plots, and box plots are employed to better understand distributions and
relationships among variables. Predictive modeling and machine learning play a crucial role in Data
Science, where algorithms such as regression analysis, classification, and clustering are applied to
derive meaningful insights from data.
Python and R are the most commonly used programming languages in data science, offering a wide
range of libraries for data manipulation, visualization, and predictive modeling. Data Science
applications are vast, ranging from healthcare and finance to marketing, social media analysis, and
urban planning. Organizations leverage data science to improve efficiency, optimize operations, and
make informed decisions based on real-time data. This module provided an overview of fundamental
concepts, illustrating how data-driven insights power modern industries.
The increasing reliance on data in modern industries has made data science an essential field for
technological advancement and innovation. Companies use data science techniques to enhance
business intelligence, optimize decision-making, and drive competitive strategies. The ability to
harness and analyze large-scale data enables organizations to predict future trends, mitigate risks, and
improve overall efficiency. This module laid the foundation for understanding the significance of data
science and its impact across various domains.
Python is a versatile and widely used programming language in the field of data science due to its
simplicity, flexibility, and powerful libraries. This module introduced the foundational elements of
Python programming, including variables, data types, operators, and control structures such as loops
6
and conditional statements. Additionally, it covered functions, modules, and file handling, which are
essential for developing scalable and modular code.
A significant portion of the module was dedicated to data manipulation using Pandas, a powerful
library that allows for easy handling and processing of structured datasets. Dataframes and series
were explored, along with essential operations like filtering, merging, and aggregation. NumPy,
another crucial library, was introduced for numerical computing, including array operations,
mathematical functions, and linear algebra.
Visualization techniques were explored using Matplotlib and Seaborn, enabling graphical
representation of data for better interpretation. The module also introduced Scikit-learn for machine
learning, covering topics such as supervised and unsupervised learning, model training, and
evaluation. Understanding Python’s role in data science is essential for processing large datasets
efficiently and building predictive models. This module provided hands-on experience in writing
Python programs, handling data, and implementing machine learning models to solve real-world
problems.
Python's adaptability and ease of use have contributed to its widespread adoption in data science and
analytics. Its rich ecosystem of libraries and frameworks has made it an indispensable tool for data
professionals, researchers, and developers. With increasing demand for data-driven solutions,
proficiency in Python has become a critical skill in the industry. This module provided a strong
foundation for understanding how Python supports data manipulation, visualization, and machine
learning applications.
Statistics is the backbone of data science, providing essential tools to analyze, interpret, and make
inferences from data. This module covered the fundamental concepts of descriptive and inferential
statistics, highlighting their importance in understanding data patterns and trends. Descriptive
statistics focused on measures of central tendency (mean, median, mode) and dispersion (range,
variance, standard deviation, and interquartile range) to summarize and describe datasets.
The module then introduced probability theory, explaining concepts such as probability distributions,
random variables, and expected values. Common probability distributions, including normal,
binomial, and Poisson distributions, were explored in depth. Hypothesis testing, an integral part of
inferential statistics, was discussed to determine the statistical significance of observed data.
Concepts such as confidence intervals, p-values, and t-tests were introduced to help evaluate
hypotheses and draw meaningful conclusions from sample data.
7
Additionally, data visualization techniques such as histograms, box plots, and scatter plots were
covered to analyze distributions and relationships between variables. The module emphasized the
importance of statistical methods in data-driven decision-making, helping businesses and researchers
gain insights from data. Understanding these statistical techniques is crucial for model evaluation,
risk assessment, and predictive analytics in real-world applications.
Machine Learning (ML) is a subset of artificial intelligence that enables systems to learn from data
and improve their performance over time without explicit programming. This module provided an
introduction to key machine learning concepts and different learning paradigms, including
supervised, unsupervised, and reinforcement learning.
Supervised learning was explored in detail, covering algorithms such as linear regression, logistic
regression, decision trees, support vector machines, and ensemble methods like random forests. These
algorithms are widely used in applications such as fraud detection, recommendation systems, and
customer segmentation. Unsupervised learning techniques, including K-means clustering,
hierarchical clustering, and principal component analysis (PCA), were introduced to identify patterns
and group similar data points without labeled outputs.
The module also covered reinforcement learning, where an agent interacts with an environment and
learns optimal strategies through rewards and penalties. This approach is used in robotics, game
playing, and autonomous decision-making.
Additionally, the module focused on the essential steps in building a machine learning model, such
as data preprocessing, feature selection, model training, hyperparameter tuning, and evaluation using
metrics like accuracy, precision, recall, and F1-score. Machine learning has transformed industries
by automating decision-making processes, enhancing predictive capabilities, and optimizing business
operations. This module provided hands-on experience in implementing ML models using Python
libraries such as Scikit-learn and TensorFlow. Machine learning has revolutionized numerous fields,
from healthcare and finance to marketing and automation. Its ability to process vast amounts of data
and identify patterns has led to advancements in artificial intelligence and predictive analytics. This
8
module laid the foundation for understanding machine learning techniques and their applications in
solving real-world problems.
The final project allowed learners to apply all the skills acquired throughout the course to solve a
real-world business problem. The case study focused on a retail banking institution aiming to improve
its marketing strategies for selling term deposits.
The project involved extensive data preprocessing, including handling missing values, outlier
detection, and feature engineering to prepare the dataset for machine learning models. Exploratory
data analysis was conducted to identify key trends and insights, helping define the most relevant
features for predictive modeling.
Various machine learning algorithms, such as logistic regression, decision trees, random forests, and
support vector machines, were implemented to predict customer behavior and optimize marketing
outreach. The models were evaluated using performance metrics like confusion matrix, ROC curves,
and classification reports.
The project also emphasized model deployment, demonstrating how trained models can be integrated
into business applications for real-time decision-making. The hands-on nature of this project provided
practical experience in applying data science methodologies to real-world problems, reinforcing the
importance of data-driven decision-making in today’s competitive business landscape.
Working on a practical project provided valuable experience in tackling real-world data challenges.
The hands-on approach ensured a deep understanding of data science workflows, reinforcing critical
problem-solving skills. The project served as a bridge between theoretical concepts and industry
applications, preparing learners for future opportunities in data science and analytics.
Furthermore, this project highlighted the importance of iterative model improvement and
optimization. Hyperparameter tuning and feature selection were performed to enhance model
accuracy and efficiency. Additionally, the integration of advanced techniques such as ensemble
learning and cross-validation helped ensure robust and reliable predictions. Beyond technical
execution, this project underscored the significance of effective communication of data insights.
Visualizations, dashboards, and detailed reports were created to present findings in a clear and
actionable manner, ensuring that stakeholders could make informed decisions based on data-driven
insights. This comprehensive experience equipped learners with the necessary skills to tackle industry
challenges and develop scalable, real-world data science solutions.
9
3. CONCLUSION
In conclusion, data science stands as a transformative discipline that has reshaped the landscape of
decision-making, innovation, and problem-solving across various industries. Through the strategic
collection, analysis, and interpretation of data, data science empowers organizations to extract
valuable insights, make informed decisions, and drive positive outcomes. The journey from raw data
to actionable insights involves a series of intricate steps, including data collection, preprocessing,
feature engineering, modeling, and interpretation.
Data science has proven its significance in addressing complex challenges, from predictive analytics
and customer segmentation to medical diagnosis and fraud detection. Its ability to uncover hidden
patterns, trends, and correlations within data has led to advancements in artificial intelligence,
machine learning, and deep learning. Furthermore, data science serves as a bridge between
technology and human understanding, enabling individuals to comprehend complex phenomena and
make data-driven choices.
However, the realm of data science is not without its challenges. From data quality and preprocessing
complexities to ethical considerations and model interpretability, navigating these obstacles demands
a holistic approach that combines technical expertise, domain knowledge, and effective
communication. Despite these challenges, data science's impact is undeniable, contributing to
enhanced efficiency, innovation, and competitive advantage.
As data continues to grow exponentially and technology evolves, the future of data science holds the
promise of unlocking even deeper insights and pushing the boundaries of what is achievable. With
its ability to transform raw data into meaningful knowledge, data science remains a cornerstone of
modern decision-making, shaping a world where information is not just a resource but a powerful
tool for progress. The continuous advancements in computational power, algorithmic development,
and data availability will further enhance the capabilities of data science. Organizations and
professionals must stay adaptive and proactive in harnessing these advancements to drive further
innovation and societal progress.
Moreover, as industries continue to integrate AI-driven solutions, data science will play an even more
critical role in automation, predictive analytics, and personalized experiences. The emergence of real-
time data processing and edge computing will enable faster and more efficient decision-making,
reducing latency and improving responsiveness in critical applications. Ethical AI and responsible
data usage will also become central themes, prompting professionals to address biases, ensure data
10
privacy, and build transparent machine learning models. With an increasing emphasis on
explainability and fairness, the future of data science will not only be about innovation but also about
fostering trust and accountability in AI-driven decision systems. As a result, interdisciplinary
collaboration between data scientists, ethicists, policymakers, and industry leaders will be essential
to shaping a responsible and forward-looking data science ecosystem.
11
BIBILOGRAPHY
12