0% found this document useful (0 votes)
16 views

Introduction to Data Science

The document provides an introduction to data science, outlining its definition, need, components, life cycle, and basic tools. It emphasizes the importance of data science in informed decision-making, competitive advantage, and various applications across industries. Additionally, it distinguishes data science from business intelligence, highlighting their different focuses and methodologies.

Uploaded by

Kedar Ghadge
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Introduction to Data Science

The document provides an introduction to data science, outlining its definition, need, components, life cycle, and basic tools. It emphasizes the importance of data science in informed decision-making, competitive advantage, and various applications across industries. Additionally, it distinguishes data science from business intelligence, highlighting their different focuses and methodologies.

Uploaded by

Kedar Ghadge
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Introduction to Data Science

Unit -1
Contents
• Introduction
• Need for Data Science
• Components of Data Science
• Data Acquisition and Data Science Life-Cycle
• Basic Tools of Data Science
• Difference between BI and Data Science
• Applications of Data Science
• Role of Data Scientist
What is Data Science?
• Data science is an
interdisciplinary field that uses
scientific techniques, procedures,
algorithms, and structures to
extract knowledge and insights
from structured and unstructured
data.
• It combines elements of statistics,
mathematics, programming, and
domain expertise to transform
data into actionable insights.
Need for Data Science
1. Informed Decision Making:
– Empowers data-driven decisions
– Enhances forecasting and planning

2. Competitive Advantage:
– Optimizes operations
– Improves customer experience

3. Efficiency and Automation:


– Streamlines routine tasks
– Boosts operational efficiency

4. Personalization:
– Tailors products and services
– Increases customer satisfaction

5. Risk Management:
– Assesses and mitigates risks
– Detects fraud and anomalies
Need for Data Science
6.Healthcare Improvements:
– Enables predictive diagnostics
– Enhances patient care

7.Scientific Research:
– Accelerates discoveries
– Validates hypotheses

8.Social Good:
– Accelerates discoveries
– Validates hypotheses

9.Customer Insights:
– Understands customer behavior
– Enhances retention strategies

10.Innovation and Development:


– Identifies market gaps
– Drives product development
Components of Data Science
1. Statistics: Statistics is one of the most important
components of data science. Statistics is a way to collect
and analyze the numerical data in a large amount and
finding meaningful insights from it.

2. Domain Expertise: In data science, domain


expertise binds data science together. Domain expertise
means specialized knowledge or skills of a particular area.
In data science, there are various areas for which we need
domain experts.

3. Data engineering: Data engineering is a part of


data science, which involves acquiring, storing, retrieving,
and transforming the data. Data engineering also includes
metadata (data about data) to the data.
Components of Data Science
4. Visualization: Data visualization is meant by representing data in a visual context so that people can easily
understand the significance of data. Data visualization makes it easy to access the huge amount of data in visuals.

5. Advanced computing: Heavy lifting of data science is advanced computing. Advanced computing involves
designing, writing, debugging, and maintaining the source code of computer programs.

6. Mathematics: Mathematics is the critical part of data science. Mathematics involves the study of quantity,
structure, space, and changes. For a data scientist, knowledge of good mathematics is essential.

7. Machine learning: Machine learning is backbone of data science. Machine learning is all about to provide
training to a machine so that it can act as a human brain. In data science, we use various machine learning algorithms to
solve the problems.
Data Acquisition
• Data acquisition is the comprehensive process of systematically collecting,
measuring, and recording data from various sources to facilitate analysis and
decision-making. This process encompasses a wide range of techniques and
tools designed to gather raw data from different origins, ensuring that the
data is accurate, relevant, and suitable for further analysis.
• Data acquisition, also known as the process of collecting data, re­lies on
specialized software­that quickly captures, processes, and store­s information.
It enables scientists and e­ngineers to perform in-de­pth analysis for scientific
or enginee­ring purposes.
• Data acquisition systems are available­ in handheld and remote ve­rsions to
cater to different me­asurement require­ments. Handheld systems are­ suitable
for direct interaction with subje­cts while remote syste­ms excel at distant
measure­ments, providing versatility in data collection.
Components of Data Acquisition
Components of Data Acquisition
• Sensors: Device­s that gather information about physical or environmental conditions, such as te­
mperature, pressure­, or light intensity.
• Signal Conditioning: To ensure­accurate measureme­nt, the raw sensor data undergoe­s preprocessing
to filter out any noise­and scale it appropriately.
• Data Logger: Hardware or software that records and stores the conditioned data over time.
• Analog-to-Digital Converter (ADC): Converts analog sensor signals into digital data that
computers can process.
• Interface: Connects the data acquisition system to a computer or controller for data transfer and
control.
• Power Supply: Provides the necessary electrical power to operate the system and sensors.
• Control Unit: The manage­ment of the data acquisition system involve­s overseeing its ove­rall
operation, which includes tasks such as triggering, timing, and synchronization.
• Software: Allows users to configure, monitor, and analyze the data collected by the system.
Components of Data Acquisition
• Communication Protocols: The transmission and re­ception of data betwee­n a system and
external de­vices or networks is known as data communication.
• Storage: For storing recorde­d data, there are a range­of options available, including memory
cards, hard drives, or cloud storage­. These provide both te­mporary and permanent storage
solutions.
• User Interface: This system allows users to interact with and control the data acquisition
system effectively.
• Calibration and Calibration Standards: To ensure accuracy the sensors and system are
periodically calibrated against known standards.
• Real-time Clock (RTC): Accurate timing is maintained to ensure synchronized data
acquisition and timestamping.
• Triggering Mechanism: Data capture is initiated based on predefined events or specific
conditions.
• Data Compression: Efforts are made to reduce the size of collected data for storage and
transmission in remote or resource limited applications.
Key Elements of Data Acquisition

Sources of Data Techniques Tools


• Sensors and IoT devices
• Manual data entry • Data acquisition systems
• Databases and data
• Automated data collection • ETL tools
warehouses • Data loggers and Web
• Streaming data collection
• Web scraping and APIs scraping tools
• Batch processing
• Surveys and forms
• Social media platforms
Key Elements of Data Acquisition
Importance
Challenges

• Provides the raw data necessary for


• Data quality and integrity
analysis
• Handling large volumes of data
• Ensures data is accurate and up-to-
• Ensuring data privacy and
date
security
• Facilitates real-time decision making
• Integrating data from diverse
• Supports predictive analytics and
sources
machine learning models
Advantages of Data Acquisition
• Advancing Scientific Exploration: Researchers across fields such as physics,
biology, and environmental science rely on data acquisition to collect information for
experiments, simulations, and observations, facilitating breakthroughs and new
insights.
• Enhancing Industrial Efficiency: Data acquisition systems play a pivotal role in
industrial settings by overseeing manufacturing processes, guaranteeing quality
assurance, and optimizing overall efficiency.
• Fostering Environmental Insights: Environmental monitoring benefits from data
acquisition by tracking critical factors like air quality, water levels, and soil conditions,
contributing to effective environmental management and timely disaster prediction.
• Revolutionizing Healthcare and Biomedical Studies: The realm of healthcare
leverages data acquisition in medical devices to monitor vital signs and acquire
physiological data, fueling diagnostic accuracy and propelling biomedical research
forward.
• Elevating Automotive Evaluation: Within the automotive industry, data acquisition
serves as an indispensable tool for testing vehicle performance, safety features, and
Data Science Life Cycle
Data Science Life Cycle
1. Identifying problems and understanding business:
• Identifying problems is one of the major steps necessary in the data science process to find a clear objective around which all
the following steps will be formulated. In short, it is important to understand the business objective early since it will decide the
final goal of your analysis.
• This phase should examine the trends of business, analyse case studies of similar analysis, and study the industry’s domain.
The team will assess in-house resources, infrastructure, total time, and technology needs. Once these aspects are all identified
and evaluated, they will prepare an initial hypothesis to resolve the business challenges following the current scenario.
• The phase should –
✔ Clearly state the problem that requires solutions and why it should be resolved at once
✔ Define the potential value of the business project
✔ Find risks, including ethical aspects involved in the project
✔ Build and communicate a highly integrated, flexible project plan
2. Data Collection/Data Gathering:
• Data collection is the next stage in the data science lifecycle to gather raw data from relevant sources. The data captured can be
either in structured or unstructured form.
• The methods of collecting the data might come from – logs from websites, social media data, data from online repositories, and
even data streamed from online sources via APIs, web scraping or data that could be present in excel or any other source.
• The person performing the task should know the difference between various data sets available and the data investment
strategy of an organisation.
• A major challenge faced by professionals in this step is tracking where each data comes from and whether it is up-to-date. It is
important to keep track of this information throughout the entire lifecycle of a data science project as it might help test
Data Science Life Cycle
3.Data processing:
• In this phase, data scientists analyse the data collected for biases, patterns, ranges, and distribution of values. It is done to
determine the sustainability of the databases and predicts their usage in regression, machine learning and deep learning
algorithms.
• The phase also involves the introspection of different types of data, including nominal, numerical, and categorical data.
• Data visualisation is also done to highlight the critical trends and patterns of data, comprehended by simple bars and line
charts. Simply put, data processing might be the most time-consuming but arguably the most critical phase in the entire life
cycle of data analytics. The goodness of the model depends on this data processing stage.
4.Data analysis
• Data Analysis or Exploratory Data Analysis is another critical step in gaining some ideas about the solution and factors
affecting the data science lifecycle. There are no set guidelines for this methodology, and it has no shortcuts.
• The key aspect to remember here is that your input determines your output. In this section, the data prepared from the
previous stage will be explored further to examine the various features and their relationships, aiding in better feature
selection required for applying it to the model.
• Experts use data statistics methods such as mean and median to better understand the data. In addition, they also plot
data and assess its distribution patterns using histograms, spectrum analysis, and population distribution. Depending on
the issues, the data will be analysed.
Data Science Life Cycle
5.Data modelling:
• Modelling Data is one of the major phases of data processes and is often mentioned as the heart of data analysis. A model
should use prepared and analysed data to provide the desired output. The environment needed for executing the data model will
be decided and created before meeting the specific requirements.
• In this phase, the team works together to develop datasets for training and testing the model for production purposes. It also
involves various tasks such as choosing the appropriate mode type and learning whether the problem is a classification,
regression, or clustering problem. After analysing the model family, you must choose the algorithms to implement them. It has to
be done carefully since extracting necessary insights from the prepared data is extremely important.
6.Model deployment:
• Now, we are at the final stage of the lifecycle of data science. After a rigorous evaluation process, the model is finally prepared
to be deployed in the desired format and preferred channel. Remember, there is no value for the machine learning model until
it’s deployed to production. Hence machine learning models have to be recorded before the deployment process. In general,
these models are integrated and coupled with products and applications.

• The stage of Model deployment involves the creation of a delivery mechanism required to get the mode out in the market among
the users or to another system. Machine learning models are also deployed on devices and gaining adoption and popularity in
the field of computing. From simple model output in a Tableau Dashboard to a complex as scaling it to cloud in front of millions
of users, this step is distinct for different projects.
Basic Tools of Data Science
1. Programming Languages
• Python: Widely used for its simplicity and rich ecosystem of libraries for data analysis, visualization, and
machine learning.
• R: Popular in the statistics and data analysis community, with strong visualization capabilities.

2. Libraries and Frameworks


• Pandas: A Python library for data manipulation and analysis, providing data structures like DataFrames.
• NumPy: A Python library for numerical computing, essential for handling arrays and performing mathematical
operations.
• Matplotlib: A plotting library for Python, useful for creating static, interactive, and animated visualizations.
• Seaborn: Built on top of Matplotlib, it provides a high-level interface for drawing attractive statistical graphics.
• SciPy: A Python library used for scientific and technical computing.
• Scikit-learn: A machine learning library for Python, offering simple and efficient tools for data mining
and data analysis.
• TensorFlow and PyTorch: Popular frameworks for deep learning.
Basic Tools of Data Science
3. Data Management Tools
• SQL: A language for managing and querying relational databases.
• MySQL, PostgreSQL: Commonly used relational database management systems.
• MongoDB: A NoSQL database for handling unstructured data.
• Hadoop: A framework for distributed storage and processing of large datasets using the MapReduce
programming model.

4. Data Visualization Tools


• Tableau: A powerful tool for creating interactive and shareable dashboards.
• Power BI: A business analytics tool by Microsoft for visualizing data and sharing insights.
• D3.js: A JavaScript library for producing dynamic, interactive data visualizations in web browsers.
Basic Tools of Data Science
5. Integrated Development Environments (IDEs) and Notebooks
• Jupyter Notebook: An open-source web application for creating and sharing documents that contain live code, equations, visualizations, and
narrative text.
• Spyder: An open-source IDE for scientific programming in Python.
• RStudio: An IDE for R that provides a user-friendly interface for data analysis.

6. Data Cleaning and Preprocessing Tools


• OpenRefine: A powerful tool for working with messy data, cleaning it, and transforming it from one format into another.
• Trifacta: A data wrangling tool for exploring and preparing diverse data for analysis.

7. Big Data Tools


• Apache Spark: An open-source unified analytics engine for large-scale data processing.
• Apache Hive: A data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage.
Basic Tools of Data Science

8. Version Control Systems


• Git: A distributed version control system for tracking changes in source code during software development.
• GitHub, GitLab, Bitbucket: Platforms for hosting and collaborating on Git repositories.

9. Data Acquisition Tools


• Beautiful Soup: A Python library for pulling data out of HTML and XML files.
• Scrapy: An open-source and collaborative web crawling framework for Python.
• APIs: Application Programming Interfaces used for retrieving data from online sources.
Difference between BI and Data Science
Sr. Factor Data Science Business Intelligence
No.
1 Concept It is a field that uses mathematics, It is basically a set of technologies,
statistics and various other tools to applications and processes that are used
discover the hidden patterns in the data. by the enterprises for business data
analysis.

2 Focus It focuses on the future. It focuses on the past and present.


3 Data It deals with both structured as well as It mainly deals only with structured data.
unstructured data.

4 Flexibility Data science is much more flexible as It is less flexible as in case of business
data sources can be added as per intelligence data sources need to be pre-
requirement. planned.

5 Method It makes use of the scientific method. It makes use of the analytic method.
6 Complexity It has a higher complexity in comparison It is much simpler when compared to data
to business intelligence. science.
Difference between BI and Data Science
Sr No Factor Data Science Business Intelligence

7 Expertise It’s expertise is data scientist. It’s expertise is the business user.
8 Questions It deals with the questions of what will happen and what if. It deals with the question of what happened.
9 Storage The data to be used is disseminated in real-time clusters. Data warehouse is utilized to hold data.

10 Integration of The ELT (Extract-Load-Transform) process is generally The ETL (Extract-Transform-Load) process is generally
data used for the integration of data for data science used for the integration of data for business intelligence
applications. application

11 Tools It’s tools are SAS, BigML, MATLAB, Excel, etc. It’s tools are InsightSquared Sales Analytics, Klipfolio,
ThoughtSpot, Cyfe, TIBCO Spotfire, etc.
12 Usage Companies can harness their potential by anticipating the Business Intelligence helps in performing root cause
future scenario using data science in order to reduce risk analysis on a failure or to understand the current status.
and increase income.
13 Greater business value is achieved with data science in Business Intelligence has lesser business value as the
Business comparison to business intelligence as it anticipates future extraction process of business value carries out
Value events. statically by plotting charts and KPIs (Key Performance
Indicator).
14 Handling data The technologies such as Hadoop are available and others The sufficient tools and technologies are not available
sets are evolving for handling understandingItsItsarge data sets. for handling large data sets.
Applications of Data Science
1.Image recognition and speech recognition:
• Data science is currently using for Image and speech recognition. When you upload an image on Facebook and
start getting the suggestion to tag to your friends. This automatic tagging suggestion uses image recognition
algorithm, which is part of data science.
• When you say something using, "Ok Google, Siri, Cortana", etc., and these devices respond as per voice control,
so this is possible with speech recognition algorithm.
2.Gaming world:
• In the gaming world, the use of Machine learning algorithms is increasing day by day. EA Sports, Sony, Nintendo,
are widely using data science for enhancing user experience.
3.Internet search:
• When we want to search for something on the internet, then we use different types of search engines such as
Google, Yahoo, Bing, Ask, etc. All these search engines use the data science technology to make the search
experience better, and you can get a search result with a fraction of seconds.
4.Transport:
• Transport industries also using data science technology to create self-driving cars. With self-driving cars, it will be
Applications of Data Science
5.Healthcare:
• In the healthcare sector, data science is providing lots of benefits. Data science is being used for tumor
detection, drug discovery, medical image analysis, virtual medical bots, etc.
6.Recommendation systems:
• Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data science technology for
making a better user experience with personalized recommendations. Such as, when you search for
something on Amazon, and you started getting suggestions for similar products, so this is because of data
science technology.
7.Risk detection:
• Finance industries always had an issue of fraud and risk of losses, but with the help of data science, this can
be rescued.
• Most of the finance companies are looking for the data scientist to avoid risk and any type of losses with an
increase in customer satisfaction.
Role of Data Scientist.
Data scientist roles and responsibilities include:

• Data mining or extracting usable data from valuable data sources


• Using machine learning tools to select features, create and optimize classifiers
• Carrying out preprocessing of structured and unstructured data
• Enhancing data collection procedures to include all relevant information for developing analytic
systems
• Processing, cleansing, and validating the integrity of data to be used for analysis
• Analyzing large amounts of information to find patterns and solutions
• Developing prediction systems and machine learning algorithms
• Presenting results in a clear manner
• Propose solutions and strategies to tackle business challenges
• Collaborate with Business and IT teams

You might also like