Chapter - 2: Data Science & Python

This document provides an overview of key topics related to data science and Python. It discusses the core competencies of a data scientist, the data science pipeline, and why Python is well-suited for data science tasks. It also describes popular Python libraries for data science, including NumPy for scientific computing, pandas for data analysis, SciPy for scientific tools, Scikit-learn for machine learning, and matplotlib for data visualization. Finally, it covers considerations for speed of execution in data science applications.

Uploaded by

Mubaraka Kundawala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views17 pages

Chapter - 2: Data Science & Python

Uploaded by

Mubaraka Kundawala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Python for Data Science (PDS) (3150713)

Chapter - 2
Data Science & Python
Topics
Looping

Core competencies of a data scientist

Creating the Data Science Pipeline
Why Python?
Understanding Python's Role in Data Science
Considering Speed of Execution
Using the Python Ecosystem for Data Science
Performing fundamental scientific computing using NumPy
Performing data analysis using pandas
Accessing scientific tools using SciPy
Implementing machine learning using Scikit-learn
Going for deep learning with Keras and TensorFlow
Plotting the data using matplotlib
Creating graphs with NetworkX
Parsing HTML documents using Beautiful Soup
Core competencies of a data scientist
 The Data Scientist requires knowledge of vast range of skills to perform required tasks.
 Most of the times data scientists work in a team to provide the best results,
 for example someone who is good at gathering data might team up with an analyst and some gifted in
presenting information.
 It would be hard to find a single person with all the required skills.
 Below are the areas in which a data scientist could find opportunity
 Data Capture :
 Managing data source (i.e. database, exel, pdf, text etc…)
 Converting the unstructured data to structured data.
 Analysis :
 Knowledge of basic statistical tools.
 Use of specialized math tricks and algorithms.
 Presentations :
 Provide graphical presentations of the pattern.
 Represent the results of the data analysis to the end users.

3
Creating the Data Science Pipeline
 Data science pipeline requires the data scientist to follow particular steps in the preparation,
analysis and presentation of the data.
 General steps in the pipeline are
 Preparing the data
 The data we access from various sources may not come directly in the structured format.
 We need to transform the data in the structured format.
 Transformation may require changing data types, order in which data appears and even the creation of missing data
 Performing data analysis
 Results of the data analysis should be provable and consistent.
 Some time single approach may not provide the desired output, we need to use multiple algorithms to get the result.
 The use of trial and error is part of the data science art.
 Learning from data
 As we iterate through various statistical analysis methods and apply algorithms to detect patterns, we begin learning
from the data.
 The data might not tell the story that you originally thought it would.
 Visualizing
 Obtaining insights
4
Why Python?
 Python is the vision of a single person, Guido van Rossum, Guido started the language in December
1989 as a replacement for the ABC language.
 However, it far exceeds the ability to create applications of all types, and in contrast to ABC, boasts four
programming styles (programming paradigms)
 Functional :
 Treats every statements as a mathematical equation and avoids any form of state or mutable data
 The main advantage of this approach is having no side effects to consider.
 This coding style lends itself better than the others to parallel processing because there is no state to consider.
 Many developers prefer this coding style for recursion and for lambda calculus.
 Imperative :
 Performs computations as a direct change to program state.
 This style is especially useful when manipulating data structures and produces elegant but simple code.
 Object-oriented :
 Relies on data fields that are treated as objects and manipulated only through prescribed methods.
 Python doesn’t fully support this coding form because it can’t implement features such as data hiding.
 This is useful coding style for complex applications because it supports encapsulation and polymorphism.
 Procedural :
 Treats tasks as step-by-step iterations where common tasks are placed in functions that are called as needed.
5
Understanding Python's Role in Data Science
 Python has a unique attribute and is easy to use when it comes to quantitative and analytical
computing
 Data Science Python is widely used and is a favorite tool along being a flexible and open
sourced language.
 Its massive libraries are used for data manipulation and are very easy to learn even for a
beginner data analyst.
 Apart from being an independent platform it also easily integrates with any existing
infrastructure which can be used to solve the most complex problems.
 Python is preferred over other data science tools because of following features,
 Powerful and Easy to use
 Open Source
 Choice of Libraries
 Flexibility
 Visualization and Graphics
 Well supported
6
Considering Speed of Execution
 Analysis takes considerable processing power.
 The dataset are so large that you can bog down even an incredibly powerful system.
 Following factors control the speed of execution for data science application
 Dataset Size
 Loading Technique
 Coding Style
 Machine capabilities
 Analysis Algorithm
 We will explore all factors in details in following couple of slides.

7
Considering Speed of Execution (Cont.)
 Dataset size :
 Data science relies on huge datasets in many cases.
 The application type determines the size of dataset in part, but dataset size also relies on the size of the
source data.
 Underestimating the effect of dataset size is deadly in data science applications, especially those that need
to operate in real time (such as self-driving cars).
 Loading technique :
 The method we use to load data for analysis is critical, and we should always use the fastest one even if it
means upgrading the hardware to do so.
 Working with data in memory is always faster than working with data stored on disk.
 Accessing local data is always faster than accessing it across a network.
 Performing data science tasks that rely on network is probably the slowest method of all

8
Considering Speed of Execution (Cont.)
 Coding Style :
 Anyone can create a slow application using any programming language by employing coding techniques that
don’t make the best use of programming language functionality.
 To create fast data science applications, you must use best-of-method coding techniques.
 Machine Capability :
 Running data science applications on a memory-constrained system with a slower processor is impossible.
 The system you use needs to have the best hardware you can afford.
 Given that data science applications are both processor and disk bound, you can’t really cut corners in any
area and expect great results.
 Analysis Algorithm :
 The algorithm you use determines the kind of result you obtain and controls execution speed.
 We must experiment to find the best algorithm for particular dataset.

9
Using the Python Ecosystem for Data Science
 We need to load certain libraries in order to perform specific data science task in python.
 Following are the list of libraries which we are going to use in this subject.
1. Performing fundamental scientific computing using NumPy
2. Performing data analysis using pandas
3. Plotting the data using matplotlib
4. Accessing scientific tools using SciPy
5. Implementing machine learning using Scikit-learn
6. Going for deep learning with Keras and TensorFlow
7. Creating graphs with NetworkX
8. Parsing HTML documents using Beautiful Soup

10
1) NumPy
 NumPy is used to perform fundamental scientific computing.
 NumPy library provides the means for performing n-dimensional array manipulation, which is
critical for data science work.
 NumPy provides functions that include support for linear algebra, Fourier transformation,
random-number generation and many more..
Explore listing of functions at https://numpy.org/doc/stable/reference/routines.html

11
2) pandas
 pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation
tool, built on top of the Python programming language.
 it offers data structures and operations for manipulating numerical tables and time series.
 The library is optimized to perform data science tasks especially fast and efficiently.
 The basic principle behind pandas is to provide data analysis and modelling support for Python
that is similar to other languages such as R.

12
3) matplotlib
 The matplotlib library gives a MATLAB like interface for creating data presentations of the
analysis.
 The library is initially limited to 2-D output, but it still provide means to express analysis
graphically.
 Without this library we can not create output that people outside the data science community
could easily understand.

13
4) SciPy
 The SciPy stack contains a host of other libraries that we can also download separately.
 These libraries provide support for mathematics, science and engineering.
 When we obtain SciPy, we get a set of libraries designed to work together to create applications
of various sorts, these libraries are
 NumPy
 Pandas
 matplotlib
 Jupeter
 Sympy
 Etc…..

14
5) Keras and TensorFlow
 Keras is an application programming interface (API) that is used to train deep learning models.
 An API often specifies a model for doing something, but it doesn’t provide an implementation.
 TensorFlow is an implementation for the keras, there are many other implementations for the
keras like
 Microsoft’s cognitive Toolkit, CNKT
 Theano

15
6) Scikit-learn
 The Scikit-learn library is one of many Scikit libraries that build on the capabilities provided by
NumPy and SciPy to allow Python developers to perform domain specific tasks.
 Scikit-learn library focuses on data mining and data analysis, it provides access to following
sort of functionality:
 Classification
 Regression
 Clustering
 Dimensionality reduction
 Model selection
 Pre-processing
 Scikit-learn is the most important library we are going to learn in this subject

16
7) Beautiful Soup
 Beautiful Soup is a Python package for parsing HTML and XML documents.
 It creates a parse tree for parsed pages that can be used to extract data from HTML, which is
useful for web scraping.

Mastering Python For Data Science With Numpy & Pandas
100% (2)
Mastering Python For Data Science With Numpy & Pandas
136 pages
Python For Data Science .
100% (4)
Python For Data Science .
112 pages
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
100% (10)
Data Analysis From Scratch With Python - Beginner Guide Using Python, Pandas, NumPy, Scikit-Learn, IPython, TensorFlow and
104 pages
Python GTU Study Material Presentations Unit-2 24072020062038AM
No ratings yet
Python GTU Study Material Presentations Unit-2 24072020062038AM
18 pages
Introduction-It Skills
No ratings yet
Introduction-It Skills
20 pages
Unit2 PDS
No ratings yet
Unit2 PDS
17 pages
FDS Syllabus and CIS
No ratings yet
FDS Syllabus and CIS
10 pages
Lesson1 Introduction To The Data Science Process and The Value of Learning Data Science
No ratings yet
Lesson1 Introduction To The Data Science Process and The Value of Learning Data Science
6 pages
DS Syllabus
No ratings yet
DS Syllabus
29 pages
PDS Unit1-1
No ratings yet
PDS Unit1-1
104 pages
Slidesgo Unlocking Insights A Professional Introduction To Data Science With Python 20241125160150D6YR
No ratings yet
Slidesgo Unlocking Insights A Professional Introduction To Data Science With Python 20241125160150D6YR
14 pages
Py Chapter 1 Topic 3
No ratings yet
Py Chapter 1 Topic 3
4 pages
Py Chapter 1 Topic 1
No ratings yet
Py Chapter 1 Topic 1
7 pages
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
Manoj 5th Sem Project Report
No ratings yet
Manoj 5th Sem Project Report
20 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
30 pages
Suraj Report File
No ratings yet
Suraj Report File
17 pages
Introduction To Python 1
No ratings yet
Introduction To Python 1
13 pages
Python For Data Science FNL
No ratings yet
Python For Data Science FNL
6 pages
PYDS 3150713 Unit-2
No ratings yet
PYDS 3150713 Unit-2
38 pages
Data Science - Data
No ratings yet
Data Science - Data
10 pages
Python For Data Science
No ratings yet
Python For Data Science
20 pages
DS Unit 1 - NUMPY
No ratings yet
DS Unit 1 - NUMPY
29 pages
Introduction To Data Science Course Outline
No ratings yet
Introduction To Data Science Course Outline
5 pages
New Ebook Guide To AI & Data Science
No ratings yet
New Ebook Guide To AI & Data Science
175 pages
Main PART PDF
No ratings yet
Main PART PDF
46 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
From Everand
PYTHON DATA ANALYTICS: Mastering Python for Effective Data Analysis and Visualization (2024 Beginner Guide)
FLOYD BAX
No ratings yet
New Ebook Guide To AI Data Science
No ratings yet
New Ebook Guide To AI Data Science
50 pages
Roshan SDP
No ratings yet
Roshan SDP
11 pages
Explain The Role of Data Science With Python? Ans
No ratings yet
Explain The Role of Data Science With Python? Ans
2 pages
Data Science Course Outline CES LUMS
No ratings yet
Data Science Course Outline CES LUMS
4 pages
Python Data Mastery Report
No ratings yet
Python Data Mastery Report
9 pages
Paper 7
No ratings yet
Paper 7
3 pages
Igual-SeguÃ 2017 Chapter ToolboxesForDataScientists
No ratings yet
Igual-SeguÃ 2017 Chapter ToolboxesForDataScientists
24 pages
Databases For Data Science-SQL
No ratings yet
Databases For Data Science-SQL
55 pages
Python Libraries Seminar Report
100% (2)
Python Libraries Seminar Report
16 pages
Big Data Analytics: Data Scientists Are in High Demand
No ratings yet
Big Data Analytics: Data Scientists Are in High Demand
32 pages
Data Science Notes Structured FINAL v2
No ratings yet
Data Science Notes Structured FINAL v2
9 pages
SENG419-python 98745
No ratings yet
SENG419-python 98745
103 pages
Ppt1 Variable Strings Functions
No ratings yet
Ppt1 Variable Strings Functions
87 pages
Paper 5184
No ratings yet
Paper 5184
7 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
Introduction To Data Science
No ratings yet
Introduction To Data Science
25 pages
Data Science - Unit 1 MDM
No ratings yet
Data Science - Unit 1 MDM
64 pages
Foslipy Notes For Data Science Module 1 & 2
No ratings yet
Foslipy Notes For Data Science Module 1 & 2
3 pages
Data Science
No ratings yet
Data Science
14 pages
PROJECT On Data Science With Python
100% (1)
PROJECT On Data Science With Python
20 pages
Nitin Seminar Report
No ratings yet
Nitin Seminar Report
47 pages
Data Science With Python
No ratings yet
Data Science With Python
4 pages
Chapter One Data Science
No ratings yet
Chapter One Data Science
4 pages
Data Science
No ratings yet
Data Science
10 pages
T - Report Abhishek Choudary
No ratings yet
T - Report Abhishek Choudary
17 pages
DATA SCIENCE 1 (7th Sem)
No ratings yet
DATA SCIENCE 1 (7th Sem)
49 pages
Wa0003.
No ratings yet
Wa0003.
12 pages
Rakshitha.M - 1BO17EC031
No ratings yet
Rakshitha.M - 1BO17EC031
26 pages
Python Itinerary
No ratings yet
Python Itinerary
4 pages
Lab Course - II (Foundations of Data Science)
No ratings yet
Lab Course - II (Foundations of Data Science)
59 pages
Data Driven Guide for Python Programming : Master Essentials to Advanced Data Structures
From Everand
Data Driven Guide for Python Programming : Master Essentials to Advanced Data Structures
Younes Hamdani
No ratings yet
Mastering Python Algorithms: Practical Solutions for Complex Problems
From Everand
Mastering Python Algorithms: Practical Solutions for Complex Problems
Robert Johnson
No ratings yet
Placement 2023-24
No ratings yet
Placement 2023-24
3 pages
Unit-03: Capturing, Preparing and Working With Data
No ratings yet
Unit-03: Capturing, Preparing and Working With Data
41 pages
Basics of Python
No ratings yet
Basics of Python
48 pages
While Loop
No ratings yet
While Loop
8 pages
Lists
No ratings yet
Lists
27 pages
For Loop
No ratings yet
For Loop
5 pages
Os Lab Manual Final
No ratings yet
Os Lab Manual Final
17 pages
National Cyber Security Policy 2021 WWW - Csstimes.pk
No ratings yet
National Cyber Security Policy 2021 WWW - Csstimes.pk
23 pages
Cyber Laws Sybcom D
No ratings yet
Cyber Laws Sybcom D
11 pages
Debris Chute Checklist
No ratings yet
Debris Chute Checklist
1 page
The Technical Specifications Required For The Integrated Library Management System
No ratings yet
The Technical Specifications Required For The Integrated Library Management System
46 pages
Introduction To R
No ratings yet
Introduction To R
6 pages
The System Unit Is A Case That Contains Electronic Components of The Computer Used To Process Data
No ratings yet
The System Unit Is A Case That Contains Electronic Components of The Computer Used To Process Data
7 pages
Lesson 6 Revolved Features: Objectives
No ratings yet
Lesson 6 Revolved Features: Objectives
54 pages
(May-2022) New PassLeader DP-900 Exam Dumps
No ratings yet
(May-2022) New PassLeader DP-900 Exam Dumps
8 pages
08 - Alteon ADC Level 1 Lab Manual - Content Modification
No ratings yet
08 - Alteon ADC Level 1 Lab Manual - Content Modification
8 pages
Mathematics Activies Grade 8
No ratings yet
Mathematics Activies Grade 8
28 pages
Manikandan-Software Testing Engineer
No ratings yet
Manikandan-Software Testing Engineer
1 page
OPJEMS Process PDF
No ratings yet
OPJEMS Process PDF
9 pages
Internet of Things & Urban Transportation Planning
No ratings yet
Internet of Things & Urban Transportation Planning
10 pages
Ics312 - Nasm - Data - BSSPPT BUENOS
No ratings yet
Ics312 - Nasm - Data - BSSPPT BUENOS
43 pages
Delectron Z32 M335C2GB - Lathe Programming
No ratings yet
Delectron Z32 M335C2GB - Lathe Programming
115 pages
ITIC 2022 Global Server Hardware, Server OS Reliability Report
No ratings yet
ITIC 2022 Global Server Hardware, Server OS Reliability Report
37 pages
Picapool Contribution NHC
No ratings yet
Picapool Contribution NHC
6 pages
Hayes Command Set / Register Formats
100% (1)
Hayes Command Set / Register Formats
5 pages
National Career Service Portal: User Manual - COUNSELLOR v4.0
No ratings yet
National Career Service Portal: User Manual - COUNSELLOR v4.0
45 pages
PhilRice Citizens Charter Handbook v3
No ratings yet
PhilRice Citizens Charter Handbook v3
55 pages
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
No ratings yet
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
8 pages
Module 1 - Interactive Lecture
No ratings yet
Module 1 - Interactive Lecture
9 pages
Profihub f1 Manual en v101
No ratings yet
Profihub f1 Manual en v101
26 pages
Excel 2019 Manual 1660979397
100% (1)
Excel 2019 Manual 1660979397
196 pages
MIS Quiz
No ratings yet
MIS Quiz
32 pages
Unit 1 Python
No ratings yet
Unit 1 Python
12 pages
(Mission and Vision) Google and Amazon and AAST
No ratings yet
(Mission and Vision) Google and Amazon and AAST
2 pages
Zerodha Final Report
No ratings yet
Zerodha Final Report
83 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
Ishigurognnintroduction201023 201027054344
No ratings yet
Ishigurognnintroduction201023 201027054344
81 pages

Chapter - 2: Data Science & Python

Uploaded by

Chapter - 2: Data Science & Python

Uploaded by

Python for Data Science (PDS) (3150713)

Core competencies of a data scientist

You might also like