0% found this document useful (0 votes)
68 views17 pages

Chapter - 2: Data Science & Python

This document provides an overview of key topics related to data science and Python. It discusses the core competencies of a data scientist, the data science pipeline, and why Python is well-suited for data science tasks. It also describes popular Python libraries for data science, including NumPy for scientific computing, pandas for data analysis, SciPy for scientific tools, Scikit-learn for machine learning, and matplotlib for data visualization. Finally, it covers considerations for speed of execution in data science applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views17 pages

Chapter - 2: Data Science & Python

This document provides an overview of key topics related to data science and Python. It discusses the core competencies of a data scientist, the data science pipeline, and why Python is well-suited for data science tasks. It also describes popular Python libraries for data science, including NumPy for scientific computing, pandas for data analysis, SciPy for scientific tools, Scikit-learn for machine learning, and matplotlib for data visualization. Finally, it covers considerations for speed of execution in data science applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Python for Data Science (PDS) (3150713)

Chapter - 2
Data Science & Python
Topics
Looping

Core competencies of a data scientist


Creating the Data Science Pipeline
Why Python?
Understanding Python's Role in Data Science
Considering Speed of Execution
Using the Python Ecosystem for Data Science
Performing fundamental scientific computing using NumPy
Performing data analysis using pandas
Accessing scientific tools using SciPy
Implementing machine learning using Scikit-learn
Going for deep learning with Keras and TensorFlow
Plotting the data using matplotlib
Creating graphs with NetworkX
Parsing HTML documents using Beautiful Soup
Core competencies of a data scientist
 The Data Scientist requires knowledge of vast range of skills to perform required tasks.
 Most of the times data scientists work in a team to provide the best results,
 for example someone who is good at gathering data might team up with an analyst and some gifted in
presenting information.
 It would be hard to find a single person with all the required skills.
 Below are the areas in which a data scientist could find opportunity
 Data Capture :
 Managing data source (i.e. database, exel, pdf, text etc…)
 Converting the unstructured data to structured data.
 Analysis :
 Knowledge of basic statistical tools.
 Use of specialized math tricks and algorithms.
 Presentations :
 Provide graphical presentations of the pattern.
 Represent the results of the data analysis to the end users.

3
Creating the Data Science Pipeline
 Data science pipeline requires the data scientist to follow particular steps in the preparation,
analysis and presentation of the data.
 General steps in the pipeline are
 Preparing the data
 The data we access from various sources may not come directly in the structured format.
 We need to transform the data in the structured format.
 Transformation may require changing data types, order in which data appears and even the creation of missing data
 Performing data analysis
 Results of the data analysis should be provable and consistent.
 Some time single approach may not provide the desired output, we need to use multiple algorithms to get the result.
 The use of trial and error is part of the data science art.
 Learning from data
 As we iterate through various statistical analysis methods and apply algorithms to detect patterns, we begin learning
from the data.
 The data might not tell the story that you originally thought it would.
 Visualizing
 Obtaining insights
4
Why Python?
 Python is the vision of a single person, Guido van Rossum, Guido started the language in December
1989 as a replacement for the ABC language.
 However, it far exceeds the ability to create applications of all types, and in contrast to ABC, boasts four
programming styles (programming paradigms)
 Functional :
 Treats every statements as a mathematical equation and avoids any form of state or mutable data
 The main advantage of this approach is having no side effects to consider.
 This coding style lends itself better than the others to parallel processing because there is no state to consider.
 Many developers prefer this coding style for recursion and for lambda calculus.
 Imperative :
 Performs computations as a direct change to program state.
 This style is especially useful when manipulating data structures and produces elegant but simple code.
 Object-oriented :
 Relies on data fields that are treated as objects and manipulated only through prescribed methods.
 Python doesn’t fully support this coding form because it can’t implement features such as data hiding.
 This is useful coding style for complex applications because it supports encapsulation and polymorphism.
 Procedural :
 Treats tasks as step-by-step iterations where common tasks are placed in functions that are called as needed.
5
Understanding Python's Role in Data Science
 Python has a unique attribute and is easy to use when it comes to quantitative and analytical
computing
 Data Science Python is widely used and is a favorite tool along being a flexible and open
sourced language.
 Its massive libraries are used for data manipulation and are very easy to learn even for a
beginner data analyst.
 Apart from being an independent platform it also easily integrates with any existing
infrastructure which can be used to solve the most complex problems.
 Python is preferred over other data science tools because of following features,
 Powerful and Easy to use
 Open Source
 Choice of Libraries
 Flexibility
 Visualization and Graphics
 Well supported
6
Considering Speed of Execution
 Analysis takes considerable processing power.
 The dataset are so large that you can bog down even an incredibly powerful system.
 Following factors control the speed of execution for data science application
 Dataset Size
 Loading Technique
 Coding Style
 Machine capabilities
 Analysis Algorithm
 We will explore all factors in details in following couple of slides.

7
Considering Speed of Execution (Cont.)
 Dataset size :
 Data science relies on huge datasets in many cases.
 The application type determines the size of dataset in part, but dataset size also relies on the size of the
source data.
 Underestimating the effect of dataset size is deadly in data science applications, especially those that need
to operate in real time (such as self-driving cars).
 Loading technique :
 The method we use to load data for analysis is critical, and we should always use the fastest one even if it
means upgrading the hardware to do so.
 Working with data in memory is always faster than working with data stored on disk.
 Accessing local data is always faster than accessing it across a network.
 Performing data science tasks that rely on network is probably the slowest method of all

8
Considering Speed of Execution (Cont.)
 Coding Style :
 Anyone can create a slow application using any programming language by employing coding techniques that
don’t make the best use of programming language functionality.
 To create fast data science applications, you must use best-of-method coding techniques.
 Machine Capability :
 Running data science applications on a memory-constrained system with a slower processor is impossible.
 The system you use needs to have the best hardware you can afford.
 Given that data science applications are both processor and disk bound, you can’t really cut corners in any
area and expect great results.
 Analysis Algorithm :
 The algorithm you use determines the kind of result you obtain and controls execution speed.
 We must experiment to find the best algorithm for particular dataset.

9
Using the Python Ecosystem for Data Science
 We need to load certain libraries in order to perform specific data science task in python.
 Following are the list of libraries which we are going to use in this subject.
1. Performing fundamental scientific computing using NumPy
2. Performing data analysis using pandas
3. Plotting the data using matplotlib
4. Accessing scientific tools using SciPy
5. Implementing machine learning using Scikit-learn
6. Going for deep learning with Keras and TensorFlow
7. Creating graphs with NetworkX
8. Parsing HTML documents using Beautiful Soup

10
1) NumPy
 NumPy is used to perform fundamental scientific computing.
 NumPy library provides the means for performing n-dimensional array manipulation, which is
critical for data science work.
 NumPy provides functions that include support for linear algebra, Fourier transformation,
random-number generation and many more..
Explore listing of functions at https://numpy.org/doc/stable/reference/routines.html

11
2) pandas
 pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation
tool, built on top of the Python programming language.
 it offers data structures and operations for manipulating numerical tables and time series.
 The library is optimized to perform data science tasks especially fast and efficiently.
 The basic principle behind pandas is to provide data analysis and modelling support for Python
that is similar to other languages such as R.

12
3) matplotlib
 The matplotlib library gives a MATLAB like interface for creating data presentations of the
analysis.
 The library is initially limited to 2-D output, but it still provide means to express analysis
graphically.
 Without this library we can not create output that people outside the data science community
could easily understand.

13
4) SciPy
 The SciPy stack contains a host of other libraries that we can also download separately.
 These libraries provide support for mathematics, science and engineering.
 When we obtain SciPy, we get a set of libraries designed to work together to create applications
of various sorts, these libraries are
 NumPy
 Pandas
 matplotlib
 Jupeter
 Sympy
 Etc…..

14
5) Keras and TensorFlow
 Keras is an application programming interface (API) that is used to train deep learning models.
 An API often specifies a model for doing something, but it doesn’t provide an implementation.
 TensorFlow is an implementation for the keras, there are many other implementations for the
keras like
 Microsoft’s cognitive Toolkit, CNKT
 Theano

15
6) Scikit-learn
 The Scikit-learn library is one of many Scikit libraries that build on the capabilities provided by
NumPy and SciPy to allow Python developers to perform domain specific tasks.
 Scikit-learn library focuses on data mining and data analysis, it provides access to following
sort of functionality:
 Classification
 Regression
 Clustering
 Dimensionality reduction
 Model selection
 Pre-processing
 Scikit-learn is the most important library we are going to learn in this subject

16
7) Beautiful Soup
 Beautiful Soup is a Python package for parsing HTML and XML documents.
 It creates a parse tree for parsed pages that can be used to extract data from HTML, which is
useful for web scraping.

17

You might also like