0% found this document useful (0 votes)
3 views

ds with py

The document discusses the significance of data science in business and research, emphasizing the concepts of datification and democratization of data analysis. It outlines key components of data science, including data collection, cleaning, analysis, visualization, and machine learning, while also highlighting the importance of Python and libraries like NumPy and Pandas for data wrangling and analysis. Additionally, it covers data visualization techniques and tools, along with the steps involved in data analysis and sources for free datasets.

Uploaded by

rs9938
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

ds with py

The document discusses the significance of data science in business and research, emphasizing the concepts of datification and democratization of data analysis. It outlines key components of data science, including data collection, cleaning, analysis, visualization, and machine learning, while also highlighting the importance of Python and libraries like NumPy and Pandas for data wrangling and analysis. Additionally, it covers data visualization techniques and tools, along with the steps involved in data analysis and sources for free datasets.

Uploaded by

rs9938
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Applied Data Science with Python

UNIT-1
why is data science seen as a novel trend within business
reviews, in technology blogs, and at academic conferences?
• The novelty of data science is not rooted in the latest scientific
knowledge, but in a disruptive change in our society that has been
caused by the evolution of technology: datification.
• Datification is the process of rendering into data aspects of the world
that have never been quantified before.
• At the personal level, the list of datified concepts is very long and still
growing: business networks, the lists of books we are reading, the
films we enjoy, the food we eat, our physical activity, our purchases,
our driving behavior, and so on. Even our thoughts are datified when
we publish them on our favorite social network; and in a not so distant
future, your gaze could be datified by wearable vision registering
devices.
Continue….
• However, datification is not the only ingredient of the data science
revolution. The other ingredient is the democratization of data
analysis. Large companies such as Google, Yahoo, IBM, or SAS
were the only players in this field when data science had no name.
• Today, the analytical gap between those companies and the rest of the
world (companies and people) is shrinking. Access to cloud computing
allows any individual to analyze huge amounts of data in short periods
of time. Analytical knowledge is free and most of the crucial
algorithms that are needed to create a solution can be found, because
open-source development is the norm in this field. As a result, the
possibility of using rich data to take evidence-based decisions is open
to virtually any person or company.
Continue….

Research Scope:
• Data Science enables innovation in research by offering tools to
handle massive datasets, apply advanced analytics, and develop
predictive models to gain new insights.
Introduction to Data Science
What is Data Science?
• Data science is commonly defined
as a methodology by which
actionable insights can be inferred
from data Data Science is an
interdisciplinary field that uses
scientific methods, algorithms, and
systems to extract insights from
structured and unstructured data. It
plays a pivotal role in solving
real-world problems, from
healthcare diagnostics to financial
fraud detection.
Introduction to Data Science
Key Components of Data Science:
1. Data Collection: Gathering data from diverse sources such as sensors,
social media, and databases.
2. Data Cleaning: Removing inconsistencies, handling missing values,
and preparing data for analysis.
3. Data Analysis: Applying statistical and computational techniques to
uncover patterns and relationships.
4. Data Visualization: Representing data insights through graphs,
charts, and dashboards for better understanding.
5. Machine Learning: Leveraging algorithms to predict outcomes and
automate decision-making.
Introduction to Data Science

Applications of Data Science in Research:


∙ Healthcare: Predicting diseases using patient data and developing
personalized treatment plans.
∙ Environment: Climate modeling and prediction using large-scale
environmental data.
∙ Engineering: Optimization of processes in manufacturing and
materials analysis.
∙ Social Sciences: Understanding human behavior through sentiment
analysis and survey data.
Essentials of Python Programming

Python for Research:


• Python's simplicity and extensive libraries make it a powerful
language for engineering research. Its versatility supports tasks
ranging from prototyping to implementing complex algorithms.
• learn in-demand skills such as how to design, develop, and improve
computer programs, methods for analyzing problems using
programming, programming best practices, and more.
Fundamentals of NumPy
NumPy (Numerical Python) is an open source Python library that’s
widely used in science and engineering.
NumPy library contains multidimensional array data structures, such
as the homogeneous, N-dimensional ndarray, and a large library of
functions that operate efficiently on these data structures.
import numpy as np
a = np.array([[1, 2, 3],
[4, 5, 6]])
a.Shape
Output-(2,3)
Working with Pandas
Why Pandas for Research?
• Pandas simplifies handling and analyzing structured data, essential for
engineering projects involving large datasets.
DataFrames in Research:
import pandas as pd
# Research Dataexperiment_data = {
"Sample": ["A", "B", "C"],
"Weight": [10.2, 13.5, 15.8],
"Result": ["Pass", "Pass", "Fail"]}
dataframe = pd.DataFrame(experiment_data)
print("DataFrame:\n", dataframe)
# Statistical Summary
print("Summary:\n", dataframe.describe())
Data Wrangling

• Data Wrangling is the process of gathering, collecting, and


transforming Raw data into another format for better understanding,
decision-making, accessing, and analysis in less time. Data Wrangling
is also known as Data Munging.
Importance Of Data Wrangling
• Data Wrangling is a very important step in a Data science project
Books selling Website want to show top-selling books of different
domains, according to user preference. For example, if a new user
searches for motivational books, then they want to show those
motivational books which sell the most or have a high rating, etc.
Data Wrangling

• But on their website, there are plenty of raw data from different users.
Here the concept of Data Munging or Data Wrangling is used. As we
know Data wrangling is not by the System itself. This process is done
by Data Scientists. So, the data Scientist will wrangle data in such a
way that they will sort the motivational books that are sold more or
have high ratings or user buy this book with these package of Books,
etc. On the basis of that, the new user will make a choice.
Data Wrangling in Python
• Data Wrangling is a crucial topic for Data Science and Data Analysis.
Pandas Framework of Python is used for Data Wrangling. Pandas is an
open-source library in Python specifically developed for Data Analysis
and Data Science. It is used for processes like data sorting or filtration,
Data grouping, etc.
• Data wrangling in Python deals with the below functionalities:
1. Data exploration: In this process, the data is studied, analyzed, and
understood by visualizing representations of data.
2. Dealing with missing values: Most of the datasets having a vast
amount of data contain missing values of NaN, they are needed to be
taken care of by replacing them with mean, mode, the most frequent
value of the column, or simply by dropping the row having
a NaN value.
Data Wrangling in Python
3. Reshaping data: In this process, data is manipulated according to
the requirements, where new data can be added or pre-existing data
can be modified.
4. Filtering data: Some times datasets are comprised of unwanted
rows or columns which are required to be removed or filtered
5. Other: After dealing with the raw dataset with the above
functionalities we get an efficient dataset as per our requirements
and then it can be used for a required purpose like data
analyzing, machine learning, data visualization, model training etc.
Examples:
Here in Data exploration, we load the data into a dataframe, and then we visualize the data in a
tabular format.
• # Import pandas package
• import pandas as pd

• # Assign data
• data = {'Name': ['Jai', 'Princi', 'Gaurav',
• 'Anuj', 'Ravi', 'Natasha', 'Riya'],
• 'Age': [17, 17, 18, 17, 18, 17, 17],
• 'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'],
• 'Marks': [90, 76, 'NaN', 74, 65, 'NaN', 71]}

• # Convert into DataFrame


• df = pd.DataFrame(data)

• # Display data
• df
Dealing with missing values in Python
there are NaN values present in the MARKS column which is a missing value in the
dataframe that is going to be taken care of in data wrangling by replacing them with the
column mean.
# Compute average
c = avg = 0
for ele in df['Marks']:
if str(ele).isnumeric():
c += 1
avg += ele
avg /= c

# Replace missing values


df = df.replace(to_replace="NaN",
value=avg)

# Display data
Data Replacing in Data Wrangling
• in the GENDER column, we can replace the Gender column data by
categorizing them into different numbers.
# Categorize gender
df['Gender'] = df['Gender'].map({'M': 0,
'F': 1, }).astype(float)
Display data
df
Filtering data in Data Wrangling
• suppose there is a requirement for the details regarding name,
gender, and marks of the top-scoring students. Here we need to
remove some using the pandas slicing method in data wrangling
from unwanted data.
# Filter top scoring students
df = df[df['Marks'] >= 75].copy()

# Remove age column from filtered DataFrame


df.drop('Age', axis=1, inplace=True)

# Display data
df
Data Wrangling Using Merge Operation
• Merge operation is used to merge two raw data into the desired format.
pd.merge( data_frame1,data_frame2, on=”field “)

• Here the field is the name of the column which is similar in both data-frame.
For example: Suppose that a Teacher has two types of Data, the first type of
Data consists of Details of Students and the Second type of Data Consist of
Pending Fees Status which is taken from the Account Office. So The Teacher
will use the merge operation here in order to merge the data and provide it
meaning. So that teacher will analyze it easily and it also reduces the time
and effort of the Teacher from Manual Merging.
Creating First Dataframe to Perform Merge Operation
using Data Wrangling:
• # import module
• import pandas as pd

• # creating DataFrame for Student Details


• details = pd.DataFrame({
• 'ID': [101, 102, 103, 104, 105, 106,
• 107, 108, 109, 110],
• 'NAME': ['Jagroop', 'Praveen', 'Harjot',
• 'Pooja', 'Rahul', 'Nikita',
• 'Saurabh', 'Ayush', 'Dolly', "Mohit"],
• 'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE',
• 'CSE', 'CSE', 'CSE', 'CSE', 'CSE']})

• # printing details
• print(details)
Creating Second Dataframe to Perform Merge operation
using Data Wrangling:
• # Import module
• import pandas as pd

• # Creating Dataframe for Fees_Status


• fees_status = pd.DataFrame(
• {'ID': [101, 102, 103, 104, 105,
• 106, 107, 108, 109, 110],
• 'PENDING': ['5000', '250', 'NIL',
• '9000', '15000', 'NIL',
• '4500', '1800', '250', 'NIL']})

• # Printing fees_status
• print(fees_status)
Data Wrangling Using Merge Operation:

•import pandas as pd
•# Creating Dataframe
•details = pd.DataFrame({
• 'ID': [101, 102, 103, 104, 105,
• 106, 107, 108, 109, 110],
• 'NAME': ['Jagroop', 'Praveen', 'Harjot',
• 'Pooja', 'Rahul', 'Nikita',
• 'Saurabh', 'Ayush', 'Dolly', "Mohit"],
• 'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE',
• 'CSE', 'CSE', 'CSE', 'CSE', 'CSE']})
Data Wrangling Using Merge Operation:
• # Creating Dataframe
• fees_status = pd.DataFrame(
• {'ID': [101, 102, 103, 104, 105,
• 106, 107, 108, 109, 110],
• 'PENDING': ['5000', '250', 'NIL',
• '9000', '15000', 'NIL',
• '4500', '1800', '250', 'NIL']})

• # Merging Dataframe
• print(pd.merge(details, fees_status, on='ID'))
Data Analysis
• Data Analysis is the technique of collecting, transforming, and
organizing data to make future predictions and informed data-driven
decisions. It also helps to find possible solutions for a business
problem. There are six steps for Data Analysis. They are:
• Ask or Specify Data Requirements
• Prepare or Collect Data
• Clean and Process
• Analyze
• Share
• Act or Report
Free Dataset Sources to Use for Data Science Projects

1.Google Cloud Public Datasets


2. Amazon Web Services Open Data Registry
3. Data.gov
4. Kaggle
5. UCI Machine Learning Repository
6. National Center for Environmental Information
7. Global Health Observatory
8. Earthdata
Data Visualization

• In new era, a lot of data is being generated on a daily basis. And


sometimes to analyze this data for certain trends, patterns may become
difficult if the data is in its raw format. To overcome this data
visualization comes into play. Data visualization provides a good,
organized pictorial representation of the data which makes it easier to
understand, observe, analyze.
Data Visualization
• Python provides various libraries that come with different features for
visualizing data. All these libraries come with different features and
can support various types of graphs. In this tutorial, we will be
discussing four such libraries.
• Matplotlib
• Seaborn
• Bokeh
• Plotly
• import pandas as pd

• # reading the database


• data = pd.read_csv("tips.csv")

• # printing the top 10 rows


• display(data.head(10))
Matplotlib
• Matplotlib is an easy-to-use, low-level data visualization library that is
built on NumPy arrays. It consists of various plots like scatter plot,
line plot, histogram, etc. Matplotlib provides a lot of flexibility.
• To install this type the below command in the terminal.
• !pip install matplotlib
Scatter Plot
• Scatter plots are used to observe relationships between variables and
uses dots to represent the relationship between them.
The scatter() method in the matplotlib library is used to draw a scatter
plot.
Line Chart
• Line Chart is used to represent a relationship between two data X and
Y on a different axis. It is plotted using the plot() function. Let’s see
the below example.
Bar Chart

• A bar plot or bar chart is a graph that represents the category of data
with rectangular bars with lengths and heights that is proportional to
the values which they represent. It can be created using
the bar() method.
Histogram

• A histogram is basically used to represent data in the form of some


groups. It is a type of bar plot where the X-axis represents the bin
ranges while the Y-axis gives information about frequency.
The hist() function is used to compute and create a histogram. In
histogram, if we pass categorical data then it will automatically
compute the frequency of that data i.e. how often each value occurred.
Seaborn
• Seaborn is a high-level interface built on top of the Matplotlib. It
provides beautiful design styles and color palettes to make more
attractive graphs.
• To install seaborn type the below command in the terminal.
!pip install seaborn
Seaborn is built on the top of Matplotlib, therefore it can be used with
the Matplotlib as well. Using both Matplotlib and Seaborn together is a
very simple process. We just have to invoke the Seaborn Plotting
function as normal, and then we can use Matplotlib’s customization
function.
Problem Solving:
Essentials of Python Programming
• Introduction to Python
• Python Data Types and Operators
• Control Flow: Loops and Conditional Statements
• Functions and Modules
• Object-Oriented Programming in Python
Problem Solving:
Fundamentals of NumPy
• NumPy Arrays and Operations
• Mathematical Operations on NumPy Arrays
• Array Broadcasting
• Random Sampling with NumPy
• Array Slicing
• Array Indexing
Problem Solving:
Working with Pandas
• Data Frames and Series
• Data Cleaning and Manipulation
• Merging, Joining, and Grouping Data
• Handling Missing Data
Problem Solving:
Data Wrangling
• Data Transformation Techniques
• Data Aggregation and Grouping
• Time Series Data Handling
Problem Solving:
Data Visualization
• Introduction to Data Visualization Tools
• Matplotlib Basics
• Seaborn for Statistical Plots
• Plotly for Interactive Plots
• Customizing Plots and Charts

You might also like