ds with py
ds with py
UNIT-1
why is data science seen as a novel trend within business
reviews, in technology blogs, and at academic conferences?
• The novelty of data science is not rooted in the latest scientific
knowledge, but in a disruptive change in our society that has been
caused by the evolution of technology: datification.
• Datification is the process of rendering into data aspects of the world
that have never been quantified before.
• At the personal level, the list of datified concepts is very long and still
growing: business networks, the lists of books we are reading, the
films we enjoy, the food we eat, our physical activity, our purchases,
our driving behavior, and so on. Even our thoughts are datified when
we publish them on our favorite social network; and in a not so distant
future, your gaze could be datified by wearable vision registering
devices.
Continue….
• However, datification is not the only ingredient of the data science
revolution. The other ingredient is the democratization of data
analysis. Large companies such as Google, Yahoo, IBM, or SAS
were the only players in this field when data science had no name.
• Today, the analytical gap between those companies and the rest of the
world (companies and people) is shrinking. Access to cloud computing
allows any individual to analyze huge amounts of data in short periods
of time. Analytical knowledge is free and most of the crucial
algorithms that are needed to create a solution can be found, because
open-source development is the norm in this field. As a result, the
possibility of using rich data to take evidence-based decisions is open
to virtually any person or company.
Continue….
Research Scope:
• Data Science enables innovation in research by offering tools to
handle massive datasets, apply advanced analytics, and develop
predictive models to gain new insights.
Introduction to Data Science
What is Data Science?
• Data science is commonly defined
as a methodology by which
actionable insights can be inferred
from data Data Science is an
interdisciplinary field that uses
scientific methods, algorithms, and
systems to extract insights from
structured and unstructured data. It
plays a pivotal role in solving
real-world problems, from
healthcare diagnostics to financial
fraud detection.
Introduction to Data Science
Key Components of Data Science:
1. Data Collection: Gathering data from diverse sources such as sensors,
social media, and databases.
2. Data Cleaning: Removing inconsistencies, handling missing values,
and preparing data for analysis.
3. Data Analysis: Applying statistical and computational techniques to
uncover patterns and relationships.
4. Data Visualization: Representing data insights through graphs,
charts, and dashboards for better understanding.
5. Machine Learning: Leveraging algorithms to predict outcomes and
automate decision-making.
Introduction to Data Science
• But on their website, there are plenty of raw data from different users.
Here the concept of Data Munging or Data Wrangling is used. As we
know Data wrangling is not by the System itself. This process is done
by Data Scientists. So, the data Scientist will wrangle data in such a
way that they will sort the motivational books that are sold more or
have high ratings or user buy this book with these package of Books,
etc. On the basis of that, the new user will make a choice.
Data Wrangling in Python
• Data Wrangling is a crucial topic for Data Science and Data Analysis.
Pandas Framework of Python is used for Data Wrangling. Pandas is an
open-source library in Python specifically developed for Data Analysis
and Data Science. It is used for processes like data sorting or filtration,
Data grouping, etc.
• Data wrangling in Python deals with the below functionalities:
1. Data exploration: In this process, the data is studied, analyzed, and
understood by visualizing representations of data.
2. Dealing with missing values: Most of the datasets having a vast
amount of data contain missing values of NaN, they are needed to be
taken care of by replacing them with mean, mode, the most frequent
value of the column, or simply by dropping the row having
a NaN value.
Data Wrangling in Python
3. Reshaping data: In this process, data is manipulated according to
the requirements, where new data can be added or pre-existing data
can be modified.
4. Filtering data: Some times datasets are comprised of unwanted
rows or columns which are required to be removed or filtered
5. Other: After dealing with the raw dataset with the above
functionalities we get an efficient dataset as per our requirements
and then it can be used for a required purpose like data
analyzing, machine learning, data visualization, model training etc.
Examples:
Here in Data exploration, we load the data into a dataframe, and then we visualize the data in a
tabular format.
• # Import pandas package
• import pandas as pd
• # Assign data
• data = {'Name': ['Jai', 'Princi', 'Gaurav',
• 'Anuj', 'Ravi', 'Natasha', 'Riya'],
• 'Age': [17, 17, 18, 17, 18, 17, 17],
• 'Gender': ['M', 'F', 'M', 'M', 'M', 'F', 'F'],
• 'Marks': [90, 76, 'NaN', 74, 65, 'NaN', 71]}
• # Display data
• df
Dealing with missing values in Python
there are NaN values present in the MARKS column which is a missing value in the
dataframe that is going to be taken care of in data wrangling by replacing them with the
column mean.
# Compute average
c = avg = 0
for ele in df['Marks']:
if str(ele).isnumeric():
c += 1
avg += ele
avg /= c
# Display data
Data Replacing in Data Wrangling
• in the GENDER column, we can replace the Gender column data by
categorizing them into different numbers.
# Categorize gender
df['Gender'] = df['Gender'].map({'M': 0,
'F': 1, }).astype(float)
Display data
df
Filtering data in Data Wrangling
• suppose there is a requirement for the details regarding name,
gender, and marks of the top-scoring students. Here we need to
remove some using the pandas slicing method in data wrangling
from unwanted data.
# Filter top scoring students
df = df[df['Marks'] >= 75].copy()
# Display data
df
Data Wrangling Using Merge Operation
• Merge operation is used to merge two raw data into the desired format.
pd.merge( data_frame1,data_frame2, on=”field “)
• Here the field is the name of the column which is similar in both data-frame.
For example: Suppose that a Teacher has two types of Data, the first type of
Data consists of Details of Students and the Second type of Data Consist of
Pending Fees Status which is taken from the Account Office. So The Teacher
will use the merge operation here in order to merge the data and provide it
meaning. So that teacher will analyze it easily and it also reduces the time
and effort of the Teacher from Manual Merging.
Creating First Dataframe to Perform Merge Operation
using Data Wrangling:
• # import module
• import pandas as pd
• # printing details
• print(details)
Creating Second Dataframe to Perform Merge operation
using Data Wrangling:
• # Import module
• import pandas as pd
• # Printing fees_status
• print(fees_status)
Data Wrangling Using Merge Operation:
•import pandas as pd
•# Creating Dataframe
•details = pd.DataFrame({
• 'ID': [101, 102, 103, 104, 105,
• 106, 107, 108, 109, 110],
• 'NAME': ['Jagroop', 'Praveen', 'Harjot',
• 'Pooja', 'Rahul', 'Nikita',
• 'Saurabh', 'Ayush', 'Dolly', "Mohit"],
• 'BRANCH': ['CSE', 'CSE', 'CSE', 'CSE', 'CSE',
• 'CSE', 'CSE', 'CSE', 'CSE', 'CSE']})
Data Wrangling Using Merge Operation:
• # Creating Dataframe
• fees_status = pd.DataFrame(
• {'ID': [101, 102, 103, 104, 105,
• 106, 107, 108, 109, 110],
• 'PENDING': ['5000', '250', 'NIL',
• '9000', '15000', 'NIL',
• '4500', '1800', '250', 'NIL']})
• # Merging Dataframe
• print(pd.merge(details, fees_status, on='ID'))
Data Analysis
• Data Analysis is the technique of collecting, transforming, and
organizing data to make future predictions and informed data-driven
decisions. It also helps to find possible solutions for a business
problem. There are six steps for Data Analysis. They are:
• Ask or Specify Data Requirements
• Prepare or Collect Data
• Clean and Process
• Analyze
• Share
• Act or Report
Free Dataset Sources to Use for Data Science Projects
• A bar plot or bar chart is a graph that represents the category of data
with rectangular bars with lengths and heights that is proportional to
the values which they represent. It can be created using
the bar() method.
Histogram