PDF Experiments-1 DADV
PDF Experiments-1 DADV
• Import all libraries which are required for our analysis, such as Data Loading, Statistical
analysis, Visualizations, Data Transformations, Merge and Joins, etc.
• Here is the link: (https://www.kaggle.com/datasets/sukhmanibedi/cars4u/data) to the
dataset.
Pandas and Numpy have been used for Data Manipulation and
numerical Calculations
• The Pandas library offers a wide range of possibilities for loading data into the pandas
DataFrame from files like JSON, .csv, .xlsx, .sql, .pickle, .html, .txt, images etc.
• Most of the data are available in a tabular format of CSV files. It is trendy and easy to
access. Using the read_csv() function, data can be converted to a pandas DataFrame.
• In this example, the data to predict Used car price is being used as an example. In this
dataset, we are trying to analyze the used car’s price and how EDA focuses on
identifying the factors influencing the car price. We have stored the data in the
DataFrame data.
data.shape
OUTPUT: (27, 7)
Step 3: Data Reduction
• Some columns or variables can be dropped if they do not add value to our analysis.
• In our dataset, the column S.No have only ID values, assuming they don’t have any
predictive power to predict the dependent variable.
Step 4: Feature Engineering
• Feature engineering refers to the process of using domain knowledge to select and
transform the most relevant variables from raw data when creating a predictive
model using machine learning or statistical modeling.
• The main goal of Feature engineering is to create meaningful data from raw data.
Step 5: Creating Features
• We will play around with the variables Year and Name in our dataset. If we see
the sample data, the column “Year” shows the manufacturing year of the car.
• It would be difficult to find the car’s age if it is in year format as the Age of the
car is a contributing factor to Car Price.
• EDA can be leveraged to check for outliers, patterns, and trends in the given
data.
• EDA helps to find meaningful patterns in data.
• EDA provides in-depth insights into the data sets to solve our business
problems.
• EDA gives a clue to impute missing values in the dataset.
Step 7: Statistics Summary
• The information gives a quick and simple description of the data.
• It can include Count, Mean, Standard Deviation, median, mode,
minimum value, maximum value, range, standard deviation, etc.
• Statistics summary gives a high-level idea to identify whether the
data has any outliers, data entry error, distribution of data such
as the data is normally distributed or left/right skewed
Step 8: Statistics Summary…
In python, this can be achieved using describe()
•Use pd.read_csv() for CSV files, similar functions exist for other data
formats (e.g., .xlsx, .json).
3. Initial Inspection:
•Get an overview of the data using df.head(), .tail(), and .info().