0% found this document useful (0 votes)
30 views7 pages

Group 03 BI Assignment

This document summarizes a group project analyzing used vehicle sales data from Craigslist. The group extracted data from the Austin Craigslist Cars and Trucks dataset, which contains information on vehicles listed in that area. They preprocessed the data by dropping unnecessary columns, identifying and removing columns with many null values, separating data into categorical and numerical types, and filtering the data. Visualizations were then created in Tableau to develop a dashboard presenting descriptive analytics insights from the cleaned data.

Uploaded by

Jay Rajapaksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views7 pages

Group 03 BI Assignment

This document summarizes a group project analyzing used vehicle sales data from Craigslist. The group extracted data from the Austin Craigslist Cars and Trucks dataset, which contains information on vehicles listed in that area. They preprocessed the data by dropping unnecessary columns, identifying and removing columns with many null values, separating data into categorical and numerical types, and filtering the data. Visualizations were then created in Tableau to develop a dashboard presenting descriptive analytics insights from the cleaned data.

Uploaded by

Jay Rajapaksha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Bachelor of Business Science Degree Program

Year IV – Semester VIII

DA 4120 – Business Intelligence

Group Assignment
Reselling of used vehicles analysis

Submitted by:
D. D. R. R. Amaranath 196003T
J. D. S. K. Rajapaksha 196070T
K. D. R. Rajapakshe 196072C
K. S. Wishmitha 196095A

Supervisory Lecturer
Mr. Maninda Edirisooriya
Introduction

In today's digital age, the used vehicle trading industry has experienced significant growth,
providing consumers with numerous options to buy and sell used vehicles. Among the various
platforms facilitating these transactions, Craigslist has emerged as a prominent player. This
report aims to provide an analysis of Craigslist as a competitor in the used vehicle trading
market.

By leveraging big data analysis techniques, we have delved into various aspects of Craigslist's
operations, aiming to gain insights into its market presence, user behavior, and potential impact
on the used vehicle trading industry. Our analysis considers factors such as the volume and
variety of vehicle listings, pricing trends, vehicle condition etc.

By examining these factors, we aim to provide a comprehensive understanding of Craigslist's


positioning and competitive advantage in the used vehicle trading market. This analysis will
assist our company in identifying potential strategies and areas of improvement to enhance our
own operations and compete effectively in the market.
Dataset

The Austin Reese Craigslist Cars and Trucks dataset including 26 columns provides a
comprehensive overview of the automotive market. By looking closely at the dataset's columns,
we can learn important things about the cars being listed in this area. These insights help us
understand different aspects of the car market in this region.

The columns in this dataset cover a lot of information about the vehicles, such as their make,
model, year, condition, price, location, and other attributes.

Exploring some columns of the dataset

Make and Model: Tell us the brand and specific model of each vehicle.

Year: This column shows the manufacturing year of the vehicles.

Condition: Provides insights into whether the vehicles are new, used, or fall into specific
categories like excellent, good, fair, or salvage.

Price: Displays the listed prices for each vehicle.

Location: This column tells us where the vehicles are listed in Austin. This gives us geographic
insights, showing us how the listings are spread across different neighborhoods and indicating
which types of vehicles are more common in specific areas.

Mileage: The mileage column shows the distance traveled by each vehicle. This is important
information for buyers, as it helps them gauge the wear and tear on a particular vehicle.

Fuel Type: This column reveals the type of fuel used by each vehicle, such as gasoline, diesel,
hybrid, or electric. It allows us to examine the popularity of different fuel types and understand
the preferences of potential buyers.

Transmission: The transmission column tells us whether the vehicles have automatic or manual
transmission. By analyzing this data, we can see how the distribution of transmission types
relates to other variables in the dataset.
Title Status: The title status column provides information about the legal status of each vehicle's
title, including designations like clean title, salvage title, or rebuilt title. This helps us understand
the history and condition of the vehicles in the dataset.

Link to download the dataset

Creating the environment

Here we used Google Cloud Platform for our big data analysis as it is very useful to manage
large amounts of data efficiently. The following paragraph describes how we create our
environment and do our analysis.

First, we created a bucket in GCP to store our dataset and python scripts. After storing the
necessary files next, we have to create the cluster, these are the specifications that that required to
link the files.

We created the cluster environment using 4CPUs and 16GB of Memory with 50GB of SSD for
employee node and the worker node with 2CPUs and 8GB of memory with 50GB SSD.

In the meantime, we generated a sample dataset using our original dataset by randomly sampling,
then we create the python script of the preprocessing tasks that should apply to our dataset.

After the cluster is created, we linked our python script to the main dataset to apply the
preprocessing tasks and submit the job. After the preprocessed task have done, the generated
cleaned dataset csv has store in Big Query. We used Big Query here because we can directly
connect into our data visualization environment called Tableau.

Finally, the data visualizations has done in the Tableau environment by linking the Big Query to
Tableau environment.

The following chapters explained more about the preprocessing tasks as well as data
visualizations that applied.
Preprocessing

The preprocessing for the data was done using PySpark. After importing the relevant PySpark
libraries and functions, a SparkSession was created as it makes way to communicate with Spark
and to ensure an uninterrupted execution of preprocessing. With that it allowed us to use
distributed computing capabilities for data processing as well as conducting analysis.

Dropping of columns

The first step of preprocessing used for the data was to drop 10 columns that contained
unnecessary details such as region URL, image URL, State, and VIN. As these data columns
would not add any insights to the analysis these columns were eliminated from the dataset.

Upon dropping these columns, the next step of preprocessing was to identify columns with more
than 50% of null values and all zero. Two columns, size, and country, were identified to have
more than the defined null values limit and these identified columns were also dropped from the
data set.

Splitting data as categorical and numerical

The 3rd step of preprocessing relating to the data was to separate them as categorical and
numerical data.
The separated numerical data were checked to see if more than 10% of null values existed and
were dropped. And for the numerical columns with less than 10%, the null values are replaced

with the median value of that column.

Similarly, the categorical columns were checked for null values of less than 10% and they were
replaced by the mode of the column.

Filtering

In this step of preprocessing, the region column was filtered with the condition of not equal to
low miles. And year, price, and odometer columns were cast as float.

Once the preprocessing was completed the final dataset was taken as CSV file.

Visualizations

Effective visualization of data plays a significant role when presenting analytical findings to a
broad audience. It makes the content understandable to individuals with different domains and
gives the users the ability gain insights easily.

Tableau was the visual analytics platform used for visualizing the findings about the used cars.
The obtained CSV file after the preprocessing was connected to tableau using Big Query.
Thereafter using suitable visualizations a dashboard was developed to get descriptive analytics
about the dataset.

You might also like