0% found this document useful (0 votes)
15 views8 pages

FDS Practical 2

Uploaded by

federerroy01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views8 pages

FDS Practical 2

Uploaded by

federerroy01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

FDS Practical 2

Aim- Categorization and implementation of different data formats for data analysis.

Description
In data analysis, effectively managing and utilizing various data formats is crucial for
ensuring efficient processing, storage, and retrieval of data. Different data formats
cater to distinct needs and use cases, influencing the choice of format based on the
nature of the data and the requirements of the analysis.

Common data formats include:

• CSV (Comma-Separated Values): This format is widely used for simple tabular
data, making it easy to read and write. CSV files are lightweight and can be
easily imported into data analysis tools, but they lack support for hierarchical
or nested data structures.

• JSON (JavaScript Object Notation): Ideal for representing hierarchical or


structured data, JSON is commonly used for web applications and APIs. Its
readability and flexibility make it suitable for transmitting data between a
server and a web application.

• Excel: Known for its rich features, Excel is used for complex spreadsheets that
may include multiple sheets, charts, and advanced formulas. It is popular
among business analysts for data manipulation and visualization.

• Parquet: This columnar storage format is designed for big data processing,
enabling efficient compression and encoding schemes. Parquet files are
particularly suited for analytical queries and are often used with big data
frameworks like Apache Spark and Hadoop.

• SQL Databases: Relational databases store data in structured formats with


predefined schemas, making them suitable for complex queries and
transactions. SQL databases are essential for applications requiring data
integrity and consistency.
SOURCE CODE: -
import pandas as pd

# 1. Reading data from different formats

# Reading from a CSV file

csv_data = pd.read_csv('titanic_train.csv')

# Reading from an Excel file

excel_data = pd.read_excel('file_example_XLSX_100.xlsx', sheet_name='Sheet1')

# Reading from a JSON file

json_data = pd.read_json('test.json')

# 2. Categorizing data by its data types (Numerical, Categorical, DateTime, etc.)

def categorize_data(df):

data_types = df.dtypes

numerical = df.select_dtypes(include=['int64', 'float64'])

categorical = df.select_dtypes(include=['object'])

datetime = df.select_dtypes(include=['datetime64'])

print(f"Numerical Columns:\n{numerical.columns}\n")

print(f"Categorical Columns:\n{categorical.columns}\n")

print(f"DateTime Columns:\n{datetime.columns}\n")

# Example of applying this function

print("Categorizing CSV Data:")

categorize_data(csv_data)

print("Categorizing Excel Data:")

categorize_data(excel_data)
print("Categorizing JSON Data:")

categorize_data(json_data)

# 3. Basic Data Analysis

# Checking for missing values

print("\nMissing values in CSV Data:\n", csv_data.isnull().sum())

print("\nMissing values in Excel Data:\n", excel_data.isnull().sum())

print("\nMissing values in JSON Data:\n", json_data.isnull().sum())

# Descriptive statistics

print("\nDescriptive statistics of CSV Data:\n", csv_data.describe())

print("\nDescriptive statistics of Excel Data:\n", excel_data.describe())

print("\nDescriptive statistics of JSON Data:\n", json_data.describe())


OUTPUT: -
Categorizing CSV Data:

Numerical Columns:

Index(['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')

Categorical Columns:

Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')

DateTime Columns:

Index([], dtype='object')

Categorizing Excel Data:

Numerical Columns:

Index([0, 'Age', 'Id'], dtype='object')

Categorical Columns:

Index(['First Name', 'Last Name', 'Gender', 'Country', 'Date'], dtype='object')

DateTime Columns:

Index([], dtype='object')

Categorizing JSON Data:

Numerical Columns:

Index(['version'], dtype='object')

Categorical Columns:
Index(['name', 'language', 'id', 'bio'], dtype='object')

DateTime Columns:

Index([], dtype='object')

Missing values in CSV Data:

PassengerId 0

Survived 0

Pclass 0

Name 0

Sex 0

Age 177

SibSp 0

Parch 0

Ticket 0

Fare 0

Cabin 687

Embarked 2

dtype: int64

Missing values in Excel Data:

0 0

First Name 0

Last Name 0

Gender 0
Country 0

Age 0

Date 0

Id 0

dtype: int64

Missing values in JSON Data:

name 0

language 0

id 0

bio 0

version 0

dtype: int64

Descriptive statistics of CSV Data:

PassengerId Survived Pclass Age SibSp /

count 891.000000 891.000000 891.000000 714.000000 891.000000

mean 446.000000 0.383838 2.308642 29.699118 0.523008

std 257.353842 0.486592 0.836071 14.526497 1.102743

min 1.000000 0.000000 1.000000 0.420000 0.000000

25% 223.500000 0.000000 2.000000 20.125000 0.000000

50% 446.000000 0.000000 3.000000 28.000000 0.000000

75% 668.500000 1.000000 3.000000 38.000000 1.000000

max 891.000000 1.000000 3.000000 80.000000 8.000000


Parch Fare

count 891.000000 891.000000

mean 0.381594 32.204208

std 0.806057 49.693429

min 0.000000 0.000000

25% 0.000000 7.910400

50% 0.000000 14.454200

75% 0.000000 31.000000

max 6.000000 512.329200

Descriptive statistics of Excel Data:

0 Age Id

count 100.000000 100.000000 100.000000

mean 50.500000 33.260000 4717.720000

std 29.011492 8.391458 2379.081421

min 1.000000 21.000000 1258.000000

25% 25.750000 26.000000 2587.000000

50% 50.500000 32.000000 3574.000000

75% 75.250000 38.000000 6540.000000

max 100.000000 58.000000 9654.000000


Descriptive statistics of JSON Data:

version

count 197.000000

mean 5.605838

std 2.590350

min 1.010000

25% 3.600000

50% 5.360000

75% 7.860000

max 9.990000

You might also like