FDS Practical 2
FDS Practical 2
Aim- Categorization and implementation of different data formats for data analysis.
Description
In data analysis, effectively managing and utilizing various data formats is crucial for
ensuring efficient processing, storage, and retrieval of data. Different data formats
cater to distinct needs and use cases, influencing the choice of format based on the
nature of the data and the requirements of the analysis.
• CSV (Comma-Separated Values): This format is widely used for simple tabular
data, making it easy to read and write. CSV files are lightweight and can be
easily imported into data analysis tools, but they lack support for hierarchical
or nested data structures.
• Excel: Known for its rich features, Excel is used for complex spreadsheets that
may include multiple sheets, charts, and advanced formulas. It is popular
among business analysts for data manipulation and visualization.
• Parquet: This columnar storage format is designed for big data processing,
enabling efficient compression and encoding schemes. Parquet files are
particularly suited for analytical queries and are often used with big data
frameworks like Apache Spark and Hadoop.
csv_data = pd.read_csv('titanic_train.csv')
json_data = pd.read_json('test.json')
def categorize_data(df):
data_types = df.dtypes
categorical = df.select_dtypes(include=['object'])
datetime = df.select_dtypes(include=['datetime64'])
print(f"Numerical Columns:\n{numerical.columns}\n")
print(f"Categorical Columns:\n{categorical.columns}\n")
print(f"DateTime Columns:\n{datetime.columns}\n")
categorize_data(csv_data)
categorize_data(excel_data)
print("Categorizing JSON Data:")
categorize_data(json_data)
# Descriptive statistics
Numerical Columns:
Categorical Columns:
DateTime Columns:
Index([], dtype='object')
Numerical Columns:
Categorical Columns:
DateTime Columns:
Index([], dtype='object')
Numerical Columns:
Index(['version'], dtype='object')
Categorical Columns:
Index(['name', 'language', 'id', 'bio'], dtype='object')
DateTime Columns:
Index([], dtype='object')
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
0 0
First Name 0
Last Name 0
Gender 0
Country 0
Age 0
Date 0
Id 0
dtype: int64
name 0
language 0
id 0
bio 0
version 0
dtype: int64
0 Age Id
version
count 197.000000
mean 5.605838
std 2.590350
min 1.010000
25% 3.600000
50% 5.360000
75% 7.860000
max 9.990000