Data Analysis using Python - Complete Notes for 3rd Year B.Sc.
Students
Unit 1: Introduction to Data Analysis and Python
- Data Analysis: Collecting, processing, and performing statistical operations on data.
- Importance: Helps in decision-making, pattern recognition, and forecasting.
- Types of Data:
* Qualitative (Nominal, Ordinal)
* Quantitative (Discrete, Continuous)
- Data Analysis Steps: Data Collection, Cleaning, Exploration, Modeling, Interpretation
Python Libraries:
- NumPy: Numerical data operations
- Pandas: Data manipulation and analysis
- Matplotlib and Seaborn: Visualization tools
- Jupyter Notebook: Interactive coding environment
Unit 2: NumPy for Numerical Computation
- Arrays: Homogeneous, efficient storage and computation
- Creating Arrays: np.array(), np.zeros(), np.ones(), np.arange(), np.linspace()
- Indexing & Slicing: Accessing data subsets
- Operations: Arithmetic, broadcasting, aggregation functions
Code Example:
import numpy as np
a = np.array([[1, 2], [3, 4]])
print(np.mean(a)) # Output: 2.5
Unit 3: Pandas for Data Handling
- Data Structures: Series (1D), DataFrame (2D)
- Creating Series and DataFrames
- Reading Files: pd.read_csv(), pd.read_excel()
- Selecting Data: .loc[], .iloc[], conditions
- Manipulations: sort_values(), groupby(), merge(), concat()
Code Example:
import pandas as pd
df = pd.read_csv("data.csv")
print(df.describe())
Unit 4: Data Cleaning and Preprocessing
- Missing Values: df.isnull(), df.dropna(), df.fillna()
- Data Types: df.dtypes, df.astype()
- Renaming Columns: df.rename()
- Duplicates: df.duplicated(), df.drop_duplicates()
- Normalization & Standardization: Scaling data for modeling
Unit 5: Data Visualization
- Matplotlib:
* Line plots, Bar plots, Histograms, Scatter plots
- Seaborn:
* Distribution plots: histplot, distplot
* Categorical plots: boxplot, countplot
* Matrix plots: heatmap
Code Example:
import seaborn as sns
sns.boxplot(data=df, x="category", y="value")
Unit 6: Basic Statistical Analysis
- Descriptive Statistics: mean(), median(), mode(), std(), var()
- Frequency Distribution: value_counts()
- Correlation & Covariance: df.corr(), df.cov()
- Inferential Stats (Basic): Hypothesis testing (t-test, chi-square)
Real-World Applications:
- Business analytics, Scientific research, Machine learning preprocessing, Financial forecasting
Best Practices:
- Always explore and understand the data
- Clean data before analysis
- Visualize before concluding
End of Notes