0% found this document useful (0 votes)
20 views4 pages

Introduction To Pandas - Loading and Exploring Data

Uploaded by

Daniel Mercer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views4 pages

Introduction To Pandas - Loading and Exploring Data

Uploaded by

Daniel Mercer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Lecture Notes: Python Pandas Tutorial - Getting Started with Data Analysis

### Instructor: Corey Schafer


- **Video Duration**: 30 minutes, 12 seconds
- **Published**: May 29, 2018
- **Objective**: Introduce beginners to Pandas, a Python library for data analysis, covering
installation, loading data into DataFrames, and basic data exploration techniques.

---

### 1. Introduction to Pandas (0:00 - 2:30)


- **What is Pandas?**
- Pandas is a Python library for data manipulation and analysis, ideal for handling structured
data (e.g., CSV, Excel, SQL).
- Built on NumPy, it provides data structures like Series (1D) and DataFrame (2D).
- Widely used in data science for cleaning, transforming, and analyzing data.
- **Why Use Pandas?**
- Simplifies working with tabular data compared to raw Python or NumPy.
- Offers powerful tools for data loading, filtering, grouping, and visualization.
- **Target Audience**:
- Beginners with basic Python knowledge (variables, lists, dictionaries).
- Those new to data analysis or transitioning from tools like Excel.

---

### 2. Setting Up Pandas (2:30 - 7:00)


- **Installation**:
- Install Pandas via pip: `pip install pandas`.
- Recommended: Use Anaconda for a pre-configured environment with Pandas, NumPy, and
Jupyter Notebook.
- Download Anaconda: https://www.anaconda.com/
- Install Pandas in Anaconda: `conda install pandas`.
- **Jupyter Notebook**:
- Ideal for interactive data analysis.
- Launch: `jupyter notebook` in terminal/command prompt.
- Create a new notebook for coding.
- **Verifying Installation**:
- Import Pandas: `import pandas as pd`.
- Check version: `pd.__version__` (e.g., outputs `0.23.0` at the time of the video).
- **Environment Setup**:
- Use Jupyter Notebook for following along with the tutorial.
- Ensure NumPy is installed (Pandas dependency): `pip install numpy`.

---
### 3. Loading Data into Pandas (7:00 - 15:00)
- **Dataset Used**: `stack-overflow-developer-survey-2018` (CSV file, available via GitHub or
Kaggle).
- Contains survey data on developers (e.g., salary, programming languages, job satisfaction).
- **Loading Data**:
- Read CSV into a DataFrame: `df = pd.read_csv('survey_results_public.csv')`.
- Path depends on file location (e.g., local directory or URL).
- **Other Input Methods**:
- Excel: `pd.read_excel('file.xlsx')`.
- SQL: `pd.read_sql('query', connection)`.
- JSON: `pd.read_json('file.json')`.
- **Viewing Data**:
- `df.head()`: Displays first 5 rows (or specify `df.head(10)` for 10 rows).
- `df.tail()`: Displays last 5 rows.
- Example: `df.head()` shows columns like `Respondent`, `Hobby`, `OpenSource`, `Country`,
`Salary`.

---

### 4. Basic Data Exploration (15:00 - 25:00)


- **DataFrame Structure**:
- DataFrame is like a spreadsheet with rows (observations) and columns (variables).
- `df.shape`: Returns dimensions (e.g., `(98855, 129)` for 98,855 rows and 129 columns).
- `df.columns`: Lists column names as an Index object.
- **Data Information**:
- `df.info()`: Shows column names, data types (e.g., `object`, `float64`), and non-null counts.
- Example: `Hobby` (object), `Salary` (float64).
- Identifies missing data (e.g., columns with fewer non-null values than total rows).
- **Basic Statistics**:
- `df.describe()`: Summary statistics for numerical columns (e.g., count, mean, min, max).
- Example: `Salary` mean ~$56,000, but many missing values.
- Non-numerical columns (e.g., `Country`) ignored by `describe()`.
- **Accessing Data**:
- Select column: `df['Country']` (returns a Series).
- Select multiple columns: `df[['Country', 'Salary']]` (returns a DataFrame).
- Unique values: `df['Country'].value_counts()` (counts occurrences of each country, e.g., USA:
~20,000 responses).

---

### 5. Practical Example: Exploring the Survey Data (25:00 - 29:00)


- **Goal**: Understand the distribution of respondents by country and salary.
- **Steps**:
1. Load data: `df = pd.read_csv('survey_results_public.csv')`.
2. Check dimensions: `df.shape` → `(98855, 129)`.
3. View column info: `df.info()` (shows many columns with missing data).
4. Country distribution: `df['Country'].value_counts()` → Top countries: USA, India, Germany.
5. Salary stats: `df['Salary'].describe()` → Mean salary ~$56,000, but skewed by missing data.
- **Insight**: Dataset is large and diverse, but missing values (e.g., in `Salary`) require careful
handling in future analysis.

---

### 6. Wrap-Up and Next Steps (29:00 - 30:12)


- **Key Takeaways**:
- Installed Pandas and set up a Jupyter Notebook environment.
- Loaded data into a DataFrame using `pd.read_csv()`.
- Explored data with `head()`, `shape`, `info()`, `describe()`, and `value_counts()`.
- **Next Steps**:
- Watch subsequent parts of the series for advanced Pandas features (e.g., filtering, grouping,
cleaning).
- Practice with the Stack Overflow survey dataset or other CSV files.
- Explore Pandas documentation: https://pandas.pydata.org/docs/
- **Tips**:
- Save Jupyter Notebooks to track code and results.
- Experiment with different datasets to build familiarity.

---

### Code Snippets (for Reference)


```python
# Import Pandas
import pandas as pd

# Check version
print(pd.__version__)

# Load data
df = pd.read_csv('survey_results_public.csv')

# Explore data
print(df.head()) # First 5 rows
print(df.shape) # Dimensions: (98855, 129)
print(df.info()) # Column types and non-null counts
print(df.describe()) # Summary stats for numerical columns
print(df['Country'].value_counts()) # Count unique values in Country column
```
---

### Suggested Title for Notes


**"Getting Started with Pandas: Data Analysis Basics in Python"**

---

### Citation
- Video Source: "Python Pandas Tutorial (Part 1): Getting Started with Data Analysis -
Installation and Loading Data" by Corey Schafer, YouTube, May 29, 2018.

---

You might also like