Document
Document
The data scientist’s toolbox includes the essential tools and concepts used
for building, managing, and sharing data analysis software. These tools help
transform raw data into actionable insights by enabling data preparation,
analysis, collaboration, and reproducibility.
2. Markdown
3. Git
4. GitHub
5. R Programming Language
6. RStudio
1. Version Control Systems
Purpose:
Version control helps track and manage changes in files, especially when
multiple collaborators are working on the same project.
Key Features:
Facilitates collaboration.
Ensures reproducibility.
2. Markdown
What is Markdown?
Writing documentation.
Headers:
# Header 1
## Header 2
Bold and Italics:
**bold**, *italic*
Lists:
- Item 1
- Item 2
Code Blocks:
Print(“Code block”)
Jupyter Notebooks.
3. Git
What is Git?
Git is a distributed version control system that tracks changes in files and
helps manage projects.
Core Concepts:
Repository (Repo): A storage location for the project and its history.
What is GitHub?
Key Features:
Document projects.
Manage workflows with GitHub Actions.
5. R Programming Language
What is R?
Key Features:
Why Learn R?
6. RStudio
What is RStudio?
Key Features:
2. Data Analysis:
5. Reporting:
Collaboration: Work efficiently with team members using Git and GitHub.
Setting up Git
2. Configure Git: Run these commands to set up your name and email:
1. Initialize a Repository:
Mkdir my_project
Cd my_project
Git init
2. Add Files:
3. Commit Changes:
4. View History:
Git log
3. Markdown
1. Open any text editor (or RStudio) and save the file with a .md
extension, e.g., README.md.
2. Markdown Example:
# Project Title
## Features
- Easy to use
- Highly efficient
## Code Example
```python
Print(“Hello, Markdown!”)
3. Use a Markdown previewer (like VS Code or GitHub) to view the
formatted document.
4. Git
Collaborating on GitHub
Push changes:
5. R Programming
# Assign values
X <- 10
Y <- 20
# Perform calculations
Z <- x + y
Print(z)
Print(df)
3. Install a Package:
Install.packages(“ggplot2”)
Library(ggplot2)
4. RStudio
Creating a Project in RStudio
3. Example R Markdown:
Output: html_document
## Analysis
```{r}
Summary(cars)
4. Click the “Knit” button to generate an HTML report.
5. Example Workflow
Library(dplyr)
Group_by(cyl) %>%
Summarise(avg_mpg = mean(mpg))
Print(summary)
Output: html_document
## Summary
```{r}
Print(summary)
4. Push to GitHub:
Git add .
Initialize Git.
1. Install R:
Visit CRAN and download R for your operating system.
2. Install RStudio:
3. Install Git:
4. Set up GitHub:
Create an account on GitHub.
1. Open RStudio.
# Load dataset
Library(dplyr)
Group_by(cyl) %>%
Print(summary)
Output: html_document
## Introduction
## Summary
```{r}
# Load libraries
Library(dplyr)
# Perform analysis
Group_by(cyl) %>%
Print(summary)
3. Click the Knit button in RStudio to generate an HTML report.
Git init
Git add .
Commit changes:
Git commit -m “Initial analysis and report”
## Files
Library(ggplot2)
Geom_bar(stat = “identity”) +
Push to GitHub.
Practical example project – (Build a data analysis workflow using the tools in
the data scientist toolbox).
Open RStudio.
Go to File > New Project > New Directory > New Project.
# Load libraries
Library(dplyr)
# Load dataset
# Summarize data
Group_by(cyl) %>%
Summarise(
Avg_mpg = mean(mpg),
Avg_hp = mean(hp)
Print(summary)
5. Verify the output in the Console and check if the summary.csv file is
created in your project folder.
## Introduction
This report analyzes the `mtcars` dataset, summarizing mileage (MPG) and
horsepower (HP) by the number of cylinders.
## Data Summary
```{r}
# Load libraries
Library(dplyr)
# Load dataset
# Summarize data
Group_by(cyl) %>%
Summarise(
Avg_mpg = mean(mpg),
Avg_hp = mean(hp)
Print(summary)
Visualization
# Load ggplot2
Library(ggplot2)
Geom_bar(stat = “identity”) +
Labs(
X = “Cylinders”,
Y = “Average MPG”
)+
Theme_minimal()
Git status
Git add .
Commit changes:
# Mtcars Analysis
This project analyzes the `mtcars` dataset to summarize mileage (MPG) and
horsepower (HP) by the number of cylinders.
## Files
Git push
4. Open a pull request on GitHub to merge the updates into the main
branch.