Summary DS231

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Ch2

Summary of Data Science and Data Engineering


1. Understanding Data Science and Data Engineering

-Data Science: The science of extracting valuable insights from raw data and effectively
communicating those insights.
-Data Engineering: An engineering domain focused on building and maintaining systems that
process large volumes of data.

.2Types of Data
-Structured Data: Data stored in traditional databases, such as MySQL.
-Unstructured Data: Data that doesn't fit into a structured format, like emails.
-Semi-structured Data: Data organized by tags, such as XML and JSON files.

.3Using Data Science in Organizations


-Data science has become essential for all organizations, not just tech companies, making data
literacy a core skill.

.4Components of Data Science


-Analytical Skills: Requires mathematical and statistical knowledge, as well as programming skills.
-Effective Communication: The ability to convey data insights clearly.

.5Different Roles in Data Science


-Data Implementer: Focuses on building solutions and needs strong math and coding skills.
-Data Leader: Guides teams and enjoys tangible outcomes from data science.
-Data Entrepreneur: Builds businesses using data science services and products, valuing innovation
and creative freedom.

Conclusion
Data science and data engineering highlight the importance of data in today's world. Required skills
include analysis, programming, and communication, enabling individuals and organizations to
achieve effective results in competitive environments.
Ch3
Comprehensive Summary of Big Data

1. Definition of Big Data


- Big Data: Refers to data that exceeds the processing capacity of traditional database systems due to
its large size, high speed, or lack of structural requirements.
- Data Engineering: Using Big Data requires specific storage and processing capabilities, often
involving investment in a Hadoop cluster.

2. Characteristics of Big Data (The Three Vs)


- Volume The lower limit for Big Data starts at 1 terabyte, with no upper limit.
- Velocity Data enters systems at rates ranging from 30 kilobytes per second to 30 gigabytes per
second.
- Variety: Comprises a wide range of data types, including structured and unstructured data.

3. Important Data Sources


- Vast amounts of data are continuously generated from humans, machines, and sensors.
- Common sources include social media, financial transactions, health records, click-streams, and the
Internet of Things.

4. Understanding the Differences Among Data Approaches


- Data Science: A scientific field dedicated to knowledge discovery through data analysis, requiring
expertise in math, statistics, programming, and domain-specific knowledge.
- Data Engineering: Focuses on building and maintaining data systems to overcome processing
bottlenecks and handle large volumes of data, often using languages like Java and Python.
- Machine Learning Engineering: Involves applying algorithms for automated predictions and requires
a strong background in software development.

5. Storing and Processing Data for Data Science


- Cloud Computing: Offers advantages such as faster time-to-market, enhanced flexibility, and
security.
- NoSQL Databases: Designed to handle the challenges of storing and processing Big Data, allowing
for non-SQL queries on non-relational data.
- Hadoop: A platform for storing and processing Big Data, consisting of components like HDFS,
MapReduce, and Spark.

6. Processing Data in Real-Time


- Real-Time Processing Frameworks: Handle data as it streams into the system, enhancing overall
efficiency.
- In-Memory Computing: Processes data within the computer's memory, providing faster results but
with limited capacity.

Conclusion: Big Data necessitates innovative storage and processing strategies to address challenges
arising from volume, velocity, and variety. This requires collaboration across data science, data
engineering, and machine learning engineering fields.
Ch4
Machine Learning Overview
Machine learning (ML) is the process of applying algorithmic models to data to identify hidden patter
ns and make predictions, often referred to as algorithmic learning. Applications include real-
time Internet advertising, spam filtering, recommendation engines, natural language processing and
sentiment analysis, and automatic facial recognition.
Main Steps:
1. Setup: Acquiring, preprocessing, selecting features, and splitting data into training and test d
atasets.
2. Learning: Experimenting, training, building, and testing models.
3. Application: Deploying models and making predictions.
Key Terms:
• Instance: A row in a data table (also called observation or data point).
• Feature: A column in a data table (also called variable).
• Target variable: The variable to predict (dependent variable).
Learning Styles:
• Supervised Learning: Uses labeled data for training (e.g., logistic regression).
• Unsupervised Learning: Groups unlabeled data based on similarities (e.g., k-
means clustering).
• Reinforcement Learning: Uses rewards to train models, akin to human learning.
Machine Learning in Practice:
• Deep Learning: Utilizes neural networks for advanced tasks like Gmail's Smart Reply and Face
book's DeepFace.
• Apache Spark: An in-
memory computing application for deploying ML algorithms on big data sources in real-time.
Ch5

Exploring Probability and Inferential Statistics

• Statistics Overview:
o Descriptive Statistics: Describe characteristics of a dataset (e.g., mean,
standard deviation) without making causal claims.
o Inferential Statistics: Use a sample to infer information about a larger
population, often to predict or understand causation.
• Importance of Descriptive Statistics:
o Used to understand real-world measures.
o Example: Estimating future profits based on past data, considering variations.
• Applications of Descriptive Statistics:
o Detecting outliers.
o Planning data preprocessing.
o Identifying useful features for analysis.
• Inferential Statistics:
o Provide insights from a sample to infer about the population.
o Useful when collecting data for the entire population is impractical.
o Requires a representative sample to be valid.
• Probability and Random Variables:
o Used to model unpredictable variables.
o Events must be mutually exclusive.
o Probability ranges from 0.0 to 1.0 and sums to 1.0.
• Types of Probability Distributions:
o Discrete: Countable values (e.g., car colors).
o Continuous: Range of values (e.g., miles per gallon).
• Common Distributions:
o Normal Distribution: Symmetric bell curve, common in natural phenomena.
o Binomial Distribution: Models number of successes in a fixed number of
trials.
o Categorical Distribution: Non-numeric variables (e.g., service levels).
• Naive Bayes Method:
o Predicts event likelihood based on data features.
o Useful for text classification (e.g., spam detection).
• Quantifying Correlation
• Correlation Coefficient ®:
o Ranges between -1 and 1.
o Closer to 1 or -1 indicates strong correlation.
o Close to 0 suggests independence between variables.
• Pearson Correlation:
o Used for continuous, numeric variables.
o Assumes normal distribution and linear relationship.
• Spearman’s Rank Correlation:
o Used for ordinal variables.
o Converts numeric pairs into ranks.
o Assumes non-linear relationships.
• Reducing Data Dimensionality with Linear Algebra
• Importance of Linear Algebra:
o Essential for mathematical and statistical operations on large datasets.
o Uses arrays and matrices.
• Singular Value Decomposition (SVD):
o Reduces dataset dimensionality.
o Removes redundant information and noise.
o Used in Principal Component Analysis (PCA).
• Factor Analysis:
o Filters out redundant information and noise.
o Assumes metric features, homogeneity, and correlation (r > 0.3).
• Principal Component Analysis (PCA):
o Finds relationships between features.
o Reduces dataset to non-redundant principal components.
o Does not regress to find underlying causes of shared variance.
• Modeling Decisions with Multiple Criteria Decision-Making (MCDM)
• Traditional MCDM:
o Used when multiple criteria are involved in decision-making.
o Assumes multiple criteria evaluation and a zero-sum system.
• Fuzzy MCDM:
o Evaluates suitability within a range.

o Uses a range of acceptability instead of binary criteria.

Introducing Regression Methods

• Overview:
o Regression methods from statistics help describe and quantify relationships
between variables.
o Useful for determining correlation strength and predicting future values,
assuming a cause-and-effect relationship.
• Linear Regression:
o Describes the relationship between a target variable ( y ) and predictor
variables.
o Simple form: ( y = mx + b ).
o Limitations:
▪ Works only with numerical variables.
▪ Requires handling missing values and outliers.
▪ Assumes linear relationships and independent features.
▪ Residuals should be normally distributed.
• Logistic Regression:
o Estimates values for a categorical target variable.
o Target variable should be numeric and describe the target’s class.
o Provides probability estimates for each prediction.
• Ordinary Least Squares (OLS) Regression:
o Fits a linear regression line to a dataset.
o Useful for models with multiple independent variables.
o Estimates the target from dataset features.
Detecting Outliers

• Importance:
o Outliers can affect statistical and machine learning models.
o Important for detecting anomalies like fraud or equipment failure.
• Univariate Analysis:
o Inspects individual features for anomalous values.
o Methods: Tukey outlier labeling, Tukey boxplotting.
• Multivariate Analysis:
o Considers combinations of variables to detect outliers.
o Methods: Scatter-plot matrix, boxplotting, DBScan, Principal Component
Analysis (PCA).
• Introducing Time Series Analysis
• Overview:
o Time series: Data collected over time.
o Used to predict future values based on past data.
• Patterns:
o Time series exhibit specific patterns.
o Univariate time series analysis focuses on changes in a single variable over
time.
Ch6
Exploring Data Worldwide

• Data.gov (USA):
o Provides open access to nonclassified US government data.
o Over 100,000 datasets available by mid-2014.
o Indicators include economic, environmental, STEM industry, quality of life,
and legal.
o Offers over 60 open-source APIs for creating tools and apps.
• Canada Open Data:
o Over 200,000 datasets available.
o Popular datasets cover environmental issues, citizenship, and quality of life.
• Data.gov.uk (UK):
o Started in 2010, with about 20,000 datasets by mid-2014.
o Useful for data on environmental, government spending, societal, health,
education, and business indicators.
• US Census Bureau:
o Provides demographic data useful for marketing and advertising research.
o Classifications include age, income, household size, gender, race, and
education level.
• NASA:
o Publicly shares nonclassified project data.
o Generates 4 terabytes of new earth-science data per day.
o Data includes astronomy, climate, life sciences, geology, and engineering.
• World Bank Open Data:
o Offers data on agriculture, economy, environment, science, technology,
financial sector, and poverty.
o Data can be downloaded or viewed online, with an API available for access.
• Knoema:
o Houses over 500 databases and 150 million time series.
o Includes government data, national public data, UN data, international
organization data, and corporate data.
• Quandl:
o A search engine for numeric data with links to over 10 million datasets.
o Includes datasets from the UN, central banks, real estate organizations, and
think tanks.
• Exversion:
o Provides collaborative functionality for data similar to GitHub for code.
o Offers version control and hosting services for public data.
• OpenStreetMap (OSM):
o A crowd-sourced alternative to commercial mapping products.
o Users can contribute data by linking their GPS systems to the OSM
application.

Discovering Saudi Open Data Portal

• Overview:
o Aims to implement a public data hub and strategy for transparency, e-
participation, and innovation.
o Publishes datasets from ministries and government agencies in an open
format.
• Benefits:
o Central access point for finding, downloading, and using government data.
o Helps bridge the gap between governments and citizens.
o Enables research, feedback, and development of applications based on
government data.
• Statistics:
o As of August 2022, there are 6,544 datasets and 147 publishers.
o Open data license allows freedom to distribute, produce, and transform
datasets with proper attribution.

Why Python Is Hot

• Popularity:
o Python became the most popular language in the world in 2017 according to
IEEE Spectrum.
o It’s widely used in machine learning, robotics, artificial intelligence, and data
science.
• Reasons for Popularity:
o Ease of Learning: Python is relatively easy to learn.
o Free Resources: Everything needed to learn and use Python is free.
o Tools and Libraries: Python offers many ready-made tools for current
technologies like data science, machine learning, AI, and robotics.

Choosing the Right Python

• Version Selection:
o Visit the Python website to download the most current stable build
recommended for use.

Tools for Success

• Python Interpreter and Editor:


o A code editor lets you type code, and an interpreter runs it.
o Anaconda: A complete Python development environment with a user-friendly
interface, often used for data science.
• Installing Anaconda and VS Code:
o Download Anaconda from Anaconda’s website.
o Follow the instructions to install it, including Microsoft VS Code for a
powerful code editor.
• Using Anaconda Navigator:
o Anaconda Navigator helps you navigate and choose different features and
tools within the app.

Writing Python in VS Code


• Setup:
o Open VS Code from Anaconda.
o Ensure you have the necessary extensions: Anaconda Extension Pack, Python,
and YAML.

Using Jupyter Notebook for Coding

• Overview:
o Jupyter Notebook supports Python, Julia, and R.
o It’s popular for sharing code online and comes free with Anaconda.
Ch7
Sure, here’s a brief summary of the instructions for using Python’s interactive mode with VS
Code and Anaconda:

• Setting Up:
o Open Anaconda Navigator and launch VS Code.
o Open the Terminal pane in VS Code (View ➪ Terminal).
• Checking Python Version:
o In the Terminal, type python --version to check your Python version.
• Using the Python Interpreter:
o Enter python in the Terminal to start the Python interpreter (you’ll see the >>>
prompt).
o You can type commands directly into the interpreter.
• Getting Help:
o Type help() in the interpreter to access Python’s built-in help system.
o To see a list of Python keywords, type keywords at the help prompt.
o Exit help by typing q or pressing Ctrl+Z.
• Creating a Workspace:
o Set up a development environment in VS Code, referred to as a workspace.
o Follow the steps in your book (pages 37-39) to create and configure your
workspace.

Here’s a summary of the steps for creating a folder for your Python code and working with
Python in VS Code and Jupyter Notebook:

Creating a Folder for Your Python Code

• Create a Folder:
o Create a folder to store all your Python code. You can name it and place it
anywhere you like.
o Associate this folder with your VS Code workspace to ensure you’re using the
correct Python interpreter and settings.

Typing, Editing, and Debugging Python Code

• Using the Code Editor:


o Write your Python code in the VS Code editor. Each file should have a .py
extension.
o Save your files in the Python folder you created.
• Saving Your Code:
o Manually save your code or enable AutoSave (File ➪ Auto Save).
• Running Python Code:
o Right-click the file name (e.g., hello.py) and choose “Run Python File in
Terminal” to execute your code.
• Debugging:
o Look for red indicators in the Explorer pane and wavy red underlines in your
code to identify errors.
o Use the built-in debugger in VS Code for more complex programs.

Writing Code in a Jupyter Notebook

• Creating and Saving a Notebook:


o Create a Jupyter notebook and save it in a folder.
o Open the notebook to see at least one cell for typing code.
• Typing and Running Code:
o Type your code in the cell and run it by pressing Alt + Enter (Windows) or
Option + Enter (Mac), or by clicking the Run button.
• Adding Markdown Text:
o Add text, pictures, and videos using Markdown tags.
o Run the cell to render the Markdown content.
• Saving and Reopening Notebooks:
o Save your notebook (File ➪ Save and Checkpoint).
o Reopen it by launching Jupyter Notebook from Anaconda and navigating to
the saved file.

You might also like