Summary DS231
Summary DS231
Summary DS231
-Data Science: The science of extracting valuable insights from raw data and effectively
communicating those insights.
-Data Engineering: An engineering domain focused on building and maintaining systems that
process large volumes of data.
.2Types of Data
-Structured Data: Data stored in traditional databases, such as MySQL.
-Unstructured Data: Data that doesn't fit into a structured format, like emails.
-Semi-structured Data: Data organized by tags, such as XML and JSON files.
Conclusion
Data science and data engineering highlight the importance of data in today's world. Required skills
include analysis, programming, and communication, enabling individuals and organizations to
achieve effective results in competitive environments.
Ch3
Comprehensive Summary of Big Data
Conclusion: Big Data necessitates innovative storage and processing strategies to address challenges
arising from volume, velocity, and variety. This requires collaboration across data science, data
engineering, and machine learning engineering fields.
Ch4
Machine Learning Overview
Machine learning (ML) is the process of applying algorithmic models to data to identify hidden patter
ns and make predictions, often referred to as algorithmic learning. Applications include real-
time Internet advertising, spam filtering, recommendation engines, natural language processing and
sentiment analysis, and automatic facial recognition.
Main Steps:
1. Setup: Acquiring, preprocessing, selecting features, and splitting data into training and test d
atasets.
2. Learning: Experimenting, training, building, and testing models.
3. Application: Deploying models and making predictions.
Key Terms:
• Instance: A row in a data table (also called observation or data point).
• Feature: A column in a data table (also called variable).
• Target variable: The variable to predict (dependent variable).
Learning Styles:
• Supervised Learning: Uses labeled data for training (e.g., logistic regression).
• Unsupervised Learning: Groups unlabeled data based on similarities (e.g., k-
means clustering).
• Reinforcement Learning: Uses rewards to train models, akin to human learning.
Machine Learning in Practice:
• Deep Learning: Utilizes neural networks for advanced tasks like Gmail's Smart Reply and Face
book's DeepFace.
• Apache Spark: An in-
memory computing application for deploying ML algorithms on big data sources in real-time.
Ch5
• Statistics Overview:
o Descriptive Statistics: Describe characteristics of a dataset (e.g., mean,
standard deviation) without making causal claims.
o Inferential Statistics: Use a sample to infer information about a larger
population, often to predict or understand causation.
• Importance of Descriptive Statistics:
o Used to understand real-world measures.
o Example: Estimating future profits based on past data, considering variations.
• Applications of Descriptive Statistics:
o Detecting outliers.
o Planning data preprocessing.
o Identifying useful features for analysis.
• Inferential Statistics:
o Provide insights from a sample to infer about the population.
o Useful when collecting data for the entire population is impractical.
o Requires a representative sample to be valid.
• Probability and Random Variables:
o Used to model unpredictable variables.
o Events must be mutually exclusive.
o Probability ranges from 0.0 to 1.0 and sums to 1.0.
• Types of Probability Distributions:
o Discrete: Countable values (e.g., car colors).
o Continuous: Range of values (e.g., miles per gallon).
• Common Distributions:
o Normal Distribution: Symmetric bell curve, common in natural phenomena.
o Binomial Distribution: Models number of successes in a fixed number of
trials.
o Categorical Distribution: Non-numeric variables (e.g., service levels).
• Naive Bayes Method:
o Predicts event likelihood based on data features.
o Useful for text classification (e.g., spam detection).
• Quantifying Correlation
• Correlation Coefficient ®:
o Ranges between -1 and 1.
o Closer to 1 or -1 indicates strong correlation.
o Close to 0 suggests independence between variables.
• Pearson Correlation:
o Used for continuous, numeric variables.
o Assumes normal distribution and linear relationship.
• Spearman’s Rank Correlation:
o Used for ordinal variables.
o Converts numeric pairs into ranks.
o Assumes non-linear relationships.
• Reducing Data Dimensionality with Linear Algebra
• Importance of Linear Algebra:
o Essential for mathematical and statistical operations on large datasets.
o Uses arrays and matrices.
• Singular Value Decomposition (SVD):
o Reduces dataset dimensionality.
o Removes redundant information and noise.
o Used in Principal Component Analysis (PCA).
• Factor Analysis:
o Filters out redundant information and noise.
o Assumes metric features, homogeneity, and correlation (r > 0.3).
• Principal Component Analysis (PCA):
o Finds relationships between features.
o Reduces dataset to non-redundant principal components.
o Does not regress to find underlying causes of shared variance.
• Modeling Decisions with Multiple Criteria Decision-Making (MCDM)
• Traditional MCDM:
o Used when multiple criteria are involved in decision-making.
o Assumes multiple criteria evaluation and a zero-sum system.
• Fuzzy MCDM:
o Evaluates suitability within a range.
• Overview:
o Regression methods from statistics help describe and quantify relationships
between variables.
o Useful for determining correlation strength and predicting future values,
assuming a cause-and-effect relationship.
• Linear Regression:
o Describes the relationship between a target variable ( y ) and predictor
variables.
o Simple form: ( y = mx + b ).
o Limitations:
▪ Works only with numerical variables.
▪ Requires handling missing values and outliers.
▪ Assumes linear relationships and independent features.
▪ Residuals should be normally distributed.
• Logistic Regression:
o Estimates values for a categorical target variable.
o Target variable should be numeric and describe the target’s class.
o Provides probability estimates for each prediction.
• Ordinary Least Squares (OLS) Regression:
o Fits a linear regression line to a dataset.
o Useful for models with multiple independent variables.
o Estimates the target from dataset features.
Detecting Outliers
• Importance:
o Outliers can affect statistical and machine learning models.
o Important for detecting anomalies like fraud or equipment failure.
• Univariate Analysis:
o Inspects individual features for anomalous values.
o Methods: Tukey outlier labeling, Tukey boxplotting.
• Multivariate Analysis:
o Considers combinations of variables to detect outliers.
o Methods: Scatter-plot matrix, boxplotting, DBScan, Principal Component
Analysis (PCA).
• Introducing Time Series Analysis
• Overview:
o Time series: Data collected over time.
o Used to predict future values based on past data.
• Patterns:
o Time series exhibit specific patterns.
o Univariate time series analysis focuses on changes in a single variable over
time.
Ch6
Exploring Data Worldwide
• Data.gov (USA):
o Provides open access to nonclassified US government data.
o Over 100,000 datasets available by mid-2014.
o Indicators include economic, environmental, STEM industry, quality of life,
and legal.
o Offers over 60 open-source APIs for creating tools and apps.
• Canada Open Data:
o Over 200,000 datasets available.
o Popular datasets cover environmental issues, citizenship, and quality of life.
• Data.gov.uk (UK):
o Started in 2010, with about 20,000 datasets by mid-2014.
o Useful for data on environmental, government spending, societal, health,
education, and business indicators.
• US Census Bureau:
o Provides demographic data useful for marketing and advertising research.
o Classifications include age, income, household size, gender, race, and
education level.
• NASA:
o Publicly shares nonclassified project data.
o Generates 4 terabytes of new earth-science data per day.
o Data includes astronomy, climate, life sciences, geology, and engineering.
• World Bank Open Data:
o Offers data on agriculture, economy, environment, science, technology,
financial sector, and poverty.
o Data can be downloaded or viewed online, with an API available for access.
• Knoema:
o Houses over 500 databases and 150 million time series.
o Includes government data, national public data, UN data, international
organization data, and corporate data.
• Quandl:
o A search engine for numeric data with links to over 10 million datasets.
o Includes datasets from the UN, central banks, real estate organizations, and
think tanks.
• Exversion:
o Provides collaborative functionality for data similar to GitHub for code.
o Offers version control and hosting services for public data.
• OpenStreetMap (OSM):
o A crowd-sourced alternative to commercial mapping products.
o Users can contribute data by linking their GPS systems to the OSM
application.
• Overview:
o Aims to implement a public data hub and strategy for transparency, e-
participation, and innovation.
o Publishes datasets from ministries and government agencies in an open
format.
• Benefits:
o Central access point for finding, downloading, and using government data.
o Helps bridge the gap between governments and citizens.
o Enables research, feedback, and development of applications based on
government data.
• Statistics:
o As of August 2022, there are 6,544 datasets and 147 publishers.
o Open data license allows freedom to distribute, produce, and transform
datasets with proper attribution.
• Popularity:
o Python became the most popular language in the world in 2017 according to
IEEE Spectrum.
o It’s widely used in machine learning, robotics, artificial intelligence, and data
science.
• Reasons for Popularity:
o Ease of Learning: Python is relatively easy to learn.
o Free Resources: Everything needed to learn and use Python is free.
o Tools and Libraries: Python offers many ready-made tools for current
technologies like data science, machine learning, AI, and robotics.
• Version Selection:
o Visit the Python website to download the most current stable build
recommended for use.
• Overview:
o Jupyter Notebook supports Python, Julia, and R.
o It’s popular for sharing code online and comes free with Anaconda.
Ch7
Sure, here’s a brief summary of the instructions for using Python’s interactive mode with VS
Code and Anaconda:
• Setting Up:
o Open Anaconda Navigator and launch VS Code.
o Open the Terminal pane in VS Code (View ➪ Terminal).
• Checking Python Version:
o In the Terminal, type python --version to check your Python version.
• Using the Python Interpreter:
o Enter python in the Terminal to start the Python interpreter (you’ll see the >>>
prompt).
o You can type commands directly into the interpreter.
• Getting Help:
o Type help() in the interpreter to access Python’s built-in help system.
o To see a list of Python keywords, type keywords at the help prompt.
o Exit help by typing q or pressing Ctrl+Z.
• Creating a Workspace:
o Set up a development environment in VS Code, referred to as a workspace.
o Follow the steps in your book (pages 37-39) to create and configure your
workspace.
Here’s a summary of the steps for creating a folder for your Python code and working with
Python in VS Code and Jupyter Notebook:
• Create a Folder:
o Create a folder to store all your Python code. You can name it and place it
anywhere you like.
o Associate this folder with your VS Code workspace to ensure you’re using the
correct Python interpreter and settings.