0% found this document useful (0 votes)
17 views4 pages

Fds Csheet and Read The Rule

Uploaded by

Omkar Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views4 pages

Fds Csheet and Read The Rule

Uploaded by

Omkar Shinde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Tools for a Data Scientist: Data scientists A data cube organizes data into Null Hypothesis (H₀): Assumes

Null Hypothesis (H₀): Assumes no effect or


use diverse tools for data manipulation, multidimensional arrays, enabling efficient relationship exists. It represents the default
modeling, visualization, and deployment. Key querying and analysis. - Purpose: Supports assumption. For instance, "A new drug has no
tools include: - Programming Languages: OLAP (Online Analytical Processing) effect on patients' recovery times."
- Python: Offers libraries like Pandas for data operations like slicing (viewing specific Alternate Hypothesis (H₁): Suggests an
manipulation, Matplotlib/Seaborn for dimensions) and dicing (creating smaller data effect or relationship exists, challenging the
visualization, and Scikit-learn for machine cubes). - Example: A retail data cube could null hypothesis. Example: "The new drug
learning. - R: Well-suited for statistical analyze sales data across dimensions like improves recovery times." Hypothesis testing
analysis and visualization. - Visualization time (days, months, years), location (regions, evaluates these claims using data to decide
Tools: Tableau and Power BI help create cities), and products (categories, brands). whether to reject H₀ in favor of H₁ based on
interactive dashboards and insights for non- Purpose of Data Preprocessing: Data significance levels (e.g., p-values).
technical stakeholders. - Big Data Tools: preprocessing transforms raw data into a Noisy Data: - Definition: Noisy data contains
Tools like Apache Spark and Hadoop handle clean and usable format. Key steps include: irrelevant or erroneous information that can
large datasets efficiently. - Machine Learning - Data Cleaning: Removes inconsistencies, distort analysis. - Causes: - Sensor Errors:
Frameworks: TensorFlow and PyTorch are missing values, and duplicates. - Data Faulty devices (e.g., temperature sensors)
widely used for deep learning. - Version Transformation: Normalizes or encodes data produce inaccurate readings. - Human Input
Control: Git/GitHub manages code and for compatibility with algorithms. - Feature Errors: Typos or inconsistent data entry (e.g.,
collaboration among team members. Engineering: Creates new features to "N/A" vs. "Not Applicable"). - Solution: Use
Statistical data analysis is the process of enhance model performance.- Purpose: data cleaning techniques like smoothing or
interpreting data to uncover patterns, Ensures data quality and improves the filtering to mitigate noise.
relationships, and trends. It involves: reliability of analyses or machine learning The 3Vs represent the characteristics of big
- Descriptive Statistics: Summarizes data models. data: - Volume: Refers to the sheer quantity of
using measures like mean, median, standard Inferential statistics draws conclusions data generated daily (e.g., terabytes of social
deviation, and variance. - Inferential Statistics: about a population using sample data. media posts). - Variety: Includes structured
Makes predictions or inferences about a - Example: Predicting election results based data (databases), unstructured data (texts,
population based on a sample, using methods on a voter survey. - Techniques: Hypothesis images), and semistructured data (JSON,
like hypothesis testing and confidence testing, regression analysis, and confidence XML). - Velocity: Represents the speed at
intervals. For instance, analyzing the intervals. which data is generated and processed (e.g.,
performance of students in an exam using ---------------------------------------------------------- real-time streaming).
mean scores provides a descriptive summary, Purpose of Data Visualization:Data Two Ways in Which Data is Stored in Files.
while predicting overall class trends involves visualization translates complex data into - CSV (Comma-Separated Values): A simple
inferential analysis. visual formats like charts, graphs, and maps. text format where data is stored in tabular
------------------------------------------------------------- - Purpose: - Identify trends, patterns, and form, with rows and columns separated by
Hypothesis testing is a statistical method to outliers. - Communicate insights effectively to commas. Widely used for structured data.
determine the validity of a claim or hypothesis stakeholders. - Example Libraries: - Matplotlib: - JSON (JavaScript Object Notation): A
about a dataset based on sample data. Used for simple 2D plots like line graphs and lightweight format for representing
Use of A bubble plot visually represents scatter plots. - Seaborn: Builds on Matplotlib semistructured data. Example: `{ "name":
three dimensions of data: the x-axis, y-axis, for advanced statistical visualizations like "John", "age": 30 }`.
and bubble size(representing a third variable). heatmaps and pair plots. - Visual Encoding: ------------------------------------------------------------
Data cleaning involves identifying and Attributes like color, shape, size, and position Role of Statistics in Data Science: Statistics
rectifying errors, inconsistencies, and represent data elements, making relationships provides foundational techniques for data
inaccuracies in a dataset to ensure quality and visually intuitive. science, such as: - Descriptive Statistics:
reliability. Applications of Data Science: Data science Summarizes data (e.g., mean, median,
Standard deviation measures the dispersion applies to various domains: - Healthcare: mode). - Inferential Statistics: Generalizes
or spread of data points from the mean in a Predictive analytics for diseases, drug findings from sample data to a larger
dataset. discovery, and patient management. - Retail: population (e.g., hypothesis testing,
Different types of attributes include nominal, Inventory management, recommendation confidence intervals). - Data Distribution:
ordinal, interval, and ratio. engines, and customer segmentation. Tools like histograms reveal data distribution,
A data object is an entity that holds data and - Finance: Fraud detection, credit scoring, and enabling better model selection.
is described by its attributes, such as a record stock market predictions. - Transportation: Two Methods of Data Cleaning for Missing
in a database or an instance in programming. Optimizing routes, analyzing traffic patterns, Values. - Imputation: Replace missing values
Tools used for geospatial data include and enabling autonomous vehicles. - Social with statistical measures like the mean,
ArcGIS, QGIS, Google Earth Engine, and Media: Sentiment analysis, content median, or mode, ensuring data continuity.
GeoPandas. recommendation, and user behavior - Deletion: Remove rows or columns with
Methods of feature selection include filter modeling. excessive missing values, provided it doesn’t
methods (e.g., correlation), wrapper methods ----------------------------------------------------------- significantly reduce dataset quality.
(e.g., recursive feature elimination), and Data science is a multidisciplinary field that Word Clouds: A word cloud is a visual
embedded methods (e.g., LASSO). uses techniques, tools, and algorithms to representation of textual data. Words appear
Two Python libraries used for data analysis extract insights and knowledge from in varying sizes, proportional to their
are Pandas and NumPy. structured and unstructured data. frequency or importance. - Uses: Quickly
A nominal attribute is a qualitative attribute A data source refers to the origin of data, identifies key topics in large text datasets.
that categorizes data into distinct groups or which can be databases, files, APIs, sensors, - Example: Analyzing customer feedback to
labels without an inherent order (e.g., colors or web-based repositories used for analysis. determine common complaints or praises.
or gender). Missing values are data entries that are Differentiate Structured and Unstructured
One hot coding is a technique for either absent or undefined in a dataset, often Data: - Structured Data: Organized into rows
representing categorical data as binary represented as `NaN` or blank spaces. and columns with a predefined schema (e.g.,
vectors, where each category is represented Visualization Libraries in Python: Some relational databases, CSV files).
by a unique vector with a single 1. popular Python visualization libraries are - Unstructured Data: Lacks a fixed format
Data visualization is the graphical Matplotlib, Seaborn, Plotly, Bokeh, and Altair. (e.g., images, videos, free text). - Example:
representation of data using charts, graphs, or Data transformation is the process of Transaction records (structured) vs. social
maps to identify trends, patterns, and insights. converting data into a suitable format for media posts (unstructured).
Variance is a statistical measure that analysis, such as normalization, scaling, or
represents the average squared deviation of aggregation.
data points from their mean, indicating the
spread of data.
Types of Data: 1. Structured Data: Organized Concept and Use of Data Visualization: Data Transformation refers to the process of
in rows and columns, easy to store in Data Visualization: Transforms raw data into converting data into a format or structure that
databases. - Example: Customer details with graphical formats like charts, graphs, and is better suited for analysis or modeling. It is
Name, Age, and Email. 2. Unstructured Data: dashboards. - Purpose: - Identify trends and an essential step in data preprocessing,
No predefined format; harder to process. patterns. - Highlight anomalies. - Improve ensuring that the data is clean, uniform, and
- Example: Social media posts, audio files, decision-making. - Libraries in Python: - compatible with the methods and algorithms
images. 3. Semi-structured Data: Organized Matplotlib: Basic 2D plotting. - Seaborn: applied later. Strategies for Data
partially with tags or markers. - Example: Advanced statistical visuals. - Plotly: Transformation:1. Scaling and Normalization:
XML, JSON. Interactive charts. - Bokeh: Real-time Techniques like Min-Max scaling or
Types of Data Attributes: 1. Nominal: visualizations. Standardization are used to scale numerical
Categories without order. Example: Blood Outlier Detection Methods: Outliers are values to a common range or distribution
group (A, B, O). 2. Ordinal: Ordered unusual data points that deviate significantly (mean=0, std=1), especially when dealing with
categories. Example: Ratings (Poor, Average, from the rest. - Methods: 1. Statistical: - Z- machine learning models.2. Feature
Excellent). 3. Interval: Numeric, no true zero. Score: Identifies values far from the mean. Encoding: Transforming categorical data into
Example: Temperature in Celsius. 4. Ratio: - IQR: Detects values outside 1.5×IQR range. numerical values using techniques such as
Numeric, has a true zero. Example: Height, 2. Visualization: - Boxplots highlight outliers One-Hot Encoding or Label Encoding,
age. with whiskers. allowing machine learning algorithms to
Cube Aggregation in Context of Data Data Transformation Techniques: process them effectively. 3. Handling Missing
Reduction:A data cube is a multi-dimensional 1. Normalization: Rescales data to fit within [0, Data: Techniques like imputation (replacing
structure used to summarize data by 1]. - Example: Converting prices from $10- missing values with mean, median, or mode)
aggregating values. - Example: Sales data $100 range to 0.1-1. 2. One-Hot Encoding: or removal of rows/columns with missing
summarized by: - Dimensions: Time (year), Converts categorical data into binary format. values are used to handle incomplete data.
Region (state), Product (category). - Measure: - Example: - Category: {Red, Blue, Green}. -------------------------------------------------------------
Total sales. - Encoding: {1, 0, 0} (Red), {0, 1, 0} (Blue). different methods for measuring data
Techniques: 1. Roll-Up: Aggregate data at a ---------------------------------------------------------- dispersion: Data Dispersion refers to the
higher level (e.g., daily sales → monthly An outlier is a data point that differs extent to which data points in a dataset
sales). 2. Drill-Down: Break down into details significantly from the other data points in a spread out or cluster around the mean.
(e.g., state-level sales → city-level sales). dataset. Outliers can affect statistical analyses Common measures of data dispersion
Use: - Reduces complexity while retaining and may need to be removed or treated include: 1. Range: The difference between the
meaningful information. - Speeds up analysis depending on the analysis context. maximum and minimum values in a dataset.
and improves insights. Types of Outliers: 1. Univariate Outliers: Formula: Range=Max Value−Min Value
------------------------------------------------------------ Outliers in a single variable, identified using - Example: Dataset {5, 10, 15, 20}. Range =
Four Data Visualization Tools. 1. Tableau: statistical methods like Z-scores or IQR 20 - 5 = 15. 2. Variance: Measures the
- Features: Drag-and-drop interface, (Interquartile Range). - Example: A person average squared deviation from the mean. It
interactive dashboards. - Use: Business with an extremely high income in a salary tells how spread out the data is. Formula:
analytics and reporting. 2. Matplotlib (Python): dataset. 2. Multivariate Outliers: Outliers in a Variance=∑ (Xi−μ)2/N - Example: A dataset with
- Features: Basic plots like line graphs, bar combination of two or more variables. They small variance means that most values are
charts, and scatter plots. - Use: Simple 2D are identified using multivariate statistical close to the mean. 3. Standard Deviation: The
visualizations. 3. Seaborn: - Features: methods like Mahalanobis distance. square root of the variance. It gives a
Advanced statistical plots like heatmaps, pair - Example: A combination of age and income measure of dispersion in the same units as
plots. - Use: Exploring relationships and that doesn’t fit any normal pattern in a dataset. the data, making it easier to interpret.
trends in data. 4. Power BI: - Features: 3. Contextual Outliers: Outliers that are - Example: In the previous example, the
Integrates with other Microsoft tools, supports considered unusual in a specific context but standard deviation gives a more interpretable
real-time dashboards. - Use: Business not in others. - Example: A temperature of measure of the spread of data. 4. Interquartile
intelligence and collaborative data analysis. 40°C in winter may be an outlier in one region Range (IQR): The range between the first
Data Science Life Cycle with Diagram: but normal in a tropical region. quartile (25th percentile) and the third quartile
Stages: 1. Data Collection: Gathering raw ----------------------------------------------------------- (75th percentile). IQR measures the middle
data from various sources like APIs, logs. three data transformation techniques: Data 50% of the data, excluding outliers.
2. Data Cleaning: Removing noise, duplicates, Transformation is the process of converting Formula: IQR=Q3−Q1
and handling missing values. 3. Exploratory data from one format or structure into another -------------------------------------------------------------
Data Analysis (EDA): Identifying trends, to make it suitable for analysis. Some key One technique of data transformation:
patterns, and correlations. 4. Model Building: techniques include: 1. Normalization: One common technique of Data
Applying algorithms to train predictive models. Rescaling data to fall within a specific range, Transformation is Normalization. It scales the
5. Evaluation: Testing the model’s accuracy usually [0, 1]. This helps in eliminating bias data to a specific range, typically [0, 1], by
using metrics (e.g., precision). 6. Deployment: when working with algorithms sensitive to applying the formula:
Implementing the model into production. varying scales (e.g., machine learning Normalized Value =Max−Min/Value−Min.This
Diagram: Represent this as a circular flow or models). Example: Normalized value= technique is especially important when
pipeline with arrows showing interactions. Max−Min/Value−Min. 2. Log Transformation: working with algorithms like k-nearest
------------------------------------------------------------- Applying the logarithm to data values, which neighbors (KNN) and neural networks, which
Primary data is original data collected directly can help reduce skewness and make patterns are sensitive to varying scales of input
by researchers for a specific purpose, such as more apparent. Example: A financial dataset features.
surveys, experiments, or interviews. with large values, like income or sales, can be Feature Extraction is the process of
Data quality refers to the degree to which transformed using the log to make analysis transforming raw data into a set of usable
data is accurate, consistent, complete, and more manageable. 3. One-Hot Encoding: features or variables that can be fed into
suitable for its intended purpose. Converts categorical variables into a series of machine learning algorithms. It is a crucial
An outlier is a data point that significantly binary (0 or 1) columns, where each column step for reducing the dimensionality of the
deviates from the rest of the dataset, often represents one possible category. This data and improving model performance. The
indicating anomalies or errors. technique is useful for algorithms that require goal is to retain the most important information
The interquartile range (IQR) is the numerical input. Example: A "Color" column while reducing noise and irrelevant details.
difference between the third quartile (Q3) and with categories {Red, Blue, Green} can be Example: In text analysis, TF-IDF (Term
the first quartile (Q1), representing the middle transformed into: - Red → [1, 0, 0] - Blue → Frequency-Inverse Document Frequency) is a
50% of a dataset. [0, 1, 0] - Green → [0, 0, 1] common feature extraction technique used to
Zip files are used to compress data, reduce represent text data based on word frequency
storage requirements, and combine multiple and importance.
files into a single package for easy sharing or
archiving.
Exploratory Data Analysis (EDA) is the
process of analyzing and visualizing datasets
to summarize their main characteristics, often
with the help of graphical representations. The
goal of EDA is to understand the structure of
the data, detect patterns, spot anomalies, test
hypotheses, and check assumptions.
Steps in EDA: 1. Data Cleaning: Handle
missing values, outliers, and duplicates.
2. Univariate Analysis: Analyze the distribution
and summary statistics of individual variables
(e.g., histograms, box plots). 3. Bivariate
Analysis: Explore relationships between two
variables (e.g., scatter plots, correlation
matrices). 4. Multivariate Analysis: Investigate
interactions between multiple variables (e.g.,
pair plots, principal component analysis).
Tools for EDA: - Python Libraries: Pandas,
Matplotlib, Seaborn, Plotly. - Software:
Tableau, Power BI.
Data discretization is the process of
converting continuous data into discrete
categories or bins, often used in machine
learning.
A tag cloud is a visual representation of text
data, where the size of each word indicates its
frequency or importance in the dataset.
Visual encoding refers to the use of visual
elements (such as color, size, shape, or
position) to represent data in visualizations.
------------------------------------------------------------
Volume Characteristic of Data in Reference
to Data Science: In data science, the volume
characteristic refers to the massive amount of
data generated from various sources,
measured in gigabytes, terabytes, or more.
Examples of semistructured data include
JSON, XML, and CSV files with irregular or
nested structures.
A quartile divides a dataset into four equal
parts. The first quartile (Q1) is the 25th
percentile, and the third quartile (Q3) is the
75th percentile.
XML (Extensible Markup Language) is a
data format used to structure data in a tree-
like format with custom tags, enabling data
exchange across systems.
RULE WHEN
FOR NUMBER
LIKE 1. ,
COPY: 2. , 3.
“-“ COME IN
WHEN ANSWER
THIS GO TO
SIGH THE
COME IN NEXT
ANSWER LINE
THEN GO THEN
TO THE
NEXT
LINE

You might also like