0% found this document useful (0 votes)
7 views

Data Science

Uploaded by

vikaburko01
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Data Science

Uploaded by

vikaburko01
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

D ATA S C I E N C E

MADE BY VIKTORIIA BURKO


C O N C T E N T:

Data Data Storage in


Data Types;
Analysis; Formats; computers;

Data
Manipulatin Data
Analysis
g Data Sets; Cleansing
Process;
D ATA
A N A LY S I S
Data analysis involves examining,
cleaning, transforming, and modeling
data to discover useful information,
draw conclusions, and support decision-
making. It employs various statistical
and computational techniques to identify
patterns, trends, and relationships within
data sets.
Common steps include data collection, data
processing, exploratory data analysis, and
interpretation of results. Tools like
spreadsheets, programming languages (e.g.,
Python, R), and specialized software (e.g.,
Tableau, SAS) are frequently used. Effective
data analysis can lead to insights that drive
business strategy, scientific research, and policy
development.
D ATA T Y P E S
A data type is an attribute associated with a piece of data that tells a computer
system how to interpret its value. Understanding data types ensures that data is
collected in the preferred format and the value of each property is as expected. Data
types should not be confused with the two types of data that are collectively
referred to as customer data: entity data and event data. To properly define event
properties and entity properties, you need a good understanding of data types. A
well-defined tracking plan must contain the data type of every property to ensure
data accuracy and prevent data loss.
Q U A L I TAT I V E
D ATA
Qualitative data refers to non-numeric information that
describes qualities or characteristics. It is often collected
through interviews, surveys, observations, and textual
analysis, capturing the subjective aspects of experiences and
perceptions. Unlike quantitative data, qualitative data is
typically categorized into themes or patterns rather than
measured in numbers. Examples include opinions,
behaviors, and descriptions, which can provide deep
insights into complex issues. Analyzing qualitative data
often involves coding and identifying recurring themes to
understand underlying meanings and motivations.
Q U A N T I TAT I V E
D ATA
Quantitative data refers to numerical information that can
be measured and quantified. It is often collected through
experiments, surveys, and databases, providing objective
data that can be analyzed statistically. This type of data is
used to identify patterns, test hypotheses, and make
predictions based on numerical trends. Examples of
quantitative data include height, weight, temperature, and
test scores. Analyzing quantitative data involves using
mathematical and statistical techniques to interpret the
numbers and draw conclusions about the relationships and
differences within the data.
Integer (int)

It is the most common numeric data type used to store numbers


without a fractional component (-707, 0, 707).

Floating Point (float)

COMMON It is also a numeric data type used to store numbers that may have a
D AT A T Y P E S fractional component like monetary values do (707.07, 0.7, 707.00).

Please note that number is often used as a data type that includes
both int and float types.

Character (char)

It is used to store a single letter, digit, punctuation mark, symbol, or


blank space.
String (str or text)

It is a sequence of characters and the most commonly used data type to store text. Additionally, a string can also include
digits and symbols, however, it is always treated as text.

A phone number is usually stored as a string (+1-999-666-3333) but can also be stored as an integer (9996663333).

Boolean (bool)

It represents the values true and false. When working with the boolean data type, it is helpful to keep in mind that
sometimes a boolean value is also represented as 0 (for false) and 1 (for true).

Enumerated type (enum)

It contains a small set of predefined unique values (also known as elements or enumerators) that can be compared and
assigned to a variable of enumerated data type.

The values of an enumerated type can be text-based or numerical. In fact, the boolean data type is a pre-defined
enumeration of the values true and false.
Array
Also known as a list, an array is a data type that stores a number of elements in a specific order, typically all of the
same type.
Since an array stores multiple elements or values, the structure of data stored by an array is referred to as an array data
structure.
Each element of an array can be retrieved using an integer index (0, 1, 2,…), and the total number of elements in an
array represents the length of an array.
Date
Needs no explanation; typically stores a date in the YYYY-MM-DD format (ISO 8601 syntax).
Time
Stores a time in the hh:mm:ss format. Besides the time of the day, it can also be used to store the time elapsed or the
time interval between two events which could be more than 24 hours. For example, the time elapsed since an event
took place could be 72+ hours (72:00:59).
Datetime
Stores a value containing both date and time together in the YYYY-MM-DD hh:mm:ss format.
Timestamp
Typically represented in Unix time, a timestamp represents the number of seconds that have elapsed since midnight
(00:00:00 UTC), 1st January 1970.
I M P O R T A N C E O F D AT A T Y P E S

You might be wondering why


it’s important to know about all
these data types when you are Your knowledge of data types
mainly concerned with will come in handy in two stages
understanding how to leverage of your data collection efforts as
customer data. There is only one described below.
main reason—to gather clean
and consistent data.
EXAMPLE AND RECAP
Different programming languages offer
various other data types for a variety of A good way to think about data types is
purposes, however, the most commonly when you come across any form or
used data types that you need to know survey.
to become data-led have been covered.

Looking at a standard registration form, A text field stores the input as a string
you should keep in mind that each field while a number field typically accepts
accepts values of a particular data type. an integer.

Names and email addresses are always In single option or multiple option
of the type string, while numbers can be fields, where one has to select from
stored as a numerical type or as string predefined options, data types
since a string is a set of characters enumerated type and arrays come into
including digits. play.
D ATA F O R M AT S
Data formats refer to the structures in which data is
organized, stored, and transmitted, allowing for efficient
access and interpretation. Common data formats include
CSV (Comma-Separated Values), which is simple and
widely used for tabular data; JSON (JavaScript Object
Notation), which is lightweight and commonly used for
data interchange in web applications; and XML (eXtensible
Markup Language), which is versatile for hierarchical data
representation. Other formats like SQL databases are used
for structured data storage and retrieval, while formats such
as Parquet and Avro are optimized for large-scale data
processing. Choosing the appropriate data format depends
on the specific needs of the data handling, including
storage efficiency, ease of access, and compatibility with
analysis tools.
STRUCTURED
D ATA
Structured data refers to information that is organized into a
defined format, making it easily searchable and analyzable
by computers. This type of data is typically stored in
relational databases and spreadsheets, where it is arranged
in tables with rows and columns. Each column represents a
variable, and each row contains a record, ensuring
consistency and enabling efficient querying and reporting.
Examples of structured data include customer information
in a CRM system, financial transactions in accounting
software, and inventory lists. The rigid structure of this data
type allows for straightforward data management and
analysis using SQL and other database management tools.
S E M I - S T R U C T U R E D D ATA
Semi-structured data is a form of data that does not reside in a
traditional relational database but still has some organizational
properties that make it easier to analyze than unstructured data.
It contains tags or markers to separate semantic elements and
enforce hierarchies of records and fields within the data.
Common formats of semi-structured data include JSON
(JavaScript Object Notation), XML (eXtensible Markup
Language), and NoSQL databases. This type of data is often
found in web data, emails, and data streams from sensors. The
flexible schema allows for a more adaptable approach to data
storage and retrieval, accommodating varying types of data and
evolving requirements.
UNSTRUCTURED
D ATA
Unstructured data refers to information that lacks a
predefined format or structure, making it more challenging
to collect, process, and analyze. Unlike structured data,
which fits neatly into tables, unstructured data is often text-
heavy and can include multimedia elements. Examples of
unstructured data include emails, social media posts, videos,
audio files, and images. This type of data is rich in
information but requires advanced techniques like natural
language processing (NLP), image recognition, and
machine learning to extract meaningful insights. Due to its
complexity and volume, managing unstructured data often
involves specialized tools and technologies for storage,
indexing, and analysis.
STORAGE IN
COMPUTERS
Storage in computers refers to the components and devices used to
retain digital data, ensuring that information is available for
processing and retrieval as needed. There are two primary types of
storage: primary storage, also known as volatile memory or RAM
(Random Access Memory), which provides fast, temporary storage
for data actively being used by the CPU; and secondary storage,
which includes non-volatile memory such as hard drives (HDDs),
solid-state drives (SSDs), and external storage devices. Secondary
storage retains data even when the computer is turned off, making it
suitable for long-term storage. Additionally, there are cloud storage
solutions that allow data to be stored on remote servers accessed over
the internet, providing scalability and remote access capabilities.
Each type of storage has its own advantages in terms of speed,
capacity, and cost.
M A N I P U L AT I N G D ATA S E T S
Manipulating data sets involves various techniques and processes to clean, transform, and analyse data to extract meaningful
insights and facilitate decision-making. Common tasks include:
1. Data Cleaning: Removing or correcting errors, handling missing values, and ensuring consistency in data formats.
2. Data Transformation: Changing the data’s format or structure, such as normalizing, aggregating, or reshaping data to fit
analytical requirements.
3. Filtering and Sorting: Selecting specific subsets of data based on criteria and arranging data in a particular order to highlight
trends or patterns.
4. Merging and Joining: Combining multiple data sets into a single, cohesive data set by aligning related information based on
common keys or indexes.
5. Aggregation: Summarizing data through operations like averaging, summing, or counting to condense large data sets into more
interpretable formats.
Tools such as Excel, SQL, Python (with libraries like pandas), and R are commonly used to perform these tasks, enabling analysts
to refine raw data into actionable insights.
Manipulating data sets in Excel involves various techniques to clean, transform,
and analyze data efficiently. Key tasks include:

Selecting Data: To select data, you can click and drag over cells, use keyboard
shortcuts (like Ctrl + Shift + Arrow keys), or use the name box to jump to
specific cell ranges. Excel also offers features like filters and tables to easily
select subsets of data based on specific criteria.

Reordering Data: Reordering involves sorting data to arrange it in a


meaningful order. You can sort columns in ascending or descending order by
selecting the column header and using the Sort feature found in the Data tab.
For more complex sorting, you can use the Sort dialog box to sort by multiple
columns.

Reformatting Data: Reformatting changes the appearance and structure of


data. This can include changing cell formats (like dates, currency, or
percentages), applying conditional formatting to highlight specific data points,
or using functions to convert text to uppercase or lowercase. The Text to
Columns feature can split text data into multiple columns based on delimiters.
SELECTING
A COLUMN
REORDERIN
G A
COLUMN
REFORMATTIN
G A COLUMN
F I LT E R I N G
AND
SORTING
ROWS
SUBSETING
D ATA
REMOVING
D U P L I C AT E
S
D ATA A N A LY S I S P R O C E S S
The data analysis process typically involves several key steps:

Defining the Problem: Clearly articulate the objectives of the analysis and the questions you seek to answer. Understanding the
purpose of the analysis helps guide subsequent steps.

Data Collection: Gather relevant data from various sources, ensuring its quality, completeness, and relevance to the analysis goals.
This may involve accessing databases, conducting surveys, or collecting data through experiments.

Data Cleaning and Preparation: Clean the data to remove errors, inconsistencies, and missing values. This step also involves
transforming and restructuring the data to make it suitable for analysis. Tasks may include standardizing formats, encoding
categorical variables, and scaling numerical data.

Exploratory Data Analysis (EDA): Explore the data to understand its characteristics, identify patterns, and detect outliers. EDA
techniques include summary statistics, data visualization (e.g., histograms, scatter plots), and correlation analysis to uncover
insights and hypotheses.
Hypothesis Testing and Modeling: Formulate hypotheses based on insights from EDA and use
statistical methods to test them. This step may involve building predictive models (e.g., regression,
classification) or conducting inferential analyses to draw conclusions about the data population.

Interpretation and Insights: Interpret the results of the analysis in the context of the problem
domain, drawing meaningful conclusions and actionable insights. Communicate findings effectively
to stakeholders through reports, presentations, or visualizations.

Iterative Refinement: Review and refine the analysis process iteratively, incorporating feedback
and additional data as needed. Continuously validate and update models to improve accuracy and
relevance over time.

Decision Making and Implementation: Use the insights gained from the analysis to inform
decision-making and drive actions or interventions. Monitor the impact of decisions and track
performance metrics to assess the effectiveness of the analysis in achieving desired outcomes.
D ATA C L E A N S I N G

1 2 3 4 5

Data cleansing, also known Identifying Data Quality Handling Missing Removing Standardizing Data
as data cleaning or data Issues: Review the dataset Data: Determine how to Duplicates: Identify and Formats: Standardize data
scrubbing, is the process of to identify common issues handle missing values, remove duplicate records formats to ensure
identifying and correcting such as missing values, which can skew analysis or observations from the consistency and
errors, inconsistencies, and duplicate records, incorrect results. Options include dataset, ensuring that each comparability across the
inaccuracies in a dataset to formats, and outliers. imputing missing values entry is unique. This helps dataset. This may involve
improve its quality and Understanding the nature using statistical methods, prevent duplication bias converting data types,
reliability for analysis. Key and extent of these issues deleting rows or columns and ensures the accuracy standardizing date formats,
steps in the data cleansing is essential for developing with missing data, or of analysis results. and normalizing text fields
process include: an effective cleansing flagging missing values for to remove variations.
strategy. further investigation.
Correcting Errors: Identify and correct errors in the dataset, such as typographical errors,
inconsistencies in naming conventions, and invalid values. This may require manual review
or automated algorithms to detect and rectify errors.

Handling Outliers: Identify and address outliers, which are data points that deviate
significantly from the rest of the dataset. Depending on the analysis goals, outliers can be
treated by removing them, transforming them, or analyzing them separately.
Validating Data Integrity: Validate the integrity of the cleansed dataset to ensure that it
meets quality standards and is fit for analysis. This may involve cross-referencing data
against external sources, conducting data validation checks, and performing quality
assurance tests.
Documenting Changes: Document all changes made during the data cleansing process,
including the rationale behind each decision and any assumptions made. Maintaining clear
documentation helps ensure transparency and reproducibility of the analysis.
THANK YOU FOR YOUR
AT T E N T I O N

You might also like