0% found this document useful (0 votes)
1 views

BA Data Science Foundations

Uploaded by

farazkh1311
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

BA Data Science Foundations

Uploaded by

farazkh1311
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Define a multidisciplinary practice with multiple meanings

Here are the key takeaways from the video "Define a multidisciplinary practice with
multiple meanings":

 Definition of Data Scientist: The term "data scientist" is broad and not yet fully
standardized. It encompasses professionals from various fields like statistics, data
analysis, mathematics, systems engineering, and even business and finance.
 Multidisciplinary Nature: Data science is still evolving as a discipline. It involves
a mix of different fields and practices, similar to early archaeology before it
became formalized.
 Empirical Approach: A key aspect of data science is using an empirical
approach—asking questions, conducting experiments, and making adjustments
based on data to gain insights.

These points emphasize the evolving nature of data science and the importance of a
scientific method in the field.

Use statistics and software


Sure, let's break down the key concepts from the video "Use statistics and software":
Key Concepts:

1. Categories of Tools:
 Storing Data: Tools like spreadsheets, databases, and key-value stores
(e.g., Hadoop, Cassandra, PostgreSQL) are used to store large amounts of
data.
 Scrubbing Data: This involves cleaning and preparing data for analysis.
Tools include text editors, scripting tools, and programming languages like
Python.
 Analyzing Data: Statistical packages such as R, SPSS, and Python's data
libraries help analyze data and create visualizations.
2. Big Data:
 Definition: Big data refers to data sets so large that they can't fit into
traditional database management systems.
 Hadoop: An open-source software that uses a distributed file system to
store data across multiple servers (a Hadoop cluster). It processes data
using tools like MapReduce (batch processing) and Apache Spark (real-
time processing).
3. Data Scrubbing:
 Importance: Data scientists spend a significant amount of time (up to
90%) cleaning data to make it usable.
 Example: If collecting Tweets, you might create a script to separate text
from pictures to analyze them differently.
4. Statistical Analysis:
 R: A statistical programming language used to find connections and
correlations in data. It also has built-in data visualization features to create
reports with diagrams.
 Example: Analyzing Twitter data to see if there's a connection between
positive feedback and the time of day.

Simplified Explanation:

Think of data science tools like the brushes and pickaxes of archaeologists. They help you

dig through data to find valuable insights. The focus should be on the scientific method—
asking questions, conducting experiments, and analyzing results—rather than just the
tools themselves.

Uncover insights and create knowledge


Sure, let's break down the key points from the video "Uncover insights and create
knowledge":

Key Concepts:

1. Exploratory Nature of Data Science:


 Unlike traditional business processes that focus on efficiency and
achieving specific objectives, data science is exploratory. It uses the
scientific method to gain useful business knowledge.
 Example: Instead of asking "How can we work faster?" data science asks
"What do we know about our customers?" or "How can we deliver a better
product?"
2. Asking the Right Questions:
 Data science requires asking higher-level, often skeptical questions to gain
deeper insights. These questions might seem annoying in a typical
business setting but are crucial for building organizational knowledge.
 Example: Questions like "Why are we doing it this way?" or "What makes
you think this will work?" are essential for uncovering new opportunities
and improving processes.
3. Operational vs. Scientific Focus:
 Many organizations initially focus on the technical side of data, such as
collecting and storing it. However, the real value comes from the scientific
approach—asking interesting questions and running experiments.
 Example: A website collecting data on customer interactions might start
by gathering data, but the real insights come from experiments like
changing the color of car images to see which gets more clicks.
4. Empirical Research:
 Data scientists should constantly run experiments, ask questions, and
produce well-designed reports to gain insights.
 Example: Running an experiment to see if fewer cars on a webpage
increase the likelihood of customer clicks, then analyzing the results to
inform business decisions.

Simplified Explanation: Think of data science as a way to explore and discover new
knowledge about your business. It's like being a detective—asking questions, running

experiments, and analyzing data to uncover hidden insights. This approach helps
organizations make better decisions and stay competitive.

Make connections with relational databases


Let's break down the key points from the video "Make connections with relational
databases":
Key Concepts:

1. Origins of Modern Databases:


 Historical Context: Modern databases have roots in the Apollo Space
Mission of the late 1960s. NASA and IBM developed an information
management system (IMS) to handle the massive amounts of data
required for the mission.
 Early Databases: The early databases were like large spreadsheets with
columns and rows, but managing millions of rows was challenging.
2. Relational Databases:
 Development: In the mid-1970s, IBM developed Structured Query
Language (SQL) to help users pull data from these large systems. Around
the same time, relational databases were created.
 Structure: Relational databases divide data into smaller, related tables
instead of one massive table. For example, instead of one table with a
million parts, you might have 50 tables with 20,000 parts each.
3. Schemas and Design:
 Schemas: Engineers create schemas, or maps, to show how tables relate to
each other. Designing these schemas requires understanding the data and
anticipating future changes.
 Challenges: Designing relational databases requires a lot of upfront
planning. If the initial design is wrong, it can be difficult to redesign the
database later.
4. SQL and RDBMS:
 SQL: SQL is a powerful language that can pull data from multiple tables
and present it in a virtual table called a view. It's still one of the most
widely used query languages today.
 RDBMS: Relational Database Management Systems (RDBMS) like those
from IBM, Microsoft, and Oracle have added functionality over the years,
making them robust tools for managing relational databases.

Simplified Explanation: Think of a relational database like a library. Instead of having


one giant book with all the information, the library has many smaller books (tables)

organized by different topics. These books are related to each other through a catalog
(schema) that helps you find the information you need. SQL is like the librarian who helps

you pull information from different books and present it in a way that's easy to
understand.

Get data into warehouses using ETL


Sure, let's break down the key points from the video "Get data into warehouses using
ETL":
Key Concepts:

1. Relational Databases vs. Data Warehouses:


 Relational Databases (OLTP): These are optimized for real-time
transactions. For example, when a customer buys a shoe online, the
database quickly joins their shipping address with the shoe details to
process the order.
 Data Warehouses (OLAP): These are optimized for analyzing historical
data. For instance, you might analyze past sales to see if there's a trend in
shoe purchases based on customer location.
2. ETL Process:
 Extract: Pulling data from various sources, like different websites or
databases.
 Transform: Cleaning and converting the data into a format suitable for
the data warehouse. This might involve changing the data structure to
match the warehouse's schema.
 Load: Importing the transformed data into the data warehouse for
analysis.
3. Practical Example:
 Imagine your website sells running shoes and is bought by a larger
company that also sells sports clothing. The company will use ETL to
combine data from your website with their other websites. This helps them
analyze all their sales data together.
4. ETL in Data Science:
 Common Terminology: Terms like "ETL the data" mean transforming data
to fit into a new system, like a Hadoop cluster.
 Hadoop vs. Data Warehouses: Some companies are moving from
traditional data warehouses to Hadoop clusters to save costs, as Hadoop
can store data on cheaper hardware.

Simplified Explanation: Think of ETL as a process of moving and cleaning data to make
it useful for analysis. It's like taking ingredients from different stores (Extract), cleaning
and preparing them in your kitchen (Transform), and then cooking a meal (Load) that
you can analyze for nutritional value.

Let go of the past with NoSQL


Sure, let's break down the key points from the video "Let go of the past with NoSQL":
Key Concepts:

1. Relational Databases:
 Structure: Relational databases use a schema, meaning you need to know
the structure of your data (like tables and relationships) before storing it.
 Example: For a website selling shoes, you might have separate tables for
shoes, customers, addresses, and shipping. Each transaction involves
multiple tables, which can slow down performance.
2. NoSQL Databases:
 Flexibility: NoSQL databases are non-relational and schemaless, meaning
you don't need to predefine the structure. This makes them more flexible
and easier to change.
 Example: Instead of splitting data into multiple tables, you store
everything related to a transaction (shoe, customer, address, shipping) in a
single record.
3. Advantages of NoSQL:
 Performance: NoSQL databases can handle large amounts of data more
efficiently, especially for big websites and applications.
 Scalability: They are cluster-friendly, meaning you can distribute data
across many servers, making it easier to manage large datasets.
 Adaptability: Adding new fields or data types is simpler since there's no
rigid schema.
4. Real-World Application:
 Example: If your shoe website is bought by a larger company, integrating
new features like a frequent buyer program is easier with NoSQL. You can
add new fields without redesigning the entire database.

Simplified Explanation:
Think of a relational database like a well-organized library where you need to know
exactly where each book (data) goes. In contrast, a NoSQL database is like a flexible
storage room where you can quickly add new items without worrying about strict
organization.

Address big data problems


Let's break down the key points from the video "Address big data problems":

Key Concepts:

1. Big Data vs. Data Science:


 Big Data: Refers to data sets that are too large to be handled by
traditional hardware and software.
 Data Science: Uses the scientific method to analyze data, regardless of its
size.
2. The Four Vs of Big Data:
 Volume: Do you have a very high amount of data? (e.g., petabytes of data)
 Variety: Is your data diverse? (e.g., text, images, videos)
 Velocity: Is your data coming in quickly? (e.g., real-time data like stock
prices)
 Veracity: Is your data reliable and accurate?
3. Identifying Big Data Problems:

 Volume: If you're collecting petabytes of data daily, you likely have a big
data problem.
 Variety: Having different types of data (text, images, videos) indicates a
big data problem.
 Velocity: High-speed data inflow, like real-time updates, suggests a big
data problem.
 Veracity: Ensuring data accuracy and reliability is crucial for meaningful
insights.
4. Practical Example:
 Self-Driving Cars: They collect massive amounts of data (video, audio,
GPS) in real-time to make decisions, which is a classic big data problem.

Simplified Explanation:
Big data is like having an overwhelming amount of information coming in from various
sources at high speeds. To determine if you have a big data problem, check if your data
meets the Four Vs: Volume, Variety, Velocity, and Veracity.

Keep things simple with structured data


Sure, let's break down the key points from the video "Keep things simple with structured
data":
Key Concepts:

1. Structured Data:
 Definition: Structured data follows a specific format and order, like a
spreadsheet where each column has a defined type (e.g., dates, numbers).
 Example: Imagine a spreadsheet with a column for "Purchase Date." Each
entry must follow a specific format (e.g., MM/DD/YYYY).
2. Data Models and Schemas:
 Data Model: Defines the structure of individual fields (e.g., a field for
dates, another for text).
 Schema: Describes the entire structure of the database, including tables
and relationships.
3. Importance of Structure:
 Consistency: Ensures data is entered in a consistent format, making it
easier to sort, filter, and analyze.
 Error Prevention: Prevents invalid data entries (e.g., entering "Tuesday" in
a date field).
4. Relational Databases:
 Optimization: Relational databases are optimized for structured data,
making them efficient for tasks like generating reports from consistent
data sets.

Simplified Explanation:
Think of structured data like a well-organized filing cabinet. Each drawer (column) is
labeled and contains specific types of documents (data). This organization makes it easy
to find and use the information later.

Share semistructured data


Let's break down the key points from the video "Share semistructured data":

Key Concepts:

1. Structured Data:
 Definition: Data that fits neatly into a predefined schema, like a
spreadsheet with fixed columns and rows.
 Example: A table with columns for "ZIPCode" and "PostalCode."
2. Semistructured Data:
 Definition: Data that has some structure but doesn't fit neatly into a rigid
schema. It includes tags or markers to separate data elements.
 Example: Email data where you have consistent fields like sender and
recipient, but the content varies.
3. Challenges with Semistructured Data:
 Schema Differences: Different systems might use different names for the
same data fields (e.g., "ZIPCode" vs. "PostalCode").
 Integration: Combining semistructured data from different sources can be
challenging because of these schema differences.
4. Common Formats:
 XML: An older format used for exchanging semistructured data.
 JSON: A more modern format often used for web services, making it
easier to exchange data between different systems.
5. Practical Example:
 Scenario: Your shoe website needs to integrate shipping data from a
carrier. Your database uses "ZIPCode" while the carrier uses "PostalCode."
 Solution: You need to map these fields correctly to exchange data
seamlessly.
6. Benefits of Semistructured Data:
 Flexibility: Easier to adapt and integrate with different systems.
 Richness: Allows for more detailed and varied data to be included, like
customer feedback from social media.

Simplified Explanation:
Think of semistructured data like a recipe book where each recipe has a consistent
structure (ingredients, steps) but the content varies. You can easily add new recipes
without needing a strict format.

Collect unstructured data


Sure, let's break down the key points from the video "Collect unstructured data"
Key Concepts:

1. Unstructured Data:
 Definition: Data that doesn't have a predefined format or structure.
Examples include emails, social media posts, videos, and images.
 Example: Think about the variety of content you see when you search for
"cats" online—videos, images, articles, etc. All of this is unstructured data.
2. Challenges:
 Schemaless: Unlike structured data, unstructured data doesn't follow a
consistent format. For instance, a Microsoft Word document and a PDF
have different structures.
 Data Model: There's no consistent place to look for specific information
(e.g., document title) across different file types.
3. Handling Unstructured Data:
 NoSQL Databases: These databases can store large files like audio, video,
and text without requiring a predefined schema.
 Big Data Tools: Technologies like Hadoop and Apache Spark help process
and analyze large volumes of unstructured data.
4. Practical Application:
 Customer Insights: For a business, unstructured data can provide a 360-
degree view of customers. For example, analyzing social media posts to
understand customer preferences and behaviors.

Simplified Explanation:
Think of unstructured data like a messy room where items are scattered everywhere.
Unlike a neatly organized room (structured data), you need special tools to find and
make sense of everything in the messy room.
Sift through big garbage
Let's highlight the key takeaways from the video "Sift through big garbage":
Key Takeaways:

 Data Retention Dilemma:


 Keep Everything: Some argue it's cheaper and easier to store all data, as
storage costs are low.
 Delete Some Data: Others argue that too much data (or "data noise")
makes it harder to find valuable insights.
 Team Decision:
 It's crucial for your data science team to decide early on a data retention
policy. Consistency in this policy helps avoid data corruption and ensures
meaningful analysis.
 Practical Example:
 A company dealing with car buyer data faced challenges with obsolete
tags and data noise. They had to decide whether to keep all data or clean
up the obsolete parts.

By understanding these points, you can better manage your data and make informed
decisions about what to keep and what to discard.

Start out with descriptive statistics


Sure, let's break down the key points from the video "Start out with descriptive
statistics":
Key Concepts:

1. Descriptive Statistics:
 Definition: Tools used to summarize or describe a set of data. They help
tell a story about the data without going into every detail.
2. Mean (Average):
 Definition: The sum of all values divided by the number of values.
 Example: If you add up the incomes of all families and divide by the
number of families, you get the mean income.
3. Median:
 Definition: The middle value in a list of numbers sorted from smallest to
largest.
 Example: If you list all family incomes from lowest to highest, the median
is the income of the family in the middle.
4. Storytelling with Statistics:
Example: One politician might say the average salary has increased by
$5,000, while another might say the median salary has decreased by
$10,000. Both can be true because they are using different statistics to tell
different stories.
5. Skewed Data:
 Definition: When there's a big difference between the mean and median,
it indicates that the data might be skewed by extreme values.
 Example: If a few families are extremely wealthy, their high incomes can
raise the mean but not affect the median much.

Simplified Explanation:
Think of descriptive statistics like different ways to summarize a story. The mean gives
you an overall average, while the median tells you what the middle looks like. Both are
useful, but they can tell different stories depending on the data.

Understand probability
Let's break down the key points from the video "Understand probability":
Key Concepts:

1. Probability Basics:
 Definition: Probability measures the likelihood that a specific event will
occur. It's expressed as a percentage or a fraction.
 Example: Flipping a coin has a 50% probability of landing on heads.
2. Probability Distribution:
 Definition: A mathematical function that provides the probabilities of
occurrence of different possible outcomes.
 Example: Rolling a six-sided die has six possible outcomes, each with a
probability of 1/6 (or about 17%).
3. Sequence of Events:
 Definition: The probability of multiple events occurring in sequence is the
product of their individual probabilities.
 Example: Rolling a specific number twice in a row on a die is 1/6 * 1/6 =
1/36 (or about 3%).
4. Practical Application:
 Example: A biotech company uses probability to predict participation in
clinical trials. Factors like fasting before the trial or fear of needles can
decrease participation likelihood.
5. Balancing Accuracy and Participation:
Scenario: The company must decide between a more accurate blood test
(with fewer participants) and a less accurate saliva test (with more
participants). They use probability to weigh the trade-offs.
6. Unexpected Insights:
 Key Point: Probability can lead to surprising conclusions, such as
preferring a less accurate test to maximize participation and data points.

Simplified Explanation:
Think of probability like predicting the weather. If there's a 70% chance of rain, you
know it's more likely to rain than not. Similarly, in data science, probability helps predict
outcomes based on data.

Find a correlation
Sure, let's break down the key points from the video "Find a correlation":
Key Concepts:

1. Correlation:
 Definition: Correlation measures the relationship between two variables. It
tells you how one variable change when the other one does.
 Scale: Correlation is measured on a scale from -1 to 1.
 1: Perfect positive correlation (as one variable increases, the other
also increases).
 0: No correlation (no relationship between the variables).
 -1: Perfect negative correlation (as one variable increases, the other
decreases).
2. Positive Correlation:
 Example: Height and weight. Generally, taller people tend to weigh more.
As height increases, weight also increases.
3. Negative Correlation:
 Example: Car weight and fuel efficiency. Heavier cars usually get fewer
miles per gallon. As car weight increases, fuel efficiency decreases.
4. Real-World Applications:
 Recommendation Systems: Companies like Netflix and Amazon use
correlation to recommend movies or products based on your past
behavior.
 LinkedIn: The "People You May Know" feature uses correlation to suggest
connections based on shared jobs, schools, or interests.
5. Correlation Coefficient:
 Definition: A numerical value that represents the strength and direction of
the correlation.
 Example: A correlation coefficient of 0.5 indicates a moderate positive
relationship, while -0.75 indicates a strong negative relationship.

Simplified Explanation:
Think of correlation like a friendship. If two friends (variables) always do things together
(positive correlation), they have a strong positive relationship. If they always do the
opposite (negative correlation), they have a strong negative relationship. If they don't
influence each other at all, there's no correlation.

See how correlation does not imply causation


Let's break down the key points from the video "See how correlation does not imply
causation":
Key Concepts:

1. Correlation vs. Causation:


 Correlation: Indicates a relationship between two variables. For example,
ice cream sales and temperature are correlated because both tend to
increase together.
 Causation: Indicates that one variable directly affects the other. For
example, turning on a light switch causes the light to turn on.
2. Correlation Doesn't Imply Causation:
 Just because two things are correlated doesn't mean one causes the other.
There could be a third factor influencing both.
 Example: A retirement community has a high correlation with hospital
visits. This doesn't mean the community causes hospital visits; the true
cause is the higher median age of residents.
3. Spurious Correlation:
 Definition: A false relationship where two variables appear to be related
but are actually influenced by a third factor.
 Example: Increased sales of running shoes in January might be correlated
with New Year's resolutions rather than people having more money.
4. Scientific Method:
 To avoid false conclusions, follow the scientific method: ask good
questions, form hypotheses, and test them rigorously.
 Example: The data science team initially thought January shoe sales were
due to people having more money. After further analysis, they found it was
due to New Year's resolutions.

Simplified Explanation:
Think of correlation like two events happening together, like more ice cream sales on
hot days. However, this doesn't mean hot days cause ice cream sales; there could be
other reasons like people wanting to cool down.

Comb techniques for predictive analytics


Sure, let's break down the key points from the video "Comb techniques for predictive
analytics":
Key Concepts:

1. Predictive Analytics:
 Definition: Uses historical data to predict future events. It's a subset of
data science.
 Example: Weather forecasting uses past weather data to predict future
conditions.
2. Difference from Data Science:
 Data Science: Applies the scientific method to data to uncover insights.
 Predictive Analytics: Takes these insights and makes actionable
predictions.
3. Practical Example:
 Weather Forecasting: Meteorologists use historical data and correlations
(like low pressure leading to storms) to predict future weather.
 Business Application: Imagine your team analyzes millions of Tweets
about running. By identifying influential runners, you can send them
promotions to boost your brand.
4. Importance of Data Quality:
 Key Point: The accuracy of predictions depends on the quality of the data
and the thoroughness of the analysis. Ensure your team understands the
past data well to make accurate future predictions.

Simplified Explanation:
Think of predictive analytics like using past experiences to make future decisions. For
example, if you know it usually rains when the sky is cloudy, you might predict rain and
carry an umbrella. In business, this means using past data to forecast trends and make
informed decisions.

You might also like