BA Data Science Foundations
BA Data Science Foundations
Here are the key takeaways from the video "Define a multidisciplinary practice with
multiple meanings":
Definition of Data Scientist: The term "data scientist" is broad and not yet fully
standardized. It encompasses professionals from various fields like statistics, data
analysis, mathematics, systems engineering, and even business and finance.
Multidisciplinary Nature: Data science is still evolving as a discipline. It involves
a mix of different fields and practices, similar to early archaeology before it
became formalized.
Empirical Approach: A key aspect of data science is using an empirical
approach—asking questions, conducting experiments, and making adjustments
based on data to gain insights.
These points emphasize the evolving nature of data science and the importance of a
scientific method in the field.
1. Categories of Tools:
Storing Data: Tools like spreadsheets, databases, and key-value stores
(e.g., Hadoop, Cassandra, PostgreSQL) are used to store large amounts of
data.
Scrubbing Data: This involves cleaning and preparing data for analysis.
Tools include text editors, scripting tools, and programming languages like
Python.
Analyzing Data: Statistical packages such as R, SPSS, and Python's data
libraries help analyze data and create visualizations.
2. Big Data:
Definition: Big data refers to data sets so large that they can't fit into
traditional database management systems.
Hadoop: An open-source software that uses a distributed file system to
store data across multiple servers (a Hadoop cluster). It processes data
using tools like MapReduce (batch processing) and Apache Spark (real-
time processing).
3. Data Scrubbing:
Importance: Data scientists spend a significant amount of time (up to
90%) cleaning data to make it usable.
Example: If collecting Tweets, you might create a script to separate text
from pictures to analyze them differently.
4. Statistical Analysis:
R: A statistical programming language used to find connections and
correlations in data. It also has built-in data visualization features to create
reports with diagrams.
Example: Analyzing Twitter data to see if there's a connection between
positive feedback and the time of day.
Simplified Explanation:
Think of data science tools like the brushes and pickaxes of archaeologists. They help you
dig through data to find valuable insights. The focus should be on the scientific method—
asking questions, conducting experiments, and analyzing results—rather than just the
tools themselves.
Key Concepts:
Simplified Explanation: Think of data science as a way to explore and discover new
knowledge about your business. It's like being a detective—asking questions, running
experiments, and analyzing data to uncover hidden insights. This approach helps
organizations make better decisions and stay competitive.
organized by different topics. These books are related to each other through a catalog
(schema) that helps you find the information you need. SQL is like the librarian who helps
you pull information from different books and present it in a way that's easy to
understand.
Simplified Explanation: Think of ETL as a process of moving and cleaning data to make
it useful for analysis. It's like taking ingredients from different stores (Extract), cleaning
and preparing them in your kitchen (Transform), and then cooking a meal (Load) that
you can analyze for nutritional value.
1. Relational Databases:
Structure: Relational databases use a schema, meaning you need to know
the structure of your data (like tables and relationships) before storing it.
Example: For a website selling shoes, you might have separate tables for
shoes, customers, addresses, and shipping. Each transaction involves
multiple tables, which can slow down performance.
2. NoSQL Databases:
Flexibility: NoSQL databases are non-relational and schemaless, meaning
you don't need to predefine the structure. This makes them more flexible
and easier to change.
Example: Instead of splitting data into multiple tables, you store
everything related to a transaction (shoe, customer, address, shipping) in a
single record.
3. Advantages of NoSQL:
Performance: NoSQL databases can handle large amounts of data more
efficiently, especially for big websites and applications.
Scalability: They are cluster-friendly, meaning you can distribute data
across many servers, making it easier to manage large datasets.
Adaptability: Adding new fields or data types is simpler since there's no
rigid schema.
4. Real-World Application:
Example: If your shoe website is bought by a larger company, integrating
new features like a frequent buyer program is easier with NoSQL. You can
add new fields without redesigning the entire database.
Simplified Explanation:
Think of a relational database like a well-organized library where you need to know
exactly where each book (data) goes. In contrast, a NoSQL database is like a flexible
storage room where you can quickly add new items without worrying about strict
organization.
Key Concepts:
Volume: If you're collecting petabytes of data daily, you likely have a big
data problem.
Variety: Having different types of data (text, images, videos) indicates a
big data problem.
Velocity: High-speed data inflow, like real-time updates, suggests a big
data problem.
Veracity: Ensuring data accuracy and reliability is crucial for meaningful
insights.
4. Practical Example:
Self-Driving Cars: They collect massive amounts of data (video, audio,
GPS) in real-time to make decisions, which is a classic big data problem.
Simplified Explanation:
Big data is like having an overwhelming amount of information coming in from various
sources at high speeds. To determine if you have a big data problem, check if your data
meets the Four Vs: Volume, Variety, Velocity, and Veracity.
1. Structured Data:
Definition: Structured data follows a specific format and order, like a
spreadsheet where each column has a defined type (e.g., dates, numbers).
Example: Imagine a spreadsheet with a column for "Purchase Date." Each
entry must follow a specific format (e.g., MM/DD/YYYY).
2. Data Models and Schemas:
Data Model: Defines the structure of individual fields (e.g., a field for
dates, another for text).
Schema: Describes the entire structure of the database, including tables
and relationships.
3. Importance of Structure:
Consistency: Ensures data is entered in a consistent format, making it
easier to sort, filter, and analyze.
Error Prevention: Prevents invalid data entries (e.g., entering "Tuesday" in
a date field).
4. Relational Databases:
Optimization: Relational databases are optimized for structured data,
making them efficient for tasks like generating reports from consistent
data sets.
Simplified Explanation:
Think of structured data like a well-organized filing cabinet. Each drawer (column) is
labeled and contains specific types of documents (data). This organization makes it easy
to find and use the information later.
Key Concepts:
1. Structured Data:
Definition: Data that fits neatly into a predefined schema, like a
spreadsheet with fixed columns and rows.
Example: A table with columns for "ZIPCode" and "PostalCode."
2. Semistructured Data:
Definition: Data that has some structure but doesn't fit neatly into a rigid
schema. It includes tags or markers to separate data elements.
Example: Email data where you have consistent fields like sender and
recipient, but the content varies.
3. Challenges with Semistructured Data:
Schema Differences: Different systems might use different names for the
same data fields (e.g., "ZIPCode" vs. "PostalCode").
Integration: Combining semistructured data from different sources can be
challenging because of these schema differences.
4. Common Formats:
XML: An older format used for exchanging semistructured data.
JSON: A more modern format often used for web services, making it
easier to exchange data between different systems.
5. Practical Example:
Scenario: Your shoe website needs to integrate shipping data from a
carrier. Your database uses "ZIPCode" while the carrier uses "PostalCode."
Solution: You need to map these fields correctly to exchange data
seamlessly.
6. Benefits of Semistructured Data:
Flexibility: Easier to adapt and integrate with different systems.
Richness: Allows for more detailed and varied data to be included, like
customer feedback from social media.
Simplified Explanation:
Think of semistructured data like a recipe book where each recipe has a consistent
structure (ingredients, steps) but the content varies. You can easily add new recipes
without needing a strict format.
1. Unstructured Data:
Definition: Data that doesn't have a predefined format or structure.
Examples include emails, social media posts, videos, and images.
Example: Think about the variety of content you see when you search for
"cats" online—videos, images, articles, etc. All of this is unstructured data.
2. Challenges:
Schemaless: Unlike structured data, unstructured data doesn't follow a
consistent format. For instance, a Microsoft Word document and a PDF
have different structures.
Data Model: There's no consistent place to look for specific information
(e.g., document title) across different file types.
3. Handling Unstructured Data:
NoSQL Databases: These databases can store large files like audio, video,
and text without requiring a predefined schema.
Big Data Tools: Technologies like Hadoop and Apache Spark help process
and analyze large volumes of unstructured data.
4. Practical Application:
Customer Insights: For a business, unstructured data can provide a 360-
degree view of customers. For example, analyzing social media posts to
understand customer preferences and behaviors.
Simplified Explanation:
Think of unstructured data like a messy room where items are scattered everywhere.
Unlike a neatly organized room (structured data), you need special tools to find and
make sense of everything in the messy room.
Sift through big garbage
Let's highlight the key takeaways from the video "Sift through big garbage":
Key Takeaways:
By understanding these points, you can better manage your data and make informed
decisions about what to keep and what to discard.
1. Descriptive Statistics:
Definition: Tools used to summarize or describe a set of data. They help
tell a story about the data without going into every detail.
2. Mean (Average):
Definition: The sum of all values divided by the number of values.
Example: If you add up the incomes of all families and divide by the
number of families, you get the mean income.
3. Median:
Definition: The middle value in a list of numbers sorted from smallest to
largest.
Example: If you list all family incomes from lowest to highest, the median
is the income of the family in the middle.
4. Storytelling with Statistics:
Example: One politician might say the average salary has increased by
$5,000, while another might say the median salary has decreased by
$10,000. Both can be true because they are using different statistics to tell
different stories.
5. Skewed Data:
Definition: When there's a big difference between the mean and median,
it indicates that the data might be skewed by extreme values.
Example: If a few families are extremely wealthy, their high incomes can
raise the mean but not affect the median much.
Simplified Explanation:
Think of descriptive statistics like different ways to summarize a story. The mean gives
you an overall average, while the median tells you what the middle looks like. Both are
useful, but they can tell different stories depending on the data.
Understand probability
Let's break down the key points from the video "Understand probability":
Key Concepts:
1. Probability Basics:
Definition: Probability measures the likelihood that a specific event will
occur. It's expressed as a percentage or a fraction.
Example: Flipping a coin has a 50% probability of landing on heads.
2. Probability Distribution:
Definition: A mathematical function that provides the probabilities of
occurrence of different possible outcomes.
Example: Rolling a six-sided die has six possible outcomes, each with a
probability of 1/6 (or about 17%).
3. Sequence of Events:
Definition: The probability of multiple events occurring in sequence is the
product of their individual probabilities.
Example: Rolling a specific number twice in a row on a die is 1/6 * 1/6 =
1/36 (or about 3%).
4. Practical Application:
Example: A biotech company uses probability to predict participation in
clinical trials. Factors like fasting before the trial or fear of needles can
decrease participation likelihood.
5. Balancing Accuracy and Participation:
Scenario: The company must decide between a more accurate blood test
(with fewer participants) and a less accurate saliva test (with more
participants). They use probability to weigh the trade-offs.
6. Unexpected Insights:
Key Point: Probability can lead to surprising conclusions, such as
preferring a less accurate test to maximize participation and data points.
Simplified Explanation:
Think of probability like predicting the weather. If there's a 70% chance of rain, you
know it's more likely to rain than not. Similarly, in data science, probability helps predict
outcomes based on data.
Find a correlation
Sure, let's break down the key points from the video "Find a correlation":
Key Concepts:
1. Correlation:
Definition: Correlation measures the relationship between two variables. It
tells you how one variable change when the other one does.
Scale: Correlation is measured on a scale from -1 to 1.
1: Perfect positive correlation (as one variable increases, the other
also increases).
0: No correlation (no relationship between the variables).
-1: Perfect negative correlation (as one variable increases, the other
decreases).
2. Positive Correlation:
Example: Height and weight. Generally, taller people tend to weigh more.
As height increases, weight also increases.
3. Negative Correlation:
Example: Car weight and fuel efficiency. Heavier cars usually get fewer
miles per gallon. As car weight increases, fuel efficiency decreases.
4. Real-World Applications:
Recommendation Systems: Companies like Netflix and Amazon use
correlation to recommend movies or products based on your past
behavior.
LinkedIn: The "People You May Know" feature uses correlation to suggest
connections based on shared jobs, schools, or interests.
5. Correlation Coefficient:
Definition: A numerical value that represents the strength and direction of
the correlation.
Example: A correlation coefficient of 0.5 indicates a moderate positive
relationship, while -0.75 indicates a strong negative relationship.
Simplified Explanation:
Think of correlation like a friendship. If two friends (variables) always do things together
(positive correlation), they have a strong positive relationship. If they always do the
opposite (negative correlation), they have a strong negative relationship. If they don't
influence each other at all, there's no correlation.
Simplified Explanation:
Think of correlation like two events happening together, like more ice cream sales on
hot days. However, this doesn't mean hot days cause ice cream sales; there could be
other reasons like people wanting to cool down.
1. Predictive Analytics:
Definition: Uses historical data to predict future events. It's a subset of
data science.
Example: Weather forecasting uses past weather data to predict future
conditions.
2. Difference from Data Science:
Data Science: Applies the scientific method to data to uncover insights.
Predictive Analytics: Takes these insights and makes actionable
predictions.
3. Practical Example:
Weather Forecasting: Meteorologists use historical data and correlations
(like low pressure leading to storms) to predict future weather.
Business Application: Imagine your team analyzes millions of Tweets
about running. By identifying influential runners, you can send them
promotions to boost your brand.
4. Importance of Data Quality:
Key Point: The accuracy of predictions depends on the quality of the data
and the thoroughness of the analysis. Ensure your team understands the
past data well to make accurate future predictions.
Simplified Explanation:
Think of predictive analytics like using past experiences to make future decisions. For
example, if you know it usually rains when the sky is cloudy, you might predict rain and
carry an umbrella. In business, this means using past data to forecast trends and make
informed decisions.