0% found this document useful (0 votes)
22 views6 pages

Assignment DSBDS Insem

The document discusses Big Data, its characteristics (5 Vs: Volume, Velocity, Variety, Veracity, Value), and the requisite skill set for data science, including programming, statistics, and machine learning. It also covers data explosion, examples of Big Data applications, data cleaning issues, and methods, along with the importance of statistics in data science. Additionally, it explains statistical concepts such as mean, median, mode, variance, standard deviation, hypothesis testing, and Pearson's correlation coefficient.

Uploaded by

Harsh Gawali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views6 pages

Assignment DSBDS Insem

The document discusses Big Data, its characteristics (5 Vs: Volume, Velocity, Variety, Veracity, Value), and the requisite skill set for data science, including programming, statistics, and machine learning. It also covers data explosion, examples of Big Data applications, data cleaning issues, and methods, along with the importance of statistics in data science. Additionally, it explains statistical concepts such as mean, median, mode, variance, standard deviation, hypothesis testing, and Pearson's correlation coefficient.

Uploaded by

Harsh Gawali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Assignment- 1 DSBDA

What is big Data ? Explain Characteristics of big data.


Big Data refers to large and complex data sets that are difficult to process using traditional data
processing tools. The term is used to describe data that is high in volume, velocity, and variety,
requiring advanced technologies for storage, processing, and analysis.
Characteristics of Big Data (5 Vs)
1. Volume: The sheer amount of data generated every second is enormous, coming from sources
like social media, IoT devices, sensors, and enterprise applications.
2. Velocity: Data is generated at high speed and needs to be processed in real time for
immediate decision-making. Examples include stock market transactions and live social
media updates.
3. Variety: Data comes in multiple formats—structured (databases), semi-structured (XML,
JSON), and unstructured (videos, images, emails).
4. Veracity: The accuracy and reliability of data vary, and managing inconsistent data is a
challenge.
5. Value: The ultimate goal of Big Data is to extract useful insights and derive value from vast
datasets.

Explain Requisite Skill Set in Data Science.


Data Science is an interdisciplinary field that combines statistics, computer science, and domain
knowledge to extract insights from data. Essential skills include:
1. Programming Skills: Proficiency in Python, R, SQL, and Java is crucial for data
manipulation and analysis.
2. Statistics & Mathematics: Understanding probability, linear algebra, and statistical tests is
essential for building models.
3. Machine Learning & AI: Knowledge of supervised and unsupervised learning techniques,
including neural networks and deep learning.
4. Big Data Technologies: Familiarity with Hadoop, Spark, and cloud platforms like AWS,
Google Cloud, or Azure.
5. Data Visualization: Tools like Tableau, Power BI, and Matplotlib help in presenting insights
effectively.
6. Data Wrangling & Cleaning: Handling missing values, dealing with inconsistencies, and
ensuring data quality.
7. Domain Expertise: Understanding the industry-specific data needs enhances decision-
making.

What is Data Explosion?


Data Explosion refers to the rapid increase in the amount of data generated worldwide due to
technological advancements. Sources of data explosion include:
1. Internet & Social Media: Platforms like Facebook, Twitter, and Instagram produce massive
amounts of data daily.
2. IoT Devices: Smart devices continuously collect and transmit real-time data.
3. Enterprise Data: Businesses generate large datasets related to sales, customer interactions,
and supply chain management.
4. Cloud Computing & Storage: The growth of cloud-based applications has further
contributed to data generation.

List and Explain Examples of Big Data.


Some real-world examples of Big Data include:
1. Healthcare: Hospitals use Big Data for predictive analytics, patient diagnostics, and
personalized medicine.
2. E-commerce: Companies like Amazon and Flipkart analyze customer behavior to
recommend products.
3. Finance: Banks use Big Data for fraud detection and risk analysis.
4. Social Media Analytics: Platforms analyze user activity to optimize content
recommendations.
5. Smart Cities: Governments use Big Data for traffic management and infrastructure planning.

Explain 5V’s of Big Data.


The five key characteristics of Big Data, also known as the 5V’s, are:
1. Volume: The massive scale of data generated every second.
2. Velocity: The speed at which data is produced and processed.
3. Variety: Different data formats, including structured, semi-structured, and unstructured data.
4. Veracity: The reliability and quality of data.
5. Value: The meaningful insights extracted from data for business growth.

What are the Data Issues in Data Cleaning?


Data Cleaning is the process of detecting and correcting inaccurate records. Common issues include:
1. Missing Data: Incomplete records may lead to inaccurate analysis.
2. Duplicate Data: Repeated records can distort analytical results.
3. Inconsistent Data: Variations in data formats or naming conventions cause discrepancies.
4. Outliers: Extreme values can skew data distributions.
5. Incorrect Data: Errors in manual entry or automated systems can introduce inaccuracies.

Explain Various Cleaning Methods.


Data Cleaning techniques help refine data for accurate analysis:
1. Handling Missing Values: Use methods like mean imputation or deletion.
2. Removing Duplicates: Identify and eliminate redundant records.
3. Correcting Data Formats: Standardize date formats, capitalization, and numeric precision.
4. Outlier Detection: Use statistical methods to identify and handle anomalies.
5. Data Validation: Ensure correctness using predefined rules.

Write Short Notes on Data Transformation.


Data Transformation is the process of converting data into a suitable format for analysis. It includes:
1. Normalization: Scaling data to a standard range.
2. Encoding Categorical Data: Converting text-based data into numerical values.
3. Aggregation: Summarizing data for analysis.
4. Feature Engineering: Creating new features to enhance model performance.
5. Data Integration: Combining data from multiple sources.

Assignment- 2

What is the need of statistics in data science and big data analytics?
Statistics plays a crucial role in data science and big data analytics by providing methodologies to
collect, analyze, interpret, and present data meaningfully. Without statistical principles, data analysis
would lack reliability and scientific accuracy.
Importance of Statistics in Data Science
1. Data Collection & Sampling:
o Statistics helps in determining proper sampling techniques to extract meaningful
insights from a dataset without analyzing the entire population.
2. Descriptive Analytics:
o Measures like mean, median, mode, variance, and standard deviation help summarize
data distributions.
3. Inferential Analytics:
o Hypothesis testing, confidence intervals, and regression analysis help in making
predictions and data-driven decisions.
4. Machine Learning & AI:
o Algorithms rely on statistical methods like probability distributions and Bayesian
inference.
5. Data Cleaning & Transformation:
o Identifying missing values, handling outliers, and normalizing data require statistical
techniques.
6. Predictive Modeling & Forecasting:
o Statistical techniques like regression, time-series analysis, and clustering help in
forecasting trends.
7. Big Data Processing:
o Since big data is vast and unstructured, statistical models help in summarizing and
analyzing it efficiently.

Explain Mean, Median, and Mode.


These three measures of central tendency help describe the distribution of a dataset.
Mean (Average)
Mean is the sum of all values in a dataset divided by the total number of values.
Mean=∑XN\text{Mean} = \frac{\sum X}{N}Mean=N∑X
• Example: Given numbers 4, 6, 8, 10, the mean is:
4+6+8+104=7\frac{4+6+8+10}{4} = 744+6+8+10=7
• It is sensitive to outliers, meaning extreme values can distort the mean.
Median
The median is the middle value when data is arranged in ascending order.
• Example: 3, 7, 9, 12, 15 → Median = 9
• If there are even numbers of elements, the median is the average of the two middle values.
• It is less affected by outliers than the mean.
Mode
The mode is the most frequently occurring value in a dataset.
• Example: 2, 4, 4, 6, 8, 4 → Mode = 4
• A dataset can be unimodal (one mode), bimodal (two modes), or multimodal (more than
two modes).

Explain Variance and Standard Deviation.


ariance and standard deviation are measures of dispersion that describe how data points spread
around the mean.
Variance (σ²)
Variance measures the average squared difference of each data point from the mean.
Formula:
σ2=∑(Xi−μ)2N\sigma^2 = \frac{\sum (X_i - \mu)^2}{N}σ2=N∑(Xi−μ)2
• Higher variance means data points are widely spread.
• Lower variance means values are close to the mean.
Standard Deviation (σ)
The standard deviation is the square root of the variance, making it easier to interpret.
σ=σ2\sigma = \sqrt{\sigma^2}σ=σ2
• It has the same unit as the original data, unlike variance.
• It helps in understanding data consistency and volatility.

What is Hypothesis Testing?


Hypothesis testing is a statistical method used to make decisions or inferences about a population
based on sample data.
Steps in Hypothesis Testing
1. Formulate Hypotheses:
o Null Hypothesis (H₀): Assumes no effect or difference.
o Alternative Hypothesis (H₁): Indicates a significant effect or difference.
2. Set Significance Level (α):
o Typically 0.05 (5%), meaning a 5% risk of concluding a false positive.
3. Choose a Test Statistic:
o Examples: t-test, z-test, chi-square test, ANOVA.
4. Calculate p-value:
o If p-value ≤ α, reject H₀ (significant result).
5. Make a Conclusion:
o Based on the p-value, determine if the result supports H₁ or not.

What is Pearson Coefficient?


Pearson’s correlation coefficient (r) measures the strength and direction of the linear relationship
between two variables.
Formula
r=∑(Xi−Xˉ)(Yi−Yˉ)∑(Xi−Xˉ)2∑(Yi−Yˉ)2r = \frac{\sum (X_i - \bar{X}) (Y_i - \bar{Y})}{\sqrt{\sum
(X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}}r=∑(Xi−Xˉ)2∑(Yi−Yˉ)2∑(Xi−Xˉ)(Yi−Yˉ)
• r ranges from -1 to +1:
o +1 → Perfect positive correlation
o 0 → No correlation
o -1 → Perfect negative correlation
Example
If studying hours (X) and exam scores (Y) have r = 0.85, it suggests a strong positive correlation—
more study hours lead to higher scores.
Pearson’s coefficient is used in finance, health, and social sciences to study relationships between
variables.

You might also like