DS - Unit I
DS - Unit I
Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and
systems to extract knowledge and insights from structured and unstructured data. It combines
techniques from statistics, machine learning, data mining, and big data to analyze complex data.
3. Facets of Data
Structured Data: Data organized in rows and columns (e.g., SQL databases).
Unstructured Data: Data without a predefined format (e.g., text, images, videos).
Semi-structured Data: Data that doesn't have a rigid structure but has some level of
organization (e.g., JSON, XML).
Time-series Data: Data collected over time (e.g., stock prices, sensor data).
Spatial Data: Data related to locations and geography (e.g., maps, geospatial data).
The data science process is a sequence of steps followed to convert raw data into actionable
insights. The main steps involved in the data science process are:
Big Data refers to vast amounts of data that traditional data-processing software cannot handle.
Big Data ecosystems support the storage, processing, and analysis of this data. Key components
include:
Data Sources: Big data comes from various sources such as social media, sensors, log
files, and transactional data.
Data Storage: Technologies like Hadoop HDFS, NoSQL databases (e.g., MongoDB,
Cassandra), and cloud storage solutions store large datasets.
Data Processing: Frameworks like Apache Hadoop and Apache Spark are used to
process big data in a distributed manner across clusters.
Data Analytics: Tools like Apache Hive, Apache Pig, and tools like Python, R, and SQL
are used to perform analytics on big data.
Machine Learning: Big data enables the use of more complex machine learning models
by providing large amounts of training data.
Data Visualization: Platforms like Tableau, Power BI, or custom visualizations with
Python’s Matplotlib and Seaborn help in presenting insights from large datasets.
Big data and data science work hand-in-hand as data scientists use big data tools to extract
insights from vast datasets, build predictive models, and make data-driven decisions.
1. Defining the Problem: Clarify the problem you’re solving and set clear goals.
2. Retrieving Data: Collect the data required for analysis.
3. Data Cleansing and Transformation: Clean and prepare the data for analysis.
4. Exploratory Data Analysis (EDA): Investigate the data to understand its structure and
relationships.
5. Model Building: Develop machine learning models and evaluate them.
6. Present Findings: Communicate insights through reports, visualizations, and
presentations.
7. Building Applications: Deploy models and use them in real-world applications.