Introduction To EDA
Introduction To EDA
Introduction to EDA
Unit I
Definition: EDA
| Aspect |R | Python |
| **Primary Use** | Statistical analysis, data visualization | General-purpose programming, data science, machine learning
|
| **Strengths** | - Built-in stats packages<br>- Great for academic and statistical modeling | - Versatile<br>- Strong machine learning libraries (scikit-learn, TensorFlow,
etc.) |
| **Community** | Strong among statisticians and researchers | Very large and cross-disciplinary |
| **Ease of Use** | Easier for statistical analysis tasks | Easier for general programming & integration |
| **Flexibility** | Limited outside data analysis | High—used for web dev, automation, AI, etc. |
| **Data Handling** | Excellent for analysis | Strong for storage, reporting, transactional processing |
| **Real-time Support** | Typically not built-in | Often includes dashboards, alerts, automation |
● Integrate with enterprise apps via APIs for automation (e.g., Python script sending predictions to Salesforce)
Understanding Data Science
|Bridging the Two
Modern enterprises combine both:
● Integrate with enterprise apps via APIs for automation (e.g., Python script sending predictions to Salesforce)
LAB
Software/hardware covered in the book
OS requirements: Python 3.x - Windows, - macOS, -Linux, or any other OS
Python notebooks
There are several options:
Local: Jupyter: https:/ / jupyter. org/
Local: https:/ / www. anaconda. com/ distribution/
Online: https:/ / colab. research. google. com/
Python libraries NumPy, pandas, scikit-learn, Matplotlib, Seaborn, StatsModel
Understanding Data Science
Data science involves cross-disciplinary knowledge from computer science, data, statistics, and mathematics. There are several phases of data analysis, including data
requirements, data collection, data processing, data cleaning, exploratory data analysis, modeling and algorithms, and data product and communication. These phases are
similar to the CRoss-Industry Standard Process for data mining (CRISP) framework in data mining.
1. Data requirement: It is important to comprehend what type of data is required for the organization to be collected, curated, and stored. For example, an application
tracking the sleeping pattern of patients suffering from dementia requires several types of sensors' data storage, such as sleep data, heart rate from the patient,
electro-dermal activities, and user activities pattern. All of these data points are required to correctly diagnose the mental state of the person. Hence, these are
mandatory requirements for the application. In addition to this, it is required to categorize the data, numerical or categorical, and the format of storage and
dissemination
2. Data collection: Data collected from several sources must be stored in the correct format and transferred to the right information technology personnel within a
company. As mentioned previously, data can be collected from several objects on several events using different types of sensors and storage tools.
3. Data processing: Preprocessing involves the process of pre-curating the dataset before actual analysis. Common tasks involve correctly exporting the dataset,
placing them under the right tables, structuring them, and exporting them in the correct format.
Understanding Data Science
4. Data Cleaning: Preprocessed data is still not ready for detailed analysis. It must
be correctly transformed for an incompleteness check, duplicates check, error
check, and missing value check. These tasks are performed in the data cleaning
stage, which involves responsibilities such as matching the correct record,
finding inaccuracies in the dataset, understanding the overall data quality,
removing duplicate items, and filling in the missing values
Understanding Data Science
4. Data Cleaning: Preprocessed data is still not ready for detailed analysis. It must
check, and missing value check. These tasks are performed in the data cleaning
- data cleaning is dependent on the types of data under study. Hence, it is most
- essential for data scientists or EDA experts to comprehend different types of
- datasets. An example of data cleaning would be using outlier detection methods
- for quantitative data cleaning.
Understanding Data Science
5 EDA: Exploratory data analysis, as mentioned before, is the stage where we
actually start to understand the message contained in the data. It should be noted