We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4
Tools for a Data Scientist: Data scientists A data cube organizes data into Null Hypothesis (H₀): Assumes
Null Hypothesis (H₀): Assumes no effect or
use diverse tools for data manipulation, multidimensional arrays, enabling efficient relationship exists. It represents the default modeling, visualization, and deployment. Key querying and analysis. - Purpose: Supports assumption. For instance, "A new drug has no tools include: - Programming Languages: OLAP (Online Analytical Processing) effect on patients' recovery times." - Python: Offers libraries like Pandas for data operations like slicing (viewing specific Alternate Hypothesis (H₁): Suggests an manipulation, Matplotlib/Seaborn for dimensions) and dicing (creating smaller data effect or relationship exists, challenging the visualization, and Scikit-learn for machine cubes). - Example: A retail data cube could null hypothesis. Example: "The new drug learning. - R: Well-suited for statistical analyze sales data across dimensions like improves recovery times." Hypothesis testing analysis and visualization. - Visualization time (days, months, years), location (regions, evaluates these claims using data to decide Tools: Tableau and Power BI help create cities), and products (categories, brands). whether to reject H₀ in favor of H₁ based on interactive dashboards and insights for non- Purpose of Data Preprocessing: Data significance levels (e.g., p-values). technical stakeholders. - Big Data Tools: preprocessing transforms raw data into a Noisy Data: - Definition: Noisy data contains Tools like Apache Spark and Hadoop handle clean and usable format. Key steps include: irrelevant or erroneous information that can large datasets efficiently. - Machine Learning - Data Cleaning: Removes inconsistencies, distort analysis. - Causes: - Sensor Errors: Frameworks: TensorFlow and PyTorch are missing values, and duplicates. - Data Faulty devices (e.g., temperature sensors) widely used for deep learning. - Version Transformation: Normalizes or encodes data produce inaccurate readings. - Human Input Control: Git/GitHub manages code and for compatibility with algorithms. - Feature Errors: Typos or inconsistent data entry (e.g., collaboration among team members. Engineering: Creates new features to "N/A" vs. "Not Applicable"). - Solution: Use Statistical data analysis is the process of enhance model performance.- Purpose: data cleaning techniques like smoothing or interpreting data to uncover patterns, Ensures data quality and improves the filtering to mitigate noise. relationships, and trends. It involves: reliability of analyses or machine learning The 3Vs represent the characteristics of big - Descriptive Statistics: Summarizes data models. data: - Volume: Refers to the sheer quantity of using measures like mean, median, standard Inferential statistics draws conclusions data generated daily (e.g., terabytes of social deviation, and variance. - Inferential Statistics: about a population using sample data. media posts). - Variety: Includes structured Makes predictions or inferences about a - Example: Predicting election results based data (databases), unstructured data (texts, population based on a sample, using methods on a voter survey. - Techniques: Hypothesis images), and semistructured data (JSON, like hypothesis testing and confidence testing, regression analysis, and confidence XML). - Velocity: Represents the speed at intervals. For instance, analyzing the intervals. which data is generated and processed (e.g., performance of students in an exam using ---------------------------------------------------------- real-time streaming). mean scores provides a descriptive summary, Purpose of Data Visualization:Data Two Ways in Which Data is Stored in Files. while predicting overall class trends involves visualization translates complex data into - CSV (Comma-Separated Values): A simple inferential analysis. visual formats like charts, graphs, and maps. text format where data is stored in tabular ------------------------------------------------------------- - Purpose: - Identify trends, patterns, and form, with rows and columns separated by Hypothesis testing is a statistical method to outliers. - Communicate insights effectively to commas. Widely used for structured data. determine the validity of a claim or hypothesis stakeholders. - Example Libraries: - Matplotlib: - JSON (JavaScript Object Notation): A about a dataset based on sample data. Used for simple 2D plots like line graphs and lightweight format for representing Use of A bubble plot visually represents scatter plots. - Seaborn: Builds on Matplotlib semistructured data. Example: `{ "name": three dimensions of data: the x-axis, y-axis, for advanced statistical visualizations like "John", "age": 30 }`. and bubble size(representing a third variable). heatmaps and pair plots. - Visual Encoding: ------------------------------------------------------------ Data cleaning involves identifying and Attributes like color, shape, size, and position Role of Statistics in Data Science: Statistics rectifying errors, inconsistencies, and represent data elements, making relationships provides foundational techniques for data inaccuracies in a dataset to ensure quality and visually intuitive. science, such as: - Descriptive Statistics: reliability. Applications of Data Science: Data science Summarizes data (e.g., mean, median, Standard deviation measures the dispersion applies to various domains: - Healthcare: mode). - Inferential Statistics: Generalizes or spread of data points from the mean in a Predictive analytics for diseases, drug findings from sample data to a larger dataset. discovery, and patient management. - Retail: population (e.g., hypothesis testing, Different types of attributes include nominal, Inventory management, recommendation confidence intervals). - Data Distribution: ordinal, interval, and ratio. engines, and customer segmentation. Tools like histograms reveal data distribution, A data object is an entity that holds data and - Finance: Fraud detection, credit scoring, and enabling better model selection. is described by its attributes, such as a record stock market predictions. - Transportation: Two Methods of Data Cleaning for Missing in a database or an instance in programming. Optimizing routes, analyzing traffic patterns, Values. - Imputation: Replace missing values Tools used for geospatial data include and enabling autonomous vehicles. - Social with statistical measures like the mean, ArcGIS, QGIS, Google Earth Engine, and Media: Sentiment analysis, content median, or mode, ensuring data continuity. GeoPandas. recommendation, and user behavior - Deletion: Remove rows or columns with Methods of feature selection include filter modeling. excessive missing values, provided it doesn’t methods (e.g., correlation), wrapper methods ----------------------------------------------------------- significantly reduce dataset quality. (e.g., recursive feature elimination), and Data science is a multidisciplinary field that Word Clouds: A word cloud is a visual embedded methods (e.g., LASSO). uses techniques, tools, and algorithms to representation of textual data. Words appear Two Python libraries used for data analysis extract insights and knowledge from in varying sizes, proportional to their are Pandas and NumPy. structured and unstructured data. frequency or importance. - Uses: Quickly A nominal attribute is a qualitative attribute A data source refers to the origin of data, identifies key topics in large text datasets. that categorizes data into distinct groups or which can be databases, files, APIs, sensors, - Example: Analyzing customer feedback to labels without an inherent order (e.g., colors or web-based repositories used for analysis. determine common complaints or praises. or gender). Missing values are data entries that are Differentiate Structured and Unstructured One hot coding is a technique for either absent or undefined in a dataset, often Data: - Structured Data: Organized into rows representing categorical data as binary represented as `NaN` or blank spaces. and columns with a predefined schema (e.g., vectors, where each category is represented Visualization Libraries in Python: Some relational databases, CSV files). by a unique vector with a single 1. popular Python visualization libraries are - Unstructured Data: Lacks a fixed format Data visualization is the graphical Matplotlib, Seaborn, Plotly, Bokeh, and Altair. (e.g., images, videos, free text). - Example: representation of data using charts, graphs, or Data transformation is the process of Transaction records (structured) vs. social maps to identify trends, patterns, and insights. converting data into a suitable format for media posts (unstructured). Variance is a statistical measure that analysis, such as normalization, scaling, or represents the average squared deviation of aggregation. data points from their mean, indicating the spread of data. Types of Data: 1. Structured Data: Organized Concept and Use of Data Visualization: Data Transformation refers to the process of in rows and columns, easy to store in Data Visualization: Transforms raw data into converting data into a format or structure that databases. - Example: Customer details with graphical formats like charts, graphs, and is better suited for analysis or modeling. It is Name, Age, and Email. 2. Unstructured Data: dashboards. - Purpose: - Identify trends and an essential step in data preprocessing, No predefined format; harder to process. patterns. - Highlight anomalies. - Improve ensuring that the data is clean, uniform, and - Example: Social media posts, audio files, decision-making. - Libraries in Python: - compatible with the methods and algorithms images. 3. Semi-structured Data: Organized Matplotlib: Basic 2D plotting. - Seaborn: applied later. Strategies for Data partially with tags or markers. - Example: Advanced statistical visuals. - Plotly: Transformation:1. Scaling and Normalization: XML, JSON. Interactive charts. - Bokeh: Real-time Techniques like Min-Max scaling or Types of Data Attributes: 1. Nominal: visualizations. Standardization are used to scale numerical Categories without order. Example: Blood Outlier Detection Methods: Outliers are values to a common range or distribution group (A, B, O). 2. Ordinal: Ordered unusual data points that deviate significantly (mean=0, std=1), especially when dealing with categories. Example: Ratings (Poor, Average, from the rest. - Methods: 1. Statistical: - Z- machine learning models.2. Feature Excellent). 3. Interval: Numeric, no true zero. Score: Identifies values far from the mean. Encoding: Transforming categorical data into Example: Temperature in Celsius. 4. Ratio: - IQR: Detects values outside 1.5×IQR range. numerical values using techniques such as Numeric, has a true zero. Example: Height, 2. Visualization: - Boxplots highlight outliers One-Hot Encoding or Label Encoding, age. with whiskers. allowing machine learning algorithms to Cube Aggregation in Context of Data Data Transformation Techniques: process them effectively. 3. Handling Missing Reduction:A data cube is a multi-dimensional 1. Normalization: Rescales data to fit within [0, Data: Techniques like imputation (replacing structure used to summarize data by 1]. - Example: Converting prices from $10- missing values with mean, median, or mode) aggregating values. - Example: Sales data $100 range to 0.1-1. 2. One-Hot Encoding: or removal of rows/columns with missing summarized by: - Dimensions: Time (year), Converts categorical data into binary format. values are used to handle incomplete data. Region (state), Product (category). - Measure: - Example: - Category: {Red, Blue, Green}. ------------------------------------------------------------- Total sales. - Encoding: {1, 0, 0} (Red), {0, 1, 0} (Blue). different methods for measuring data Techniques: 1. Roll-Up: Aggregate data at a ---------------------------------------------------------- dispersion: Data Dispersion refers to the higher level (e.g., daily sales → monthly An outlier is a data point that differs extent to which data points in a dataset sales). 2. Drill-Down: Break down into details significantly from the other data points in a spread out or cluster around the mean. (e.g., state-level sales → city-level sales). dataset. Outliers can affect statistical analyses Common measures of data dispersion Use: - Reduces complexity while retaining and may need to be removed or treated include: 1. Range: The difference between the meaningful information. - Speeds up analysis depending on the analysis context. maximum and minimum values in a dataset. and improves insights. Types of Outliers: 1. Univariate Outliers: Formula: Range=Max Value−Min Value ------------------------------------------------------------ Outliers in a single variable, identified using - Example: Dataset {5, 10, 15, 20}. Range = Four Data Visualization Tools. 1. Tableau: statistical methods like Z-scores or IQR 20 - 5 = 15. 2. Variance: Measures the - Features: Drag-and-drop interface, (Interquartile Range). - Example: A person average squared deviation from the mean. It interactive dashboards. - Use: Business with an extremely high income in a salary tells how spread out the data is. Formula: analytics and reporting. 2. Matplotlib (Python): dataset. 2. Multivariate Outliers: Outliers in a Variance=∑ (Xi−μ)2/N - Example: A dataset with - Features: Basic plots like line graphs, bar combination of two or more variables. They small variance means that most values are charts, and scatter plots. - Use: Simple 2D are identified using multivariate statistical close to the mean. 3. Standard Deviation: The visualizations. 3. Seaborn: - Features: methods like Mahalanobis distance. square root of the variance. It gives a Advanced statistical plots like heatmaps, pair - Example: A combination of age and income measure of dispersion in the same units as plots. - Use: Exploring relationships and that doesn’t fit any normal pattern in a dataset. the data, making it easier to interpret. trends in data. 4. Power BI: - Features: 3. Contextual Outliers: Outliers that are - Example: In the previous example, the Integrates with other Microsoft tools, supports considered unusual in a specific context but standard deviation gives a more interpretable real-time dashboards. - Use: Business not in others. - Example: A temperature of measure of the spread of data. 4. Interquartile intelligence and collaborative data analysis. 40°C in winter may be an outlier in one region Range (IQR): The range between the first Data Science Life Cycle with Diagram: but normal in a tropical region. quartile (25th percentile) and the third quartile Stages: 1. Data Collection: Gathering raw ----------------------------------------------------------- (75th percentile). IQR measures the middle data from various sources like APIs, logs. three data transformation techniques: Data 50% of the data, excluding outliers. 2. Data Cleaning: Removing noise, duplicates, Transformation is the process of converting Formula: IQR=Q3−Q1 and handling missing values. 3. Exploratory data from one format or structure into another ------------------------------------------------------------- Data Analysis (EDA): Identifying trends, to make it suitable for analysis. Some key One technique of data transformation: patterns, and correlations. 4. Model Building: techniques include: 1. Normalization: One common technique of Data Applying algorithms to train predictive models. Rescaling data to fall within a specific range, Transformation is Normalization. It scales the 5. Evaluation: Testing the model’s accuracy usually [0, 1]. This helps in eliminating bias data to a specific range, typically [0, 1], by using metrics (e.g., precision). 6. Deployment: when working with algorithms sensitive to applying the formula: Implementing the model into production. varying scales (e.g., machine learning Normalized Value =Max−Min/Value−Min.This Diagram: Represent this as a circular flow or models). Example: Normalized value= technique is especially important when pipeline with arrows showing interactions. Max−Min/Value−Min. 2. Log Transformation: working with algorithms like k-nearest ------------------------------------------------------------- Applying the logarithm to data values, which neighbors (KNN) and neural networks, which Primary data is original data collected directly can help reduce skewness and make patterns are sensitive to varying scales of input by researchers for a specific purpose, such as more apparent. Example: A financial dataset features. surveys, experiments, or interviews. with large values, like income or sales, can be Feature Extraction is the process of Data quality refers to the degree to which transformed using the log to make analysis transforming raw data into a set of usable data is accurate, consistent, complete, and more manageable. 3. One-Hot Encoding: features or variables that can be fed into suitable for its intended purpose. Converts categorical variables into a series of machine learning algorithms. It is a crucial An outlier is a data point that significantly binary (0 or 1) columns, where each column step for reducing the dimensionality of the deviates from the rest of the dataset, often represents one possible category. This data and improving model performance. The indicating anomalies or errors. technique is useful for algorithms that require goal is to retain the most important information The interquartile range (IQR) is the numerical input. Example: A "Color" column while reducing noise and irrelevant details. difference between the third quartile (Q3) and with categories {Red, Blue, Green} can be Example: In text analysis, TF-IDF (Term the first quartile (Q1), representing the middle transformed into: - Red → [1, 0, 0] - Blue → Frequency-Inverse Document Frequency) is a 50% of a dataset. [0, 1, 0] - Green → [0, 0, 1] common feature extraction technique used to Zip files are used to compress data, reduce represent text data based on word frequency storage requirements, and combine multiple and importance. files into a single package for easy sharing or archiving. Exploratory Data Analysis (EDA) is the process of analyzing and visualizing datasets to summarize their main characteristics, often with the help of graphical representations. The goal of EDA is to understand the structure of the data, detect patterns, spot anomalies, test hypotheses, and check assumptions. Steps in EDA: 1. Data Cleaning: Handle missing values, outliers, and duplicates. 2. Univariate Analysis: Analyze the distribution and summary statistics of individual variables (e.g., histograms, box plots). 3. Bivariate Analysis: Explore relationships between two variables (e.g., scatter plots, correlation matrices). 4. Multivariate Analysis: Investigate interactions between multiple variables (e.g., pair plots, principal component analysis). Tools for EDA: - Python Libraries: Pandas, Matplotlib, Seaborn, Plotly. - Software: Tableau, Power BI. Data discretization is the process of converting continuous data into discrete categories or bins, often used in machine learning. A tag cloud is a visual representation of text data, where the size of each word indicates its frequency or importance in the dataset. Visual encoding refers to the use of visual elements (such as color, size, shape, or position) to represent data in visualizations. ------------------------------------------------------------ Volume Characteristic of Data in Reference to Data Science: In data science, the volume characteristic refers to the massive amount of data generated from various sources, measured in gigabytes, terabytes, or more. Examples of semistructured data include JSON, XML, and CSV files with irregular or nested structures. A quartile divides a dataset into four equal parts. The first quartile (Q1) is the 25th percentile, and the third quartile (Q3) is the 75th percentile. XML (Extensible Markup Language) is a data format used to structure data in a tree- like format with custom tags, enabling data exchange across systems. RULE WHEN FOR NUMBER LIKE 1. , COPY: 2. , 3. “-“ COME IN WHEN ANSWER THIS GO TO SIGH THE COME IN NEXT ANSWER LINE THEN GO THEN TO THE NEXT LINE