0% found this document useful (0 votes)
0 views

Data Structuring & Data Gathering 1

The document discusses the importance of data gathering and structuring in machine learning, emphasizing that well-structured data is crucial for accurate modeling and insights. It outlines various methods for data collection, types of data, challenges faced, and the steps involved in structuring data for effective analysis. Additionally, it highlights the significance of timestamps and feature engineering in enhancing model performance.

Uploaded by

s6307823
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Data Structuring & Data Gathering 1

The document discusses the importance of data gathering and structuring in machine learning, emphasizing that well-structured data is crucial for accurate modeling and insights. It outlines various methods for data collection, types of data, challenges faced, and the steps involved in structuring data for effective analysis. Additionally, it highlights the significance of timestamps and feature engineering in enhancing model performance.

Uploaded by

s6307823
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Data structuring &

Data Gathering
By Deepti
Introduction & Motivation
Talking Points:
• “Most of the hard work in machine learning isn’t the
modeling. It’s in collecting and preparing the data.”
• Poorly structured or noisy data leads to misleading
results, model drift, and failed predictions.
• Structured data is the foundation for successful ML—
just like good soil is critical for growing plants.
What is Data Gathering?
•The process of collecting raw data from relevant sources.
•In ML, data can be manually labeled, logged automatically, or collected via APIs, databases,
sensors, etc.
•Examples:
• User interaction logs from a mobile app
• Sensor data from IoT devices
• Survey forms, transaction logs, e-commerce activity
What is Data Gathering?
Data gathering is the process of collecting raw information from various sources for analysis.
It can come from:
• Internal systems (like databases or logs)
• External sources (APIs, web scraping, public datasets)
• User inputs (forms, surveys, feedback)

Sensors or devices (IoT)The goal is not just to collect a lot of data — but to collect the right data.

Why Data Gathering Matters
Poor data leads to poor insights. The phrase “garbage in, garbage out” is very real in data
science. Accurate, relevant, and timely data ensures:
• Better models
• Smarter decisions
• Increased trust in your results

Types of Data

During gathering, we deal with:


•Structured data: Numbers, categories, tables — easy to store in databases.
•Unstructured data: Text, images, videos, social media content — harder to manage but
extremely valuable.
•Semi-structured data: JSON, XML — somewhere in between.
Common Data Gathering Methods

Let’s go through some key methods:


• Manual collection – Surveys, interviews, feedback forms
• Automated systems – Log files, transactions, sensors
• APIs – Pulling data from platforms like Twitter, Google, or public services
• . Web scraping – Extracting data from websites (ethically and legally) .
• Third-party datasets – Kaggle, government portals, research institutions

Challenges in Data Gathering


•Incomplete or missing data
•Noisy or irrelevant information
•Data privacy and legal constraints
•Access limitations or API quotas
1. You're tasked with predicting customer churn for a subscription service. The CRM system
logs events like logins, payments, and cancellations. What's the best first step to gather
useful data for your model?
A. Train a churn model using only recent cancellation data
B. Export the event logs and structure them into user-level summaries
C. Ask customer support for anecdotal reasons why people churn
D. Use social media sentiment to guess why users are unhappy

Ans : b

2. You're building a recommendation system for an e-commerce app. The product team asks
you to include user interaction logs. The logs contain time stamps but are in different time
zones. What should you do before using the data?
B. Ignore time zones since timestamps are relative
B. Convert all timestamps to UTC or a consistent time zone
C. Use the raw timestamps as-is
D. Remove timestamps entirely to simplify the dataset
3. What is Data Structuring?
Slide: Structured vs Unstructured
Talking Points:
•Data structuring is converting raw, messy data into organized formats:
tables, arrays, JSON, etc.
•Essential for enabling feature extraction, analysis, and feeding into ML
models.
•Types of structuring:
• Tabular structuring
• Time-series transformation
• Categorical encoding
•Structured data improves reproducibility, automation, and traceability.
What is Data Structuring?
Data structuring is the process of organizing, cleaning, and formatting data so that it can be
easily accessed and analyzed.Think of it like cleaning and sorting ingredients before you
cook. You may have all the right data — but if it’s not structured, it’s hard to work with.

Why Data Structuring Matters ?


Structured data:
•Improves data quality and consistency
•Enables faster and more accurate analysis
•Reduces errors in models and dashboards
•Makes collaboration and data sharing easier
Common Issues in Raw Data :
•Missing values •Duplicate records •Inconsistent formats (e.g., “NY” vs.
“New York”) •Outliers or invalid entries •Unnecessary or redundant fields
Key Steps in Structuring Data
• Cleaning – Remove duplicates, fill or drop missing values, fix typos.
• Formatting – Standardize dates, currencies, categories, etc.
• Transforming – Convert columns, normalize values, create new features.
• Labeling – Add headers, metadata, or classification tags.
• Storing – Organize in tables, files, or databases that are easy to access.
Structured vs. Unstructured Data
Let’s quickly compare:
• Structured: Excel tables, SQL databases, CSV filesEasy to search, filter,
and model.
• Unstructured: PDFs, emails, images, audioHarder to organize, but still
valuable — often structured later using NLP, OCR, et
Tools for Structuring DataSome popular tools
• Excel/Google Sheets – great for small datasets
• Python (Pandas, NumPy) – powerful for automation and cleaning
• SQL – ideal for relational data
• ETL tools – like Talend, Apache NiFi, or Alteryx
• Cloud platforms – AWS, Azure, GCP offer data pipelines

Real-World Example
Suppose we collected e-commerce data with thousands of transactions.
We’d need to:
•Standardize product names
•Remove incomplete records
•Convert all currencies to USD
•Split full names into first and last names
•Store everything in a consistent schema (like “date | product | price | customer
ID”)Only then can we perform reliable analysis or feed it into ML models.
1. You're analyzing task logs for productivity trends. Each row contains a task description,
user ID, and timestamp, but no clear task categories. What should you do to make this
dataset usable for supervised ML?
A. Remove the task description column since it’s unstructured
B. Create a new "task category" feature by applying text classification or manual tagging
C. Ignore categories and treat all tasks as the same
D. Randomly assign category labels to balance the dataset
ans : b

2. Your dataset has timestamps stored as strings in multiple formats (e.g., “05/31/25”,
“2025-05-31T14:00”). What should be your first step in structuring this data for ML
modeling?
A. Use the strings as-is since the date is visible
B. Drop all inconsistent rows
C. Standardize all timestamps to a common format and convert to datetime objects
D. Extract only the year from the strings for simplicity
Ans c
Task Logs as Datasets
Task logs are chronological records of actions taken in a system—each line is an
event.
•Contains rich behavioral data:
• Who did what?
• When was it done?
• What kind of task was it?
• What was the outcome?
•In ML:
• Logs can be used to build classification, prediction, and recommendation
models.
• Examples: Predict next action, flag anomalies, estimate completion time.
Use Case: Project management system predicting delays based on task history.
• [Task ID] | [User] | [Action] | [Category] | [Time Stamp] | [Status]
What is a Task Log?
A Task Log is a structured record of the work being done in a project.
It keeps track of:

•What was done


•Who did it
•When it was done
•Why it was done

Why Maintain a Task Log?


There are several key reasons:
•Transparency – Everyone knows what’s happening
•Accountability – Each task has an owner
•Reproducibility – You or someone else can retrace your steps later
•Progress Tracking – Helps in managing deadlines and deliverables
•Error Handling – Easier to trace back where something went wrong
What to Include in a Task Log?
A simple task log might include:

•Task name or ID
•Description of the task
•Assigned person
•Start and end date
•Status (Pending, In Progress, Completed)
•Notes or observations
•Next steps (if any)
Importance of Time Stamps in ML
Slide: Time-Based Features in Machine Learning
•Timestamps provide a temporal context to data.
•Essential for time-series models, sequential pattern recognition, churn prediction.
•Can be used to derive:
• Task duration
• Task frequency
• Gaps between tasks
• Peak activity hours
•Considerations:
• Time zones
• Missing values
• Consistent formats (ISO 8601, Unix)
• Granularity (second/minute/hour)
ML Models That Use Time:
•LSTM
•ARIMA
•Prophet
•Transformer-based models
What is a Timestamp?
A timestamp is a sequence of characters or encoded information that identifies when a certain
event occurred.In datasets, it usually represents the date and time something happened — like
when a user logged in, when a transaction was made, or when a sensor recorded data
Examples:
•2025-05-31 10:15:30
•31/05/2025 10:15 AM
•May 31, 2025 – 10:15

Why Timestamps Matter?


•Track behavior over time (e.g., customer login patterns)
•Analyze trends (e.g., weekly sales, peak hours)
•Build time-based features (e.g., day of the week, hour of the day)
•Work with time series models (e.g., forecasting demand)
•Ensure data integrity (e.g., correct sequencing of events)
Common Timestamp Formats

Timestamps can appear in many formats, such as:

•ISO 8601: 2025-05-31T10:15:30Z


•UNIX Timestamp: 1748691330 (seconds since Jan 1, 1970)
•Custom formats: 31-May-2025 10:15 AM
When working with timestamps, standardizing the format is key to avoid
errors.
1. You’re working with a system that logs user activities like "Login", "Create Task", "Submit File", and "Logout".
Each log includes a timestamp and user ID. You need to predict user churn. What's your first structuring step?
A. Remove repetitive actions from logs
B. Aggregate logs into user sessions and extract features like session length and action frequency
C. Keep logs as-is and apply linear regression
D. Randomly sample logs for faster processing
Ans:b

2. Your task logs are collected from multiple microservices and stored in separate files. Some actions (e.g.,
“Approve Task”) span multiple services. What’s the best approach to make the logs useful for ML?
B. Analyze one microservice log at a time
B. Join the logs based on common task IDs and order them chronologically
C. Randomly merge logs
D. Drop entries with missing service info
Ans B
Data Structuring Workflow for ML

Slide: Step-by-Step Pipeline


Workflow Steps:
1.Data Ingestion – From logs, sensors, APIs
2.Data Cleaning – Remove duplicates, nulls, inconsistencies
3.Data Normalization – Format timestamps, unify category names
4.Feature Engineering – Time since last task, task density, encoded categories
5.Splitting & Labeling – For training, validation, testing
Tools Mentioned:
•pandas, numpy, datetime, scikit-learn
•ETL tools: Apache Airflow, Talend
•Data validation: great_expectations, pandas-profiling
Data Ingestion
This is where we bring the data into our environment.Sources may include:
•CSV/Excel files
•Databases
•APIs
•Cloud storage
•Web scrapingTools: pandas.read_csv(), SQL connectors, ETL platforms

Data Inspection
Before touching the data, we need to understand its structure and quality.
•View sample rows
•Check data types
•Identify missing or inconsistent values
•Look at value distributions
Example: Use df.head(), df.info(), df.describe() in Python
Data Cleaning
“This is where we remove or fix problems in the raw data”
•Handle missing values
•Correct typos or anomalies
•Remove duplicates
•Standardize formats (dates, currencies, categories)Cleaning ensures your data is
accurate, consistent, and complete.
Data Transformation
Now we reshape and enhance the data for analysis or modelling
•Convert data types
•Normalize or scale numeric values
•Encode categorical data (e.g., One-Hot Encoding)
•Create new features (e.g., extracting “month” from a date)This step adds
analytical value to your dataset.
Data Integration

If you have data from multiple sources, you’ll need to merge or join them
•Combine tables based on keys (e.g., user ID, transaction ID)
• Ensure consistent naming and units
•Resolve conflicts in overlapping fieldsTools: pd.merge(), SQL JOIN, Power BI or Tableau
connectors.

Data Validation
Once transformed and integrated, we validate the dataset:
•Run sanity checks (e.g., no negative sales)
•Check column ranges and uniqueness
•Ensure row counts match expectationsValidation helps catch unnoticed errors before
analysis or modeling.
Data Storage
Finally, store the structured data for use:
• Save as CSV, JSON, or Parquet files
• Load into databases
• Push to cloud storage (e.g., S3, Azure Blob)Choose a format that aligns with
your workflow and team needs.

What is Feature Engineering?


Feature engineering is the process of:
•Creating new input variables (features)
•Transforming existing features
•Selecting the most relevant features.
That help machine learning models perform better.
“It’s like crafting the right questions to get better answers from your data.”
: Why Feature Engineering Matters
Feature engineering helps:
•Improve model accuracy
•Reduce overfitting
•Reveal patterns and relationships
•Simplify complex data
Without meaningful features, even the most advanced ML model won’t perform well.
Common Feature Engineering Techniques
1.Binning – Grouping continuous data (e.g., age into age groups)
2.Encoding – Converting categories into numbers (e.g., One-Hot, Label Encoding)
3.Normalization/Scaling – Bringing numeric features to the same range
4.Date/Time Extraction – From timestamps to day, hour, or season
5.Text Features – Word count, sentiment score, TF-IDF
6.Interaction Features – Multiplying or combining two variables
7.Log Transformation – Reducing skewness in numerical features
Splitting and Labeling
“What is Splitting?”
Splitting is the process of dividing your dataset into:
•Training Set – Data used to train the model
•Validation Set – (Optional) For tuning hyperparameters
•Test Set – Data used to evaluate the final model
A common split ratio is:
•70% Training
•15% Validation
•15% Testing.
Use train_test_split() in Python (from sklearn) to do this.
Why Splitting is Important?
Without splitting, the model might “memorize” the data and perform well only on
what it has already seen.This leads to overfitting — good performance on training
data, poor performance in the real world.
What is Labeling?
Labeling is the process of identifying the target variable — the value you want the model to
predict.
Examples:
•In a spam detection model:
•Input = Email text
•Label = Spam or Not Spam
•In a sales prediction model:
•Input = Product, Region, Month
•Label = Sales amountAccurate and clean labels are essential for supervised learning.
Best Practices for Splitting & Labeling
•Always randomize before splitting
•Use stratified splitting for imbalanced classes (e.g., fraud detection)
•Ensure no data leakage — don’t use future data to predict the past
Keep labels separate from features in your processing pipeline

What is ETL?
The ETL process involves three major steps
1.Extract – Pulling data from various sources (databases, APIs, files)
2.Transform – Cleaning, formatting, and structuring data
3.Load – Moving the transformed data into a destination like a database, data
warehouse, or data lake.
ETL makes raw data usable by converting it into a structured, consistent format.
Case Study Example

Slide: Mini Case Study – Predicting Employee Overload from Task Logs
Scenario:
•Input: Task logs from an internal tool
•Structured features:
• Number of tasks/day
• Avg task completion time
• Category spread
• Time active per day
•Model: Random Forest Classifier to predict overload risk
Outcome:
•Model helps assign support proactively
1)Your company wants to analyze productivity trends from an internal project management tool. It logs
all tasks with timestamps, user actions, and task categories. However, data is spread across multiple
inconsistent Excel sheets. What’s the best first step in data gathering?
A. Start training a model using one of the Excel files
B. Manually clean each file separately
C. Consolidate the files into a central structured format (e.g., CSV or database) and standardize the
schema
D. Request employees to input their tasks again using a new format
ANS: C

2. You’re reviewing task logs that include timestamped actions: “Start Task,” “Pause,” “Resume,” and
“Complete.” You need to structure this data to predict delays. What’s the best approach?
B. Count the total number of actions per task
B. Calculate actual task durations by pairing relevant timestamp events
C. Focus only on “Start” and “Complete” actions
D. Drop tasks that contain “Pause” or “Resume” events

ANS :B
3. While gathering user activity data for an ML model, you notice some logs are missing timestamps, and others
have them in mixed formats. What’s the correct approach during data structuring?
A. Ignore the missing timestamps—they’re not critical
B. Convert all timestamps to a consistent format (e.g., ISO 8601) and impute missing values based on nearest actions
C. Replace missing timestamps with zeros
D. Use the timestamp column as a text string
✅ Correct Answer: B
Why: Consistent, complete timestamps are crucial for time-based features and modeling.

4. You're building a classifier to categorize incoming tasks based on historical logs. The “Task Type” column has 300
unique values, many of which appear only once. What’s the most efficient way to handle this during structuring?
A. One-hot encode all 300 values
B. Group rare values into “Other” and apply frequency encoding
C. Drop the column
D. Encode all categories as integers from 1 to 300
✅ Correct Answer: B
Why: Grouping and encoding reduces noise and dimensionality while preserving information.

You might also like