Data Structuring & Data Gathering 1
Data Structuring & Data Gathering 1
Data Gathering
By Deepti
Introduction & Motivation
Talking Points:
• “Most of the hard work in machine learning isn’t the
modeling. It’s in collecting and preparing the data.”
• Poorly structured or noisy data leads to misleading
results, model drift, and failed predictions.
• Structured data is the foundation for successful ML—
just like good soil is critical for growing plants.
What is Data Gathering?
•The process of collecting raw data from relevant sources.
•In ML, data can be manually labeled, logged automatically, or collected via APIs, databases,
sensors, etc.
•Examples:
• User interaction logs from a mobile app
• Sensor data from IoT devices
• Survey forms, transaction logs, e-commerce activity
What is Data Gathering?
Data gathering is the process of collecting raw information from various sources for analysis.
It can come from:
• Internal systems (like databases or logs)
• External sources (APIs, web scraping, public datasets)
• User inputs (forms, surveys, feedback)
•
Sensors or devices (IoT)The goal is not just to collect a lot of data — but to collect the right data.
⸻
Why Data Gathering Matters
Poor data leads to poor insights. The phrase “garbage in, garbage out” is very real in data
science. Accurate, relevant, and timely data ensures:
• Better models
• Smarter decisions
• Increased trust in your results
Types of Data
Ans : b
2. You're building a recommendation system for an e-commerce app. The product team asks
you to include user interaction logs. The logs contain time stamps but are in different time
zones. What should you do before using the data?
B. Ignore time zones since timestamps are relative
B. Convert all timestamps to UTC or a consistent time zone
C. Use the raw timestamps as-is
D. Remove timestamps entirely to simplify the dataset
3. What is Data Structuring?
Slide: Structured vs Unstructured
Talking Points:
•Data structuring is converting raw, messy data into organized formats:
tables, arrays, JSON, etc.
•Essential for enabling feature extraction, analysis, and feeding into ML
models.
•Types of structuring:
• Tabular structuring
• Time-series transformation
• Categorical encoding
•Structured data improves reproducibility, automation, and traceability.
What is Data Structuring?
Data structuring is the process of organizing, cleaning, and formatting data so that it can be
easily accessed and analyzed.Think of it like cleaning and sorting ingredients before you
cook. You may have all the right data — but if it’s not structured, it’s hard to work with.
Real-World Example
Suppose we collected e-commerce data with thousands of transactions.
We’d need to:
•Standardize product names
•Remove incomplete records
•Convert all currencies to USD
•Split full names into first and last names
•Store everything in a consistent schema (like “date | product | price | customer
ID”)Only then can we perform reliable analysis or feed it into ML models.
1. You're analyzing task logs for productivity trends. Each row contains a task description,
user ID, and timestamp, but no clear task categories. What should you do to make this
dataset usable for supervised ML?
A. Remove the task description column since it’s unstructured
B. Create a new "task category" feature by applying text classification or manual tagging
C. Ignore categories and treat all tasks as the same
D. Randomly assign category labels to balance the dataset
ans : b
2. Your dataset has timestamps stored as strings in multiple formats (e.g., “05/31/25”,
“2025-05-31T14:00”). What should be your first step in structuring this data for ML
modeling?
A. Use the strings as-is since the date is visible
B. Drop all inconsistent rows
C. Standardize all timestamps to a common format and convert to datetime objects
D. Extract only the year from the strings for simplicity
Ans c
Task Logs as Datasets
Task logs are chronological records of actions taken in a system—each line is an
event.
•Contains rich behavioral data:
• Who did what?
• When was it done?
• What kind of task was it?
• What was the outcome?
•In ML:
• Logs can be used to build classification, prediction, and recommendation
models.
• Examples: Predict next action, flag anomalies, estimate completion time.
Use Case: Project management system predicting delays based on task history.
• [Task ID] | [User] | [Action] | [Category] | [Time Stamp] | [Status]
What is a Task Log?
A Task Log is a structured record of the work being done in a project.
It keeps track of:
•Task name or ID
•Description of the task
•Assigned person
•Start and end date
•Status (Pending, In Progress, Completed)
•Notes or observations
•Next steps (if any)
Importance of Time Stamps in ML
Slide: Time-Based Features in Machine Learning
•Timestamps provide a temporal context to data.
•Essential for time-series models, sequential pattern recognition, churn prediction.
•Can be used to derive:
• Task duration
• Task frequency
• Gaps between tasks
• Peak activity hours
•Considerations:
• Time zones
• Missing values
• Consistent formats (ISO 8601, Unix)
• Granularity (second/minute/hour)
ML Models That Use Time:
•LSTM
•ARIMA
•Prophet
•Transformer-based models
What is a Timestamp?
A timestamp is a sequence of characters or encoded information that identifies when a certain
event occurred.In datasets, it usually represents the date and time something happened — like
when a user logged in, when a transaction was made, or when a sensor recorded data
Examples:
•2025-05-31 10:15:30
•31/05/2025 10:15 AM
•May 31, 2025 – 10:15
2. Your task logs are collected from multiple microservices and stored in separate files. Some actions (e.g.,
“Approve Task”) span multiple services. What’s the best approach to make the logs useful for ML?
B. Analyze one microservice log at a time
B. Join the logs based on common task IDs and order them chronologically
C. Randomly merge logs
D. Drop entries with missing service info
Ans B
Data Structuring Workflow for ML
Data Inspection
Before touching the data, we need to understand its structure and quality.
•View sample rows
•Check data types
•Identify missing or inconsistent values
•Look at value distributions
Example: Use df.head(), df.info(), df.describe() in Python
Data Cleaning
“This is where we remove or fix problems in the raw data”
•Handle missing values
•Correct typos or anomalies
•Remove duplicates
•Standardize formats (dates, currencies, categories)Cleaning ensures your data is
accurate, consistent, and complete.
Data Transformation
Now we reshape and enhance the data for analysis or modelling
•Convert data types
•Normalize or scale numeric values
•Encode categorical data (e.g., One-Hot Encoding)
•Create new features (e.g., extracting “month” from a date)This step adds
analytical value to your dataset.
Data Integration
If you have data from multiple sources, you’ll need to merge or join them
•Combine tables based on keys (e.g., user ID, transaction ID)
• Ensure consistent naming and units
•Resolve conflicts in overlapping fieldsTools: pd.merge(), SQL JOIN, Power BI or Tableau
connectors.
Data Validation
Once transformed and integrated, we validate the dataset:
•Run sanity checks (e.g., no negative sales)
•Check column ranges and uniqueness
•Ensure row counts match expectationsValidation helps catch unnoticed errors before
analysis or modeling.
Data Storage
Finally, store the structured data for use:
• Save as CSV, JSON, or Parquet files
• Load into databases
• Push to cloud storage (e.g., S3, Azure Blob)Choose a format that aligns with
your workflow and team needs.
What is ETL?
The ETL process involves three major steps
1.Extract – Pulling data from various sources (databases, APIs, files)
2.Transform – Cleaning, formatting, and structuring data
3.Load – Moving the transformed data into a destination like a database, data
warehouse, or data lake.
ETL makes raw data usable by converting it into a structured, consistent format.
Case Study Example
Slide: Mini Case Study – Predicting Employee Overload from Task Logs
Scenario:
•Input: Task logs from an internal tool
•Structured features:
• Number of tasks/day
• Avg task completion time
• Category spread
• Time active per day
•Model: Random Forest Classifier to predict overload risk
Outcome:
•Model helps assign support proactively
1)Your company wants to analyze productivity trends from an internal project management tool. It logs
all tasks with timestamps, user actions, and task categories. However, data is spread across multiple
inconsistent Excel sheets. What’s the best first step in data gathering?
A. Start training a model using one of the Excel files
B. Manually clean each file separately
C. Consolidate the files into a central structured format (e.g., CSV or database) and standardize the
schema
D. Request employees to input their tasks again using a new format
ANS: C
2. You’re reviewing task logs that include timestamped actions: “Start Task,” “Pause,” “Resume,” and
“Complete.” You need to structure this data to predict delays. What’s the best approach?
B. Count the total number of actions per task
B. Calculate actual task durations by pairing relevant timestamp events
C. Focus only on “Start” and “Complete” actions
D. Drop tasks that contain “Pause” or “Resume” events
ANS :B
3. While gathering user activity data for an ML model, you notice some logs are missing timestamps, and others
have them in mixed formats. What’s the correct approach during data structuring?
A. Ignore the missing timestamps—they’re not critical
B. Convert all timestamps to a consistent format (e.g., ISO 8601) and impute missing values based on nearest actions
C. Replace missing timestamps with zeros
D. Use the timestamp column as a text string
✅ Correct Answer: B
Why: Consistent, complete timestamps are crucial for time-based features and modeling.
4. You're building a classifier to categorize incoming tasks based on historical logs. The “Task Type” column has 300
unique values, many of which appear only once. What’s the most efficient way to handle this during structuring?
A. One-hot encode all 300 values
B. Group rare values into “Other” and apply frequency encoding
C. Drop the column
D. Encode all categories as integers from 1 to 300
✅ Correct Answer: B
Why: Grouping and encoding reduces noise and dimensionality while preserving information.