UNIT - Introduction - DataScience - New
UNIT - Introduction - DataScience - New
Data Science
T. Y. BTECH
e.g.,
Google Flu Trends:
Detecting outbreaks
two weeks ahead
of CDC data
2
“Big Data” Sources
It’s All Happening On-line User Generated (Web &
Mobile)
Every:
Click
Ad impression
Billing event
….
Fast Forward, pause,… .
Server request
Transaction
Network message
Fault
…
4
What can you do with the data?
to produce:
9
Ben Fry’s Model
Visualizing Data Process
1. Acquire
2. Parse
3. Filter
4. Mine
5. Represent
6. Refine
7. Interact
10
Jeff Hammerbacher’s Model
1. Identify problem
3. Collect data
5. Build model
6. Evaluate model
7. Communicate results
11
Data Scientist’s Practice
Clean,
prep
Evaluate
Interpret
The Big Picture
Extract
Transform
Load
13
Data Science: Getting Value out of Data
Data Science: Getting Value out of Data
Data Science: Getting Value out of Data
Data Science: Getting Value out of Data
Why the Increased Interest in Data Science?
Why Python for Data Science???
https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html
Applications
ACID = Atomicity, Consistency, Isolation and Durability CAP = Consistency, Availability, Partition Tolerance
Contrast: BI
Contrast: AI
Modern Data Science Skills
The Structure Spectrum
Solution:
Data Quality Problems
• (Source) Data is dirty on its own.
• Transformations corrupt the data (complexity of software
pipelines).
• Data sets are clean but integration (i.e., combining them)
screws them up.
• “Rare” errors can become frequent after transformation or
integration.
• Data sets are clean but suffer “bit rot”
• Old data loses its value/accuracy over time
Integrate
Clean
Extract
Transform
Load
34
Numeric Outliers
• Accuracy
– The data was recorded correctly.
• Completeness
– All relevant data was recorded.
• Uniqueness
– Entities are recorded once.
• Timeliness
– The data is kept up to date.
• Special problems in federated data: time consistency.
• Consistency
– The data agrees with itself.
How we can deal with the noisy data
• It is used to convert the raw data into the format that is convenient
for the consumption of data. Data Wrangling is a technique that is
executed at the time of making an interactive model.
• Steps:
– extracting the data from different data sources
– sorting of data using certain algorithm is performed
– decompose the data into a different structured format
– finally store the data into another database.
54
Operations
• map() functions
• filter (apply predicate to rows)
• sort/group by
• aggregate: sum, count, average, max, min
• Pivot or reshape
• Relational:
– union, intersection, difference, cartesian product (CROSS
JOIN)
– select/filter, project
– join: natural join (INNER JOIN), theta join, semi-join, etc.
– rename
55
Applications
Amazing real-time Data Science Applications:
Recommendation- Most of the apps and websites like Amazon, YouTube, Flipkart, etc. give
recommendation over as per the viewer’s interest. Online music applications like Spotify give
recommendations as per your taste in music. So these are good examples of data science
recommendation applications.
Search Results- Machine Learning algorithms used to find the most relevant search for Google
search engines. Such an algorithm used for the most visited sites on google chrome.
Intelligent Assistant- Google assistant, Siri are examples of intelligent assistants. The advanced
machine learning algorithm converts voice input into text output. These smart assistants
recognize the voice and provide the required information in both voice and text outputs.
Autonomous driving vehicles- Automobile companies like Waymo and Tesla looking for the next
generation of autonomous vehicles. 3D images were taken by the cameras and the information
provided to the algorithms for further processing.
Piracy Detection- YouTube is an example of piracy detection using machine learning algorithms.
Due to the big database, copied contents cannot be detected manually. So it helps to detect and
remove the copied content to reduce human efforts.
Image Recognition- Facebook is the application that uses image recognition by data science and
machine learning for the friend suggestion. Even Google lens uses an image recognition algorithm
to provide the related information to you.