0% found this document useful (0 votes)
32 views11 pages

DE_Python

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views11 pages

DE_Python

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Data Engineering and Machine Learning Using Python

Module 1: Introduction to Machine Learning

▪ Introduction To Machine Learning


▪ Life Cycle of Machine Learning
▪ Skills required for Machine Learning
▪ Careers Path in Machine Learning
▪ Applications of Machine Learning

Module 3: Python for Machine Learning

▪ Python programming:
▪ Environment Setup
▪ Jupyter Notebook Overview
▪ Data types:Numbers,Strings,Printing,Lists,Dictionaries,Booleans,Tuples
,Sets
▪ Comparison Operators
▪ if,elif, else Statements
▪ Loops:for Loops,while Loops
▪ range()
▪ list comprehension
▪ functions
▪ lambda expressions
▪ map and filter
▪ methods
▪ Programming Exercises.
▪ Object Oriented Programming
▪ Modules and packages
▪ Errors and Exception Handling
▪ Python Decorators
▪ Python generators
▪ Collections
▪ Regular Expression
▪ Python for Exploratory Data Analysis:
▪ NumPy:
▪ Installing numpy
▪ Using numpy
▪ NumPy arrays
▪ Creating numpy arrays from python list
▪ Creating arrays using built in
methods(arrange(),zeros(),ones(),linspace(),eye(),rand(),etc.
▪ Array attributes :shape, type
▪ Array methods: Reshape(),min(),max(),argmax(),argmin(),etc.
▪ Pandas:
▪ Introduction to Pandas
▪ Series
▪ DataFrames
▪ Missing Data
▪ GroupBy
▪ Merging, Joining and Concatenating
▪ Operations
▪ Data Input and Output
▪ Python for Data Visualization:
▪ Matplotlib:
▪ Installing Matplotlib,Basic Matplotlib commands
▪ Creating Multiplot on same canvas
▪ Object Oriented Method:figure(),plot(),add_axes(),subplots(),etc.
▪ MatplotlibExercise
▪ Seaborn:
▪ Categorical plot
▪ Distribution plot
▪ Regression plot
▪ Seaborn Exercise
▪ Pandas built in visualization:
▪ Scatter plot
▪ Histograms
▪ Box plot
▪ CAPSTONE PROJECT FOR DATA ANALYSIS

Module 4: Deep dive into Machine Learning

▪ Introduction To Machine Learning:


▪ Relationship between Data Science and Machine Learning
▪ Supervised Learning
▪ Unsupervised Learning

Supervised Learning (Regression AND Classification Algorithms):

▪ Linear Regression
▪ Ridge Regression
▪ Lasso Regression
▪ Polynomial Regression
▪ Support vector regression
▪ Decision Tree Regression
▪ Random Forest Regression
▪ Logistic Regression
▪ Support Vector Machines
▪ Kernel SVM
▪ Decision Trees and Random Forest
▪ Ensemble Of Decision Trees
▪ Model Evaluation and Improvement

Unsupervised Learning:

▪ Challenges in Unsupervised Learning


▪ Preprocessing AND Scaling
▪ Dimensionality Reduction, Feature Extraction
▪ Principle Component Analysis (PCA)
▪ Clustering
▪ KMEANS
▪ Model evaluation and improvement
▪ Cross validation, Grid search, Evaluation metrics and scoring
▪ Working with text data

Module 5: NLP & Recommender Systems:

▪ Corpus
▪ Text preprocessing using Bag of words technique
▪ TF(Term Frequency)
▪ IDF(Inverse Document Frequency)
▪ Normalization
▪ Vectorization
▪ NLP with Python

Hadoop Developer Course

During this course you will learn:

• Linux (Ubuntu/Centos) - Tips and Tricks


• Basic Java Programming – Core Java Oops Concepts
• Introduction to Big Data and Hadoop
• Hadoop ecosystem concepts
• Hadoop MapReduce concepts and features
• Developing MapReduce applications
• Pig concepts
• Hive concepts
• Impala
• Oozie workflow concepts
• Sqoop Data Ingestion
• Flume Agents
• Tableau Visualization
HBase concepts
• Real Time tools like Hue, Putty, FileZilla, Cloudera Manager
• Real Time Projects

Linux (Ubuntu/Cent Os) - Tips and Tricks

Basic(core) Java Programming Concepts – OOPS

Introduction to Big Data and Hadoop


• What is Big Data?
• What are the challenges for processing big data?
• What is Hadoop?
• Why Hadoop?
• History of Hadoop
• Hadoop ecosystem
• HDFS
• MapReduce

Understanding the Cluster


• Hadoop 2.x Architecture
• Typical workflow
• HDFS Commands
• Writing files to HDFS
• Reading files from HDFS
• Rack awareness
• Hadoop daemons

Let's talk MapReduce


• Before MapReduce
Hadoop Developer Course

• MapReduce overview
• Word count problem
• Word count flow and solution
• MapReduce flow

Developing the MapReduce Application


• Data Types
• File Formats
• Explain the Driver, Mapper and Reducer code
• Configuring development environment - Eclipse
• Writing unit test
• Running locally
• Running on cluster
• Hands on exercises

How MapReduce Works


• Anatomy of MapReduce job run
• Job submission
• Job initialization
• Task assignment
• Job completion
• Job scheduling
• Job failures
• Shuffle and sort
• Hands on exercises

MapReduce Types and Formats


• File Formats – Sequence Files
• Compression Techniques
• Input Formats - Input splits & records, text input, binary input
• Output Formats - text output, binary output, lazy output
• Hands on exercises

MapReduce Features

Counters
• Side data distribution
• MapReduce combiner
• MapReduce partitioner
• MapReduce distributed cache
• Hands exercises

Hive
• Hive Architecture
• Types of Metastore
• Hive Data Types
Hadoop Developer Course
• HiveQL
• File Formats – Parquet, ORC, Sequence and Avro Files Comparison
• Partitioning & Bucketing
• Hive JDBC Client
• Hive UDFs
• Hive Serdes
• Hive on Tez
• Hands-on exercises
• Integration with Tableau

Pig
• Pig Architecture
• Pig Data Types
• Load/Store Functions
• PigLatin
• Pig Udfs

Hbase

• HBase architecture and concepts


• Hbase Data Model
• Hbase Shell Interface
• Hbase Java API

Sqoop
• Sqoop Architecture
• Sqoop Import Command Arguments, Incremental Import
• Sqoop Export
• Sqoop Jobs
• Hands-on exercises

Flume
• Flume Architecture
• Flume Agent Setup
• Types of sources, channels, sinks Multi Agent Flow
• Hands-on exercises

Oozie
• Oozie Fundamentals
• Oozie workflow creations
• Oozie Job submission, monitoring, debugging
• Concepts on Coordinators and Bundles
• Hands-on exercises
Case Studies Discussions

Any one of the Four Projects


• Log File Analysis covering Flume, HDFS, MR/Pig, Hive, Tableau
• Crime Data Analysis Covering Oozie, Sqoop, HDFS, Hive, Hbase, RestFul Client.

• Hadoop Use Cases in Insurance Domain

Hadoop Use Cases in Retail Domain


Scala or Python , Spark
➢ Understand the difference between Apache Spark and Hadoop
➢ Learn Scala and its programming implementation

✓ Why Scala or python


✓ Scala Installation
✓ Get deep insights into the functioning of Scala
✓ Execute Pattern Matching in Scala
✓ Functional Programming in Scala – Closures, Currying, Expressions,
Anonymous Functions
✓ Know the concepts of classes in Scala
✓ Object Orientation in Scala – Primary, Auxiliary Constructors, Singleton &
Companion Objects
✓ Traits and Abstract classes in Scala
✓ Scala Simple Build Tool – SBT
✓ Building with Maven

➢ Spark Basics

✓ What is Apache Spark?


✓ Spark Installation
✓ Spark Configuration
✓ Spark Context
✓ Using Spark Shell
✓ Resilient Distributed Datasets (RDDs) – Features, Partitions, Tuning Parallelism
✓ Functional Programming with Spark

➢ Working with RDDs


✓ RDD Operations - Transformations and Actions
✓ Types of RDDs
✓ Key-Value Pair RDDs – Transformations and Actions
✓ MapReduce and Pair RDD Operations
✓ Serialization

➢ Spark on a cluster

✓ Overview
✓ A Spark Standalone Cluster
✓ The Spark Standalone Web UI
✓ Executors & Cluster Manager
✓ Spark on YARN Framework

➢ Writing Spark Applications

✓ Spark Applications vs. Spark Shell


✓ Creating the SparkContext
✓ Configuring Spark Properties
✓ Building and Running a Spark Application
✓ Logging
✓ Spark Job Anatomy

➢ Caching and Persistence

✓ RDD Lineage
✓ Caching Overview
✓ Distributed Persistence

➢ Improving Spark Performance

✓ Shared Variables: Broadcast Variables


✓ Shared Variables: Accumulators
✓ Per Partition Processing
✓ Common Performance Issues

➢ Spark API for different File Formats & Compression Codecs

✓ Text
✓ CSV
✓ Sequence
✓ Parquet
✓ ORC
✓ Compression Techniques – Snappy, Zlib, Gzip

➢ Spark SQL
✓ Spark SQL Overview
✓ HiveContext
✓ SQL Datatypes
✓ Dataframes vs RDDs
✓ Operations on DFs
✓ Parquet Files with Spark Sql – Read, Write, Partitioning, Merging Schema
✓ ORC Files
✓ JSON Files
✓ Inferring Schema programmatically
✓ Custom Case Classes
✓ Temp Tables vs Persistent Tables
✓ Writing UDFs
✓ Hive Support
✓ JDBC Support - Examples
✓ HBase Support - Examples
➢ Spark Streaming

✓ Spark Streaming Overview


✓ Example: Streaming Word Count
✓ Other Streaming Operations
✓ Sliding Window Operations
✓ Developing Spark Streaming Applications – Integration with Kafka and Hbase

Complementary Course: AWS

You might also like