0% found this document useful (0 votes)
2K views10 pages

Big Data Analytics

In Fudy

Uploaded by

Sameer Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
2K views10 pages

Big Data Analytics

In Fudy

Uploaded by

Sameer Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 10
Big Data Analytics Course Code: 22684 ProgramName _: Diploma in Artificial Intelligence and Machine Learning/ Diploma in Cloud Computing and Big Data Program Code: AN/BD Semester : Sixth Course Title : Big Data Analytics Course Code 222684 1. RATIONALE Data analytics techniques enable a business to take raw data and uncover patterns to extract valuable insights. Data analysis helps companies make informed decisions, create a more effective marketing strategy, improve customer experience and streamline operations. 2. COMPETENCY The aim of this course is to help the student to attain the following industry identified competeney through vurivus (euching leaning experiences: © Use Big data analytic technologies to process large amount of heterogeneous raw data to retrieve information, 3. COURSE OUTCOMES (COs) The theory, practical experiences and relevant sofi skills associated with this course are to be taught and implemented, so that the student demonstrates the following industry oriented COs associated with the above mentioned competency: a. Describe Big data and Big Data Analytics. Apply the Big data Analytics procedure to work on datasets. Describe Hadoop Distributed File System. Analyze structured data using HIVE, Analyze structured, semi structured and unstructured data using SPARK. eae 4, ___ TEACHING AND EXAMINATION SCHEME Teaching Examination Scheme attou Theory Practical | [| | [este] SS [aed |ae [al ae! ni] oe | a Mea eel [ of2] s 3 | 70 | 28] 30*} 0 | 1) }40 | as@] 10) 25] 10] so} 20 (**) marks should be awarded on the basis of internal end semester theory exam of 50 marks based on the specification table given in S. No. 9. (~): For the practical only courses, the PA has two components under practical marks i.e the assessment of practicals (seen in section 6) has a weightage of 60% (i.e.30 marks) and micro-project assessment (seen in section 12) has a weightage of 40% (i.e.20 marks). This is designed to facilitate attainment of COs holistically, as there is no theory ESE. Legends: L-Lecture; T — Tutorial/Teacher Guided Theory Practice; P -Practical;, @ Cea ESE -End Semester Examination; PA - Progressive Assessment, #”: No Theory Examinatio 5. COURSE MAP (with saupl. COs, Leatning Outuuus ic. LOs aul fi MSBTE — Final Copy Dt. 11.07.2023 Page 1 of 9 Big Data Analytics Course Code: 22684 This course map illustrates an overview of the flow and linkages of the topics at various levels of outcomes (details in subsequent sections) to be attained by the student by the end of the course, in all domains of learning in terms of the industry/employer identified competency depicted at the centre of this map. ora at oS > ae We y as Ae ey 77 Bape SS, Cae ee € XN 7 ee, Legends Lo ao ais oe Figure I - Course Map 6. SUGGESTED PRACTICALS/ EXERCISES, The practicals/exercises/tutorials in this section are psychomotor domain LOs (ie.sub- components of the COs) are to be developed and assessed in the student to lead to the attainment of the competency. sr. Practical Exercises Unit | APPFOX No. (Learning Outcomes to be achieved through practical) No. | peguived Case Study on Big data and Big data Analysis. (Walmart, Uber 1. | Netflix, eBay etc.) : eng MSBTE — Final Copy Dt. 11.07.2023 Page 2 0f9 Big Dat ia Analytics Course Code: 22684 ] Sr. No. Practical Exercises 1g Outcomes to be achieved through practical) (Lear Unit | No. Approx. Hrs. Required Write a Pandas program a. To import given excel data into a Pandas Dataframe. b. To get the data types of the given excel data fields, ¢. To read specific columns from a given excel file. 4. To find the sum, mean, max, min value of a specific column of a given excel file. ‘To import some excel data skipping some rows or columns, To select the specified columns and rows from a given data frame. g. To Delete Rows and Columns from DataFrame. a o4* Perform the Extract- Transform-Load (ETL) process a, Import the functions and required modules b. Download the source file ¢. Extract the zip file 4d. Set the path for the target files €. Use the extract( ) function to extract data from multiple sources £, Transform the data as per the given requirement using transform( ) function g. Load the data into the target file h, Call the log function for each phase a o1* Study any one Hadoop Use Case. Vv o2* Create Hive table: a, Create Hive External Table. b. Load data into Hive table. ¢. Create Hive Internal Table. 02* Load the data into Hive Table: a. Load data from Local file system b. Load data from Hadfs file system c. Copy data to Hive table Location 4. Sqoop Hive import to import table data oat Create Hive table with following storage format specification: Hive Text File Format . Hive Sequence File Format Hive RC File Format Hive AVRO File Format Hive ORC File Format Hive Parquet File Format pope ge Consider the sample logs.txt shown in figure. Write a MSBTE ~ Final Copy Dt. 11.07.2023 Page 3 of 9 Big Data Analytics Course Code: 22684 Sr. Practical Exereises Unie | ABPr™ No. (Learning Outcomes 10 be achieved through practical) | No. . | Required Spark application to count the total number of WARN lines in the | logs.tst file. (Implement using Scala / Python Programming) Sample logs.txt ERROR This is an error message WARN This is a warning message Implement using Seala / Python Programming: a. Create the following data as logdata.log with comma delimiters as shown 10242510192 12823 t/a google comfsexehStiegOOCL 0242430121 10323 t/a amar com 00CE | 02424,10112 12323 fae aaron com/etronies 0001 9. 262101412826 htm aaron contro torgedevces00C1 VI 02* 0242210122:12323 tp fnew gal.com 00C2 302423, 10122.18.2 tp: fw gtr com.0002 0242,10124.128.25 hp few ptr confor 00C1 ‘The schema for these data is Time, IP Address, URL and Location b. Create a DataFrame of the created log file using spark.read.csv. Write and run SparkSQL queries programmatically for the following requirements. (Implement using Scala / Python Programming) kart domain in each location? a, How many people accessed the 10. |b. Who accessed the Flipkart domain in each location? vi | oat c. List their IpAddress. d. How many distinct Internet users are available in each location? e. List the unique locations available Read and Write data sored in Apache Hive through Spark SQL.| | 9, 11. | (implement using Scala / Python Programming) Total 32 " compulsory practicals to be performed. Note i. Given in above tables is suggestive list of practical exercises, Teachers car similar exercises. Assessment of the ‘Process’ and ‘Product’ related skills in the laborato work should be done as per suggested sample below: MSBTE ~ Final Copy Dt. 11.07.2023 Page 4 0f9 Big Data Analytics Course Code: 22684 Sr. Performance Indicators ‘Weightage linea in % ] park. [20] [1__Timport packages and Libraries of Python / Scala /Hive 2 _ | Use Python / Scala Hive / Spark to ereate, edit, assemble and link the | 49 2 | programs. [3 | Debug, test and execute the programs 20 4 | Able to answer oral questions. E 10 5__| Submission of report in time, 10 Total | 100 Additionally, the following affective domain LOs (social skills/attitudes), are also important constituents of the competency which can be best developed through the above mentioned laboratory/field based experiences: a. Work with various libraries to handle data. b, Demonstrate working as a leader/a team member. ¢. Maintain tools and equipment. 4. Follow ethical practices. The development of the attitude related LOs of Krathwohl’s ‘Affective Domain Taxonomy’, the achievement level may reach: ‘Valuing Level’ in 1* year © ‘Organizing Level’ in 2 year and © ‘Characterizing Level’ in 3" year. 7. MAJOR EQUIPMENT/ INSTRUMENTS REQUIRED The major equipment with broad specification mentioned here will usher in uniformity in conduct of experiments, as well as aid to procure equipment by authorities concemed. Equipment Name with Broad Specifications Expt. S.No. 1. | Hardware: Personal computer, (3 preferable), RAM minimum 4 GB onwards. For all 2. | Operating system: Windows 10 onward Experiments 3.__| Software: Editor: Python setup ‘Apache Hadoop and Hive Practical 50 11 4,_| Software: Editor: Scala setup Practical 8 to 11 8. UNDERPINNING THEORY COMPONENTS The following topics/subtopies should be taught and assessed in order to develop LOs in the cognitive domain for achieving the COs to attain the identified competency. Unit | ene Topies and Sub-topics | Unit=1 Ta Describe the characteristics | 1.1 Introduction: | Introduction | of data. + Characteristics of Data | toBigData _ | 1b. Define Big Data, «Evolution of big data Analytics | 1c. Explain the challenges with | ¢ Definition of Big Data Big Data, «Challenges with Big 1d. Define Big Data Analytics. | © What is Big Data le. Explain the challenges with | Why Big Data Big Data Analytics 1.2 Introduction to Big Dat MSBTE — Final Copy Dt. 11.07.2023 Page 5 of 9 Big Data Analytics Course Code: 22684 Major Learning Outcomes Gait (in cognitive domain) a eeaeblcs a| If. Explain Data Science. What is Big Data Analytics Ig. Write down responsibilities | Classification of Analytics ofa Data Scientist. «Why is Big Data Analytics 1h. Explain Terminologies Used Important in Big Data Environment. «Data Science + Responsibilities of a Data Scientist Terminologies Used in Big Data Environments Unit 2a, Explain any one Domain | 2.1 Domain Specific Examples of Big Data Data specific example of Big + Web Analyties Data. © Financial Process 2b. Explain analytics flow for © Healtheare Big Data, Internet of Things 2c. State different Big Data «Environment Stack. «Logistics & Transportation 2d. Describe mapping analytics | © Industry flow to Big Data Stack. «Retail 2e. State different analytics 2.2 Analytics Flow for Big Data patterns. Data Collection | © Data Preparation + Analysis Types «Analysis Modes «Visualizations 2.3 Big Data Stack «Raw Data Sources «Data Access Connectors + Data Storage Batch Analytics Real-time Analytics «Interactive Querying + Serving Databases, Web & Visualization Frameworks 2.4 Mapping Analytics Flow to Big Data Stack 2.5 Case Study: Genome Data Analysis 2.6 Case Study: Weather Data Analysis 2.7 Analytics Patterns Unit-I1 3a, State the features of Hadoop. | 3.1 Introduction to Hadoop: The Big Data | 3b.Enlist key advantages of «Features of Hadoop Technology: Hadoop. Key Advantages of Hadoop Hadoop 3c. Compare RDBMS versus + Why Hadoop Hadoop. «RDBMS versus Hadoop 34. Explain Hadoop. 3.2 Hadoop Overview 3e. Describe HDFS. 3.3 Use Case of Hadoop 3.4 HDFS 3.5 Processing Data with Hadoop Unit-IV 4a. State the use of HIVE. 4.1 What is HIVE? Introduction | 4b. Describe HIVE Architecture. | 4.2 HIVE Architecture to HIVE 4c. Explain HIVE File Format, | 4.3. HIVE Data Types 4d, Execute HIVE Query 4.4 HIVE File Format ‘MSBTE — Final Copy Dt. 11.07.2023 Page 6 0f9 Big Data Analytics Course Code: 22684 Unit Sanne ieee Topies and Sub-topic: | Language commands. 4.5 HIVE Query Language | 4e, Explain SERDE 4.6 RCFile Implementation | | Af. Describe User Defined 4.7 SERDE | Functions __ | 438 User Defined Functions | Unit-V $a, State the use of Apache 3.1 What Is Apache Spark? Introduction | Spark. 5.2 Why Apache Spark? toSPARK | Sb. Compare Spark and Hadoop | 5.3 Spark vs. Hadoop MapReduce MapReduce. 5.4 Apache Spark Architecture | Se, Describe Apache Spark 5.5 Spark Components Architecture. 5.6 Spark Shell Sd. State the Spark Components. | 5.7 Spark Core: RDD Se. Define RDD. + RDD Operations 5f. tate the RDD Operations. « Creating an RDD 5g. Execute commands of Spark | 5.8 What Is Spark SQL? SQL. 5.9 Spark Session Sh, Deseribe DataFrame 5.10 Creating DataFrames Operations. © DataFrame Operations Si, Deseribe Generic Load and | Dataset Operations | Save Functions. 5.11 Different Data Sources: Generic Load | 5]. Write a code for Building —_| and Save Functions | Spark SQL Application with | 5.12 Building Spark SQL Application with SBT. SBT 5k. Explain Spark Real-Time | 5.13 Spark Real-Time Use Case Use Case. «Data Analytics Project Architecture e Use Cases ‘Note: To attain the COs and competency, above listed Learning Outcomes (LOs) need 10 be undertaken to achieve the ‘Application Level’ of Bloom’s ‘Cognitive Domain Taxonomy’. 9. SUGGESTED SPECIFICATION TABLE FOR QUESTION PAPER DESIGN ; sae [Distribution of Theory Marks ba Unit Title se 8 “) Total Level | Level | Level | Marks T_[ Introduction to Big Data Analytics 08 os | o | - | 2 T_| Data Analytics Process 10 06 | 06 | 02 | 14 II_| The Big Data Technology: Hadoop 08 06_| 05 _| 02 | 14 TV_[ Introduction to HIVE 10 o2_|_06 | 06 | 14 ‘V_[ntroduction to SPARK 2 06_| 04 | 06 | 16 Total | 48 28_| 26 | 16 | 70 Legends: R=Remember, U=Understand, A=Apply and above (Bloom’s Revised taxonomy) Note: This specification table provides general guidelines to assist student for their learning and to teachers to teach and assess students with respect to attainment of LOs. The actual distribution of marks at different taxonomy levels (of R, U and A) in the question paper may vary from above table. aOTE This specification table also provides a general guideline for teachers to Sapgiien semester practical theory exam paper which students have to undertake. & 10. SUGGESTED STUDENT ACTIVITIES MSBTE — Final Copy Dt. 11.07.2023 Page 7 of 9 Big Data Analytics Course Code: 22684 Other than the classroom and laboratory learning, following are the suggested student-related co-curricular activities which can be undertaken to accelerate the attainment of the various outcomes in this course: a, Prepare journals based on practical performed in laboratory b. _Library/E-Book survey regarding assembly language programming used in Computer industries. c. Prepare power point presentation for showing Programming Applications. ferent types of Assembly language 11. SUGGESTED SPECIAL INSTRUCTIONAL STRATEGIES (if any) These are sample strategies, which the teacher can use to accelerate the attainment of the various outcomes in this course: a. Massive open online courses (MOOCs) may be used to teach various topics/sub topics. b. ‘L’in item No. 4 does not mean only the traditional lecture method, but different types of teaching methods and media that are to be employed to develop the outcomes. c. About 15-20% of the topics/sub-topics which is relatively simpler or descriptive in nature is to be given to the students for self-directed learning and assess the development of the LOs/COs through classroom presentations (see implementation guideline for details). d. With respect to item No.10, teachers need to ensure to create opportunities and provisions for co-curricular activities ¢. Guide student(s) in undertaking micro-projects. £. No. of practical’s selection to be performed should cover all units. 12. SUGGESTED MICRO-PROJECTS Only one micro-project is planned to be undertaken by a student assigned to him/her in the beginning of the semester. S/he ought to submit it by the end of the semester to develop the industry oriented COs. Each micro-project should encompass two or more COs which are in fact, an integration of practicals, cognitive domain and affective domain LOs. The micro- project could be industry application based, internet-based, workshop-based, laboratory-based or field-based. Each student will have to maintain a dated work diary consisting of individual contributions in the project work and give a seminar presentation of it before submission. The total duration of the micro-project should not be less than 16 (sixteen) student engagement hours during the course. In the first four semesters, the micro-project could be group-based. However, in higher semesters, it should be individually undertaken to build up the skill and confidence in every student to become problem solver so that s/he contributes to the projects of the industry. A suggestive list is given here. Similar micro-projects could be added by the concerned faculty: a, Study of Hadoop in the Financial Sector/ Healthcare Sector/ Retail Sector/for Telecom Industry/for Building Recommendation System. b. Load the data set and store it in a data-frame using Pandas and perform following operations. ‘+ Remove the missing data using List-wise deletionRemove the missing data using Pair- wise deletion Remove the missing data using Forward filling Check for duplicate value Separate categorical and numerical data. replace(), interpolate()}. ¢. Write a Pandas program « To join the two given dataframes along columns. MSBTE — Final Copy Dt. 11.07.2023 Page 8 of 9 ata Analytics Course Code: 22684 To join the two given dataframes along rows and merge with another dataframe along the common column, ¢ To join the two dataframes using the commen column of both dataframes. 4d. Create Hive table, load data into Hive table and Execute following Hive built-in functions on given Hive Table (Simple Fun e. Create an RDD: ‘© Use the parallelize method of SparkContext. Create Array of integers and pass that as an argument to the parallelize method. Using an external data source. Using an external datasource HDFS Create an RDD of a numeric list. Then apply map(func) to multiply each element by 2. {Implement Matrix algorithms in SparkSql programming. g- Perform Untyped Dataframe operations of SparkSQL (Select, Filter and Ageregate Operations) on a given dataset. ns, Aggregate Functions, Date Function). 13.___ SUGGESTED LEARNING RESOURCES ] 7 Title of Book | Author Publication ] ‘Seema Acharya | Wiley India 1 ee eee Subhashini ISBN: 978-8 -269-7951-8 Chellappan ISBN: 978-81-265-8836-7(ebK) Big Data Science & Analytics | Arshdeep Bahga 2 | A Hands-On Approach Vijay Madisetti CEE ae aeaeg Pore Wiley India 3. | Pata Analytics Using Python | Bharti Motwani ISBN: 978-81-265-0295-0 ISBN: 978-81-265-8965-4(ebK) 5 ‘Apress, Subhashini ; 14. | Practical Apache Spark - Chellappan aT Cobb): 278-1882 Using the Scala API Dparanitharn SONG Cer oer ee 4842-3652-9 14. SOFTWARE/LEARNING WEBSITES .._https://spark apache.org/docs/latest/rdd-programmingguide.html (For practicals on Spark) (As on 18 April 2023) https://nvww simplilearn.com/what-is-big-data-analytics-article (As on 18 April 2023) .__https:!/www.analyticsvidhya.com/blog/2021/06/implementing-python-to-leam-data- engineering-etl-process/ (As on 18 April 2023) apse MSBTE — Final Copy Dt. 11.07.2023 Page 9 of 9

You might also like