0% found this document useful (0 votes)
8 views19 pages

01 Intro

intro to info retreival

Uploaded by

yahia linus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views19 pages

01 Intro

intro to info retreival

Uploaded by

yahia linus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

9/23/2020

Text Technologies for Data Science


INFR11145

Introduction
Instructor:
Walid Magdy

23-Sep-2020

Lecture Objectives
• Know about the course:
• Topic
• Objectives
• Requirements
• Format
• Logistics

• Note:
• No much technical content today
• Don’t assume next lectures would be the same!

2
Walid Magdy, TTDS 2020/2021

1
9/23/2020

Text Technologies for Data Science

= documents, words, terms, …


≠ images, videos, music (with no text)

Information Retrieval Search Engines


Text Classification Technologies
Text Analytics

3
Walid Magdy, TTDS 2020/2021

What is Information Retrieval (IR)?

IR is NOT just

Web search
4
Walid Magdy, TTDS 2020/2021

2
9/23/2020

What is IR?

Speech - QA
5
Walid Magdy, TTDS 2020/2021

What is IR?

Information
Filtering

Recommendation

Social search
6
Walid Magdy, TTDS 2020/2021

3
9/23/2020

What is IR?

Library (book) search


1950’s
7
Walid Magdy, TTDS 2020/2021

What is IR?

Legal search

8
Walid Magdy, TTDS 2020/2021

4
9/23/2020

What is IR?

Cross-Language search
9
Walid Magdy, TTDS 2020/2021

What is IR?

Content-based music search


10
Walid Magdy, TTDS 2020/2021

10

5
9/23/2020

*Source: Matt Lease (IR Course at U Texas)

What is IR?
Advertising

Query suggestion
/ correction

Snippet selection
/ summarisation

Categorisation
(search verticals)

11
Walid Magdy, TTDS 2020/2021

11

What is IR? Find?

IR ≠ Find
• Sequential
• Exact match
12
Walid Magdy, TTDS 2020/2021

12

6
9/23/2020

What is IR?
• IR is finding material of an unstructured nature that
satisfies an information need from within large
collections

• Find → Task
• Unstructured → Nature
• Information need → Target
• Satisfies → Evaluation

13
Walid Magdy, TTDS 2020/2021

13

Text classification

14
Walid Magdy, TTDS 2020/2021

14

7
9/23/2020

Text classification

15
Walid Magdy, TTDS 2020/2021

15

Text classification

16
Walid Magdy, TTDS 2020/2021

16

8
9/23/2020

What is text classification?


• Text classification is the process of classifying
documents into predefined categories based on their
content.

- Input: Text (document, article, sentence)


- Task: Classify into one/multiple categories
- Categories:
- Binary: relevant/irrelevant, spam .. etc.
- Few: sports/politics/comedy/technology
- Hierarchical: patents

17
Walid Magdy, TTDS 2020/2021

17

In this course, we will learn to


• How to build a search engine
• which search results to rank at the top
• how to do it fast and on a massive scale
• How to evaluate a search algorithm
• is system A really better than system B
• How to work with text
• two tweets talk about the same topic?
• handle misspellings, morphology, synonyms
• How to classify text
• into categories (sports, news, comedy, …)
• features to use
• evaluate classification quality
• Apply text analytics
• Find what makes a set of document different from others
18
Walid Magdy, TTDS 2020/2021

18

9
9/23/2020

How this course is different from others?


• ANLP, FNLP
• Some text processing
• Text laws
• No NLP (word/phrase level vs document level)
• ML practical
• Text classification
• No ML (using off-the-shelf ML tool)

• It does not overlap with others on:


• Search engines
• IR methods/models
• IR evaluation
• Text analysis

19
Walid Magdy, TTDS 2020/2021

19

Some terms you will learn about


• Inverted index
• Vector space model
• Retrieval models: TFIDF, BM25, LM
• Page rank
• Learning to rank (L2R)
• MAP, MRR, nDCG
• Mutual information, information gain, Chi-square
• SVMs: binary/multiclass classification, ranking, regression

20
Walid Magdy, TTDS 2020/2021

20

10
9/23/2020

This Course is Highly Practical


• 70% of the mark is on practical work
• You will implement 50+% of what you learn
• By W5, you should have developed a basic working
Search Engine from scratch
• Practical Lab every week
• Two coursework, mostly coding
• A course group project to develop a full system

21
Walid Magdy, TTDS 2020/2021

21

Pre-requests (1/3)
• Maths requirements:
• Linear algebra: vectors/matrices (addition, multiplication, inverse,
projections ... etc).
• Probability theory: Discrete and continuous univariate random variables.
Bayes rule. Expectation, variance. Univariate Gaussian distribution.
• Calculus: Functions of several variables. Partial differentiation. Multivariate
maxima and minima.
• Special functions: Log, Exp, Ln.

22
Walid Magdy, TTDS 2020/2021

22

11
9/23/2020

Pre-requests (2/3)
• Programming requirements:
• Python or Perl
• Knowledge in regular expressions
• Shell commands (cat, sort, grep, uniq, sed, ...)
• Data structures and software engineering for course project.

23
Walid Magdy, TTDS 2020/2021

23

Pre-requests (3/3)
• Team-work requirement:
• Final course project would be in groups of 5-6 students.
• Working in a team for the project is a requirement.
• No exceptions will be allowed!

24
Walid Magdy, TTDS 2020/2021

24

12
9/23/2020

Skills to be gained !!!


• Working with large text collections
• Few shell commands
• Some Python programming
• Software engineering skills
• Build text classifier in few mins
• TEAM WORK
• Project management
• Time management
• Task assignment + system integration

25
Walid Magdy, TTDS 2020/2021

25

Course Structure
• 20 Lectures:
• 2 lectures → Introduction (today)
• 14 lectures → IR (50% practical lectures)
• 4 lectures → Text Analytics/Classification

• 8-10 Labs:
• Practice what you learn

• No Tutorials
• Some self-reading
• Lots of system implementation
• Few online videos

26
Walid Magdy, TTDS 2020/2021

26

13
9/23/2020

Course Instructors

Walid Magdy Steve Wilson


Reader Research Associate
(14 lectures) (5 lectures)

+ 1 guest lecture
27
Walid Magdy, TTDS 2020/2021

27

Lecture Format
• 2 Lectures at a time
• Questions are allowed any time. Feel free to interrupt
• 5-10 mins break after L1
• Feel free to go out and come back
• Discuss 1st lecture with friends
• Questions on L1 are allowed before starting L2
• Mind teaser math problem (for fun)
• Some lectures are interactive. Please participate
• Some lectures will include demos (running code)
• 2 tutorial lectures about using tools

28
Walid Magdy, TTDS 2020/2021

28

14
9/23/2020

Labs
• Online!
• How it will work?
• Relevant lab will be announced with each lecture on Wednesday
• You should implement lab directly after lecture
• Any issues → ask on Piazza (tag question by lab number)
• Produced output → Share on Piazza (publicly)
• Demonstrators → answer questions + validate your output
• DO NOT ask a question before checking if it was asked before
• Tuesdays → Optional Teams meetings for those still require support

• Live lab times: Tuesday, 10am, 12am, 6pm


• Demonstrators:
Marina Posti, Maysara Hammouda and Zheng Zhao

29
Walid Magdy, TTDS 2020/2021

29

Assessments
• Coursework 1: 10%
The same as labs 1-3 → Build your first search engine
• Coursework 2: 20%
IR Evaluation, Text classification/analytics
• Group project: 40%
A full running search engine supported by text technologies
• Final Exam: 30%

30
Walid Magdy, TTDS 2020/2021

30

15
9/23/2020

Group Project
• The largest weight: 40% of the total mark
• Teamwork → Group 5-6 (you select your group)
• Design a full end-to-end search engine that searches a large
collection of documents with many functionalities.
• Mark:
• 50% on project → the same for all team members
- How complete/effective/fast/nice is your search engine?
• 50% on individual contribution → different for each member
- How useful/much is your contribution? (Mark can be negative!)

• Project prize → a prize will be awarded to best project

31
Walid Magdy, TTDS 2020/2021

31

Example
• A search engine that retrieves quotes of movies and TV shows.
• Collection size: 77 million movie quotes
• Search options
• Phrase search of quotes
• Movie info search
• Advanced search: movie title, actor/actress, years, keywords
• Query suggestion
• Classifying results by genre
• Demo: http://167.71.139.222/

32
Walid Magdy, TTDS 2020/2021

32

16
9/23/2020

Timeline
• 2 Semesters (or one?)

Exam
Lectures Labs

Semester 1 Semester 2

W5 W11 W11

CW 1 & 2 Group Project


33
Walid Magdy, TTDS 2020/2021

33

Logistics (1/2)
• Lectures:
• Live on 2 Wednesdays, 12.00-14.30 (some exception might occur)
• Recording will be available
• Handouts to be posted on the day of the lecture
• Course webpage:
• Link: http://www.inf.ed.ac.uk/teaching/courses/tts/
• Handouts, Labs, CW details
• Learn:
• Lecture recordings
• Deadlines

34
Walid Magdy, TTDS 2020/2021

34

17
9/23/2020

Logistics (2/2)
• Pizza:
• All communication will be there
• Questions about lectures/labs/CW are there
• Feel free to answer each other questions
• Lab support will be mainly there
• Please share your lab answers there
• Join NOW: link

• Microsoft Teams:
• Live lab support will be there
• Join NOW: link

35
Walid Magdy, TTDS 2020/2021

35

FAQ
• How the project would be managed? What if one
member does not work?
• I am not that solid in programming, should I take this
course?
• Can I audit the course?

• Anything else?

36
Walid Magdy, TTDS 2020/2021

36

18
9/23/2020

Next Lecture
• Definitions of IR main concepts
(more introduction)

37
Walid Magdy, TTDS 2020/2021

37

19

You might also like