AnandKumar Course Intro IT356
AnandKumar Course Intro IT356
2
Course Plan:
• Introductory concepts of Linguistic systems, Language Modeling
and Sequence tagging, Word stemming, tokenization,
normalization, Part of Speech tagging, Traditional models of
distributional semantics,
• Unstructured Text Management, Word and Sentence embeddings,
n-gram models, Maximum Entropy models, Hidden Markov
Models, Viterbi Algorithm, Neural Language Models;
• Information Extraction, Named Entity Recognition, Relation
Extraction; Understanding Semantics, word sense and word
similarity, Lesk Algorithm, Wordnets, Topic Modeling, Dialog
Systems,
• Emerging trends, Research issues, challenges, interesting
applications in various domains.
3
Texts and References:
•Texts and References:
∙ Daniel Jurafsky and James H. Martin. "Speech and Language Processing: An Introduction to
Natural Language Processing, Computational Linguistics and Speech Recognition". Second
Edition. Prentice Hall, 2008
∙ Christopher D. Manning and Hinrich Schütze, "Foundations of Statistical Natural Language
Processing" MIT Press, 1999
∙ Turney, Peter D., and Patrick Pantel. "From frequency to meaning: Vector space models of
semantics." Journal of artificial intelligence research 37 (2010): 141-188.
∙ IMPORTANT NOTE:
1. Course Mini / Minor Project Proposal - Aug 16th, 2023
2. Mid Sem Project Progress Presentation - Sep 11th, 2023
3. Final Project Presentation and Demo - Oct 30th, 2023
Analysis of Algorithms 4
Books etc.
• Main Text(s):
– Speech and NLP: Jurafsky and Martin
– Foundations of Statistical NLP: Manning and Schutze
• Journals
– Computational Linguistics, Natural Language Engineering, AI,
AI Magazine, IEEE SMC, TALIP, Computer Speech and
Language
• Conferences
– ACL, EACL, COLING, MT Summit, EMNLP, IJCNLP, HLT,
ICON, SIGIR, WWW, ICML, ECML *ACL, FIRE, SPELLL
Assessment Type and COs
Analysis of Algorithms 6
Evaluation Plan
• Course Mini / Minor Project: 30%
• Continuous Evaluation & Assignments: 20%
• Mid Sem Exam: 20%
• End Sem Exam: 30%
Analysis of Algorithms 7
Course Minor Project (30%)
• IEEE/ACM Reputed Journals as base papers
• Core Conferences/ Shared Tasks
• Implementation
– Title/Topic/Team Proposal(5)
– Midsem Eval (10)
– End sem Eval (15)
• Plagiarism free Report –Not AI generated
• Conf/Journal Publication (Bonus Marks)
Collaboration Works
• Winnipeg University, Canada
• University of Galway, Ireland
• Legal Summarization– NIT Trichy
• Telugu NLP - NIT-AP / IIIT-Hyd
• LLMs-Eduminster US
• Conversation System -ISRO
Some open Topics
• Finance NLP
• Medical Documents-ClinicalNLP
• Education Documents – NLP
• Social Media Comments –Depression –Mental
Well being
• Legal Documents – Ontology – Document
Retrieval
• LLMs-Llama 2 –ChatGPT
• Conversation System –QA - Chatbot
10
Some open Topics
• Sign Language Translation
• Financial Document Causality Detection”
• Multimodal Argument Mining
• Violence Inciting Text Detection
• Multi-lingual Multi-task Information
Retrieval
• Ontology based Senticnet
11
12
NLP with AI and Deep learning
https://
marutitech.c
om/use-
cases-of-
natural-
language-
processing-
in-
healthcare/
NLP in education
• Innovative Education Applications
• Educational Chatbots
• Automatic Essay/answer Grading – Quality
assessment
• Automatic Question/ Exercise generation
• Behavior analytics.
NLP for Finance an Agriculture
• Sentiment Analysis – Stock Prediction
• Chatbots for Financial/Invesment suggestions
• Chatbots for Farmers (Regional Languages)
• Discovering crop disease trends using farmer
queries
• Terminology Extraction for Document
Matching and Open Data in Agricultural
Domain:
Sub domains
Bio-NLP
• Open Problems
https://towardsdatascience.com/summarising-the-latest-research-on-coronavirus-with-nlp-
and-topic-modelling-28b867ad9860
The NLP Research Community
• Papers
– ACL Anthology has nearly everything, free!
• Over 60,000 papers!
• Free-text searchable
– Great way to learn about current research on a topic
– New search interfaces currently available in beta
» Find recent or highly cited work; follow citations
• Used as a dataset by various projects
– Analyzing the text of the papers (e.g., parsing it)
– Extracting a graph of papers, authors, and institutions
(Who wrote what? Who works where? What cites what?)
The NLP Research Community
• Conferences
– Most work in NLP is published as 9-page conference papers
with 3 double-blind reviewers.
– Main annual conferences: ACL, EMNLP, NAACL
• Also EACL, IJCNLP, COLING … and LREC!
• + various specialized conferences and workshops
– Big events, and growing fast! ACL 2020:
• > 2000 attendees
• 2244 full-length papers submitted (25% accepted)
• 1185 short papers submitted (18% accepted)
• 19 workshops on various topics
• “Best paper” awards – worth reading these papers
The NLP Research Community
• Datasets
– Raw text or speech corpora
• Or just their n-gram counts, for super-big corpora
• Various languages and genres
• Usually there’s some metadata (each document’s date, author, etc.)
• Sometimes licensing restrictions (proprietary or copyright data)
– Text or speech with manual or automatic annotations
• What kind of annotations? That’s the rest of this lecture …
• May include translations into other languages
– Words and their relationships
• Morphological, semantic, translational, evolutionary
– Grammars
– World Atlas of Linguistic Structures
– Parameters of statistical models (e.g., grammar weights)
The NLP Research Community
• Datasets
– Read papers to find out what datasets others are using
• Linguistic Data Consortium (searchable) hosts many large datasets
• Many projects and competitions post data on their websites
• But sometimes you have to email the author for a copy
– CORPORA mailing list is also good place to ask around
– LREC Conference publishes papers about new datasets & metrics
– Amazon Mechanical Turk – pay humans (very cheaply) to annotate your
data or to correct automatic annotations
• Old task, new domain: Annotate parses etc. on your kind of data
• New task: Annotate something new that you want your system to find
• Auxiliary task: Annotate something new that your system may benefit from
finding (e.g., annotate subjunctive mood to improve translation)
– Can you make annotation so much fun or so worthwhile
that they’ll do it for free?
Thank You
43