Dr. Saravanan Thirumuruganathan, an expert in AI and data mining, introduces a course focused on practical applications of data science using Python. The course will utilize Moodle for communication, feature programming assignments, and cover various topics including visualization, classification, and clustering. Grading will be based on programming assignments, tests, and a final exam, with an emphasis on collaborative learning and applied techniques in an enterprise context.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
5 views12 pages
01 Course Logistics
Dr. Saravanan Thirumuruganathan, an expert in AI and data mining, introduces a course focused on practical applications of data science using Python. The course will utilize Moodle for communication, feature programming assignments, and cover various topics including visualization, classification, and clustering. Grading will be based on programming assignments, tests, and a final exam, with an emphasis on collaborative learning and applied techniques in an enterprise context.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12
Course Intro and Logistics
Dr. Saravanan (Sara) Thirumuruganathan
My Background • MS/PhD from University of Texas at Arlington • Prev: 2016-2022: Senior Scientist at QCRI, Qatar • Prev: 2020-2024: Co-founder and Chief Scientist of a Customer Onboarding startup • Current: 2024- : Co-founder and CEO of a startup building customer engagement platform (chatbots and more) using AI/LLMs • Other: Consultant to multiple governments, international organizations and large enterprises (Research) Interests • SLMs and LLMs • Deep/Machine learning , Artificial Intelligence , Data mining • Data Integration • Cybersecurity
• Python, Golang, Typescript aficionado
• And purveyor of very many languages and frameworks
• Lectiophile especially of Fantasy, Science Fiction and Comics
Course Details • Moodle LMS • Slides, Forums, etc • This will be the primary way to talk about the course • So if you have course related comments / doubts etc, please use Moodle • Contact • [email protected] • Use this for non-public correspondence • Office Hours • Before or after class • By appointment Textbooks • There is no single book to cover all course topics • Slides will be the primary reference material. • The instructor will also share some reading materials (chapters from some book, research papers, blog posts etc) • All the textbooks will be online and publicly available • You are welcome to buy the books, but it is not needed Textbooks • [MMDS] Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeff Ullman • http://www.mmds.org/ • [ISLP] An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, Jonathan Taylor • https://www.statlearning.com/ • [IDM] Introduction to Data Mining by Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin Kumar • https://www-users.cse.umn.edu/~kumar001/dmbook/index.php • [IIR] Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schutze. • https://nlp.stanford.edu/IR-book/ Grading • 40% (four) Programming Assignments • 20% Test 1 and Test 2 • 40% Final exam (comprehensive) Topics relevant to Programming Assignments • Scientific Python: numpy, pandas/modin, sklearn • Env management: pyenv, pipx, poetry/pdm/rye • Code quality: pylint, ruff, type annotations, pyright, pydantic, pytest • Profiling/Debugging: scalene, wat, pyinstrument, loguru, • Web stuff: httpx, FastAPI • Databases: sqlite, duckdb, PRQL • Misc: tqdm, click/argparse • Data formats: JSON, TOML, YAML • Viz: matplotlib, seaborn, altair Programming Assignments • Goal: expose you to important data mining tools to make you productive as a data scientist • Team based, 1-3 members • Coding will be in Python • All of them will be intensive • Startup code, testing code will be provided • Budget approx. 20 hours per assignment (Learning + Coding time) • Teams might require less time based on how you split the tasks Programming Assignments • Start early • Find good team members • Okay to change teams per project • Everyone in team gets same score • Collaboration/Brainstorming is Okay! • Plagiarism is not ☺ Course Topics: Design Goals • Breadth of topics rather than depth • Biased selection based on how useful they are in an enterprise setting • Applied focus : given a technique • When to it? • How to use it? • When not to use it? Course Topics • Visualization • Pattern mining • Finding similar items • Classification • Simple models: Decision trees, kNN, Naïve Bayes • Ensemble models: Random forests, Boosting and Bagging • Model Comparison and Evaluation • Clustering + Dimensionality Reduction • Recommenders • Data mining in the wild: sampling, simulations, hypothesis testing, MAB, A/B • Presenting data mining results: narrative storytelling