Main textbooks:
A. Rajaraman, J. D. Ullman, Mining of Massive Datasets, Cambridge University Press , 2011
D. Jurafsky, J. H. Martin, Speech and Language Processing, 2020
D. Sarkar, Text Analytics with Python, Apress, 2019
Additional books:
Ian Witten, Text Mining - 2004
C.D. Manning, P. Raghavan, P. Raghavan, Introduction to Information Retrieval, Cambridge University Press – 2008
Details on books availability in the moodle page of the course
Learning Objectives
The course first aims at introducing the main Data Mining techniques that allow you to model large amounts of data and extract useful information.
Secondly, we consider the problems arising when extracting information and indexing both textual and non-textual documents. To this purpose we introduce the main models and algorithms in Information Retrieval and Natural Language Processing.
Prerequisites
It is essential to know topics typically taught in the Algorithms and Data Structures classes. Some knowledge of Machine Learning can be useful.
Teaching Methods
Classes, homework.
Further information
Oral exams are usually made after completion of the report.
Type of Assessment
Study and presentation of one research paper to the class. Writing of a short report on the studied topic. Oral exam.
Course program
Data Mining
Datawarehouse. Hardware. Disk Organization. Access times
Distributed file systems. Map Reduce, Word count, Matrix-Vector and Matrix Multiplication with Map Reduce
The market-basket model. Association rules. Algorithms for computing frequent item-sets and Association Rules. Hash-based filtering. PCY algorithm, Random sampling, SON algorithm, Apriori with MapReduce. Bloom filters.
Finding similar items. Document similarity, shingling, min-hashing
Locality sensitive hashing (LSH)
Families of hash functions. LSH for cosine distance. LSH for Euclidean distance.
Curse of dimensionality. Distance measures.
Clustering, Hierarchical clustering, k-means clustering. SOM clustering
BFR algorithm, CURE algorithm. Dimensionality reduction. Principal Component Analysis (PCA). Singular Value Decomposition (SVD)
Text Mining. Information Retrieval. Boolean and Vector Space Model
Linguistic pre-processing: tagging, stop-word removal, lemmatization, stemming. Wildcard queries. N-grams, Edit-distance Vectorial Model. TF-IDF. Inverted Index.
Spelling correction. Performance evaluation in Information Retrieval (Precision, Recall).
Probabilistic language models. Text classification. Word meaning, vector semantics. Dense embeddings. POS tagging. NE recognition