Offerta formativa | Università degli Studi di Firenze

Course year

First year - Second Semester

Belonging Department

Information Engineering (DINFO)

Course Type

Single education field course

Scientific Area

ING-INF/05 - INFORMATION PROCESSING SYSTEMS

Credits

6

Teaching Hours

48

Teaching Term

28/02/2022 ⇒ 17/06/2022

Attendance required

No

Type of Evaluation

Final Grade

Course Content

show

Course program

show

Lectureship

MARINAI SIMONE

Mutuality

Course teached as:
B031285 - DATA MINING
Second Cycle Degree in ARTIFICIAL INTELLIGENCE

Teaching Language

Lectures are in Italian, but all the teaching material is in English

Course Content

Data Mining, Clustering, Locality sensitive hashing, Frequent Itemsets, text mining, linguistic pre-processing, language models, word embeddings

Learning Objectives

The course first aims at introducing the main Data Mining techniques that allow you to model large amounts of data and extract useful information.
Secondly, we consider the problems arising when extracting information and indexing both textual and non-textual documents. To this purpose we introduce the main models and algorithms in Information Retrieval and Natural Language Processing.

Prerequisites

It is essential to know topics typically taught in the Algorithms and Data Structures classes. Some knowledge of Machine Learning can be useful.

Teaching Methods

Classes, homework.

Further information

Oral exams are usually made after completion of the report.

Type of Assessment

Study and presentation of one research paper to the class. Writing of a short report on the studied topic. Oral exam.

Course program

Data Mining
Datawarehouse. Hardware. Disk Organization. Access times

Distributed file systems. Map Reduce, Word count, Matrix-Vector and Matrix Multiplication with Map Reduce

The market-basket model. Association rules. Algorithms for computing frequent item-sets and Association Rules. Hash-based filtering. PCY algorithm, Random sampling, SON algorithm, Apriori with MapReduce. Bloom filters.

Finding similar items. Document similarity, shingling, min-hashing
Locality sensitive hashing (LSH)
Families of hash functions. LSH for cosine distance. LSH for Euclidean distance.

Curse of dimensionality. Distance measures.
Clustering, Hierarchical clustering, k-means clustering. SOM clustering
BFR algorithm, CURE algorithm. Dimensionality reduction. Principal Component Analysis (PCA). Singular Value Decomposition (SVD)

Text Mining. Information Retrieval. Boolean and Vector Space Model
Linguistic pre-processing: tagging, stop-word removal, lemmatization, stemming. Wildcard queries. N-grams, Edit-distance Vectorial Model. TF-IDF. Inverted Index.
Spelling correction. Performance evaluation in Information Retrieval (Precision, Recall).
Probabilistic language models. Text classification. Word meaning, vector semantics. Dense embeddings. POS tagging. NE recognition

B031358 - DATA MINING

Academic Year 2021-22

Teaching Language

Course Content

Suggested readings (Search our library's catalogue)

Learning Objectives

Prerequisites

Teaching Methods

Further information

Type of Assessment

Course program