Intelligent Data Analysis / Intelligente Datenanalyse (IDA)

Lecture and labs in the summer semester 2016. Prof. Dr. Tobias Scheffer, Gerrit Gruben, Nuno Marques, Achmed Abdelwahab.


Dates

The whole class gives a credit of 6 ECTS.

It was agreed to move the German lab from 16th June to Monday 13th June 16-18 same place.

On Tuesday the 10th May there will be no lab session. The German lab will be regarding homework 3 on Thursday 12th May.

There are no lab sessions in the first lecture week (04/11/2016 - 04/15/2016).

If you have already taken the course "Intelligente Datenanalyse in Matlab", note that the final exam for this lecture will include the content of Chapter 2, 3, 4, and 6 of the Bishop textbook.



Syllabus

This course gives a comprehensive introduction to the field of intelligent data analysis. While methods of data analytics are focussing on describing the status quo, we are using algorithms of the field of machine learning to build predictive models that are able to generalise to unseen and/or unknown concepts.

The applications of these methods are both wide and deep: ranging from predicting the risk of credits, evaluation of astronomical data, detecting malware from HTTPS traffic logs to personalised recommendation of music, movies, or products.

More precisely, we cover the topics of decision tree based models, linear discriminator models, support vector machines, and basic clustering methods. We include an introduction to Bayesian statistics and kernel methods. Additionally, a basic treatment of statistical model evaluation and selection is given. The nature of the field requires a good command in linear algebra, statistics and probability theory, and optimisation techniques as well as fundamental knowledge of calculus (high school level).

In accompanying labs and exercises the students apply the learned methods using Python and its related ecosystem of libraries (e.g. NumPy, Pandas, scikit-learn) on data sets from various domains.


Literature / Auxiliary

Machine Learning: Pattern Recognition and Machine Learning by Chris Bishop (plenty of copies at the library).

Tutorials Linear Algebra: 1, 2, 3, 4.

Gradient Descent: 1, 2.

Fundaments in Mathematics [German]: [Folien] [Aufzeichnung]



Lecture Material

1. Models, Data, Learning Problems
2. Introduction to Python
3. Problem Analysis and Data Preprocessing
4. Decision Trees and Random Forests
5. Linear Models
6. Kernel Methods
7. Bayesian Learning
8. Neural Networks
9. Model Evaluation
10. Semester projects

The lecture on July 11 will focus on the semester projects. Students can present and discuss their ideas and ask questions regarding the projects

11. Recap and exam questions

The lecture on July 18 will revisit the topics of the course. Students can ask questions regarding the exams. There will be another question answering session during the lab on July 19.



Projects

The projects will be handed out in the lab sessions in the respective language of the lab. Please write Gerrit if you have questions, need help, or to request a translation to English.

1. Feuer Naturpark/Fire in nature reserve: [Beschreibung] [zip]

2. Spam: [Beschreibung] [zip]

3. Versicherung/Insurance: [Beschreibung] [zip]

4. Kredit/Credit: [Beschreibung] [zip]

5. Einkommen/Income: [Beschreibung] [zip]

6. Krebserkrankung/Cancer: [Beschreibung] [zip]


Labs

1. Intro to Python: [zip]

2. Problem Analysis and Data Preprocessing: [zip]

3. Decision Trees: [zip]

4. Random Forests: [zip]

5. Linear Models (Classification): [zip]

6. Linear Models (Regression): [zip]

7. Kernel: [zip]

8. Bayesian Learning: [zip]

9. Logistic Regression: [zip][Exercise 4]
(Note: if h5py was not installed by the staff, try conda install --user h5py if that does not work try pip instead of conda.)

10. Neural Networks: [zip]

11. Evaluation: [zip][Selected Solutions]

On July 19, there is a question answering lab session. There will be no lab session on July 21.


Extra material mentioned in the labs

[1] Loss function tumblr

[2] Platt scaling (to get confidence/class probability outputs of SVM)

[3] Shuffling Data, Algorithm P

[4] Detecting Terrorists (how improper experimental design / data interpretation looks like)

[5] A few useful things to know about Machine Learning

[6] Convolutional Neural Networks in the browser (Interactive classification of the toy data is what shows exemplarily what is being done with the geometry of the representation)

[7] Dropout Paper (2014)

[8] Do we need hundreds of classifiers to solve real world problems?