Research

Our work is focused on machine learning: the problem of automatically building models which explain systems that are being observed and predict their future behavior. Current research addresses adversarial learning problems, transfer learning, model-building for scientific problems, machine learning for computer security, and pattern recognition. In adversarial learning, an adversary exercises some control over the data-generation process; this reflects many security applications of machine learning. Transfer learning algorithms learn to perform a task, but do so using training data that reflect a different task. Machine learning has many diverse applications, and we are working on some of them: security (spam, phishing, botnets), model-building in the sciences, and face recognition.

Transfer learning, learning when training and test distributions differ.
Adversarial learning, machine learning for computer security.
Model-building in the sciences.
Face recognition, object recognition.

Funded Research Projects

Malware Detection by Analyzing HTTPS Logs

Principal Investigator: Tobias Scheffer
Funding: Cisco Systems
Duration: since 2016

In this project, we develop a technology that can identify malicious software by analyzing patterns in encrypted network communication.

Not Too Long; Did Read

Principal Investigator: Tobias Scheffer
Funding: Golem Media
Duration: since 2016

The goal of this project is to create a technology that is able to automatically expand documents by identifying related information that fits a given context.

Predictive Marketing

Principal Investigator: Tobias Scheffer
Funding: Datalovers AG
Duration: since 2015

We create a technology that analyzes CRM data and a wide range or corporate data in order to identify leads with a high chance of conversion.

Risk Analysis for Peer-to-Peers Loans

Principal Investigator: Tobias Scheffer
Funding: Bitbond GmbH
Duration: since 2015

In this project, we analyze the risk of online peer-to-peer loans based on PayPal and eBay account histories and other available information.

Embedded Face Identification

Principal Investigator: Tobias Scheffer
Funding: Asaphus Vision GmbH and IBB Business Team
Duration: 2015-2016

The goal of this project is to develop and improve algorithms for efficient landmark localization and real-time face identification on embedded systems.

Model Building from Experimental Data: Machine Learning and Model Evaluation with Non-IID Data

Principal Investigator: Niels Landwehr
Funding: German Science Foundation DFG, Emmy Noether Program (LA-3270/1-1)
Duration: 2013-2018

The project studies machine learning approaches for building predictive models from observational data in the sciences. Because of the way observational data is collected -- by longitudinal studies, by pooling different data sources, or by experimental protocols that influence what we can observe -- these data have specific statistical characteristics that often violate assuptions made in standard machine learning methods. We study how such characteristics affect the theoretical analysis and empirical results of machine learning methods, and how we can derive methods that better account for them. We also practically study the inference of predictive models from observational data in collaboration with researchers from cognitive psychology and geophysics.

For more details see the research group on Machine Learning and Scientific Data Analysis.

Prediction Games: Parallel Robust Machine Learning Algorithms

Principal Investigator: Tobias Scheffer
Funding: German Science Foundation DFG
Duration: 2014-2017

Machine learning addresses problem settings that involve automatic model building from data and predicting the future behavior of the system that is reflected in the data. Most results of this field are based on the assumption that available data and future behavior of the system are governed by the same probability distribution. This assumption is an oversimplification for applications in which an active adversary can influence the future behavior of the system. This is the case, for instance, for detecting phishing emails and fraudulent credit card transactions.

In the preceding project, we have modeled adversarial learning problems using paradigms of game theory. We have been able to identify conditions under which non-zero-sum prediction games have a unique equilibrium point; such points are an optimal solution when learner and adversary both aim at minimizing their own cost functions. We have derived primal and dual learning algorithms for static prediction games in which learner and adversary act simultaneously, without knowing the action which their respective opponent chooses. We also derived learning methods for learning problems in which the adversary can react on the model chosen by the learner and for learning problems in which the learner has some uncertainty about the exact cost function which the adversary is trying to minimize. In the context of email spam filtering, we can observe empirically that predictive models generated by game-theoretic learning algorithms maintain high accuracy for longer periods of time than models generated by learning algorithms that do not account for an active adversary. However, game-theoretic learning algorithms have to solve complex optimization problems; these methods are not immediately practical for large data sets.

In the succeeding project, we want to focus on scalable solutions to robust learning problems that can be executed in parallel on GPU and cluster architectures. The highest degree of parallel execution can be attained by algorithms that first solve subproblems entirely parallel, and then aggregate the solutions to these subproblems into a total model in a final, single aggregation step. However, not all optimization problems can be solved by algorithms which have this structure. The goals of the project therefore are the theoretical analysis of this approach to parallel, robust learning and the development and empirical analysis of scalable, parallel, robust – in particular, game theoretical – machine learning algorithms. The analysis of the convergence of algorithms which follow this structure towards the optimal solution of the underlying optimization problem is a central element of investigation. We will explore several approaches to splitting robust learning problems into subproblems that can be solved in parallel. We will study properties of parallel solutions to zero-sum, static non-zero-sum, and Stackelberg prediction games both theoretically and empirically.

Completed Projects

Detection of Malware and DDoS Attackers

Principal Investigator: Tobias Scheffer
Funding: Strato Rechenzentrum AG
Duration: 2005-2015

Strato is a European provider of webspace and server hosting services. Starting in 2005, we have developed large-scale spam detection methods, and have extended our focus to outbound spam and guestbook spam in the following years. More recent foci of our research are detection of PHP and JavaScript malware and the prevention of DDoS attacks

Asaphus Embedded Face Recognition

Principal Investigator: Tobias Scheffer
Funding: Bundesministerium für Wirtschaft und Energie im EXIST-Programm
Duration: 2014-2015

The Asaphus Embedded Face Recognition Library is a software library for facial recognition on embedded systems. The software detects faces, localizes facial landmarks, infers the head orientation, and identifies faces based on reference images. The software uses extremely efficient algorithms that make real-time facie identification on low-end embedded processors possible. The software features its own memory management, it does not require an operating sytsem and is therefore executable on virtually every embedded system. Asaphus is optimized for automotive applications.

Prediction Games

Principal Investigator: Tobias Scheffer
Funding: German Science Foundation DFG
Duration: 2010-2014

Most results on machine learning rely on the assumption that training data reflect the future behavior of the system under investigation. This assumption over-simplifies reality when an active adversary can exercise some control on the future behavior of the system. This is the case, for instance, with the identification of phishing attacks or credit card fraud. Here, model building becomes a game between learner and adversary. Game theory models such interactions as interleaved optimization problems. Since data-dependent optimization criteria are not a focus of game theory, many questions remain open today. Based on game-theoretic paradigms that model various patterns of interaction between players, the project aims at analyzing prediction games. In particular, the project will investigate learning models that constitute optimal solution to prediction games under defined circumstances.

Multimedia Retrieval

Principal Investigator: Tobias Scheffer
Funding: STRATO AG
Duration: 2009-2013

The goals of this project is to evaluate and develop thechnology that allows to implement intuitive and intelligent ways of navigating large photo and video collections.

Scalable Ranking of Online Ads

Principal Investigator: Tobias Scheffer
Funding: nugg.ad AG
Duration: 2007-2013
nugg.ad is a leading provider of targeted online advertising. The goal of this project was to develop technology that enables scalable prediction of users' socio-demographhic attributes.

In this project, we investigate efficient algorithms that predict which ad a user is most likely to click at, based on that user's past clicking behavior and all other information that is available.

Modelling and Optimization of Dialysis Treatment

Principal Investigator: Tobias Scheffer
Funding: Fresenius-affiliate NephroCare e-Services GmbH
Duration: 2008-2012

We investigate model-building and the generation of actionable knowledge from records of dialysis treatments.

Differing Training and Test Distributions in Active Learning

Principal Investigator: Tobias Scheffer
Funding: Google Research Award
Duration: 2009

Active learning reduces the labeling effort incurred by applying machine learning algorithms. Active learning procedures direct the attention of a labeler towards examples whose label is believed to convey a maximum of information. Labeled samples in active learning are governed by a distribution that differs from the natural test distribution for multiple reasons. An initial labeled sample may be compiled from auxiliary data sources; the natural input distribution may change over time, or may be altered by an adversary. In addition - and specific to active learning - an active instance selection procedure creates a labeled sample that is biased by the selection criterion. Treating the artificially selected sample in active learning as if it was governed by the test distribution is not necessarily the best course of action. We will understand, develop, and evaluate systematic approaches to active learning that account for this discrepancy between labeled training and test distributions.

Mining Jazz Data to Assess Development Processes

Principal Investigators: Andreas Zeller, Tobias Scheffer
Funding: IBM, Jazz Faculty Grant
Duration: 01/2008-12/2008

What is it that makes a good development process? We want to develop a plug-in that learns from collaboration and defect data as tracked by Jazz, relates features of the collaborative development process to the defect density of individual components, and thereby automatically predicts code quality. For instance, the plug-in might advise that package P should be reviewed more, because a new dependency on compiler internals has been added shortly before the release date by a developer who is new to the team.

Text Mining: Knowledge Discovery in Text Databases and Efficient Document Processing

(German project title: Text Mining: Wissensentdeckung in Textsammlungen und Effizienz von Dokumentenverarbeitungsprozessen)

Principal Investigator: Tobias Scheffer
Funding: German Science Foundation DFG, Emmy Noether Program
Duration: June 2003 through June 2008

The amount of documents available in archives and on the web is growing exponentially. This growth induces a demand for methods that automatically analyze large volumes of documents, discover and utilize valuable knowledge contained in them. A substantial part of our working processes consists of processing (i.e., reading, writing, manipulating) documents. Many tools support the administration of text documents, such as file systems, databases, or document management systems. Much greater efforts (and more expenses), however, are imposed by the actual document manipulation processes — such as writing documents. Any support of document manipulation processes requires substantial knowledge; it is therefore much more difficult to support document processing rather than document administration.
The goal of the „Text Mining“ project is to develop and study text mining algorithms that discover knowledge in large document archives, and utilize this knowledge to support future text manipulation processes.

One of the project goals lies in the development, and in studying the properties of, efficient active learning algorithms that generate sequence models from example documents. Such statistical models are able to segment, classify, and extract information from documents. In several ways, statistical sequence models can be used to support, and enhance the efficiency of, document manipulation processes.
While text mining methods allow to extract information from textual documents and to translate this information into structured representation, data mining algorithms are able to discover knowledge (e.g., patterns or rules) in structured databases. Our goal is to study how these steps can be interleaved automatically, allowing discovery of knowledge that is hidden in textual archives.
Our ultimate project goal is to develop and study methods that discover knowledge in document archives, and utilize this knowledge to effectively support future document manipulation processes. Prototypically, we will develop a sentence completion function for natural language. Based on stored documents, the system is to generate a domain and user specific language model representing frequently used phrases and their semantic context. In the application phase, the system will analyze text fragments entered by the user of – for instance – a word processing system. Based on the analysis of text fragment entered so far, the system will propose the most likely completion of a sentence, if this completion can be derived from the acquired knowledge.

Data and Text Mining in Quality and Service

Principal Investigator: Tobias Scheffer
Funding: DaimlerChrysler AG
Duration: 08/2005-07/2008

We study the problem of discovering trends and new developments in production and warranty databases as well as in workshop reports. We develop technologies that automatically identify such trends and discover their hidden causes. The goal of this project is the constructive analysis of data mining methods that lead to improved service processes by integrating and analyzing textual information and data from multiple, heterogeneous and distributed databases.