Uncertainty Quantification in Active Learning

Project overview

Number of Students	1-2
Project Type	Master Student Project
Project Owner	Dr. Peter Kuchling Dr. Alaa Othman Prof. Dr.-Ing. Wolfram Schenck
Project Context	Project within the Center for Applied Data Science Gütersloh (CfADS) with internal university partners.

Zusammenfassung

Trotz der riesigen Datenmengen, die von IoT-Fabriken, Gesundheitssystemen und verschiedenen Branchen erzeugt werden, bleibt ein Großteil dieser Daten „unetikettiert“, was ihren Nutzen für Modelle des maschinellen Lernens (ML) einschränkt. Die Kennzeichnung von Daten kann teuer und zeitaufwändig sein, was die Entwicklung effektiver Modelle behindert. Aktives Lernen (AL) geht dieses Problem an, indem es Modelle in die Lage versetzt, selektiv die informativsten und repräsentativsten Datenpunkte abzufragen. Dieser Forschungsvorschlag untersucht die „Quantifizierung von Unsicherheiten beim aktiven Lernen“ und konzentriert sich auf die Verbesserung des Auswahlprozesses innerhalb von AL-Algorithmen. Mit Hilfe von Methoden wie Bayes'scher Inferenz und Gauß'schen Prozessen sowie Ensemble-Techniken wird untersucht, wie die Quantifizierung von Unsicherheit die Effizienz des aktiven Lernens beeinflusst. Die Studie zielt darauf ab, den Entscheidungsfindungsprozess bei der Datenauswahl zu verbessern und damit Anwendungen in Bereichen voranzutreiben, in denen die Beschaffung gekennzeichneter Daten schwierig und kostspielig ist.

Short description

In the rapidly evolving machine learning landscape, raw or unlabelled data is becoming very large and free in many areas such as IoT, healthcare and industry. Despite the massive amount of data being generated, the cost and time associated with labelling this data is hindering effective model training. The key challenge is to identify which data points will provide the most value when labelled, ensuring efficient learning without overwhelming resources. The active learning (AL) technique offers a solution by querying the most informative and representative points to form a small but high-quality training data that improves the performance and generalization of learning models. However, there are many AL strategies. In this proposal, we propose to incorporate uncertainty quantification into AL algorithms. By using methods such as Bayesian inference and Gaussian processes, our approach enhances the ability of the model to assess its confidence in predictions. This allows intelligent selection of the most informative data points, rather than random sampling. As a result, we can optimise labelling efforts, reduce costs and improve model performance.

In practice, this methodology is particularly beneficial in environments where data is abundant but labelling is expensive. For example, in a healthcare setting, active learning can identify uncertainties in patient data and guide targeted labelling efforts to improve diagnostic accuracy. This strategic focus not only streamlines the learning process, but also maximizes the use of limited resources, paving the way for more effective machine learning applications across multiple domains.

Task definition

The student(s) will develop and implement a framework for uncertainty quantification in active learning algorithms. This framework will focus on enhancing the selection process for labeling data points in scenarios where unlabeled data is abundant. The goal is to integrate techniques such as Bayesian inference and Gaussian processes to accurately assess uncertainty in model predictions, thereby identifying the most informative data points for labeling. Students will work on refining the algorithm to ensure that it intelligently queries data based on uncertainty metrics, optimizing the labeling process and reducing costs associated with data annotation. The practical application of this framework will be demonstrated in a relevant industrial or healthcare setting, where the effectiveness of active learning can be evaluated in real-world scenarios. By focusing on uncertainty quantification, this project aims to improve model performance and efficiency, ultimately facilitating more effective machine learning applications in data-scarce environments.

Reference to the topic of data science

The proposed research aligns closely with fundamental data science principles by integrating uncertainty quantification into active learning frameworks. This approach complements the research master's curriculum, which emphasizes advanced methodologies in predictive modeling, data analysis, and decision-making based on data insights. The project offers students the opportunity to explore and apply innovative data science techniques in a practical setting, enhancing their understanding of how uncertainty quantification can improve the efficiency and effectiveness of machine learning models. By focusing on the strategic selection of data points for labeling, students will gain valuable experience in optimizing data-driven processes, ultimately contributing to more effective applications in various fields, including IoT and healthcare.

Available Resources

Expert Supervision: The project benefits from the support of knowledgeable and experienced supervisors who are proficient in data science techniques, including active learning, as well as IoT and industrial applications. Students can draw on this expertise for guidance, feedback, and mentorship during the research process, helping to ensure the quality and relevance of their work.
Collaborative spaces: The CfADS group at HS Bielefeld offers collaborative spaces for students to connect with each other, exchange ideas, and access extra resources. These hubs create an atmosphere that supports interdisciplinary cooperation and encourages a comprehensive approach to solving problems.
No additional hardware required: The project utilizes the current hardware resources in the IoT factory, avoiding the necessity for further investment. Students can easily incorporate their research into the existing infrastructure, facilitating the implementation of adaptive sensor activation through active learning algorithms.

Project plan

First Semester: Project Setup and Exploration

Conduct a literature review on uncertainty quantification and active learning applications.
Identify challenges and gaps in integrating uncertainty quantification into active learning.
Familiarize with theoretical foundations of uncertainty estimation techniques.
Write a literature review paper summarizing findings.

Second Semester: Initial Development and Prototyping

Design the architecture for the active learning framework.
Implement the framework using appropriate programming languages.
Integrate active learning algorithms for data selection.
Conduct initial experiments to evaluate performance.
Analyze results to identify improvements.

Third Semester: System Refinement and Integration

Evaluate the model on relevant datasets.
Compare results with baseline and state-of-the-art methods.
Identify limitations and refine the framework based on findings.
Write a research paper detailing the evaluation and improvements.

Fourth Semester: Optimization and Final Evaluation

Explore various experimental scenarios, including dataset sizes.
Analyze the framework's impact on model performance, convergence speed, and labeling effort.
Investigate scalability on large datasets and specific applications.
Summarize results, draw conclusions, and recommend future research directions.
Write the Master Thesis documenting the entire research process and findings.

Necessary Competencies

Mandatory: Strong programming skills, particularly in languages suitable for ML applications (e.g., Python).

Optional:

Good mathematical background.
Proficiency in uncertainty quantification for active learning.
Experience with machine learning libraries and programming languages.
Skills in evaluating model performance against benchmarks.
Ability to conduct literature reviews and write research papers.
Experience in designing experiments to assess model variables.

Acquirable Competencies

Proficiency in uncertainty quantification for active learning.
Experience with machine learning libraries and programming languages.
Skills in evaluating model performance against benchmarks.
Ability to conduct literature reviews and write research papers.
Experience in designing experiments to assess model variables.