Statistical Hypothesis Testing in Positive Unlabelled Data

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

206 Downloads (Pure)

Abstract

We propose a set of novel methodologies which enable valid statistical hypothesis testing when we have only positive and unlabelled (PU) examples. This type of problem, a special case of semi-supervised data, is common in text mining, bioinformatics, and computer vision. Focusing on a generalised likelihood ratio test, we have 3 key contributions: (1) a proof that assuming all unlabelled examples are negative cases is sufficient for independence testing, but not for power analysis activities; (2) a new methodology that compensates this and enables power analysis, allowing sample size determination for observing an effect with a desired power; and finally, (3) a new capability, supervision determination, which can determine a-priori the number of labelled examples the user must collect before being able to observe a desired statistical effect. Beyond general hypothesis testing, we suggest the tools will additionally be useful for information theoretic feature selection, and Bayesian Network structure learning.
Original languageEnglish
Title of host publicationhost publication
Publication statusPublished - Sept 2014
EventEuropean Conference on Machine Learning - ECML/PKDD 2014 - Nancy, France
Duration: 15 Sept 201419 Sept 2014

Conference

ConferenceEuropean Conference on Machine Learning - ECML/PKDD 2014
CityNancy, France
Period15/09/1419/09/14

Keywords

  • Hypothesis testl Positive unlabelled; Semi supervised; Mutual information

Fingerprint

Dive into the research topics of 'Statistical Hypothesis Testing in Positive Unlabelled Data'. Together they form a unique fingerprint.

Cite this