Abstract
We propose a set of novel methodologies which enable valid statistical hypothesis testing when we have only positive and unlabelled (PU) examples. This type of problem, a special case of semi-supervised data, is common in text mining, bioinformatics, and computer vision. Focusing on a generalised likelihood ratio test, we have 3 key contributions: (1) a proof that assuming all unlabelled examples are negative cases is sufficient for independence testing, but not for power analysis activities; (2) a new methodology that compensates this and enables power analysis, allowing sample size determination for observing an effect with a desired power; and finally, (3) a new capability, supervision determination, which can determine a-priori the number of labelled examples the user must collect before being able to observe a desired statistical effect. Beyond general hypothesis testing, we suggest the tools will additionally be useful for information theoretic feature selection, and Bayesian Network structure learning.
Original language | English |
---|---|
Title of host publication | host publication |
Publication status | Published - Sept 2014 |
Event | European Conference on Machine Learning - ECML/PKDD 2014 - Nancy, France Duration: 15 Sept 2014 → 19 Sept 2014 |
Conference
Conference | European Conference on Machine Learning - ECML/PKDD 2014 |
---|---|
City | Nancy, France |
Period | 15/09/14 → 19/09/14 |
Keywords
- Hypothesis testl Positive unlabelled; Semi supervised; Mutual information