Bias reduction in the population size estimation of large data sets

Jeffrey Chu, Yuanyuan Zhang, Stephen Chan, Saraleesan Nadarajah

Research output: Contribution to journalArticlepeer-review

Abstract

Estimation of the population size of large data sets and hard to reach populations can be a significant problem. For example, in the military, manpower is limited and the manual processing of large data sets can be time consuming. In addition, accessing the full population of data may be restricted by factors such as cost, time, and safety. Four new population size estimators are proposed, as extensions of existing methods, and their performances are compared in terms of bias with two existing methods in the big data literature. These would be particularly beneficial in the context of time-critical decisions or actions. The comparison is based on a simulation study and the application to five real network data sets (Twitter, LiveJournal, Pokec, Youtube, Wikipedia Talk). Whilst no single estimator (out of the four proposed) generates the most accurate estimates overall, the proposed estimators are shown to produce more accurate population size estimates for small sample sizes, but in some cases show more variability than existing estimators in the literature.
Original languageEnglish
Article number106914
Number of pages32
JournalComputational Statistics and Data Analysis
Volume145
DOIs
Publication statusPublished - 15 Jan 2020

Keywords

  • Relative bias
  • Twitter
  • Size estimator
  • YouTube
  • Random walk sampling

Fingerprint

Dive into the research topics of 'Bias reduction in the population size estimation of large data sets'. Together they form a unique fingerprint.

Cite this