Posts Tagged ‘stats’

IBM Research: Preserving Validity in Adaptive Data Analysis

Wednesday, September 23rd, 2015

Preserving Validity in Adaptive Data Analysis Using differential #privacy for correct #stats even w/ test-set reuse

“A common next step would be to use the least-squares linear regression to check whether a simple linear combination of the three strongly correlated foods can predict the grade. It turns out that a little combination goes a long way: we discover that a linear combination of the three selected foods can explain a significant fraction of variance in the grade (plotted below). The regression analysis also reports that the p-value of this result is 0.00009 meaning that the probability of this happening purely by chance is less than 1 in 10,000.

Recall that no relationship exists in the true data distribution, so this discovery is clearly false. This spurious effect is known to experts as Freedman’s paradox. It arises since the variables (foods) used in the regression were chosen using the data itself.

We found that challenges of adaptivity can be addressed using techniques developed for privacy-preserving data analysis. These techniques rely on the notion of differential privacy that guarantees that the data analysis is not too sensitive to the data of any single individual. We rigorously demonstrated that ensuring differential privacy of an analysis also guarantees that the findings will be statistically valid. We then also developed additional approaches to the problem based on a new way to measure how much information an analysis reveals about a dataset.

The Thresholdout Algorithm

Using our new approach we designed an algorithm, called Thresholdout, that allows an analyst to reuse the holdout set of data for validating a large number of results, even when those results are produced by an adaptive analysis.


Science Isn’t Broken | FiveThirtyEight

Saturday, August 22nd, 2015

Science Isn’t Broken by @cragcrest Great (but cynical) description of “p-hacking” & “researcher degrees of freedom”

Multiple hypothesis testing in genomics – Goeman – 2014 – Statistics in Medicine – Wiley Online Library

Monday, August 17th, 2015

Multiple hypothesis testing in genomics Nice overview, comparing familywise error & FDR control + FDP estimation

This paper presents an overview of the current state-of-the-art in multiple testing in genomics data from a user’s perspective. We describe methods for familywise error control, false discovery rate control and false discovery proportion estimation and confidence, both conceptually and practically, and explain when to use which type of error rate. We elaborate the assumptions underlying the methods, and discuss pitfalls in the interpretation of results. In our discussion we take into account the exploratory nature of genomics experiments, looking at selection of genes before or after testing, and at the role of validation experiments.

Why Most Published Research Findings are false

Saturday, February 7th, 2015

Why Most Published Research Findings are False Evaluating 2×2 confusion matrix, effects of bias & multiple studies

PLoS Medicine | 0696
August 2005 | Volume 2 | Issue 8 | e124

Published research fi ndings are sometimes refuted by subsequent evidence, with ensuing confusion and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [1–3] to the most modern molecular research [4,5]. There is increasing concern that in modern research, false fi ndings may be the majority or even the vast majority of published research claims [6–8]. However, this should not be surprising. It can be proven that most claimed research fi ndings are false. Here I will examine the key

Research fi ndings are defi ned here as any relationship reaching formal statistical signifi cance, e.g., effective interventions, informative predictors, risk factors, or associations. “Negative” research is also very useful. “Negative” is actually a misnomer, and the misinterpretation is widespread. However, here we will target relationships that investigators claim exist, rather than null fi ndings. As has been shown previously, the probability that a research fi nding is indeed true depends on the prior probability of it being true (before doing the study), the statistical power of the study, and the level of statistical signifi cance [10,11]. Consider a 2 × 2 table in which research fi ndings are compared against the gold standard of true relationships in a scientifi c fi eld. In a research fi eld both true and false hypotheses can be made about the presence of relationships. Let R be the ratio of the number of “true relationships” to “no relationships” among those tested in the fi eld. R

is characteristic of the fi eld and can vary a lot depending on whether the fi eld targets highly likely relationships or searches for only one or a few true relationships among thousands and millions of hypotheses that may be postulated. Let us also consider, for computational simplicity, circumscribed fi elds where either there is only one true relationship (among many that can be hypothesized) or the power is similar to fi nd any of the several existing true relationships. The pre-study probability of a relationship being true is R⁄(R + 1). The probability of a study fi nding a true relationship refl ects the power 1 − β (one minus the Type II error rate). The probability of claiming a relationship when none truly exists refl ects the Type I error rate, α. Assuming that c relationships are being probed in the fi eld, the expected values of the 2 × 2 table are given in Table 1. After a research fi nding has been claimed based on achieving formal statistical signifi cance, the post-study probability that it is true is the positive predictive value, PPV. The PPV is also the complementary probability of what Wacholder et al. have called the false positive report probability [10]. According to the 2 × 2 table, one gets PPV = (1 − β)R⁄(R − βR + α). A research fi nding is thus

PLOS Genetics: Statistical Estimation of Correlated Genome Associations to a Quantitative Trait Network

Sunday, December 28th, 2014

Correlated Genome Associations to Quantitative Trait #Network (QTN)
Uses fused #lasso for estimation of relationships

Kim & Xing (’09) provide a new method for calculating how genetic
markers associate with phenotypes by incorporating phenotype
connectivity features into the correlation structure between markers
and phenotypes. Their model attempts to quantify pleiotropic
relationships between different phenotypes and assumes a common
genotypic origin for the existence of clusters of correlated
phenotypes, which their algorithm uses to reduce the number of
significant genetic markers. In particular, Kim and Xing present a
method for performing quantitative trait analysis that implements two
novel approaches to inferring the contribution of a
[marker/allele/SNP/gene/locus] to a quantitative trait. The first is
organization of traits into a quantitative trait network (QTN). The
second is the utilization of fused lasso, a variation of multivariate
regression that seeks to minimize the number of non-zero coefficients
and least squared error. These two approaches are combined in an
attempt to minimize noise (in the form of small coefficients for SNP’s
that don’t really make a contribution) and focus on truly relevant
SNP’s while dealing with the correlated nature of quantitative
traits. Based on two datasets – simulated HapMap data and
data from the Severe Asthma Research Program – the authors show marked
improvement in accuracy and reduction of false positives over simpler
multivariate regression methods.

Is ecology explaining less and less?

Monday, September 15th, 2014

Is #ecology explaining less & less? Over 100yr & 18k papers: more #pvalues but falling <r2>. What’s P for this trend?

Belles lettres Meets Big Data » American Scientist

Saturday, July 5th, 2014

Belles lettres Meets #BigData #Statistical analysis of literature pre-dating recent advent of digital #humanities

Data Science and Prediction | December 2013 | Communications of the ACM

Monday, June 2nd, 2014

#DataScience & Prediction: Nice overview of the field, emphasizing testable models, even when causation isn’t implied

Common SNPs explain a large proportion of the heritability for human height : Nature Genetics : Nature Publishing Group

Thursday, May 8th, 2014

Eight (No, Nine!) Problems With Big Data –

Monday, April 14th, 2014

8 Problems With #BigData: correlation v causation, multiple testing, garbage in & out, gaming system, sample bias…