Posts Tagged ‘mining’

AlphaFold @ CASP13: “What just happened?”

Friday, February 8th, 2019

“Let me get the most important question out of the way: is AlphaFold’s advance really significant, or is it more of the same? I would characterize their advance as roughly two CASPs in one (really ~1.8x). Historically progress in CASP has ebbed and flowed, with a ten year period of almost absolute stagnation, finally broken by the advances seen at CASP11 and 12, which were substantial. What we’ve seen this year is roughly twice as much as the recent average rate of advance (measured in mean ΔGDT_TS from CASP10 to CASP12—GDT_TS is a measure of prediction accuracy ranging from 0 to 100, with 100 being perfect.) As I will explain later, there may actually be a good reason for this “two CASPs” effect, in terms of the underlying methodological breakdown. This can be seen not only in the CASP-over-CASP
improvement, but also in terms of the size of the gap between AlphaFold and the second best performer, which is unusually large by CASP standards. Below is a plot that depicts this.”

Spam: A Shadow History of the Internet Excerpt, Part 2

Saturday, July 28th, 2018

Spam: A Shadow History of the Internet Fascinating discussion of #LitSpam: how the #spam arms race led to the development of Bayesian filters & then, in response, a bizarre mash-up of free literary texts meant to evade them


“Let us return to Turing, briefly, and introduce the fascinating Imitation Game, before we leave litspam and the world of
robot-read/writable text. The idea of a quantifiable, machine-mediated method of describing quali- ties of human affect recurs in the literature of a variety of fields, including criminology, psychology, artificial intelligence, and computer science. Its applications often provide insight into the criteria by which different human states are determined—as described, for example, in Ken Alder’s fascinating work on polygraphs, or in the still understudied history of the “fruit machine,” ….is the so-called Turing Test. The goal of Turing’s 1950 thought experiment (which bears repeating, as it’s widely
misunderstood today) was to “replace the question [of ‘Can machines think?’] by another, which is closely related to it and is expressed in relatively unambiguous words.” Turing considered the question of machines “thinking” or not to be “too meaningless to deserve discussion,” and, quite brilliantly, turned the question around to whether people think—or rather how we can be convinced that other people think. This project took the form of a parlor game: A and B, a man and a woman, communicate with an “interrogator,” C, by some intermediary such as a messenger or a teleprinter. C knows the two only as “X” and “Y”; after communicating with them, C is to render a verdict as to which is male and which female. A is tasked with convincing C that he, A, is female and B is male; B’s task is the same. “We now ask the question,” Turing continues, “‘What will happen when a machine takes the part of A in this game?’ …

What litspam has produced, remarkably, is a kind of parodic imitation game in which one set of algorithms is constantly trying to convince the other of their acceptable degree of salience—of being of interest and value to the humans. As Charles Stross puts it, “We have one faction that is attempting to write software that can generate messages that can pass a Turing test, and another faction that is attempting to write software that can administer an ad hoc Turing test.” …

Surrealist automatic writing has its particular associative rhythm, and the Burroughsian Cut-Up depends strongly on the taste for jarring juxtapositions favored by its authors (an article from Life, a sequence from The Waste Land, one of Burroughs’s “routines” in which mandrills from Venus kill Eisenhower). Litspam text, along with early comment spam and the strange spam blogs described in the next section, is the expression of an entirely different intentionality without the connotative structure produced by a human writer. The results returned by a probabilistically manipulated search engine, or the poisoned Bayesian spew of bot-generated spam, …

Spam: A Shadow History of the Internet [excerpt, Part 2]

The Moth | Stories | Data Mining for Dates

Tuesday, December 6th, 2016

Similarity network fusion for aggregating data types on a genomic scale : Nature Methods : Nature Publishing Group

Tuesday, February 9th, 2016

Similarity #network fusion for aggregating data types Combines mRNA, miRNA & gene fusions to classify cancer subtypes

Similarity network fusion for aggregating data types on a genomic scale : Nature Methods : Nature Publishing Group

Monday, June 1st, 2015

Similarity #network fusion for aggregating data types Combines mRNA, miRNA & gene fusions to classify cancer subtypes

Machine learning applications in genetics and genomics : Nature Reviews Genetics : Nature Publishing Group

Saturday, May 30th, 2015

#Machinelearning applications in…genomics Nice overview of key distinctions betw generative & discriminative models

In their review, “Machine learning in genetics and genomics”, Libbrecht and Noble overview important aspects of application of machine learning to genomic data. The review presents illustrative classical genomics problems where machine learning techniques have proven useful and describes the differences between supervised, semi-supervised and unsupervised learning as well as generative and discriminative models. The authors discuss considerations that should be made when selecting the right machine learning approach depending on the biological problem and data at hand, provide general practical guidelines and suggest possible solutions to common challenges.

Banjo Raises $100 Million to Detect World Events in Real Time

Saturday, May 9th, 2015

Banjo Raises $100 Million to Detect World Events in Real Time Will their global "crystal ball" notice this tweet?

Crime mining: Hidden history emerges from court data – 25 June 2014 – Control – New Scientist

Monday, April 27th, 2015

Hidden history emerges from [#mining] court data Diverging descriptions of types of #crime likened to genetic drift

Back to the future
Jennifer Ouellette
Available online 28 June 2014


Instead, he turned to information theory, invented by Claude Shannon
in the 1940s. DeDeo’s aim was to reveal gradual changes in the way
crimes were spoken about. He split all the trials into two categories
– trials for violent crimes like murder or assault and trials for
non-violent crimes like pickpocketing or fraud – and then he looked at
the actual words that people used in the courtroom. Information theory
lets you quantify the amount of information given by a word in a
specific context. Using a measure known as Jensen-Shannon divergence,
a word picked at random from the transcript of a trial can be given a
score based on how useful it is for predicting the type of the trial.

So, for example, if you walked into the Old Bailey during Hall’s trial
and heard the word "murdered" uttered in court, how much information
about the type of trial underway would that single word convey? In the
early years of the period they looked at, most crimes involved some
level of violence. "There might be bloodshed, or an eye gouged out,
but the real crime is someone’s wallet got stolen," DeDeo says. "The
casual everyday violence of the past is remarkable."

Slowly, however, that changed. By the 1880s, the team found that the
majority of violent language was reserved for talking about crimes
like assault, murder or rape. So you could walk into the courtroom,
hear words like "murdered", "hit," "knife" and "struggled" – all words
from Martin’s testimony in 1801 – and be confident that you were
witnessing a trial for a violent crime rather than a trial for theft.

The analysis reveals a story of the gradual criminalisation of
violence. This is not necessarily evidence that we have become less
violent – as Steven Pinker argues, based on statistics for violent
crime, in his book The Better Angels of Our Nature. Rather, it is a
story of the state gaining a monopoly on violence and controlling its
occurrence among the public. "What is deemed criminal has changed,"
says Hitchcock.

DeDeo likens the shift to genetic drift. If you took two herds of
goats and isolated each for centuries, the herds would gradually
evolve into separate species. Similarly, he sees Old Bailey cases as
populations of violent and non-violent trials. Over time the two types
"speciate" and become distinct from one another (see chart). "In 1760,
the patterns of language used in both kinds of trial are almost
exactly identical," he says. "Over the next 150 years they diverge."

In Search of Bayesian Inference

Sunday, April 12th, 2015

In Search of #Bayesian Inference Nice intuition on priors in recovering air-crash wreckage & analyzing mammographs


In its most basic form, Bayes’ Law is a simple method for updating beliefs in the light of new evidence. Suppose there is some statement A that you initially believe has a probability P(A) of being correct (what Bayesians call the “prior” probability). If a new piece of evidence, B, comes along, then the probability that A is true given that B has happened (what Bayesians call the “posterior” probability) is given by

P(A|B)=P(B|A) P(A) / P(B)

where P(B|A) is the likelihood that B would occur if A is true, and P (B) is the likelihood that B would occur under any circumstances.

Consider an example described in Silver’s book The Signal and the Noise: A woman in her forties has a positive mammogram, and wants to know the probability she has breast cancer. Bayes’ Law says that to answer this question, we need to know three things: the probability that a woman in her forties will have breast cancer (about 1.4%); the probability that if a woman has breast cancer, the mammogram will detect it (about 75%); and the probability that any random woman in her forties will have a positive mammogram (about 11%). Putting these figures together, Bayes’ Law—named after the Reverend Thomas Bayes, whose manuscript on the subject was published posthumously in 1763—says the probability the woman has cancer, given her positive mammogram result, is just under 10%; in other words, about 9 out of 10 such mammogram results are false positives.

In this simple setting, it is clear how to construct the prior, since there is plenty of data available on cancer rates. In such cases, the use of Bayes’ Law is uncontroversial, and essentially a tautology—it simply says the woman’s probability of having cancer, in light of her positive mammogram result, is given by the proportion of positive mammograms that are true positives. Things get murkier when
statisticians use Bayes’ rule to try to reason about one-time events, or other situations in which there is no clear consensus about what the prior probabilities are. For example, large passenger airplanes do not crash into the ocean very often, and when they do, the
circumstances vary widely. In such cases, the very notion of prior probability is inherently subjective; it represents our best belief, based on previous experiences, about what is likely to be true in this particular case. If this initial belief is way off, we are likely to get bad inferences.


Twitter “Exhaust” Reveals Patterns of Unemployment | MIT Technology Review

Monday, December 1st, 2014

Social media fingerprints of unemployment, from detecting network components in tweet mining +

Lots of press for an arxiv paper, viz:
Twitter “Exhaust” Reveals Patterns of Unemployment | MIT Technology Review


So the team analysed the rate at which messages were exchanged between regions using a standard community detection algorithm. This revealed 340 independent areas of economic activity, which largely coincide with other measures of geographic and economic distribution. “This result shows that the mobility detected from geolocated tweets and the communities obtained are a good description of economical areas,” they say.

Finally, they looked at the unemployment figures in each of these regions and then mined their database for correlations with twitter activity.