All around us are technologies that seem to know us better than we know ourselves. We point our cell phone camera at friends and it focuses on their faces. We address an envelope and a postal machine deciphers our scribbles as names and places. Facebook scans photographs and tags our friends automatically.
This is machine learning — a field of computer science that involves teaching machines to recognize patterns by training them on large data sets. Already, the approach has transformed the way we work and play. But increasingly, biologists are looking to machine learning to answer fundamental questions about life.
In a recent study published in Nature Biotechnology, researchers at the Sloan Kettering Institute’s Computational Biology Program used machine learning to solve a common problem in genetics: Where, in our DNA, will a transcription factor bind?
Transcription factors are proteins that turn genes “on” or “off” by binding to precise regions of DNA. When a gene is on, the cell makes a protein that corresponds to that gene. It’s the particular set of genes turned on in a cell that determines that cell’s identity — as a liver cell, say, or a neuron.
Scientists know quite a bit about how transcription factors work. But when it comes to determining which DNA sequence a transcription factor will bind to, they usually rely on what you might call the spaghetti method: Throw the protein against a wall of several thousand DNA sequences and see where it sticks. (The technical name for this is a protein binding microarray.)
The new approach does something quite different. It trains a computer to learn the binding preferences of a family of transcription factors, so that it can then make predictions about a transcription factor it hasn’t seen before.
Lead scientist Christina Leslie likens the process to the way the online movie service Netflix makes movie recommendations to customers based on the preferences of other users who share similar characteristics. “What Netflix does is look at attributes of users and attributes of movies, and then draw conclusions about what sorts of movies appeal to what sorts of people,” she says.
So, for example, one conclusion Netflix might draw is that 30-year-old British men like comedies with Mr. Bean. Or that American teens like romantic movies featuring vampires.
“The recommender system that we use is conceptually quite similar, only instead of users and movies we’re looking at proteins and DNA,” says Dr. Leslie.
Her lab’s Netflix-inspired approach outperforms all existing methods for predicting the binding preferences of transcription factors. That’s a boon to basic researchers, and may also bring benefits to patients in the form of more precisely targeted medicines.
Teaching a Computer to Learn the Rules of Protein Binding
For the paper, Dr. Leslie and her colleagues didn’t actually do any DNA-binding experiments in the laboratory. Rather, they took advantage of the mountain of existing data on protein binding preferences, mining it in new ways.
From published sources, they had access to the binding preferences of 178 transcription factors for more than 40,000 specific DNA sequences. What they wanted to do was understand the rules that relate particular protein sequences with particular DNA sequences — what they call the “recognition code” of transcription factors. They hoped that sifting through the vast amount of data would reveal consistent underlying patterns.
Crunching this math is too hard for even the smartest human, but a powerful computer can do it without frying a microchip. “Our model has about six million unknowns and three million equations,” says Rafi Pelossof, the paper’s first author and a postdoctoral fellow in the Leslie lab. “But computers can solve that huge problem within seconds.”
The hardest part was setting up the equations that allow the computer to learn properly. Dr. Pelossof notes that modeling the problem as interactions between amino acid sequences and DNA sequences was the key to solving the transcription factor recognition code. But after that, it was a simple matter to run the program.
By viewing each protein as a set of short amino acid sequences and each DNA binding region as a sequence of nucleotides, their method learned the preferences of these short bits of protein for the corresponding bits of DNA.
Once the computer learns the rules governing amino-acid-to-DNA sequence binding, it can then be used to identify the binding preference of a new protein, based on its amino acid sequence alone. The approach works about as well as actually doing a laboratory screen — throwing the spaghetti — but without all the mess. It also identifies the precise amino acids within the protein that are responsible for the specificity — a piece of information that previously would have required the painstaking work of x-ray crystallography to decipher.
A Technique with Many Uses
The new approach should appeal to biologists seeking to understand the evolution and function of transcription factors, but may find additional converts as well.
“The Netflix problem is sort of ubiquitous in biology,” Dr. Leslie says. “Something is interacting with something else and you want to learn a rule for what interacts with what.”
The algorithm they developed could theoretically be used to solve any problem of this sort that possesses a sufficiently large dataset to train the computer.
For example, it could be used to identify what drugs would be best to use with which patients, given their particular genetic profile. Or it could be used to determine which interactions between cancer cells allow them to elude destruction by drugs. Even online dating sites might benefit from this approach, learning human-to-human interactions — though the two scientists say they currently have no plans to enter this area.
“These tools are deployed all around us and used in other fields,” says Dr. Leslie. “We’re now bringing them into biology.”