Theory background

A bit about Protein-DNA Complexes and methods for obtaining them.

Why Protein-DNA Complexes are important?

DNA binding proteins have an extremely important role in all the aspects of metabolic activity within an organism, such as gene transcription, packaging, replication and repairment. Due to this is extremely important the examination of the complexes that are formed between proteins and DNA, as they form the basics of how we understand how these processes take place

Although we have seen a great increase in the obtention of high quality structures of DNA-binding proteins in the past years, however, determination of the complex formation in proteins requires sophisticated, expensive and time-consuming methods, that can be replaced with computational methods.

The obtention of the complex structures, specially the ones that binds to DNA, provide valuable insight into the principles of binding, and how the DNA structure is recognized.

In the methods of prediction of a complex of Protein-DNA structure, we find two principal ways: De novo methods, and Template based methods.

De novo approach methods

Method based on De Novo approach uses an in silico recreation of physical forces to bend the primary sequence of the protein into secondary and after tertiary structures.

However, to predict computationally the structure of a protein from its primary sequence, remains with a lot of challenges, like to efficiently search in all the conformational space accesible to proteins. Another main problem of these methods, is to accurately represent the physical forces behind the protein folding in silico.

Therefore, although these methods can be use in small peptides, are not suitable for large proteins, since the computation cost is enormous, and the accuracy decreases with each particle in the simulation, so Protein-DNA complexes are not suitable to be predicted by de novo methods.

Template based methods

Template-based, or Homology-based methods, are approaches that uses similar structures that adopt a similar conformation to generate the predicted conformation of the structure.

They are based in the principle that a protein with more than 50 residues and more than 50% of identity are structurally similar.

Zones of structural alignments. Twilight zone means that we cannot guarantee that the proteins are structurally equal.

Therefore these methods took a known structure, and try to thread the sequence of interest into the tertiary/quaternary structure.

SbiRandM has incorporated two Templated based methods: Superimposition of structures, and a Modeling-based approach.

Superimposition of structures

This approach is based in the concatenation of pairwise interaction pdb structures, by joining the atoms that they share in common.

It starts with a complex (Or at least two joined chains), and several files of pairwise interaction between chains (Polypeptide chains or DNA/RNA strands). Then it checks which atoms/residues are in common between the two structures, and generates a rotation matrix that indicates the spatial rotation needed to align the two set of atoms.

These rotation is then applied to the structure, adding a chain in the complex. This process is iterated until there are no chains left to add.

Modeling-based approach.

This method is based on the threading of a sequence into template structures. it obtains the best structure for a sequence, based on the alignment of that sequence with templates whose structure it known.

This approach generates a full alignment within all the pairwise interaction structures, which are the pieces of the complex, and the sequence, to generate the full protein.

About SbiRandM, why its random and iterative

SbiRandM in the Structural Superimposition variant, uses an iterative approach to the complex generation, which is faster than a recursive approach.

First of all, the algorithm checks the steichiometry of the complex based on the fasta file. Then it generates a list of chains that need to be added to the complex.

Aftert that, PDB files of the Pairwise interactions are parsed based on their homology to the fasta chain, thus convention in the naming is not required for these files.

When the algorithm has a dictionary that shows the relation between chains and stores the structure that reflexes the interaction, and knows how many chains has to add to the protein, the complex is started with two chains totally at random (At each execution the start of the complex is different). Although this selection is at random, the result is always the same.

After the initialization, it starts to add the remaining chains to the structure. For this is takes a chain at random of the remaining chains, and checks the dictionary of relationships. It checks for each interaction stored in the PDB files, if there is a common chain between the PDB Pairwise structure and the Complex structure.

When it founds one, it checks if there are steric clashes. If they exist, it looks for another interaction between the chain we want to add and the complex. If steric clashes are not present, the chain is added and the complex is updated. If the chain is still not able to be added, due to a missing part of the complex that has not yet being created, it will skip randomly to another chain and try to add that one.

Why we do not recommend the modeling-based approach although we added it

Threading a sequence into a homologous structure, usually leads to good results. However in that case, in the Modeling-based variant of our algorithm, we are doing model generation by threading with Modeller the sequence of a full complex with a lot of small templates.

That confounds the algorithm when we try to model homodimers, and the algorithm does not perform correctly as compared to the native structure.

Also, the Modeling-based variant of our algorithm, is computationally extremely expensive when computing huge models.

Last updated

Was this helpful?