Chapter 1
Introduction

 1.1 Protein conformational landscapes encode functional information.
 1.2 The Folding@home platform allows access to protein motions at biologically relevant timescales.
  1.2.1 Folding@home distributes simulations across thousands of computers at once.
  1.2.2 Markov State Models allow for the construction of unified models from large simulation datasets.
  1.2.3 The scalable power of Folding@home has generated insights into protein behaviors.
 1.3 Allosteric communication is critical for protein function, but difficult to infer.
  1.3.1 Allosteric communication is universal and critical for biological function.
  1.3.2 Inferring allosteric communication in proteins remains a non-trivial task.
 1.4 Scope of thesis.

1.1 Protein conformational landscapes encode functional information.

Proteins are the machines that power cellular function and life. They allow us to see, smell, think, and carry out many of the basic functions required for us to live. However, when they malfunction or misbehave (usually through mutation), they can also result in diseases like cancer or heart disease among others. Proteins are also utilized by viruses and bacteria to infect host cells, replicate, or even break down the drugs we use to stop them.

Understanding protein behaviors relevant to health and disease depends on being able to model them in atomic detail. Atomistic scale motions allows us to infer things like mechanisms, thermodynamic profiles, and kinetic rates. This level of detail can provide predictive models and explanations for why certain mutations may cause disease, which can be useful for targeting proteins using drug design methods. Furthermore, knowledge of chemical interactions allows us to design chemical groups against pockets that would jam them open [4]. Atomic-scale knowledge may even provide guiding principles upon which proteins can be designed to perform novel functions, which has implications for therapeutic design and industrial applications [5].

Structural biology methods have been transformative in allowing us to learn about the structure of proteins and their behaviors. The first view of protein structures was done using X-Ray crystallography [6]. However, static structures do not provide a complete picture of protein function. These static structures may not be able to provide information about a proteins stability  [7], ligand affinities or specificities [8], or how different mutations could impact function [9]. Indeed, it has been often observed that crystal structures of the same protein families are too similar to explain the difference in their measured physiological parameters [10].

As the atoms of a protein move around relative to one another, a protein is able to shift between an enormous number of different structures. Even small proteins of <100 amino acids have   ~200 rotatable bonds along their backbone, granting access to more than 1060 backbone conformations  [11]. Many of these structures however are never accessed by the protein in physiologically relevant timescales. Of the fraction that are accessed however, some of them may have relevance to a protein’s mechanism and biophysical behavior. Each of these structures that a protein may shape-shift into has an associated energy that characterizes intra-protein and protein-environment interactions. Given that the probability of a protein adopting any one structure is proportional to the exponential of that structures energy, we are able to characterize how likely a protein is to adopt some states over others. The phase space of a protein’s energies (or probabilities) and their corresponding structures are often referred to as an “energy landscape”, with most likely states (such as those observed by crystallography) are named “ground” states due to being energy minima.

From these ground states, a protein can also transition into less likely “excited” states, some of which may contain key functional information. These states can be characterized using a plethora of methods. Nuclear Magnetic Resonance (NMR) and Hydrogen-Deuterium Exchange (HDX) have provided unique functional insight into the conformational heterogeneity a protein can have [1213]. Work on DHFR has been critical in understanding the complete catalytic cycle and the residue level configurations (and dynamics) [14]. NMR experiments on DHFR that was arrested in one stage of its catalytic cycle provided evidence that DHFR was also adopting stages in the latter part of its cycle [14]. Mutational experiments also put forth a correlation between dynamics and catalytic ability [141513]. Powerful combinations of NMR, crystallography, and computer simulations have rationalized cofactor- and mutational-effects on kinases [1617]. With advances in structural methods, it is even possible to resolve structural information about excited states [1415]. Other electron paramagnetic resonance (EPR) experiments have granted unique insight into the conformational distributions of proteins [18]. NMR also has the additional power to measure the degree of conformational entropy of residues [19]. Recently, Cryo-EM structures of even large complexes have revealed the degree of conformational heterogeneity in physiologically relevant systems [20212223]. However, it is important to note that each of these methods have tradeoffs due to resolution limits, labelling strategies that may perturb the system, or technical limitations (system size, material requirements, etc.). Altogether, this large body of work studying excited states of folded proteins suggests that the equilibrium motions of a protein may encode all of its functionally relevant states.

Molecular dynamics (MD) simulations have the potential to provide atomistic detail to explain complex biological processes. These simulations compute the movements of atoms over time by integrating Newton’s laws of motion over each atom. Thus, MD acts as a “computational microscope”, allowing us to observe the different conformations a protein adopts [2425]. A perfect simulation would completely describe a protein’s thermodynamic and kinetic behaviors at equilibrium. However, there are major limitations: (i) The accuracy of atomic parameters (aka ”force fields”) that are used to describe atomic interaction (A topic that is discussed extensively elsewhere [26]). (ii) Simulations take femtosecond-sized timesteps, making it expensive to gather data at biologically relevant timescales (microsecond to milliseconds). (iii) Interpreting large datasets with thousands of unique structures to generate biologically meaningful predictions remains a daunting task.

1.2 The Folding@home platform allows access to protein motions at biologically relevant timescales.

1.2.1 Folding@home distributes simulations across thousands of computers at once.

A myriad of methods have been developed to improve sampling of protein motions out to biologically relevant timescales [27282930]. However, many perturb the thermodynamics or kinetics of a system, in turn biasing the predictions these simulations make. The advent of GPUs provided access to longer timescales, but they are bounded by an upper limit of parallelism in computing architectures [31323334]. Recently, novel adaptive sampling techniques have allowed for the mapping of slower motions [35] but may require system-specific knowledge or an order parameter of relevance. Specialized hardware has also been developed [36] that allowed computational biophysicsts to study processes at unprecedented timescales. However, utilizing and maintaining this kind of specialized hardware can be a costly endeavor.

To sample unbiased simulations at biologically relevant timescales using commodity hardware, the Folding@home platform was developed in 2000 [37]. Folding@home, headquartered at the Bowman lab at WUSTL, is a distributed computing network that runs MD simulations on donated computing power thanks to thousands of citizen-scientists who download the app onto their hardware. Folding@home runs trajectories as small “work units” that are smaller simulations on the order of nanoseconds. The starting file for a work-unit is generated server-side, which is sent to a client computer somewhere in the world (Fig. 1.1). The client then runs the work-unit and returns it. This is used to generate the subsequent work-unit (representing the next chunk of time in a simulation). Afterwards, work units are stitched together to generate a single trajectory. Folding@home allows for the generation of a datasets containing hundreds of trajectories that, in aggregate, capture a large amount of a protein’s energy landscape. These datasets are so large that interpreting them presents a unique “big data” challenge, and requires the development of new methods and software [3839].


PIC

Figure 1.1: Folding@home distributes work units worldwide to run simulations A. Workflow schematic detailing the steps a citizen-scientist’s computer (”client”, green) takes to receive, run, and return a work-unit successfully by communicating with the Assignment Server (blue) and the Work server (yellow). B. Heatmap of completed work-units returned from each country in the world in a representative 48-hour period. The number of returned work units is indicated by the color scale (right).

1.2.2 Markov State Models allow for the construction of unified models from large simulation datasets.

Markov state models (MSMs) are network representations of a protein’s free-energy landscape, providing map representations of protein conformational space with thermodynamic and kinetic properties taken from equilibrium simulations (Fig. 1.2). Rather than depending on a single long simulation that explores multiple states sequentially, MSMs are capable of stitching together multiple short trajectories into a single unified landscape. Thus MSMs capture slow events, and their intermediates, far beyond the reach of any individual simulation. Thanks to distributed networks like Folding@home, gathering large numbers of short trajectories is tractable in a reasonable time-frame. Many reviews have been dedicated to providing accessible and in-depth explanations of MSM technology [403841].


PIC

Figure 1.2: Diagram highlighting the conversion of a simulation dataset into a Markov State Model (MSM). Simulation trajectories (left) are parsed into discrete states on a per-frame basis (red, blue, green, yellow), and estimation methods are used to convert the trajectories into a MSM (right). The spheres radii is proportional to the population of that state, while arrow thickness denotes transition probability.

MSMs require two components to describe the dynamics of a biomolecular system: (i): A discretization of a high dimensional state space into n conformational states, and (ii) A model of the stochastic transitions between each states, represented as a Transition Matrix P . The probability of transitioning from state i to state j (Pij) is:

Pij(τ) = Prob(xt+τ Sj|xt Si) (1.1)

where τ is a lag-time parameter across which transitions between states are observed, and Si is state i and Sj refers to state j. These transition matrices P give rise to a stationary distribution π as a result of the eigenvalue equation:

πT P = π T (1.2)

where T represents a single time-step. In an MSM, the stationary distribution π represents the equilibrium probabilities for each state. Therefore, assuming sufficient statistics are collected to observe transitions between states, MSMs are able to appropriately capture the equilibrium thermodynamic and kinetic behaviors of a system. It is important to note that the underlying assumption behind MSM construction is that the discretized dynamics of biomolecules is memoryless (aka Markovian). That is, the probability of a transition from state i to state j at time t is only dependent on state i and not any of the previously visited states.

For any constructed MSM, the Markovian assumption is tested by looking at implied timescales plots of a Markov State Model. That is, for a given state decomposition, the molecular relaxation timescales for eigenvalues λ and eigenvectors ri are computed for a series of lag-times τ:

ti = - τ ln |λi(τ)| (1.3)

These timescales ti are plotted for a series of lag-times τ (C.8). If the timescales remain relatively unchanged for a series of τ values and higher, then the models constructed for those values of τ can be considered Markovian. The values of Pij can then be estimated using a variety of estimation methods [4240].

A critical component of high-quality MSM construction is the features selected in the discretization of state space. It is important to balance statistical error against systematic bias when choosing, as narrow features ranges may result in poor statistics, while broadly-ranged features may have large systematic errors due to convolution of multiple motions into a single state. Choice of appropriate feature can challenging due to the sheer size of simulation datasets and the system-specific knowledge often need to identify appropriate features.

While methods such as PCA might provide some degree of feature-reduction, a major breakthrough in simulation featurization was the use of time-lagged independent component analysis (tICA) [43], which transforms input coordinates to identify the rarest (which are assumed to be the slowest) motions. For protein folding simulations, this provides excellent dimensionality reduction, since the slowest coordinate is often the rarest and most valuable  [4445]. However, in simulations of folded proteins the rarest motions observed may be less interesting or artifacts of sampling due to the simulation size. Much success has previously been seen using geometric features, such as cartesian coordinates or dihedral angles [464748]. Other features such as solvent accessible surface area (SASA) or ligand-residence-time have also been useful for construction of predictive MSMs  [4948].

Once appropriate features are selected, there are a myriad of approaches to cluster them into discrete states. One major method is a hybrid k-centers/k-medoids approach [5051]. In brief, for every pair of features a distance metric, such as the Root Mean Square Deviation (RMSD) is computed. The k-centers algorithm then (i) chooses an initial cluster center either as a predetermined point or randomly, and then all points are assigned to this initial cluster. (ii) The distance between every point and its assigned cluster center is then computed. (iii) The point with the largest distance to the assigned cluster center is then labelled as a new cluster center. (iv) The distances between all points and all cluster centers are recalculated, and points are reassigned to their closest cluster center based on these new labels. Steps (ii - iv) are repeated until the maximum distance from any point to its assigned cluster center goes below a specified threshold, or a maximum number of cluster centers is reached.

To refine the cluster center assignments, and ensure that the “center” of each cluster is truly equidistant from all assigned points, a k-medoids algorithm known as Partitioning around Medoids (PAM) [51] is used. PAM proceeds by iterating through each cluster and choosing a new center from one of the currently assigned points. All points across all states are then reassigned based on this new proposed cluster center. From this proposed center, a ”cost” is calculated (i.e. the sum of distances from each point to their respective center), and the proposed center placement is accepted if the cost is minimized. Once states are discretized, clustered, assigned across a simulation trajectory, the transition probability matrix is estimated. As mentioned above, the matrix of transitions between states is counted based on some lag time τ.

One simple approach to estimate transitions is to count the number of transitions from states i to j between all states, and divide by the number of states i observed. To maintain ergodicity, some methods only estimate over the largest connected subset of states [52] or using maximum likelihood estimators that respect detailed balance [53]. One method to estimate transition probabilities is to to average the count matrix with the transpose of itself

Cijtranspose = Cij + Cji 2 (1.4)

where Cij is the observed number of transitions from state i to state j, and Cji represents the number of transitions in the reverse direction. Subsequent row normalization is used to calculate the equilibrium probabilities:

πi = jCijtranspose k,jCk,jtranspose (1.5)

where Cijtranspose is the averaged number of transitions between states i and state j, and Ck,jtranspose is the number of transitions between states k and j. Recent success has been observed by simply adding a pseudocount C̃ to serve as an estimate of the system in absence of data  [4254]. This pseudocount is computed as a single observed transition that is divided up across all states

C̃ = 1 N (1.6)

where N is the number of states.

1.2.3 The scalable power of Folding@home has generated insights into protein behaviors.

There are many success stories that have come out of the usage of the Folding@home platform such as the observation of a millisecond-timescale folding event in 2010  [46]. Markov state models have been shown to quantitatively agree with experimental measurements [5540]. Particularly, there is excellent agreement between microsecond-scale simulations and the properties of systems measured with NMR and room-temperature crystallography [5657]. Furthermore, simulations have been able to interpret the impact of mutations on diseases such as phenylketonuria [58] and characterize the landscapes of a myriad of targets [59]. Folding@home and MSMs have allowed for the assessment of families of protein homologs [10].

Folding@home has also made significant progress in developing and supplementing experimental work with predictive models. For example, Folding@home data was used to identify novel cryptic pockets in TEM-1 β-lactmase [49], the enzyme most directly involved in antibiotic breakdown and microbial resistance. Indeed, accounting for the dynamics within the active site of TEM-1 β-lactamase substantially improved the predictive ability of modern virtual screening technologies [9]. Similar approaches have yielded valuable insights into the pH dependence of protein-protein interactions [60], and other biological phenomena.

1.3 Allosteric communication is critical for protein function, but difficult to infer.

1.3.1 Allosteric communication is universal and critical for biological function.

One cellular phenomena existing at long-timescales is communication between distant structural elements of a protein. This behavior, referred to as allostery [61], was first recognized in hemoglobin [62] where the binding of oxygen to a single subunit increases the oxygen affinity within the other three subunits. Since then, the importance of allostery has been recognized in a myriad of cellular functions, such as transcription factors [6364] or cellular signaling [65].

A well-studied protein (and drug target) with allosteric behavior is the G protein coupled receptor, which transmits information from outside the cell to inside based on a stimulus. This stimulus can be anything from ligand binding [66] to membrane deformations [67]. Structural methods revealed rearrangements of transmembrane helices that convert the GPCR into an “active” form  [68]. However, recent data of different GPCR-Gα complexes have highlighted the conformational heterogeneity in the allostery and activation of GPCRs [20212223].

The idea that a protein’s conformational landscape can impact its allosteric behavior leads one to speculate if all proteins have some degree of allosteric coupling [69]. Indeed, the ubiquity of allostery has been acknowledged in studies of natural and directed evolution [7071], where mutations distant from the active site can impact measured properties. Given the potentially universal nature of allostery, it is worth speculating if mutations and ligands work by tapping into existing allosteric networks to modulate the distributions of a protein’s of structures and dynamics [41]. Thus, understanding allosteric coupling in proteins could present new opportunities for modulating biological processes, designing therapeutics, or even designing new proteins.

Mounting evidence highlights the value in leveraging allostery to modulate protein function. There are a multitude of biological systems where allostery is leveraged to inhibit or activate protein-protein interactions, and so it may be possible to identify small molecules that could achieve the same objective of modulating protein behaviors. An allosteric drug that could modulate protein behaviors could play a huge role in restoring lost functions or reducing overactive protein behaviors. However, modern drug-design methods often require the presence of a cavity (or “pocket”) to successfully design small molecule hits. Many surfaces involved in protein-protein interactions are often too flat for a small molecule to bind tightly [72], and targeting known ligand binding sites of critical signaling proteins like GPCRs creates the risk of off-target effects. Identifying distant pockets that are not as conserved between homologs could be a means to achieve specificity [73].

Hidden allosteric sites known as ‘cryptic pockets’ could be promising targets for drug design methods. The shape-shifting nature of proteins implies the existence of states that contain new pockets that are not observed in existing experimental structures. The hidden pockets may also be cryptic allosteric sites that are connected to key functional sites via the underlying allosteric network of a protein [74]. Successful methods have emerged to identify novel allosteric sites, some of which have been verified by experiments  [49]. Indeed, the value of cryptic pockets has been supported by the discovery of small molecule inhibitors that are shown to bind an allosteric pocket and modulate a protein’s function [9757677]. Computer simulations provide promising avenues to hunt and target cryptic pockets, an effort which has yielded promising results [7879], but it remains a challenge to apply these approaches to a wide variety of systems. Furthermore, the discovery of a distant pocket in a protein does not imply it is “useful” as a drug target, because it is difficult to measure the degree of coupling between the cryptic pocket and a protein’s functional regions (like active site). Understanding the communication between a cryptic pocket to functional regions of a protein could further supplement drug design strategies. However, obtaining a complete picture of the allosteric network of a protein is often difficult.

1.3.2 Inferring allosteric communication in proteins remains a non-trivial task.

Methods for inferring allostery typically rely on observing concerted structural changes. A system with two distant sites may jump between alternative configurations in some coupled fashion. That is, the structure of site A may be coupled to the configuration of site B, and vice versa. This extreme example of conformational selection could be inferred by comparing structures of proteins before and after some perturbation to one of the sites is introduced (such as ligand binding). Indeed, crystallography and HDX methods have proved useful in revealing residues involved in allosteric networks of TIM-barrels and their catalytic domains [80]. Multiple NMR methods have also proven useful in studying the nature of allosteric communication between proteins [81].

Likewise, computational methods measure concerted structural changes using a variety of metrics on a myriad protein features. Some methods utilize sequence coevolution to group proteins regions into “sectors” that are coupled to one another [82]. Recent work has highlighted that molecular simulations can capture atomistically detailed pictures of allosteric coupling between sites. The underlying assumption that a proteins functional states are encoded in the equilibrium simulations implies that observing correlated motions in MD simulations would be representative of the degree of coupling between residues. Indeed, a number of algorithms use a myriad of features and metrics to quantify coupling [8384]. Some features used could be the backbone Cα atoms of proteins, and measuring the degree of covariance in pairs of Cα atoms [85]. Other methods utilize mutual information methods on dihedral angles to quantify how much better one residue’s dihedral angle predicts the dihedral angle of another residue [86]. However, there has been growing recognition that allostery via concerted structural changes is not the only mechanism through which two sites may be coupled.

In recent years the role of conformational entropy in allosteric communication has been increasingly acknowledged. The importance of conformational entropy was first described theoretically in 1984 by Cooper and Dryden [87]. Since then experimental evidence for this “dynamical allostery” has grown. Particularly, NMR data demonstrated two sites on a transcription factor were coupled with no discernible structural changes [88]. Furthermore, intrinsically disordered regions can also play a role in allosteric coupling, as normally ordered regions of a protein may transition locally into higher-entropy excited states that rapidly hop between multiple thermodynamic minima. This flattening of the effective free energy surface for a set of residues distinguishes dynamic allostery from the previously discussed mechanism of concerted structural changes. More recently, it has become apparent that to understand a protein’s allosteric network, it is important to observe both concerted structural changes and altered conformational entropy [89]. The ability to construct allosteric networks, by measuring both structure and disorder, has the potential to explain the mechanism of coupling in many complex biological process and may present opportunities to identify promising druggable pockets and the role they play in modulating protein function.

1.4 Scope of thesis.

It remains a critical challenge to understand allostery to completely describe biological behavior. The potential power of MD simulations to explain complex biological processes in atomistic detail presents a promising avenue to achieve these goals, but the tools to do so remain limited in scope, and generating simulation datasets that capture slow allosteric processes remains difficult. This thesis describes an approach to understand allosteric communication and the conformational landscape of proteins, and leverages these insights towards understanding fundamental biological phenomena or supplementing drug design efforts.

In this thesis I will describe a method to infer allosteric coupling in MD simulations via both concerted structural changes and conformational entropy. This will be done by measuring the Correlation of All Rotameric and Dynamical States (CARDS) – a novel method presented in chapter 2. This algorithm builds upon previous works that infer allostery through structural changes by capturing allostery through changes in conformational entropy. It parses dihedrals into dynamical states, capturing whether a rotamer is ordered (remaining in a single basin) or disordered (rapidly hopping between basis). We then describe our framework to measure coupling between every pair of residues by computing communication between rotameric states, between dynamical states, as well as cross-correlations. We apply the CARDS methods to a system with known dynamic allostery, the Catabolite Activator Protein (CAP), a transcription factor whose allosteric behavior was previously measured in NMR and ITC studies.

Chapter 3 describes the application of CARDS and other MD/MSM methods to a known allosteric system of importance, the heterotrimeric G proteins. Heterotrimeric G proteins are molecular switches that regulate everything including vision, smell, and neurotransmission. Malfunctions in G protein activation are implicated in cancers such as uveal melanoma. While considerable work has characterized G protein thermodynamics and kinetics, a complete mechanism of activation remains unclear including the allosteric network coupling the receptor and nucleotide binding sites. Here we describe in atomistic detail for the first time a complete mechanism of G protein activation, GDP release, and the conformational and dynamical changes driving this process.

Chapter 4 describes the discovery of a hidden cryptic pocket in the previously ‘undruggable’ ebolavirus protein VP35. The seeding strategy described in chapter 3 is applied to the RNA-binding VP35 protein. A cryptic pocket is identified from a Folding@home dataset, and CARDS identifies the degree of coupling between the cryptic pocket and residues important for Protein-Protein and Protein-Nucleic-Acid interactions (PPIs and PNIs, respectively). The existence of this cryptic pocket is supported using experiments, and the functional importance of this distant allosteric site is solidified using experimental techniques that observe VP35 inhibition after pocket-open state is stabilized.

Chapters 5 and 6 describe recent efforts utilizing Folding@home to study SARS-CoV-2, the virus behind the COVID19 pandemic. In chapter 5, we rapidly generate conformations of the Nucleocapsid protein folded domains, and integrate them with Monte Carlo simulations and experiments. This study shows that the Nucleocapsid protein is dynamic, disordered, and undergoes Liquid Liquid Phase Separation (LLPS) behavior. Folding@home simulations of the folded domain are used to seed Monte Carlo simulations of the intrinsically disordered regions of the Nucleocapsid protein, obtaining a complete picture of the free energy landscape; a feat which would not have been achieved using a single method. These simulations are integrated with experimental data to describe a model explaining how the N protein may package the genome.

In Chapter 6, I describe a recent effort where Folding@home shifted focus to simulating potential drug targets in the proteome of the SARS-CoV-2 virus, the cause of the COVID19 pandemic. Many citizen-scientists around the world rallied together, downloading the Folding@home app and running simulations of almost every possible protein from SARS-CoV-2. This effort generated a historic 0.1 seconds of equilibrium simulation data using Folding@home. Included is a description of how, through generations donations and partnerships, Folding@home surpassed the exascale barrier in computing speed, a feat never before achieved in human history. We utilize this monumental computational power to study how the viral Spike protein uses conformational masking to evade an immune response, and identify cryptic pockets that are not present in existing experimental snapshots. The data generated by Folding@home presents new potential targets for drug design efforts, and new structural and mechanistic insights that may supplement the design of therapeutics.

Chapter 7 demonstrates how MD simulations can be integrated with standard structural biology techniques to explain mechanisms of antibiotic resistance. The cefotaximase enzyme CTXM is responsible for the breakdown of many modern cephalosporins, which can result in microbial resistance and sustained infections. With each new drug generated, such as Ceftazidime (CAZ), CTXM has been shown to accrue mutations that grant it enhanced resistance profiles. This chapter will focus on two mutations, D240G and P167S. Counter-intuitively, these mutations are not additive in their behavior, and the presence of both mutations abrogates the CAZ resistance profile in CTXM. Combining MD and crystallographic methods, along with biochemical approaches, we describe how these mutations alter the acyl-enzyme complex and modulate a key-region in CTXM known as the Ω-loop. We show that while each mutation uniquely modulates the Ω-loop to better accomdate CAZ, both mutations revert the conformational behavior of the Ω-loop back to its non-resistant state. This study demonstrates the unique ability of combining MD with structural biology methods to better understand fundamental behaviors and biological phenomena.

Lastly, in chapter 8 I summarize the main advancements presented within this thesis. This chapter further explores the general implications of these findings and how allosteric coupling may be a universal phenomena. I expand by discussing future projects, discussing future promising questions, and speculate on the prospect of further integrating simulations with experiments to better answer fundamental biological questions.