Students select one of the following project assignments and present their results to the class at the end of the term. Oral presentations should last 15 minutes in addition to a 5 minute question and answer session with the class.
New experimental evidences suggest that gene's location inside a nucleus is critical for its regulation and transcription. Genes that are close in space can be easily co-regulated, while those that are distant maybe hard to co-regulate. Genes that are close on the chromosome are close in space inside the nucleus and hence could be co-regulated. Correlation between genes' proximity and co-expression has been reported earlier (e.g. see papers by K. Wolfe). Here your goal is to go beyond simple correlation, to establish:
Proteins have to recognize and bind their targets. A small group of residues could be responsible for specific recognition. These residues are called specificity-determining residues (SDRs). Analysis of other proteins with different specificity from the same organism (paralogs) and proteins with the same specificity in other organisms (orthologs) allows to predict SDRs (see papers by Gelfand and Mirny). We hypothesize that SDRs must be responsible for formation of hetero-and homodimers. Your goal is to test this hypothesis on coiled-coils and epidermal growth factor receptors (EGFR).
Let P be a morphogenic protein, and S a (near) homogeneous population of cells which are P-responsive. Namely, S cells undergo observable changes in some phenotype in response to P after a specific time period T (A possible phenotypic transformation could be that S grows more heterogeneous). The negative control experiment is S subject to a vehicle protein C for the same length of time T. The general objective is to characterize the direct and indirect downstream molecular cascade/s of P in S, bioinformatically. The experimental data consists of transcriptome profiles of S under the conditions described above measured by micro array. An independent source of supplementary material may be the public BIND or DIP database of curated/predicted protein-protein interactions.
Some questions:
Let B and T be developmental transcriptome profiles ("time series") of mouse B and T cells respectively (>4 developmental stages each, measurement/biological replicates at each stage), measured by micro arrays.
The general objective is to find genes whose time profiles distinguish between B and T cell development, and/or gene pairs whose pair wise dissimilarity measure are different in the B versus T datasets. For example, genes G1 and G2 are close to one another (belong to a common cluster/partition C1) in B, but G1 and G2 are distant from one another (belong to disjoint clusters/partitions) in the T.
Some questions:
Bioinformatically characterize in trisomy 21 transcriptome datasets the interactions between presence of an extra chromosome 21 and loci of non-21 chromosomes in mouse and human. The general question is to assess the likelihood of an extra Chr: 21 affecting homologous genes in human and mouse, and to check whether a gene's chromosomal location is a modulating factor in its interaction with a Chr: 21 gene. Construct and justify a rational approach to resolving this problem.
Several recent papers claim that some genes have multiple amino acid replacements driven by positive selection after divergence of human and chimpanzee lineages. It was suggested earlier that pattern of amino acid substitutions in genes under positive selection differs from other genes. The goal of this project is to compare patterns of amino acid replacements in genes showing signs of positive selection. You can take any subset of genes evolving under positive selection from any of the papers below. To find substitutions you can compare human proteins to chimpanzee genomes using tblastn (to obtain translated alignments). Alternatively, you can translate alignments from Clark et al. 2003 (Science 302:1960-3). It is very important to exclude any doubtful alignments and gene annotations. There are three questions to be answered:
Many proteins that contain well-annotated domains also contain sequence regions that do not correspond to any known domains or sequence motifs. Some domains can be evolving fast which makes difficult their identification as conserved mobile sequence elements. This project is based on the hypothesis that in some cases domain organization of proteins may be conserved even if some domain sequences diverged beyond detectable level of sequence similarity. This would allow finding new domains by association with other domains (similarly to the idea of conserved gene neighborhood). The goal is to identify a novel domain (or show a lack of such) located between two known domains. First, you use existing databases to identify all proteins with the same domain organization as proteins in a given set (e.g. EGFR, insulin receptors etc). Second, you test a hypothesis that sequence regions between corresponding domains of these proteins also belong to the same uncharacterized domain. Weak sequence similarity, similarity in predicted secondary structure, length and amino acid composition should be explored.
Currently available annotation of protein coding genes in the genome is not perfect. As a result, many genes contain cryptic exons-exons which were not properly annotated. Most frequently these exons are alternatively spliced. The aim of this project is to discriminate between protein coding exons and conserved non-coding sequence regions using comparative genomics. MultiZ alignments of mammalian genomes needed for this project can be downloaded from the UCSC genome browser. For simplicity we can use three genomes (e.g. human, mouse and dog). Coding and non-coding sequences are expected to have very different patterns of sequence changes because of the following characteristics of coding sequences: biased codon composition, higher conservation, absence of stop codons and frameshifts, higher fraction of synonymous than nonsynonymous substitutions, higher frequency of substitutions corresponding to changes of chemically similar amino acids, higher proportion of mutable contexts. All these differences can be captured explicitly if substitution pattern for tri-nucleotides in multiple genomes would be compared in coding and non-coding sequences. The project aims at estimating frequencies fijk that tri-nucleotides i, j and k are observed in corresponding positions in human, mouse and dog genomes. These frequencies should be estimated for both coding exons in the correct frame and non-coding sequences included in the alignments. Since some fijk can be very small, the use of pseudo counts may be necessary. After estimating the above frequencies, we will have to test whether an additive score of logarithmic ratio of frequencies corresponding to coding exons and frequencies corresponding to non-coding sequences can discriminate between exons and non-coding sequence.
Mutation rate is a fundamental parameter in models of population genetics and molecular evolution. Estimation of neutral mutation rate is critical for many tests for natural selection. The purpose of this project is to test whether mutation rate is constant along the genome. Assume that the fraction of functionally important sequence in the genome is small and can be neglected. The projects will use whole genome alignment of human and chimpanzee, which can be downloaded from the UCSC genome browser. Homogeneity of mutation rate, can be estimated from variance of number of substitutions in windows of a given length. Is observed variance similar to the variance expected under the assumption of homogeneity of mutation rate along the genome for all window lengths? If mutation rate is not homogeneous along the genome, what is characteristic scale of this inhomogeneity? How does the result change if substitutions in hypermutable CpG dinucleotides were excluded?