BIO 373 -- Mechanisms of Evolution
P.O. Box 805
Grinnell, IA 50112
Do not circulate without permission!
Biology 373 -- Mechanisms of Evolution -- Spring 1998
LAB 1 -- METHODS OF EVOLUTIONARY ANALYSIS
Mondays activities, in classroom and lab, will focus on the varying methods used in testing evolutionary hypothesis as reviewed in Chapter 3 of the textbook. You should read this chapter carefully, as well as the accompanying handout on basic statistical techniques from the beginning through the description of regression analysis. [Do not worry about the sections on multiple regression and principle components analysis for the time being.] We will take up the subject matter in the following order:
1. Section 3.3 -- Reconstructing history.
2. Section 3.1 -- Experimental approaches
Lab (please meet in the computer lab, Science 2231)
1. Finish any discussion of issues from Section 3.1
2. Section 3.2 -- The comparative method
Consider the analysis of body weight vs. testis size in primates by Harvey and Harcourt. What is the hypothesis being tested by this analysis? What is the biological significance of the regression line in figure 3.8? Consult the attached photocopy of a discussion of allometry from Futuymas Evolutionary Biology for review of this concept.
Your handout includes a photocopy of the raw data used in this analysis. Using Minitab, perform the regression analysis shown in figure 3.8. Are body weight and testis size related allometrically? What are two ways to "correct" for this relationship when testing the primary hypothesis of the study? Perform these analyses using the raw data provided. Compare your result with that using values of relative testis size "uncorrected" for allometry.
Consider the analysis of breeding system and seed type in angiosperms performed by Givnish. What statistical methods were used to analyze the data? Summarize Donoghues argument for why the method of Givnish is flawed in his test of functional associations between states of different characters. Does this argument apply to the Harvey and Harcourts study as well?
3. Simulation studies
What particular aspects of evolutionary history (descent from common ancestors) create the problems described by Donoghue? Is it necessary in all cases to know the evolutionary history of a group of organisms in order to do comparative studies? (If so, a long history of comparative biology is practically worthless). One approach to addressing this issue is to simulate the evolution of characters on a phylogeny assuming NO functional relationship between two characters, and then testing the resulting species in the manner of Harvey and Harcourt or Givnish to see if spurious significant results are generated. By varying certain aspects of the phylogeny, you can test whether the results are sensitive to certain aspects of evolutionary history.
You will be divided into groups of two, each of will use one simulation program, either one that simulates evolution of continuously varying characters (like body and testis size) or discrete characters (like mating system and seed type). I will provide relevant details for each program. By the end of the lab you should be able to give a short report on the results of your research.
BIO 373 -- Mechanisms of Evolution
Lab 3 -- EVOLUTION IN STRUCTURED POPULATIONS
The Hardy-Weinberg theorems assumption of random mating in populations is rarely, if ever, met in real populations. The resulting processes of inbreeding and genetic drift have an important effect on genotype frequencies within and among populations. Understanding how these effects depend on parameters such as population size and migration among subpopulations (gene flow), and how they interact with selection, will become important in understanding how species maintain "cohesiveness" -- or whether they cease to do so and thus form separate evolutionary lineages. The packaging of individuals into separately interacting units also creates opportunities for new levels at which selection might act, e.g. favoring traits that favor group rather than individual survivorship. Understanding these theoretical effects of population structure will be critical to developing empirically testable hypotheses about real organisms.
I -- Basic population structure
Choose "Differentiation Models" from the first menu, and "Population Structure" from the next menu. Read the introductory screens.
This simulation shows how populations differentiate under variable rates of migration and deme size. A deme is the same as a subpopulation, i.e., a population connected by migration to a number of other populations. Note that there is NO selection against the alleles.
1. Explore the effect of deme size on population structure over at least 100 generations. Make sure to look at the graph displaying the changes in FIS, FST, and FIT over time (hit the space bar to switch between graphs).
(a) What is the effect of increasing deme size on inbreeding within demes?
(b) Does increasing deme size have an effect on population differentiation as measured by FST? Explain why this occurs.
2. Holding deme size constant (e.g. at 12), explore the effect of increasing migration rates on differentiation.
(a) How does increasing the migration rate affect inbreeding within populations? Explain.
(b) How does the increase affect population differentiation as measured by FST? Explain.
3. Assuming that m is small, the equilibrium frequency of FST is given as . Does this equation match your general observations of the effects of N and m in the above simulations?
Many population geneticists measure FST using studies of allozyme allele frequencies in order to estimate rates of migration (m) or number of migrants (Nm). Try to see whether the quantitative predictions of the above equation hold using Populus. If they do not, speculate on what factors prevent the equilibrium from being reached.
II -- Selection and Gene Flow
Choose "Selection, Gene Flow and Clines" from the "Differentiation Models" menu. Read the introductory screens carefully.
1. Start with the default conditions, which use the "Gradient" model of changes in selection across the linear array of demes. After getting the first graphical output, press <CTRL>+<ENTER> to run the model to equilibrium.
2. Vary the steepness of the selection gradient across the array by varying the parameter s. What is the effect of this on the shape (width) of the cline? on the time to equilibrium?
3. Vary the rate of migration to adjacent demes. What is the effect of this on the shape (width) of the cline? on the time to equilibrium?
4. Does the shape of the cline itself indicate anything about selection? Explore many combinations of s and g, as well as other models of selection (e.g., heterozygous advantage and frequency dependence), in order to answer this question. Are these variations biologically plausible?
III -- Levels of selection
Certain forms of population structure provide the possibility for selection to act at levels higher than that of the individual; basically, genes that produce altruistic traits (by definition traits that reduce the bearers fitness, but increase the fitness of others in the group) can evolve to higher frequencies through a process driven by the greater success of groups with more altruists vs. groups with fewer altruists.
1. Before running the models, review the alternative (and most commonly cited) way of viewing the mechanism by which altruistic traits evolve, i.e., kin selection. [Consult your book -- Chapt. 16 -- or your instructor, if you are unclear about this concept.] Below, describe the logic behind explanations of the evolution of altruism based on kin selection.
2. Choose "Selection" from the first menu, "Group selection" in the second, and then "Interdemic Group Selection." Read the introduction screens so you understand the assumptions of the model. Enter the model and Press <F4> (allowing you to compare current results to previous ones).
3. Vary the strength of selection against altruists within demes (sAA). What effect does this have on the evolution of the altruistic allele? Explain.
4. Vary the values of a and b in order to alter the advantage of groups containing more altruists. What effect does this have on the evolution of the altruistic allele? Explain.
Are there biological situations that conform to the assumptions of this variation in a and b?
5. Alter deme size and rate of migration separately to determine the effect of each on the evolution of the altruistic allele. Explain what you observe.
6. Could you use the kin selection framework to explain all these phenomena? Do you think one framework is more useful than the other?
7. At the bottom the parameter screen, change the number of runs to average to 1. What does the result demonstrate?
8. Escape from "Interdemic Group Selection" and choose "Intrademic Group Selection." Read the introductory screens and make sure you understand the differences between this and previous model. Run the simulation at the default settings. Note that the frequency of the altruistic allele after dispersal is shown in larger yellow dots.
(a) How does the frequency of the altruistic allele evolve between dispersal events? Why?
(b) How does the frequency of the altruistic allele evolve when dispersal occurs? Why?
9. Raise the cost for the altruist from 0.05 to 0.1. Explain the difference from the previous result.
10. Reset s = 0.05 and N=10. Observe how the results change as you gradually decrease deme size (N). Can you explain why this occurs using the kin selection framework? using the "group selection" framework?
5.. Set G=1 and N=4. Note that under these conditions, offspring of individuals colonizing a group do not interact. Explain the result using the logic of group or kin selection.
If you set N=2, what common biological phenomenon would you be modeling?
Lab 4 -- Multiple alleles and Multiple Loci
Evolution would be simple to understand if all populations could be modeled as 1-locus, 2-allele systems. However, this is unlikely to be a realistic assumption. Adding multiple alleles and loci to models creates several important complications to the evolution of populations -- most importantly, it is clear that the assumption that populations evolve to their highest possible fitness will not always be true.
I. Multiple alleles.
1. Choose Selection, then "Selection on a multi-allelic locus" from the Populus menu. Read the intro screens carefully considering the following points (ask questions if you are confused about any notation or ideas):
(a) Note that Hardy-Weinberg genotype frequencies are easily calculate for multi-allele loci. Show how they are derived for a 3-allele locus below:
(b) Remember from class that for the two-allele case, an allele will increase in frequency when its average fitness (here called "marginal fitness") is greater than the average fitness of the population. This is still true for the multi-allele case -- make sure you know how "marginal fitness" is calculated.
2. Read the section on Page 2-3 on equilibria, including that describing the condition for stable polymorphic equilibria. Go to the parameter screen and determine by trial and error whether these criteria are true. Make sure you have shown any equilibria are stable by starting at different starting allele frequencies.
3. Press F2 to retrieve the intro screens. Read page 3 about the sickle cell locus, paying attention to how the fitnesses of genotypes were calculated. Does this method have any drawbacks?
Run the model with the relative fitnesses given in the intro screen. Begin with starting allele frequencies for A=.998, S = .001 and C=.001. Describe what happens below. Calculate the marginal fitness of the C allele and the average fitness of the population to demonstrate why the C allele does not evolve to higher frequencies.
4. Find the threshold frequency at which the C allele will evolve by gradually altering its starting frequencies. Repeat this exercise with S=0. Note the difference in average fitness at equilibrium point for the two starting assumptions.
II. Selection at two loci
1. Read the introduction screens to understand the model. Then run the default conditions, noting that no selection is occurring in this population. Do gamete frequencies change? Do allele frequencies change? Why or why not? Explain below in your own words.
2. Vary the recombination fraction to determine its effect on the decline of linkage disequilibrium (D) over time. Describe below.
3. Confirm (using simulations) the following examples from the text concerning linkage disequilibrium:
(a) If loci are in linkage equilibrium, selection at one locus does not effect the second locus.
(b) If loci are in linkage disequilibrium, selection at one locus DOES effect the other.
4. Read question #2 at the end of the chapter, and then confirm that epistatic effects of loci on fitness can result in linkage disequilibrium.
5. Consult the example below for one case where linkage and epistatic effects on fitness can lead to multiple (stable and/or unstable) equilibria. Enter the fitnesses and recombination fraction from Figure 6.12 below. Run the model under different starting allele frequencies to confirm that the stable polymorphic equilibria exist. Does these equilibria depend on the magnitude of R? Vary R to see.
6. Considering the fitness surface in Figure 6.12 below, describe why, under these conditions, fitness of a population is not always maximized.
Phenotypic Plasticity Lab
In this lab, we will explore the connections between the ideas of heritability and gene-environment (GxE) interaction. The fact that environmental, as well as genetic variation, contributes to phenotypic variation has more interesting consequences for evolutionary change than you most likely have been exposed to in BIO 136 or elsewhere. These complications turn around two basic ideas:
(1) Since environmental conditions vary among populations of species, genotypes will express different phenotypes in different populations (i.e., they will show phenotypic plasticity). The range of phenotypes express by a genotype across environments is called the norm of reaction.
(2) Genotypes may not show the same response to a change in environmental conditions, a condition known as gene-environment interaction. As we will see this has importance consequences when thinking about how selection in different populations may lead to differentiation among those populations.
I. Measuring heritability in a clonal species
Read the attached handout from BIO136, which explains the basic concept of measuring heritability in a single population using a "common garden" experiment. [Note the heritability worksheet at the end of the lab, which you will need to use in analyzing your own experiment. You can easily adapt this worksheet to an Excel spreadsheet and let the computer do all the calculations for you. See me for tips on this, if youre not familiar with using formulas in Excel.] Also read the description of the organism we will be using in this lab, the fungus Schizophyllum commune.
II. Heritability and analysis of variance
Although students in BIO136 didnt know this, they were performing a statistical analysis called analysis of variance (usually called ANOVA) when calculating heritability. ANOVA is a commonly used statistical tool for understanding how factors influence variation in some measured variable. For example, someone who did a replicated experiment to determine the influence of different levels of fertilizer on growth of a plant would use ANOVA to test whether the different levels had a significant effect of growth. This is done by comparing the mean growth for the different levels of fertilizer (called the main effect) with the variation among replicates within a single level of fertilizer (the error effect).
The following web site has an excellent summary of the principles of ANOVA which you should now read:
http://www.statsoft.com/textbook/stathome.html (or go to BIO373 Web page for link)
In the right-hand frame, click on ANOVA/MANOVA. Then in the left frame click "Basic Ideas" and read through to the end of the section called "Interaction Effects." As you read, consider the following connections to our experiment:
(1) In the 136 experiment the "main effect" was "genotype", while the "error" is equivalent to the effects of environmental variation within the common garden.
(2) The statistical significance of an ANOVA would tell you whether variation among genotypes is significant, given variation within genotypes (i.e., environmental variation). This is equivalent to testing whether the heritability value obtained is significantly different from 0.
(3) It is possible to do experiments in which two factors are simultaneously varied and the effects of each evaluated -- these are "multi-factor" ANOVAs. If we expand our common garden experiment to measure heritability by replicating it under two different environmental conditions (which we control), e.g. temperature, our ANOVA would have two main effects, genotype and temperature. The error effect in this analysis refers to the
(4) When multiple factors are tested in an experiment, it is possible that the effects of one factor may depend on the condition at another factor, a so-called "interaction" effect in the ANOVA. In our experiment, this is equivalent to asking whether genotypes show different responses to changes in the environment.
III. ANOVA vs. Norms of reaction
Read the paper by Gupta and Lewontin on reserve for Mondays class discussion (and send in three questions). Pay particular attention to their discussion of how norms of reactions (pictures of HOW different genotypes respond to environmental variation) tell us something more than ANOVA alone does.
Jargon alert! "Fishers fundamental theorem" is a population genetic theory that predicts that the increase in fitness over 1 generation is equal to the additive genetic variance for fitness.
IV. Plasticity and norms of reaction in Schizophyllum.
The experiment you will be running over the next two weeks will consist of an investigation of norms of reaction in 7 genotypes of a wood-decomposing fungus. You and your partners will be responsible for designing, setting up, taking data and analyzing the experiment. Run your proposal for an experiment by me before starting off on it. You should plan on reporting heritabilities of traits in each environment, reaction norms and genetic correlations between the same traits measured in different environments.
Since science is rarely done 3 hours a week on Monday afternoons, you may have to make plans to take data at other times. Coordination among the members of the team is crucial! Because of this, I wont expect you to be in the lab the entire time during the next two weeks, but Id like a progress report each week on Monday afternoon, and will of course be available for troubleshooting, and advice.
You will then individually write a paper in the style of a biological journal article describing your study. It is due by March 13th at 5pm (have a nice break!).
1. Temperature and light are likely to be important environmental influences on growth and reproduction in this species. Here are the options for controlled environmental conditions:
Light at 30° and any temp > room temp
Dark at 12° , 18° , 24° , 30° , 37° , 42° , 48°
Your group should decide on what environmental condition you want to vary (temp in dark, temp in light, light/dark at one temp). Dont forget to randomize the positions of genotypes and replicates within each environment!
2. Your group may use up to 84 plates for your experiment. The growth medium is called CYM, the recipe for which is below:
2.00g yeast extract
15.00 g agar
in: 1L distilled water
3. Transfer small plugs from the stock plate to a new plate -- these are best taken from the edge of the growing mycelium. Use sterile technique when transferring plugs from the stocks to each of your plates. Place the plug in the center of the plate, taking care not to drop mycelium on any other part of the plate. Mark the initial position of the plug on the bottom of the plate. Place plates upside-down while incubating.
Analyzing your Fungal Experiment
I will ask you to do four types of analyses of your fungal experiments:
1. Heritability calculated within each environment. For each trait you measure (remember that the same feature measured at different times can be considered a different trait), calculate a heritability value within each environment separately using the matrix approach described in class.
2. Do an analysis of variance (ANOVA) to test for significant effects of Genotype, Temperature (or Light) and their interaction. Remember the latter is a measure of the significance of Gene-Environment interaction.
Set up your data sheet in Minitab in the following way:
Genotype Clone Temp Trait1 Trait 2 etc.
1 1 18 23 6 ..
1 2 18 19 8 ..
. . ,
1 6 18 30 4 ..
2 1 18 26 8 ..
After all the data are entered, choose "Balanced ANOVA" from the "ANOVA" submenu of the Stats menu. In the "Responses" box choose all the traits you want to analyze. In the the "Model" box type "Genotype Temp Genotype*Temp" -- the latter is asking for the interaction effect in addition to the main effects. In the "Random" box type "Genotype" -- you didnt set levels of genotype yourself (as you did temperature), and the assumptions of the tests of significance are different with such random factors.
Heres what the output should look like:
Analysis of Variance (Balanced Designs)
Factor Type Levels Values
Genotype random 3 4 7 88
Temp fixed 5 18 24 30 37 42
Analysis of Variance for DiaWk1
Source DF SS MS F P
Genotype 2 384.53 192.27 0.61 0.567
Temp 4 6325.56 1581.39 5.01 0.026
Genotype*Temp 8 2525.91 315.74 18.15 0.000
Error 30 522.00 17.40
Total 44 9758.00
This is analysis of Montagnea arenaria (a fungus from the Namib desert). The trait is colony diameter after 1 week of growth. Three clones of each of three genotypes were grown at 5 temperatures. Note that genotype is not significant here, but temperature and the interaction effect are. How would you interpret these results?
3. Plot norms of reaction for each trait. Heres an example of the above data:
NOTE: I didnt put standard error bars on the means in this figure (with only 3 reps/genotypes they are big), but you should on your figures.
4. One of the interesting implications of crossing norms of reaction is that a trait measured in two different environments may show negative genetic correlations -- this is related to the idea raised by Gupta and Lewontin that the phenotypic rank order of genotypes can be different in different environments. To do such an analysis, pair up average phenotypic values in two environments from each genotype and do a correlation analysis. With only 3 replicates/genotype in the above data, this is a VERY weak analyis statististically, but the correlation coefficient between mean pheotype at 24 and 37 ° is
The goal of this lab is to become familiar with some of the assumptions and complications of phylogenetic analysis via cladistic methods. To do so, we will analyze created, simulated and real data sets using two complementary programs, MacClade and Phylogenetic Analysis using Parsimony (or PAUP). Both are Macintosh-only programs and have a common file format (called Nexus) ; MacClade is available in the Science Apps folder on the Science Building Macs, and PAUP can be run directly from a folder on my Storageserver account called Systematics Lab, where you will also find data files for this lab. You should copy these data files to your own disk or Storageserver account.
A. General principles
In class I made the point that Hennigs conceptual breakthrough concerning the inference of phylogeny consisted of the recognition that overall similarity was not as good a criterion for recognizing relative closeness of ancestry as was similarity of shared, derived character states, or synapomorphies. Thus, symplesiomorphies (shared, ancestral character states) and autapomorphies (non-shared derived character states) are not phylogenetically informative. This exercise will help you understand why this true, and how the principle of parsimony is related to this idea. However, like phenetic approaches, cladistic inference can be mistaken if characters show convergent evolution; here, we suggest one solution to the recognition of such homoplasy and thus the avoidance of errors in the estimation of phylogenetic trees.
1. Constant rates. Open the file called "Constant rates". At the bottom is a tree of the TRUE relationships of some fictional taxa, with the evolution of 10 characters mapped onto the phylogeny. At the top of the figure are boxes color-coded yellow or blue for states 0 or 1 for the 10 characters. Switch to the "Data editor" screen via the Display menu to see the relationship between boxes and a data matrix.
Using the "arrow" tool, drag the branch leading to C onto the D branch, thus creating a (false) hypothesis for the evolution of these taxa -- note what happens to the overall treelength (the number of steps or character transitions on the tree), which is shown in the box at the bottom of the screen.
Can you explain why this occurs? It will help to use the "Trace character" command (Trace menu) to trace each character on the phylogeny one at a time; click on the arrows to go forward and back through all seven characters.
Which character(s) increase in steps when the phylogeny is altered (and which do NOT)? Relate your observations to the distinction between symplesiomorphies, synapomorphies and autapomorphies.
Finally, note that when rates of character evolution are constant across all lineages, overall similarity, as well as parsimony, will recover the true phylogeny. Confirm this for yourself before closing this file.
2. Variable rates -- In this example, there has been a burst of evolutionary change in taxon B (new characters 11-14). Note that taxa A and B are no longer most similar to each other; in fact, taxon B is more similar to taxon C than to A.
Explain why the phylogeny inferred using cladistic inference is not altered by this change. It will help to consider the number of steps in characters 11-14 when the position of taxon B is altered.
3. Homoplasy -- Open the file called "Homoplasy". In this example, we presume that taxa B and D have evolved the same character states convergently. [Convergent evolution can often occur due to natural selection to the same conditions, but as well see below can also occur for non-adaptive reasons.]
What is the "most parsimonious" hypothesis for the evolution of this group?
Open the file called "Lysozyme" for an example of how this can occur with molecular data. The tree shown is the most parsimonious tree based on the amino acid sequence of lysozyme, a protein involved in digestion, and proposes that Hanuman langurs (a species of Asian monkey) are more closely related to cows than they are to humans. Trace characters (amino acid positions) to discover why.
4. Homoplasy happens: a solution -- Open the file called "Homoplasy 2" -- in the example, weve made the assumption that, with the exception of the convergently evolving characters 11-14, weve gathered data on 3X as many characters as before.
What is the most parsimonious solution here? Why doesnt homoplasy have the same effect on phylogenetic inference in this example as it did in the last? How would you use this principle in the lysozyme sequence example?
B. When parsimony fails
Convergent evolution can occur when organisms evolve the same adaptations to similar circumstances, but it can also evolve by random processes. For example, each position in a DNA sequence can have one of 4 bases; given enough time, even with low rates of substitutions, it is possible that parallel substitutions from an ancestral condition can occur in independent lineages. Here, well use MacClade to simulate these conditions and analyze the results using PAUP. We should demonstrate a point made by Joe Felsenstein (1978) that phylogenetic inference via parsimony can, under particular circumstances, be "positively misleading," i.e., getting more data gives us the wrong answer with even greater certainty. For this reason, Felsenstein is one the the leading advocates of maximum-likelihood approaches to phylogeny estimation.
1. Simulating the data
Open the file in MacClade called "Long branches." This file has been set up with a simple "comb" phylogeny of 10 taxa, referred by the letters A through J, and a single, constant "dummy" character. Note that in the "true" phylogeny, A is most basal, followed by B etc. The branches of this phylogeny have been assigned different numbers of "branch segments" (shown in red) which are the number of simulation "rounds" of DNA substitutions along the branch; basically, Ive set it up so that the amount of time for evolution to occur from any ancestor to any two descendant taxa is the same, which is the same as saying that the rate of sequence evolution doesnt vary among branches (except by stochastic processes).
Go to "Evolve Characters" in the Edit menu. Evolve 1000 characters, with random ancestral states (with equal probability for each base). On the middle diagonal of the matrix, enter "0.9" which is the probability that the character will NOT change along one "evolve segment." Click on the lock to the lower right of the matrix, and then on the "norming button" on the top right; this should set equal probabilities for the three possible substitutions. Click on the box for considering branch segments and then the "Create" button to simulate the data.
MacClade has now added the branch lengths to the tree, which are the number of inferred substitutions for this tree and this data set. Do you notice any unexpected patterns in the branch lengths, given your knowledge of how these data were simulated? Can you explain why this pattern occurs? Save this file under a new name, and then start up PAUP.
2. Analyzing the data
Open and execute your new simulated file in PAUP. First, analyze these data with only 250 of the simulated characters: go to "Include-Exclude Characters" in the Data menu and exclude 750 of the simulated characters.
Choose "Heuristic" from the Search menu. [Review the differences between Exhaustive, Branch-and-Bound, and Heuristic searches in your text, including when each is used.] Click on the "Random Addition" button and enter "10" for the number of replicates and a seed for the pseudorandom number generator. As the program searches for most parsimonious trees, pay attention to whether any of the replicates find non-minimal trees; the text has a good explanation for why this occurs.
When your search is finished, look at the resulting tree by choosing "Show Trees" from the "Tree" menu. If you have multiple trees, you can compute a consensus tree (a tree with only nodes common to all trees) by choosing this option in the Tree menu. You may save your trees or consensus tree as well in a separate file.
One method of determining confidence in a tree (given the data at hand) is to apply a statistical procedure called bootstrapping, which is explained well in your textbook. Do a bootstrap analysis of this data set by choosing this option in the Search menu. When you get to the "addition sequence" menu choose "simple addition" (rather than "random addition" ) to save time.
Is the resulting tree the same as the true phylogeny? Can you explain why discrepancies have occurred? How would you test this? Open the "Long branches" file with MacClade and generate a 250 character data set with the "diagonal probabilities" (probability of no change) set at 0.96. Analyze it in the same way with PAUP. Does this help explain why parsimony didnt retrieve the true phylogeny in your first simulation?
Does adding more data solve the problems that long branches create for phylogenetic inference? Repeat the above search and bootstrapping on the data from your first simulation, but now use the entire 1000 characters. Do you get an answer that is closer to the truth?
C. Real data
In the simulation above, we assumed all possible nucleotide substitutions were equally likely ( an example of Fitch parsimony). We know, however, that for molecular data, (1) certain types of substitutions are more likely than others (e.g., transitions more likely than transversions) and (2) certain characters are more likely to change than others (e.g., third positions in codons more likely than first or second positions). Morphological characters are also likely to vary probability of change as well.
What problems do you think such variability in the frequency of character changes can create for phylogenetic inference?
In this exercise, youll analyze a data set of mine to learn how to recognize such phenomena, and how to implement some possible solutions. [Note, however, that this is a hotly-debated area within the field at the present.]
1. mtDNA of insects
Open up the "Damselfly mtDNA file" with PAUP and do a heuristic search. Save your most parsimonious trees to a file.
Open the same file with MacClade to analyze character evolution on one of these trees. Start by going to the Tree window (Display menu), then opening the tree file (Tree menu). Explore the options in the "Character steps etc." and "State changes/stasis" options in the Chart menu. Are certain types of substitutions more likely than others? Are certain types of characters more likely to change (and thus show homoplasy) than others? Would this be likely to lead to errors in phylogenetic inference? Why?
2. Character weighting -- One suggested way to avoid the problems created by such variation in the rate of character change is to use generalized parsimony (see text for good explanation of this). The basic idea behind these approaches is to give lesser weight to character changes that are more likely to be homoplasious, in effect favoring synapomorphies of evolutionarily conservative characters (or types of changes) over synapomorphies of quickly changing characters.
Go back to PAUP and redo a heuristic search, but this time give third positions less weight than first or second positions in codons (or positions in the intervening tRNA sequence): Choose "Set character weights" in the Data menu, and then assign a weight to each character set (first, second, third and tRNA). Do you get the same answer as before?
Some systematists dislike the idea of assigning weights for characters a priori, preferring to let the analysis of all characters suggest which should be given more or less weight. For example, while in this data set third codon positions are on average more variable, some third codon positions are in fact quite evolutionarily conservative (confirm this yourself using MacClade). One way to incorporate such information is called "successive weighting," or "Farris weighting" after its creator. After obtaining an initial tree (or trees) using equal weights for all characters, this protocol suggests weighting characters according to a measure of how homoplasious they are on the initial tree . This measure is called the consistency index (or C.I.) which equals the minimum number of changes divided by the actual number of changes on the tree.
Run the phylogenetic analysis again after successive weighting. Do you get the same results? What are some potential criticisms of this method?
Develop with your partner a phylogenetic hypothesis for a group of organisms based on data you get from the literature, or from on-line sources. There are two different ways that you can do this assignment. The first is to decide on a group you are interested in and use the resources in the library or on-line to develop a phylogenetic hypothesis. Alternatively, you may take a data matrix from a paper you find in the literature and use it without adding any of "your own" characters.
After working with your partner on data analysis, each of you should write a paper (individually) describing your findings. It should be approximately 3-5 pages long (not including figures) and isdue on Monday April 13 by 1 pm (10 pts. off per 24 hours late). It should have the following structure:
Like a regular introduction, this should provide context for the study and justify why it is interesting. Make sure you describe the group in general, and how it is related to other groups of the same rank (e.g., if you are working on ducks, briefly describe how ducks are thought to be related to other families of birds).
Briefly describe the origins of your data and how you analyzed it.
Describe the results of your phylogenetic analysis. If you worked on morphological data, present a data matrix of the characters and taxa in the paper (if you use a molecular data set, do not include the matrix in your paper). Describe the characters (or gene sequences) and make sure you cite your sources. In addition you should hand in a diskette containing a MacClade/PAUP file with your data matrix and trees on it. Describe any ambiguities that arose during your analysis or any particularly surprising results. Show any character reconstructions that you will discuss in your final section.
Discuss the implications of your results. In particular, discuss any character reconstructions or phylogenetic patterns that increase your understanding of the groups ecological features or evolutionary history.
If youve taken the option to analyze a "canned" data set, I expect you to do more than a book report of your source paper. Your paper should convince me that youve learned something new by manipulating the data matrix with MacClade. For this reason, I will ask you to hand in a copy of the paper from which any "canned" matrix has been taken.
Good sources of data for phylogenetic analysis:
Journals -- Systematic Botany, Evolution, American Naturalist, American Zoologist, Copeia (for Ichs and Herps), Systematic Biology (in the lab), Molecular Biology and Evolution (in the lab).
On-line resources -- Home pages for the journalSystematic Biology have references and downloadable Nexus data files. GenBank is the best source for amino acid and DNA sequence data. There are links to these resources on the class home page.
For DNA sequence data, you will need to download each sequence ("Save as . . .") and line up your sequences. There is a program "SeqPUP" in the Biology folder of the Science Apps that can read your downloaded sequences, which then can be edited and saved as a PAUP/MacClade file. [This program is a little buggy, so see me for help).