How does allele frequency affect genotype frequency




















Therefore, to allow for uncertainty in the designation of the minor allele, the likelihood function can be modified as:. Since can be very small with big data sets e. Order the three conditional log-likelihoods as to l 1 , l 2 , l 3 , where l 1 is the largest one. In association studies, SNPs showing significant differences in allele frequency between cases and controls are said to be associated with the phenotype of interest.

Association mapping can be performed using data from next-generation sequencing studies. We first discuss approaches that require calling individual genotypes and then perform a test for association using the called genotypes. In this approach, a genotype is first called for each individual. The genotypes can be filtered or unfiltered. This leads to the well-known likelihood ratio test for independence, the G -test:.

The well-known Pearson's chi-square test is asymptotically equivalent to the G -test. However, in our studies, we construct the G -statistic using "called" genotypes, thus HWE may not hold due to over- and under-calling of heterozygotes. Furthermore, constructing the test statistic by counting "called" genotypes instead of "observed" genotypes likely introduces extra variability.

Therefore, the statistical theory may not be valid any more. Instead of calling genotypes, the likelihood framework allows for uncertainty in the genotypes and tests at each site j whether the allele frequency is the same between cases and controls.

Assuming that minor m and major M alleles are known, the likelihood of the minor allele frequency can be computed as described in Equation 2 , and the likelihood ratio test statistic is computed as:.

If the minor allele is unknown, the likelihood under the null hypothesis is computed as in Equation 3 , and the LRT statistic is modified as:. Other notations are the same as in Equation 6. For rare SNPs, the minor allele type is often not apparent. When calling genotypes, the second most common nucleotide is assumed to be the minor allele. The ML method directly incorporates uncertainty in determining the minor allele and unless otherwise stated, results using the unknown minor allele method Equation 3 are shown.

Note that the unknown minor allele ML method performs similarly to the known minor allele ML method but the former better for very rare SNPs Additional file 1. Figure 1 shows boxplots of the distributions of estimated MAFs using the four different approaches. However, when the depth decreases, the estimates of the MAF obtained by first calling genotypes become biased.

The reason for the upward bias is that it becomes harder to call heterozygotes since true heterozygotes often look like sequencing errors. Therefore, more heterozygotes than minor homozygotes tend to have missing genotypes. However, the overall bias in MAF estimates from called genotypes is not always in one direction data not shown. This pattern may seem counter-intuitive since filtering the genotype calls would seem to decrease the probability of calling a sequencing error a heterozygote.

However, the Call F method also results in a larger amount of missing data since many homozygotes for the major allele will not be called due to sequencing errors.

Thus, in this instance, calling genotypes without filtering seems to be the better strategy than filtering genotypes when trying to estimate the MAF. At each depth, 1, sites were simulated using individuals, and at each site, an estimate of allele frequency is computed using: 1 true genotypes True ; 2 called genotypes without filtering Call NF ; 3 called genotypes with filtering Call F ; and 4 the maximum likelihood method ML.

For more details of the estimation methods, see Methods. The results are dramatically different for the new ML method. In particular, the MSE computed based on the Call F method is much higher than those from the other methods especially when the depth decreases. The MSE of the estimates of the MAF based on the true genotypes reflects the lower limit of the MSE and is not constant across depths due to sampling variance and a finite sample size.

Using 50 individuals, the MSE approaches 0. Mean squred error MSE; Expected of four different types of allele frequency estimators for different sample sizes left and right panel and depths of coverage x-axis.

We next examine how the different estimation approaches performed in estimating the proportion of SNPs at different frequencies in the population similar to the site frequency spectrum but based on population allele frequency instead of sample frequency. Here we simulated 20, SNPs where the distribution of the true MAFs followed the standard stationary distribution for an effective population size of 10, see Methods. Note that in practice, however, it is very difficult to distinguish a very rare SNP from a sequencing error.

Distribution of allele frequencies of SNPs simulated assuming the standard stationary distribution of allele frequencies. In particular, these methods over-estimate the proportion of low-frequency SNPs. The over-estimation of the proportion of low-frequency SNPs occurs due to confusion of sequencing errors with true heterozygotes, which results in overcalling heterozygous genotypes.

The magnitude of this inflation differs across different filtering cutoffs, but a larger cutoff does not necessarily increase or decrease the inflation. The picture is entirely different for the ML method. The estimated MAF distribution obtained from the new ML method closely follows the true distribution even with shallow depths of coverage. Here there is almost no excess of low-frequency SNPs. Thus, more reliable estimates of the frequency spectrum can be made from low-coverage data by using our likelihood approach than by using the genotype calling approaches.

We compare the performance of methods that treat inferred genotypes as true genotypes in tests of association using a G -test to our likelihood ratio test LRT that accounts for uncertainty in the genotypes.

We examine the distribution of the test-statistic under the null hypothesis of no allele frequency difference between cases and controls. We also compare the power of the different approaches. The effect is caused by an increase in the variance, due to overcalling homozygotes as heterozygotes, in the allelic test used here for detecting association. Genotypic tests such as Armitage trend test, which are robust to deviations from Hardy-Weinberg equilibrium, do not show a similar increase in the false positive rate Additional file 2.

Consistent with this observation, filtering the called genotypes results in a decrease in the fraction of significant tests when using the G -test, although filtering does not completely solve the problem. Each column corresponds to a different test statistic: 1 G -statistic computed using the true genotypes True ; 2 G -statistic computed using called genotypes without filtering Call NF ; 3 G -statistic computed using called genotypes with filtering Call F ; and 4 the likelihood ratio test statistic with unknown minor allele LRT.

The "Inflation" factor [ 44 ] is shown in the upper left corner of each figure. We also generated receiver operating characteristic ROC curves for each of the different association tests. These curves show the power of the test at different false-positive rates. The power is computed as the fraction of simulated disease loci that have a statistic exceeding the critical value. Overall, we find that the LRT performs better than the G -test based on either genotype calling method Figure 5.

In particular, at low depth, the G -test applied to called genotypes with filtering performs very poorly left most column in Figure 5. If we compare the power of the LRT to the Armitage trend test using called genotypes, we find that the LRT also has higher power than the Armitage trend test Additional file 3.

This suggests that if one wishes to use called genotypes, filtering them based on call confidence can result in a loss of power. Receiver operating characteristic ROC curves of four tests of association. For the definition of the four statistics, see the caption of Figure 4. At each false positive rate x-axis , the corresponding critical value was computed using the empirical null distribution. The true positive rate power; y-axis was obtained by computing the fraction of causative sites with test statistics that exceed the critical value.

We used the genotype likelihoods generated by the "SOAPsnp" program [ 32 ] for our inference. For more details, see Methods. Both the estimates using the ML method and the genotype calling method without filtering are highly correlated with the estimates made from the Sequenom genotype data i.

However, estimates based on genotype calling with filtering show poor correspondence to the frequencies estimated from the Sequenom genotype data, especially when sequencing depth is low. Specifically, the estimated MAF from the Sequenom genotype data is Individual examination shows that in many individuals, the highly supported genotype based on the sequencing data differs from the Sequenom genotypes.

Note that there are a couple of SNPs in which the estimated MAFs from the genotype calling approach without filtering seem to better correspond to the MAFs estimated from the Sequenom genotyping than the estimates from the ML approach do.

However, individual inspection reveals there are a few individuals for which the called genotype from the sequencing data differs from the Sequenom genotype. In these cases, the errors in the called genotypes canceled, giving the appearance of better correspondence with the Sequenom genotype data. Therefore, for these SNPs, it is hard to tell which method performs best.

Estimates of allele frequency computed from individuals using next-generation sequencing data vs. Sequenom genotype data. At each site, only individuals that have both Sequenom genotype data and sequencing data were used for estimation of allele frequency.

The standardized difference for each estimate was computed as , where and are the estimated MAFs from the sequencing data and Sequenom genotype data, respectively, and n is the number of individuals used for the estimation.

We next examined the distribution of MAFs computed using several approaches across a range of sequencing depths from our next-generation exome sequencing data Figure 7. We further removed sites in which there was a significant difference p -value less than 10 -5 using a rank-sum-test [ 43 ] in the quality score of read bases between the minor and major alleles.

These sites are likely to be artificial SNPs that may occur due to incorrect mapping or unknown biases introduced during the experimental procedure. Then we classified each site into bins based on the depth of coverage. The number of SNPs in each bin is shown in Table 1. This pattern mirrors what was seen in our simulation studies Figure 3.

Also, for the genotype calling methods, the allele frequency distribution changes dramatically as sequencing depth changes. Therefore, as discussed previously, when depth is not very high, the genotyping calling methods are likely to include a lot of false SNPs that are sequencing errors. These errors appear as an excess of low-frequency SNPs in the frequency distribution.

Distribution of the minor allele frequency estimated from the exomes of sequenced individuals. For each site, the minor allele frequency was estimated using four different methods: 1 the ML method with unknown minor allele, 2 the ML method with a known or fixed minor allele, 3 calling genotypes without filtering Call NF , and 4 calling genotypes with filtering Call F.

Each site is classified into bins based on the depth of coverage. For the number of SNPs that were used for this analysis, see Table 1.

Finally, we used this exome-resequencing data to simulate a case-control association study. To examine the distribution of the association test statistics under the null hypothesis, we randomly assigned individuals to a case group and the other to the control group. The inflation factor [ 44 ] is 1. Phenotypes were randomly assigned to indivdiduals in the exome resequencing dataset such that there are cases and controls.

For each site, three statistics were computed: the G -statistic using called genotypes without filtering Call NF , the G -static using called genotypes with filtering Call F , and the LRT statistic. For display purposes, results from sites on chromosome 2 are shown. Note that the inflation factor is shown in the upper left corner of each QQ-plot.

The likelihood method discussed here is an extension of our previous approach [ 30 ] which was similar to that of Lynch [ 29 ]. We have improved this approach by allowing for uncertainty in determining which allele is the minor allele.

Additionally, the present formulation includes base-specific error rates see Equation 8. These additions may have a practical benefit particularly when estimating the frequencies of rarer alleles, where it may not be obvious which allele is the minor allele and where sequencing errors may have the greatest effect on frequency estimation.

Though not surprising, it is important to note that with higher sequencing coverage, the particular approach used to estimate allele frequencies does not matter as much. Thus, with high depths of coverage, the traditional and simple method of calling genotypes and then treating those genotypes as being known with certainty is still effective. The reason for this is that with such high depth, the called genotypes are likely to be accurate.

With lower depths of coverage, however, there is considerable uncertainty regarding the true genotype. Often the most-likely genotype will not be the true genotype, leading to biases in estimates of allele frequency and spurious signals of association in case-control studies. In this situation, the ML method is a superior approach. In our simulations, we compared the performance of our ML approach to a relatively simple genotype calling approach see Methods.

It is possible that more sophisticated genotype calling approaches such as SOAPsnp [ 32 ], MAQ [ 23 ], and GATK [ 45 ] may show improved performance relative to the simple genotype calling approach used here. However, many of the same trends found in our simulations, where the simple genotype calling approach was used, were also seen in the exome sequencing data where genotypes were called using SOAPsnp.

We have explored whether it is better to call genotypes with filtering or without filtering when analyzing low-coverage data. Intuitively, one would expect that if there was uncertainty in the genotypes, it would be better to call genotypes only if one was very confident in that genotype and treat the other less confident genotypes as missing data. However, as discussed by Johnson et al.

Our simulations and analyses of real data show that for estimating allele frequencies, genotype calling methods perform better without any filtering because filtering creates a strong upward bias in the frequency estimates. For association studies, it is not always clear whether it is better to filter the genotypes. Not filtering can result in an excess of false-positive results for allelic-based tests, but filtering can result in a decrease in power.

Studies have suggested that genotype calling approaches that use LD information to call genotypes [ 21 , 36 ] may result in more accurate inferences from low-coverage data. However, it is unclear whether using population genetic characteristics of the data, like LD patterns, to call genotypes biases downstream population genetic and evolutionary analyses.

Such an evaluation is beyond the scope of the present work. However, this is not a concern for our method to estimate allele frequencies because our approach does not use any LD information. As currently implemented our method does not tackle the problem of SNP calling itself. Such an approach is the subject of ongoing research.

This equation is known as the Hardy-Weinberg equation , and it defines a population in which relative allele frequencies do not change over successive generations.

Such a population is said to be in equilibrium. This state of equilibrium represented by the Hardy-Weinberg equation is an ideal model against which to compare observed changes in relative allele and genotype frequencies in natural populations. The Hardy-Weinberg equation describes a population at equilibrium. This can only occur in the absence of disturbing factors and when mating between individuals is completely random. When mating is random in a large population, both the relative genotype and allele frequencies will remain constant.

Hardy-Weinberg equilibrium in a population can be disturbed by a number of forces, including mutations , nonrandom mating , migration and genetic drift random changes in alleles from one generation to the next. These forces drive evolutionary change because they add to or take away from the relative allele frequencies in a population.

For instance, mutations can disrupt the equilibrium of relative allele frequencies by introducing new alleles into a population. Nonrandom mating can influence relative genotype frequencies within the mating group, because mate choice of the parents can cause a bias toward certain combinations of alleles among their progeny. Migration causes a phenomenon called gene flow that occurs when breeding between two populations leads to the transfer of alleles into a new population, thereby altering the equilibrium of relative allele frequencies.

Genetic drift, which typically occurs at a higher rate in small populations, takes place when relative allele frequencies increase or decrease by chance. Since all of these disruptive forces commonly occur in nature, the Hardy-Weinberg equilibrium rarely stays constant. Typically, populations can exist in equilibrium for short periods of time, but rarely stay there in perpetuity. Therefore, Hardy-Weinberg equilibrium describes an idealized state of a population, and genetic variations in nature can be measured as changes from this ideal.

The Hardy-Weinberg equation is therefore a tool for measuring real genetic variation in a population over time. This page appears in the following eBook. Aa Aa Aa. Relative genotype frequency and relative allele frequency are the most important measures of genetic variation. Relative genotype frequency is the percentage of individuals in a population that have a specific genotype.

The relative genotype frequencies show the distribution of genetic variation in a population. Relative allele frequency is the percentage of all copies of a certain gene in a population that carry a specific allele. This is an accurate measurement of the amount of genetic variation in a population. Examining allele frequencies. Remember the Punnett square? The possible combinations can be represented mathematically as:. How can relative frequencies be used to study populations?

How is the Hardy-Weinberg equation used? This is just one model. Principal component analysis PCA Figure 5A revealed that the random variation between replicate samples from the same population was at least as large as the variation between samples from different populations.

When allele frequencies were averaged across the three replicates this random variation appeared to be reduced Figure 5B , indicating that a large part of the variation between samples was due to random variation which was reduced with the averaging of the three replicates.

Figure 5. Genetic differentiation between survivor populations sampled from four pure stand plots Ps, red clover only and four mixed stand plots Ms, red clover growing in mixture with white clover, perennial ryegrass and tall fescue , sown at high H or low L seeding density study 3.

MAF for Ps H populations was the average of the three replicate samples. The largest differences were between populations belonging to different stand types. There was also a difference between Ps populations sown at different seeding densities, possibly only detectable in Ps due to the higher number of individuals genotyped in Ps H populations.

The difference in allele frequency between the average Ps population and the average Ms population for the 11 SNPs ranged from 0. Thus, the differentiation between survival populations in this study was larger than the differentiation between the original population and survivor populations in study 1. Thirty-three of these were among those 42 that had been identified as being affected by stand type.

The 11 SNPs identified with the simple F ST -based method were also identified in both BayeScan analyses, and among those for which there was a significant effect of stand type. Table 4. Single nucleotide polymorphisms SNPs with different allele frequencies in red clover populations 2.

The identified SNPs under selection were spread across all seven red clover chromosomes, and were in some cases closely or moderately linked Figure 3. We took a closer look at the chromosomal regions around the SNPs with the largest allele frequency differences between populations further details are found in Supplementary Tables S2 , S4.

The SNP with the largest allele frequency difference, a difference of 0. This SNP was not located in a known gene. There were also several SNPs located toward the distal end of chromosome 4. The former one was located near an annexin and the latter one was located in an oxygenase and close to a transcription factor.

It was located in a stress-induced phosphoprotein and close to a syntaxin. Further up on chromosome 2 there was a region with many SNPs with moderate allele frequency differences between populations. This SNP was located in one of three adjacent membrane transport protein-like genes. Several statistical methods have been developed to scan large numbers of loci across many individuals and link patterns of genetic variation to environmental variation Holderegger et al.

These methods identify outlier loci — loci with stronger differentiation in allele frequencies between populations than can be expected to occur due to random processes only, and which are, therefore, assumed to have been under selection.

Statistically significant associations between genetic variation in outlier loci and variation in environmental variables indicate a role of the outlier loci in local adaptation. Adaptive outlier loci may represent new beneficial mutations that have increased in frequency and eventually become fixed in the population hard sweeps.

Alternatively, outlier loci represent alleles or haplotypes that have increased in frequency, but where some polymorphism is maintained soft sweeps Barrett and Schluter, Soft sweeps can occur when selection on standing variation acts on multiple haplotypes in the genome simultaneously.

Studies of local adaptation usually compare populations that have been exposed to contrasting conditions over many generations, and, in spite of migration, have evolved through repeated cycles of recombination and selection e.

In some cases, such studies include replicates of populations that have started out from a common pool and been exposed to the same conditions; these replicates can be used to separate consistent signs of selection from random changes like genetic drift Wiberg et al. This allows for the use of a simple F ST -based test of changes in allele frequencies resulting from selection.

In contrast, in study 3, where a higher number of individuals were pooled in each population sample and the differentiation between populations was larger than in study 1, BayeScan identified more potential outliers than the simple F ST -based method. In study 3, all outliers identified by the simple F ST -based method were included among those identified by BayeScan. In order to be able to detect all loci with differences in allele frequency, it is necessary to have a sufficient coverage of the genome, i.

Red clover has a relatively small genome approximately Mb , facilitating good read depth relative to the sequencing effort, but varieties tend to have limited LD. The LD along the different chromosomes in the original population studied here has previously been characterized by De Vega et al.

At Kb LD had decayed completely to background levels R 2 0. The likelihood of detecting a locus with significantly different allele frequency in different populations depends on the magnitude of the allele frequency difference, the distance between the gene conferring the effect on survival and a linked SNP, and the LD in that specific region.

Here, we obtained an average density of one SNP per 85 kb or 37 kb in study 1 and in study 3, respectively. The studied variety is a synthetic population with several possible haplotypes at any given chromosomal segment, thus all nearby SNPs might not necessarily be diagnostic, that is, distinguish between alleles with different effects on survival.

Therefore, with the SNP densities obtained in our study, we are likely to pick up a substantial amount of loci affecting survival, but not all, particularly not in study 1.

Pooling of individual DNA samples, or of individual leaf samples prior to DNA extraction, can increase the allele frequency information obtained per sequencing effort, and allow for comparison of a large number of populations Turner et al. While sequencing of individuals requires a certain read depth in order to call SNPs and distinguish between homozygotes and heterozygotes, sequencing pools requires an even higher read depth for allele frequencies to be estimated accurately.

Moreover, information about haplotypes and population structure is lost when sequencing pools. In our study, a very good correlation was obtained between allele frequencies obtained from a DNA pool of 88 individuals and allele frequencies obtained from genotyping of individuals Figure 4 and Table 3. Read depth was increased only 7 times in the pool relative to the 88 individual samples i.

At the same MAF and read depth range, pooling of leaves of plants prior to DNA extraction led to an average correlation of 0.

This is slightly lower than that reported by Byrne et al. Pooling of individual leaf samples prior to DNA extraction reduces costs, but the accuracy of the allele frequency estimates is also reduced.

Estimates could possibly have been improved if we had used more uniform leaf material and taken more care in sampling equal amounts of tissue from each individual. However, the use of several replicate populations compensates to some extent for the reduced accuracy of allele frequency estimates.

The replicate samples from two of the populations in study 3 showed that there was considerable sampling error in our method. The 88 plants in the original population sample represent the sown populations while the survivor populations represent subsets remaining in each plot after selection survival during 2. Such selection within one generation represents the environmental flexibility that the genetic variation within populations of outcrossing species can provide Charles, ; Crossley and Bradshaw, Some alleles may contribute to yield in some environments, while other alleles contribute in other environments, making the population or cultivar robust to environmental variation.

Our analyses of the genetic variation in the survivor populations as compared to the original population that was sown study 1 showed that the survivor populations in four different plots had diverged from the original population in different directions. Thus, although the first PC-axis separated the two harvesting regimes Figure 1 , most of the allele frequency variation was random.

This may reflect a response to unintended variation in the environment among plots, random selection of alleles at the majority of loci, or sampling error. The original population had only a very weak genetic structure, which remained in the survivor populations, indicating that there was no selection acting on the structure Figure 2.

In study 3, the first PC-axis separated Ps from Ms, and within Ps it separated the two seeding densities, suggesting that differential selection had occurred due to the different treatments Figure 5. If the original population has high genetic diversity and low LD typical of forage cultivars , it cannot be expected that selection acting on a relatively limited number of loci will affect average genetic distance measured across the genome.

In order to identify such selection, each individual locus must be considered. Indeed, by looking for allelic shifts of individual SNPs in several replicate survivor populations we identified loci that had been systematically selected under the prevailing conditions in the investigated field experiment Figure 3.

These are candidate loci for establishment success or persistence. In study 1, 12 SNPs, representing 11 loci, had significantly altered allele frequencies, measured as F ST , in Ps survivor populations high seeding rate relative to the original population.

These SNPs represent loci with alleles conferring a higher likelihood for survival under the conditions that are common to all four plots. They may be related to, e. The absolute average allele frequency changes detected ranged from 0.

It is located in the middle of the proximal half of Tp3. Interestingly, this is also the approximate location of the only QTL for persistence detected in a red clover mapping population of red clover by Herrmann et al. In study 3, survivor populations were not compared with the originally sown population. Instead, survivors from Ps populations were compared to survivors from Ms populations, and survivors from populations sown at high seeding density was compared to survivors from populations sown at low density.

A number of loci with allele frequencies indicating differential selection in Ps and Ms were identified. The absolute allele frequency changes detected were up to 0. Red clover in mixture with perennial ryegrass and tall fescue experience earlier competition for light and possibly other resources, as the grasses grow and elongate earlier in the summer. Indeed, we have previously shown that offspring of survivor populations from Ms have earlier stem elongation than offspring from survivor populations from Ps Ergon and Bakken, , suggesting differential selection for earliness.

Later in the summer, red clover plants are likely to experience stronger competition in Ps than in Ms, as individual red clover plants grow very large. Another condition that may vary between Ps and Ms is a stronger dependence of red clover plants on nitrogen fixation in Ms, as grasses have a more efficient nitrogen uptake and less is left for the clover.

Breeding, variety testing and seed multiplication of red clover occurs in Ps. Although seeding rates used usually are much lower 2—4 kg ha -1 than those in our experiment, our results suggest that unintended selection occurring in Ps during breeding and seed multiplication may not necessarily be in favor of good persistence in practical farming, were Ms are used.

Making use of replicate populations and a simple F ST -based test, it was possible to identify loci that had been under selection within one generation in a red clover variety grown in a field experiment for two and a half years.

Pooling of individual DNA samples or leaf samples before sequencing and estimation of allele frequencies reduce costs substantially, allowing analysis of multiple populations and treatments simultaneously.

Sampling error must be controlled, e. Characterization of genomic changes in survival experiments may be utilized in identification of genomic regions, genes and alleles conferring survival in red clover and other species under various environmental conditions, which again can be utilized in breeding.

In addition to identifying loci associated with survival under the conditions prevailing in our field experiment, we have shown that there is differential selection occurring in pure stands of red clover as compared to red clover growing in species mixtures, suggesting that the use of pure stands in breeding might not identify the best genotypes for development of varieties to be used in species mixtures. OR initiated the research. All authors corrected and approved the final version.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Abberton, M. Progress in breeding perennial clovers for temperate agriculture. Annicchiarico, P. Achievements and challenges in improving temperate perennial forage legumes. Plant Sci. Effect of selection under cultivation on morphological traits and yield of ladino white clover landraces. Crop Evol. Google Scholar.

Barrett, R. Adaptation from standing genetic variation. Trends Ecol. Boller, B. Boller, et al.



0コメント

  • 1000 / 1000