Saturday, November 18, 2017

I took a Y-DNA test. Should I upgrade my Y-STRs?


If you've taken a Y-chromosome DNA test from Family Tree DNA, your first test showed the results of Short Tandem Repeats (STRs).  Below are the results of a 37-marker test:

STR results


These results are meaningless on their own; they are only useful when compared to the results of others.  So your list of Y-DNA matches becomes critical. Are 37 markers enough to tell us what we need, and is there any point in upgrading?


How Family Tree DNA determines matches


In addition to your STR results, Family Tree DNA will give you a list of men whose Y-DNA results match yours. These matches are based on the number of markers tested and genetic distance. Genetic Distance is the number of differences, or mutations, between two sets of test results..

If you order a 37-marker Y-STR test,  FTDNA will show another man as a match to you if there is a genetic distance of 4 or less. If you test 67 markers, your match must have a genetic distance of 7 or less. At 111 markers your match list contains people who match you at a genetic distance of 10 or less. 

Sometimes you have matches to other men with your surname. Sometimes you have no matches at all, and sometimes you have so many matches that it's difficult to determine which ones are actually related to you. We will examine these three types of results to see if it may be useful to upgrade STR markers.


Surname matches: How closely related are we?


If you have matches to other people with your surname, you want to determine how closely related they are. 

At 37 markers, my brother has 17 matches.  Only people who tested 37 markers or more can appear on this match list. In the first column of the match list the Genetic Distance (GD) is shown. Because the cut off for 37 markers is a genetic distance of 4, anyone who has a GD of five or more at this level will not be on the list. 


Y-DNA matches


The genetic distance is reported, but we can't see which markers are matching. We can only know the actual results by joining projects or by contacting each person on the list. 

In this case, my brother joined the Thompson DNA Project. Below are the results for some of the Thompsons who match him. Comparing kit 34484 to kit 38962, we can see a genetic distance of four--one mismatch each at DYS448, DYS456, DYS576, and CDY. 


surname project

There seem to be a few markers that are distinguishing between different groups of descendants from the common Thompson ancestor. At DYS448 four men have a 19, and the four below them have a 20.  The four men who have a 19 at DYS448 also have an 18 at DYS576, and the four men who have 20 at DYS448 have a 17 at DYS576. These markers can be considered signature markers that will allow us to help place the men within the Thompson family tree. 

One of these men has a proven descent from Robert Thompson who was the immigrant ancestor to Colonial Maryland. Perhaps all of these men descend from him. But the 37-marker test indicates that the men with a genetic distance of four could be more distantly related and may share a common ancestor further back that Robert Thompson. Family Tree DNA shows the likelihood of relationships in the table below:


Y-DNA relationships
Family Tree DNA Relationship Table

At a genetic distance of 4 at 37 markers, the interpretation column states "it is unlikely that you share a common ancestor in recent genealogical times (one to six generations). You may have a connection in more distant genealogical times (less than 15 generations). 

At 37 markers and a genetic distance of four, we have no way of knowing whether the relationship is closer to six or 15 generations. We could be discouraged and decide that there may be no point in spending a lot of time trying to find our common ancestor. There also may no point in upgrading to 67 markers where we will likely see more mismatches.  

But one-by-one, we upgraded anyway.  Each person who upgraded encouraged others to do so. Here are the Thompson results for markers 38-67:


compare STR


In the panel of 38-67 markers, there is only a single mismatch in this Thompson group. Some of these men were a genetic distance of four at 37 markers. They are still a genetic distance of four at 67 markers. Looking at the Family Tree DNA Relationship Table above, we find the genetic distance of 3-4 in the Y-DNA 67 column. It now indicates, "Your degree of matching is within the range of most well-established surname lineages in Western Europe."  

Without these extra markers, we may not have tried to find the common ancestor. But with 67 markers, it appeared more likely that we may be able to find a common ancestor within the genealogical time frame, and, indeed, we did.  The common ancestor of almost all of these men has now been proven to be George Thompson, the son of Robert the immigrant. In this case more markers helped to more closely determine degrees of relationship. The upgrade to 67 markers was very important in this surname group.

When men in a surname project notice that others are upgrading their STRs, they are more likely to do the same.  In addition, when men compare STRs, there is one more important piece of information that helps encourage them to do further testing: an indication of SNP testing is included next to the STR results.

SNP testing


In the example above, seven of the men have not done any SNP testing. We know this because their haplogroup R-M269 is listed in red. A haplogroup in red is the predicted haplogroup provided by Family Tree DNA.  Haplogroups listed in green mean that these men have conducted SNP testing. Two of the men have tested positive for the SNP FGC11134. One of the men has taken the Big Y test, and as soon as this was discovered another Thompson quickly followed suit. We are now awaiting the second set of Big Y results.

Upgrading STR results not only encouraged others to upgrade their STRs, but also encouraged them to do SNP testing.


No matches


My cousin has 240 matches at 12 markers and 24 matches at 25 markers. At 37 markers, my cousin has no matches at all.


no Y-DNA matches


The number of matches has gone done at each level of testing.  Of course, when he joined his surname project, he had no matches. Here is a case where testing 67 markers would appear to be futile. But again, we did it anyway:


67 markers


At 67 markers, he has one genealogically relevant match.  This man had not joined the surname project, so there would be no way to find this match without a 67-marker test.


Too many matches


In contrast to the cousin with no matches at 37 markers, another cousin has 1610.


many Y-DNA matches



To cut that list down to a more reasonable number, I upgraded the STRs to 67 markers:




The number of matches actually went up! He now has 2252 matches at 67 markers. The list contained multiple surnames, so in this case the haplogroup project was more useful than the surname project. 

Because STRs can mutate up in one generation and back down in another, STR results can indicate that people are more closely related than they actually are. This is called convergence. A match list of 2252 men (and multiple surnames) is an example of this. It is difficult to determine how closely these men are related. However, a significant number of these people have tested 111 markers. Will upgrading to 111 increase or decrease the number of people on the match list?




Upgrading to 111 markers made a big difference. It cut the match list from 2252 to 138. This is not because few of the men have tested 111 markers; it's because at 111 markers most of most of the men on the previous list were no longer matching. Their genetic distance at 111 markers was now greater than the threshold of 10 that FTDNA will consider to be a match. My cousin's closest match at 111 markers is a genetic distance of 7, so many of the men on the previous lists are now gone.


What does upgrading markers do?


  • Upgrading helps to more precisely determine time to the Most Recent Common Ancestor.
  • Upgrading may add new matches when someone previously had few or none.
  • Upgrading may cut down the list of matches when someone previously had too many.
  • Upgrading helps determine subgroups of men who may be related.  For example, some of the men in the Thompson project only tested 12 markers. But at 37 markers, we could see signature markers dividing these men into two groups. Upgrading them all to 111 could further differentiate the lines.
  • Upgrading STRs and conducting SNP tests encourages others to do the same.
  • Upgrading STRs provides model haplotypes that help others decide what testing to consider. Tests that have both 111 STRs and Big Y results are the most useful.


Summary


In every case above, whether the person had matches to others with his surname, no matches at all, or too many matches, upgrading STRs proved useful in finding men who may share a recent common ancestor. In some cases STRs are enough to prove lineages. But STR testing can be combined with SNP testing to prove the family tree and extend the lines even further. We will examine combining SNP results and STR results in future blog posts. In the meantime, please consider upgrading your STRs. You may find unexpected results.




Saturday, November 4, 2017

Evaluating new Big Y changes


Big Y conversion in process



I have been seeing many questions regarding changes to Family Tree DNA's Big Y results or the inability to access them. Here we will see why some people cannot access results, why others may be frustrated at their new results, and what we can do to understand what we have so far. 

I will be referring to my brother's Big Y test results as "my" results.


Background to Big Y testing



Let's review a little about how the Big Y test is processed. As you may remember, your DNA consists of two strands of DNA coiled into a double helix. The strands run in opposite directions.  One is called the forward strand, and the other is the reverse strand. The strands are connected by base pairs (bp) which are the As, Cs, Gs, and Ts that form your DNA sequence. All of these can be numbered to show their position on the chromosome.




During the testing process, your DNA is not read in one continuous stretch. Instead, your DNA is broken into random fragments. The test then reads these fragments from each end. Some fragments are read many more times than others. For example, one of your fragments may have been read two times, and another 56 times. Unfortunately, not all of the reads may give the same result. So a fragment that was read consistently many times will be reported as a high quality SNP, while one read a few times with different results will be considered to be a much less reliable SNP. 

After all the fragments are read, they must be reassembled, mapped to the human genome reference sequence, and given a precise location. Differences between your DNA results and the reference sequence are then reported. 

The human genome reference sequence is continually improving. Big Y results were formerly compared against the human genome reference sequence known as hg19 which was Build 37. 



What happened to my Big Y results?


When  FTDNA began the conversion process, all haplogroup designations were rolled back to what they were before Big Y testing.  In the image below my results are at the bottom with Cairns and another Thompson. Before the Big Y, I had tested the single SNP R-DF13, and Cairns and the other Thompson had tested positive for R-FGC11134.  




  
Our former Big Y results disappeared.



Big changes to Big Y



For the past several weeks, Family Tree DNA has been remapping all Big Y results to the most recent human genome reference sequence, hg38. This means that many of the SNP position numbers have changed. FTDNA is also adding a new Y-chromosome browser and a new matching system.

The remapping to hg38 will lead to more accurate identification of SNPs and even the discovery of new SNPs. The Y-chromosome browser will show more information about the test results, and the new matching system will lead to more accurate matches.

We will examine new SNP discoveries and the new Y-chromosome browser. We will not look at the new matching system because not all Big Y tests have been converted.



Former Big Y results


In the previous version of the Big Y results page, there were three tabs: Known SNPs, Novel Variants, and Matching. Your novel variants are newly-discovered SNPs that have not yet been seen by Family Tree DNA in any other tester. Family Tree DNA doesn't name SNPs until they are shared, so these novel variants are identified by their position number on the Y-chromosome.

In the former Big Y version I had three novel variants. The position numbers were based on the old hg19 human reference sequence.




Not all Big Y tests have been converted


My Big Y conversion is now complete. My haplogroup designation has changed, but the haplogroups for Kits N116392 and 34484 have not.  Since all three of us have ordered the Big Y, their results have not been completed, and they will not yet show up on my list of matches.





New SNP discoveries in Big Y Tests


On the new Big Y results page, the tab names have changed.  Instead of Known SNPs and Novel Variants, they are now called Named Variants and Unnamed Variants. The named variants are shared with others; the unnamed variants are, so far, unique to you.




When I click the Unnamed Variants tab, I can see the new Y-chromosome browser at the top. I now have six unnamed or novel variants instead of three as shown in the previous report. The position numbers have all changed. These position numbers are based on the hg38 human reference sequence:





Family Tree DNA does not list the old hg19 position numbers, so I can't tell which ones were previously reported and which are new.  So I had to use other resources to convert them.

hg 38          hg19

11321844 = 13477520 

11514480 = 13670156 

  11649109 = did not exist

12144610 = 14265316 

19139783 = 21301669 

56831461 = 58977608

Positions 11514480, 12144610, and 19139783 appeared in my former Novel Variants table as positions 13670156, 14265316, and 21301669. I had submitted my hg19 Big Y results to Full Genomes Corp (FGC) and to YFull. Position 11321844 did not appear in my original Big Y Novel Variants table, but it was recognized as a SNP and named by Full Genomes Corp. So these four Unnamed Variants, 11514480, 12144610, 19139783, and 11321844, appear to be genuine.

Positions 11649109 and 56831461 were not previously recognized as new SNPs by FTDNA, FGC, YFull, or any other analyst. These two need to be examined.


Why do I have two new SNPs that were not previously recognized?


Position 11649109: I have submitted all of my unnamed SNPs to YSeq so that they can be verified by Sanger sequencing. YSeq accepted all but one: they rejected position 11649109 because it was located in a "high repetitive region." 

Position 56831461: FGC previously reported that in my results, this was a low quality SNP and that it had been seen in two prior scientific studies.


Mapping to the hg19 human reference sequence


Although FTDNA and FGC give all SNPs a quality rating, YFull shows the specific reason for their rating. Here is what was reported about position 56831461 in my YFull hg19 analysis:




The above screen states that position 56831461 was formerly called position 58977608. It was read 53 times in my Big Y test. 34 times I had a T reported for this position, and 19 times I had a C reported (which is the ancestral position). Therefore, this SNP was rejected by all analysts.

If this position had been identified as a valid SNP by YFull, I would have been able to see it in their chromosome browser where it would have had 53 segments aligned--34 segments would have had Ts, and 19 would have had Gs.  The chromosome browser would have appeared similar to the image below.  Here the cursor is pointing to a position that had been read seven times; six of them showed an A in this position, and one showed a C.  




If any of the above segments had been misaligned to the less accurate hg19 human reference sequence, the new hg38 reference sequence would show a different result.


Mapping to the hg38 human reference sequence


After the new mapping to the hg38 reference sequence, Family Tree DNA now rates position 56831461 as a high quality SNP. FTDNA's new Y-chromosome browser indicates that this position was read 18 times, and all of them were T. 




Perhaps some of the previous segments were more correctly mapped to a different location. We do not yet have access to the new BAM files, so we can't compare these results to the old BAM files to see why these changes occurred.

In FTDNA's new Y-chromosome browser the forward and reverse strands are color-coded. Notice also that in this browser image at position 56831423 the calls are identified with a blue color and no letter. If you click on the blue column you can see that this column is not a change from one base to another. These are insertions that appear in my sequence.


SNP may not be novel at all


If I look at position 56831461 at YBrowse.org, I see that it has been given two previous SNP names. This agrees with the FGC report that this SNP had been seen twice.  



So this SNP may not be truly novel, but is only a new SNP in the FTDNA database.


How can we verify that any new SNPs are genuine?

  • See if any of the current Unnamed Variants are shared by other testers when the Big Y conversions are finished.
  • Have your results further analyzed. For example, YFull has announced that they will convert any previously-submitted kit from the hg19 reference sequence to the hg38 reference sequence and provide a new analysis for only $15. I will definitely be ordering that.
  • Have your novel SNPs verified by Sanger sequencing. The least expensive way to do this is to submit each new SNP to YSeq using "Wish A SNP." Use the hg38 position numbers. Then order a test at YSeq for your new SNPs and submit a DNA sample.


Using the new Big Y chromosome browser for Named Variants


In the Named Variants table, the SNPs are listed in alphabetical order. In the image below the first SNP is A1207.




The Reference Column contains the ancestral value from the hg38 Human Genome Reference Sequence. The Genotype column shows my derived value.

If you click on the name of any SNP, you will be taken to the Y-chromosome browser. I clicked on A1207:




SNP A1207 is shown at position 10631919. When I click anywhere in that column a black reference box will appear to the right of the column. This box tells me that at position 10631919 the reference sequence has a G, and I have a T. It does not tell me how many times this position was read, but we can count down the column to find out. Move the scroll bar at the bottom of the chromosome browser all the way to the right to access the vertical scroll bar.




You can also zoom out in your Internet browser to see all segments at once.




We can count the number of segments in the column for Position 10631919. According to the chromosome browser this position was read 34 times, and all reads showed a T in my results. 


"Derived" vs "Mismatch"


It appears in the above chromosome browser that I have several SNPs in the same short region. I can click on any of these locations in the browser to get more information.  But if I click in the column for position 10631929 (ten positions to the right of 10631919), I notice that the Type does not say "Derived"; it says "Mismatch".  




The only column that has the notation "Derived" is the column for position 10631919. It is also the only column for which we can determine the SNP name (A1207). 

What does "Mismatch" mean?  Looking at the browser, position 10631929 sure looks like a genuine SNP, but that position is not on my list of Unnamed Variants.  So if it's a real SNP, it must be in my list of Named Variants.  Unfortunately, the chromosome browser only has the position numbers, and the Named Variants table only has the SNP names.

I wish all the tools were in one location so that this was not such a cumbersome process, but we currently need to use multiple tools for our evaluations. I can use YBrowse.org to look up known SNPs and find out more about them. YBrowse indicates that position 10631929 has been named BY23083.





Do I have SNP BY23083? When I go back to the Named Variants in my Big Y Results, I can enter BY23083 in the SNP Name Search Box.




This SNP immediately shows up in my list of Named Variants:





If I click that SNP name, I will be taken back to the Y-chomosome browser. Clicking anywhere in the 10631929 column, we see that now the Type is Derived instead of Mismatch. 




When we looked at SNP A1207, position 10631919, the black information box indicated that this position was "Derived" On the screen for position 10631929 (SNP BY23083), position 10631919 (SNP A1207) is now listed as "Mismatch."




As we can see from the above browser images, only the SNP that is named at the top of the each browser screen will be listed as "Derived." All others on that screen will be listed as "Mismatch."


Evaluation of New Big Y results


It is still too early to tell the full impact of the Big Y conversion because we can't yet compare all of the people who match us, and we don't have access to the BAM files. The SNP names, hg38 position numbers, and hg19 position numbers are not fully cross-referenced, so understanding the recent changes can be frustrating.

But this change has huge potential. We should soon be discovering new SNPs, learning more about them, and finding more accurate matches. We won't see any changes to the current system until after all the results are processed, but we can make a few recommendations at a time. 


Suggested improvements to Big Y Results


Although we will have many suggestions in the near future, here are a few that can make the current results easier to use.


Unnamed Variants table:

  • Include hg38 and hg19 position numbers



Named Variants table:

  • Include the SNP name and its hg38 position



Y-chromosome browser:

  • Include SNP names and positions in the black information boxes 


Family Tree DNA has stated its commitment to making our results easier to evaluate.  I look forward to seeing how much we will soon learn!