ceyx.snp.filtering

1 Load packages and read in unfiltered vcf

This vcf comes straight out of mapping raw RADseq reads to a publicly available Ceyx reference genome and calling SNPs using Stacks (Rochette, Rivera-Colón, and Catchen 2019) (exact pipeline used for this dataset is available here).

Code

library(vcfR)
library(SNPfiltR)
library(StAMPP)
library(adegenet)
library(ggplot2)
#read vcf
v<-read.vcfR("~/Desktop/nmel.ceyx.rad/populations.snps.vcf.gz")

I will now follow the SNP filtering pipeline outlined in detail in the documents of the SNPfiltR R package (see SNPfiltR vignettes) (DeRaad 2022)

2 Implement quality filters that don’t involve missing data

This is because removing low data samples will alter percentage/quantile based missing data cutoffs, so we wait to implement those until after deciding on our final set of samples for downstream analysis

Code

#visualize distributions
hard_filter(v)

no depth cutoff provided, exploratory visualization will be generated.

no genotype quality cutoff provided, exploratory visualization will be generated.

Code

#hard filter to minimum depth of 3, and minimum genotype quality of 30
v<-hard_filter(vcfR=v, depth = 3, gq = 30)

13.84% of genotypes fall below a read depth of 3 and were converted to NA

2.59% of genotypes fall below a genotype quality of 30 and were converted to NA

***** Object of Class vcfR *****
135 samples
23234 CHROMs
179,940 variants
Object size: 659.6 Mb
61.66 percent missing data
*****        *****         *****

Use the function allele_balance() to filter for allele balance from Puritz SNP filtering tutorial “Allele balance: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous, we expect that the allele balance in our data (for real loci) should be close to 0.5”

Code

#execute allele balance filter
v<-filter_allele_balance(v, min.ratio = .1, max.ratio = .9)

0.36% of het genotypes (0.01% of all genotypes) fall outside of 0.1 - 0.9 allele balance ratio and were converted to NA

We now want to implement a max depth filter (super high depth loci are likely multiple loci stuck together into a single paralogous locus, which we want to remove).

Code

#visualize and pick appropriate max depth cutoff
max_depth(v)

cutoff is not specified, exploratory visualization will be generated.

dashed line indicates a mean depth across all SNPs of 45.2

Code

#filter vcf by the max depth cutoff you chose
v<-max_depth(v, maxdepth = 250)

maxdepth cutoff is specified, filtered vcfR object will be returned

4.38% of SNPs were above a mean depth of 250 and were removed from the vcf

Remove SNPs from the vcfR that have become invariant following the removal of questionable genotypes above, and see how many SNPs we have left after this initial set of filters

Code

v<-min_mac(v, min.mac = 1)

24.49% of SNPs fell below a minor allele count of 1 and were removed from the VCF

Code

***** Object of Class vcfR *****
135 samples
19666 CHROMs
129,922 variants
Object size: 490.8 Mb
64.51 percent missing data
*****        *****         *****

3 Setting the missing data by sample threshold

Check out the exploratory visualizations and make decisions about which samples to keep for downstream analysis.

Code

#run function to visualize per sample missingness
miss<-missing_by_sample(v)

No popmap provided

Code

#run function to visualize per SNP missingness
by.snp<-missing_by_snp(v)

cutoff is not specified, exploratory visualizations will be generated

Picking joint bandwidth of 0.0279

Warning: Removed 135 rows containing non-finite values
(`stat_density_ridges()`).

Warning: Removed 1 rows containing missing values (`geom_point()`).

Based on these visualizations, we will want to drop the worst sequenced samples that are dragging down the rest of the dataset. Drop those samples based on an approximate missing data proportion cutoff here (this can always be revised if we end up feeling like this is too lenient or stringent later):

Code

#run function to drop samples above the threshold we want from the vcf
vcfR.trim<-missing_by_sample(v, cutoff = .8)

21 samples are above a 0.8 missing data cutoff, and were removed from VCF

Code

#remove invariant sites generated by sample trimming
vcfR.trim<-min_mac(vcfR.trim, min.mac = 1)

4.32% of SNPs fell below a minor allele count of 1 and were removed from the VCF

Code

#see what effect trimming samples had on missing data across the dataset
by.snp<-missing_by_snp(vcfR.trim)

cutoff is not specified, exploratory visualizations will be generated

Picking joint bandwidth of 0.053

4 Assess technical replicates

To investigate evidence for technical error, batch effects, or contamination, I will compare patterns of sample clustering among all samples with technical replicates still included in the dataset.

Code

#set 90% completeness per SNP threshold
test<-missing_by_snp(vcfR.trim, cutoff=.9)

cutoff is specified, filtered vcfR object will be returned

83.48% of SNPs fell below a completeness cutoff of 0.9 and were removed from the VCF

Code

#convert to genlight
gen<-vcfR2genlight(test)
#fix sample names to fit in <= 10 characters
gen@ind.names

  [1] "C_solitarius_4846"     "C_sacerdotis_5307"     "C_sacerdotis_5311"    
  [4] "C_sacerdotis_5317"     "C_solitarius_5447-8"   "C_solitarius_5565"    
  [7] "C_solitarius_6836"     "C_solitarius_6948"     "C_solitarius_7161"    
 [10] "C_solitarius_7256"     "C_solitarius_7604"     "C_solitarius_7629"    
 [13] "C_solitarius_9547"     "C_solitarius_9572"     "C_solitarius_9628"    
 [16] "C_solitarius_12865"    "C_nigromaxilla_15907"  "C_nigromaxilla_15908" 
 [19] "C_solitarius_16175"    "C_margarethae_27239"   "C_margarethae_27245"  
 [22] "C_margarethae_27254"   "C_margarethae_27261"   "C_margarethae_27264"  
 [25] "C_sacerdotis_27646"    "C_mulcatus_27674"      "C_mulcatus_27686"     
 [28] "C_mulcatus_27746"      "C_mulcatus_27839"      "C_mulcatus_27852"     
 [31] "C_solitarius_27884"    "C_sacerdotis_29529"    "C_meeki_32006"        
 [34] "C_meeki_32007"         "C_meeki_32014"         "C_meeki_32022"        
 [37] "C_meeki_32023"         "C_meeki_32024"         "C_meeki_32054"        
 [40] "C_meeki_32075"         "C_malaitae_32770"      "C_malaitae_32790"     
 [43] "C_nigromaxilla_32835"  "C_nigromaxilla_32846"  "C_nigromaxilla_32860" 
 [46] "C_collectoris_33756"   "C_collectoris_33757"   "C_collectoris_33759"  
 [49] "C_collectoris_33760"   "C_collectoris_33789"   "C_collectoris_33790"  
 [52] "C_collectoris_33822"   "C_collectoris_33832"   "C_collectoris_33842"  
 [55] "C_collectoris_33871"   "C_collectoris_33905"   "C_collectoris_33906"  
 [58] "C_meeki_34840"         "C_meeki_34848"         "C_meeki_34855"        
 [61] "C_meeki_34862"         "C_meeki_34865"         "C_meeki_34870"        
 [64] "C_meeki_5633-2"        "C_meeki_32075-2"       "C_meeki_32075-3"      
 [67] "C_gentianus_34953"     "C_gentianus_34969"     "C_gentianus_35042"    
 [70] "C_collectoris_36072"   "C_collectoris_36076"   "C_collectoris_36094"  
 [73] "C_collectoris_36095"   "C_collectoris_36096"   "C_collectoris_36105"  
 [76] "C_collectoris_36112"   "C_collectoris_36129"   "C_collectoris_36144"  
 [79] "C_collectoris_36145"   "C_collectoris_36156"   "C_collectoris_36212"  
 [82] "C_collectoris_36217"   "C_meeki_34887"         "C_collectoris_36249"  
 [85] "C_meeki_5633"          "C_dispar_5611"         "C_solitarius_5157"    
 [88] "C_solitarius_5192"     "C_solitarius_6977"     "C_solitarius_6982"    
 [91] "C_solitarius_7229"     "C_solitarius_7295"     "C_solitarius_7526"    
 [94] "C_solitarius_9539"     "C_gentianus_13530"     "C_margarethae_14484"  
 [97] "C_nigromaxilla_15880"  "C_nigromaxilla_15892"  "C_margarethae_19259"  
[100] "C_collectoris_33266"   "C_collectoris_33272"   "C_collectoris_33274"  
[103] "C_collectoris_33761"   "C_collectoris_33797"   "C_collectoris_33863"  
[106] "C_collectoris_33878"   "C_collectoris_33908"   "C_collectoris_33922"  
[109] "C_collectoris_36115"   "C_collectoris_36133"   "C_dispar_5611-2"      
[112] "C_malaitae_32790-2"    "C_collectoris_33272-2" "C_collectoris_33274-2"

Code

gen@ind.names<-gsub("C_collectoris","co", gen@ind.names)
gen@ind.names<-gsub("C_margarethae","mar", gen@ind.names)
gen@ind.names<-gsub("C_gentianus","gen", gen@ind.names)
gen@ind.names<-gsub("C_solitarius","sol", gen@ind.names)
gen@ind.names<-gsub("C_nigromaxilla","ni", gen@ind.names)
gen@ind.names<-gsub("C_malaitae","ml", gen@ind.names)
gen@ind.names<-gsub("C_meeki","me", gen@ind.names)
gen@ind.names<-gsub("C_dispar","dis", gen@ind.names)
gen@ind.names<-gsub("C_sacerdotis","sac", gen@ind.names)
gen@ind.names<-gsub("C_mulcatus","mul", gen@ind.names)
gen@ind.names

  [1] "sol_4846"   "sac_5307"   "sac_5311"   "sac_5317"   "sol_5447-8"
  [6] "sol_5565"   "sol_6836"   "sol_6948"   "sol_7161"   "sol_7256"  
 [11] "sol_7604"   "sol_7629"   "sol_9547"   "sol_9572"   "sol_9628"  
 [16] "sol_12865"  "ni_15907"   "ni_15908"   "sol_16175"  "mar_27239" 
 [21] "mar_27245"  "mar_27254"  "mar_27261"  "mar_27264"  "sac_27646" 
 [26] "mul_27674"  "mul_27686"  "mul_27746"  "mul_27839"  "mul_27852" 
 [31] "sol_27884"  "sac_29529"  "me_32006"   "me_32007"   "me_32014"  
 [36] "me_32022"   "me_32023"   "me_32024"   "me_32054"   "me_32075"  
 [41] "ml_32770"   "ml_32790"   "ni_32835"   "ni_32846"   "ni_32860"  
 [46] "co_33756"   "co_33757"   "co_33759"   "co_33760"   "co_33789"  
 [51] "co_33790"   "co_33822"   "co_33832"   "co_33842"   "co_33871"  
 [56] "co_33905"   "co_33906"   "me_34840"   "me_34848"   "me_34855"  
 [61] "me_34862"   "me_34865"   "me_34870"   "me_5633-2"  "me_32075-2"
 [66] "me_32075-3" "gen_34953"  "gen_34969"  "gen_35042"  "co_36072"  
 [71] "co_36076"   "co_36094"   "co_36095"   "co_36096"   "co_36105"  
 [76] "co_36112"   "co_36129"   "co_36144"   "co_36145"   "co_36156"  
 [81] "co_36212"   "co_36217"   "me_34887"   "co_36249"   "me_5633"   
 [86] "dis_5611"   "sol_5157"   "sol_5192"   "sol_6977"   "sol_6982"  
 [91] "sol_7229"   "sol_7295"   "sol_7526"   "sol_9539"   "gen_13530" 
 [96] "mar_14484"  "ni_15880"   "ni_15892"   "mar_19259"  "co_33266"  
[101] "co_33272"   "co_33274"   "co_33761"   "co_33797"   "co_33863"  
[106] "co_33878"   "co_33908"   "co_33922"   "co_36115"   "co_36133"  
[111] "dis_5611-2" "ml_32790-2" "co_33272-2" "co_33274-2"

Code

pop(gen)<-gen@ind.names
#assign populations (a StaMPP requirement)
gen@pop<-as.factor(gen@ind.names)
#generate pairwise divergence matrix
sample.div <- stamppNeisD(gen, pop = FALSE)
#export for splitstree
stamppPhylip(distance.mat=sample.div, file="~/Desktop/nmel.ceyx.rad/test.90.splits.txt")

#view tree
knitr::include_graphics("/Users/devonderaad/Desktop/ceyx.tree.png")

Code

#view zoomed patterns of clustering which show technical replicates tightly clustered
knitr::include_graphics("/Users/devonderaad/Desktop/tech.reps1.png")

Code

knitr::include_graphics("/Users/devonderaad/Desktop/tech.reps2.png")

Code

knitr::include_graphics("/Users/devonderaad/Desktop/tech.reps3.png")

Code

knitr::include_graphics("/Users/devonderaad/Desktop/tech.reps4.png")

I will now investigate the number of genotypes that differ between technical replicates which were derived from the same DNA extract, in this filtered, 85% complete dataset

Code

#define function to calculate degree of genotype conflict between technical replicates
assess_technical_reps<-function(vcf=NULL, rep1=NULL, rep2=NULL){
  #print percentage of called genotypes that differ between reps
  zz<-table(gsub(":.*","",(test@gt[,colnames(vcf@gt) == rep1])) == gsub(":.*","",(vcf@gt[,colnames(test@gt) == rep2])))
  print(paste0("percentage of called SNPs where genotypes differ between ", rep1," and ",rep2," = ",round(zz[1]*100/sum(zz), 3),"%."))
  #isolate the exact genotypes that conflict
  z<-gsub(":.*","",(vcf@gt[,colnames(vcf@gt) == rep1 | colnames(vcf@gt) ==   rep2]))[gsub(":.*","",(vcf@gt[,colnames(vcf@gt) == rep1])) !=   gsub(":.*","",(vcf@gt[,colnames(vcf@gt) == rep2])),]
  #print them with missing values removed
  print(paste0("Conflicting genotypes printed below"))
  z[complete.cases(z), ]
}

#assess all possible comparisons
assess_technical_reps(vcf=test, "C_collectoris_33272", "C_collectoris_33272-2")

[1] "percentage of called SNPs where genotypes differ between C_collectoris_33272 and C_collectoris_33272-2 = 0.039%."
[1] "Conflicting genotypes printed below"

     C_collectoris_33272 C_collectoris_33272-2
[1,] "0/0"               "0/1"                
[2,] "1/1"               "0/1"                
[3,] "0/1"               "0/0"                
[4,] "0/1"               "0/0"                
[5,] "0/0"               "0/1"                
[6,] "0/0"               "0/1"                
[7,] "0/1"               "0/0"                
[8,] "0/1"               "1/1"

Code

assess_technical_reps(vcf=test, "C_collectoris_33274", "C_collectoris_33274-2")

[1] "percentage of called SNPs where genotypes differ between C_collectoris_33274 and C_collectoris_33274-2 = 0.083%."
[1] "Conflicting genotypes printed below"

      C_collectoris_33274 C_collectoris_33274-2
 [1,] "0/1"               "0/0"                
 [2,] "0/0"               "0/1"                
 [3,] "0/1"               "0/0"                
 [4,] "0/1"               "0/0"                
 [5,] "0/0"               "0/1"                
 [6,] "0/1"               "0/0"                
 [7,] "0/0"               "0/1"                
 [8,] "0/1"               "0/0"                
 [9,] "0/0"               "0/1"                
[10,] "0/1"               "0/0"                
[11,] "0/0"               "0/1"                
[12,] "0/0"               "0/1"                
[13,] "0/1"               "0/0"                
[14,] "0/1"               "0/0"                
[15,] "0/1"               "0/0"                
[16,] "0/1"               "0/0"

Code

assess_technical_reps(vcf=test, "C_meeki_5633", "C_meeki_5633-2")

[1] "percentage of called SNPs where genotypes differ between C_meeki_5633 and C_meeki_5633-2 = 0.136%."
[1] "Conflicting genotypes printed below"

      C_meeki_5633-2 C_meeki_5633
 [1,] "0/0"          "0/1"       
 [2,] "0/1"          "0/0"       
 [3,] "0/1"          "0/0"       
 [4,] "0/1"          "0/0"       
 [5,] "0/1"          "1/1"       
 [6,] "0/0"          "0/1"       
 [7,] "0/1"          "0/0"       
 [8,] "0/1"          "0/0"       
 [9,] "0/0"          "0/1"       
[10,] "0/1"          "1/1"       
[11,] "0/1"          "0/0"       
[12,] "0/1"          "0/0"       
[13,] "0/1"          "0/0"       
[14,] "0/0"          "0/1"       
[15,] "1/1"          "0/1"       
[16,] "0/0"          "0/1"       
[17,] "0/0"          "0/1"       
[18,] "0/0"          "0/1"       
[19,] "0/1"          "1/1"       
[20,] "0/1"          "0/0"       
[21,] "0/0"          "0/1"       
[22,] "0/1"          "0/0"       
[23,] "0/0"          "0/1"       
[24,] "0/1"          "1/1"       
[25,] "0/1"          "0/0"       
[26,] "0/0"          "0/1"       
[27,] "0/0"          "0/1"

Code

assess_technical_reps(vcf=test, "C_meeki_32075", "C_meeki_32075-2")

[1] "percentage of called SNPs where genotypes differ between C_meeki_32075 and C_meeki_32075-2 = 0.116%."
[1] "Conflicting genotypes printed below"

      C_meeki_32075 C_meeki_32075-2
 [1,] "0/0"         "0/1"          
 [2,] "0/0"         "0/1"          
 [3,] "0/1"         "0/0"          
 [4,] "0/0"         "0/1"          
 [5,] "0/1"         "0/0"          
 [6,] "0/0"         "0/1"          
 [7,] "0/1"         "0/0"          
 [8,] "0/1"         "0/0"          
 [9,] "0/1"         "0/0"          
[10,] "0/1"         "0/0"          
[11,] "0/1"         "0/0"          
[12,] "0/1"         "0/0"          
[13,] "0/0"         "0/1"          
[14,] "1/1"         "0/1"          
[15,] "0/1"         "1/1"          
[16,] "0/1"         "0/0"          
[17,] "0/0"         "0/1"          
[18,] "0/0"         "0/1"          
[19,] "0/0"         "0/1"          
[20,] "0/0"         "0/1"          
[21,] "0/0"         "0/1"

Code

assess_technical_reps(vcf=test, "C_meeki_32075", "C_meeki_32075-3")

[1] "percentage of called SNPs where genotypes differ between C_meeki_32075 and C_meeki_32075-3 = 0.14%."
[1] "Conflicting genotypes printed below"

      C_meeki_32075 C_meeki_32075-3
 [1,] "0/0"         "0/1"          
 [2,] "0/1"         "0/0"          
 [3,] "0/1"         "0/0"          
 [4,] "0/1"         "0/0"          
 [5,] "0/1"         "0/0"          
 [6,] "0/0"         "0/1"          
 [7,] "0/1"         "1/1"          
 [8,] "0/1"         "0/0"          
 [9,] "0/1"         "0/0"          
[10,] "0/1"         "0/0"          
[11,] "0/0"         "0/1"          
[12,] "0/1"         "0/0"          
[13,] "0/0"         "0/1"          
[14,] "0/0"         "0/1"          
[15,] "0/1"         "0/0"          
[16,] "0/1"         "1/1"          
[17,] "0/0"         "0/1"          
[18,] "0/1"         "1/1"          
[19,] "0/1"         "0/0"          
[20,] "0/1"         "0/0"          
[21,] "0/0"         "0/1"          
[22,] "0/1"         "0/0"          
[23,] "0/0"         "0/1"          
[24,] "0/0"         "0/1"          
[25,] "0/0"         "0/1"

Code

assess_technical_reps(vcf=test, "C_meeki_32075-2", "C_meeki_32075-3")

[1] "percentage of called SNPs where genotypes differ between C_meeki_32075-2 and C_meeki_32075-3 = 0.12%."
[1] "Conflicting genotypes printed below"

      C_meeki_32075-2 C_meeki_32075-3
 [1,] "0/1"           "0/0"          
 [2,] "0/0"           "0/1"          
 [3,] "0/0"           "0/1"          
 [4,] "0/1"           "0/0"          
 [5,] "0/1"           "1/1"          
 [6,] "0/0"           "0/1"          
 [7,] "0/0"           "0/1"          
 [8,] "0/0"           "0/1"          
 [9,] "0/0"           "0/1"          
[10,] "0/0"           "0/1"          
[11,] "0/0"           "0/1"          
[12,] "0/1"           "0/0"          
[13,] "0/1"           "0/0"          
[14,] "0/1"           "0/0"          
[15,] "0/1"           "0/0"          
[16,] "0/0"           "0/1"          
[17,] "0/1"           "0/0"          
[18,] "0/1"           "0/0"          
[19,] "0/0"           "0/1"          
[20,] "0/0"           "0/1"

Code

assess_technical_reps(vcf=test, "C_malaitae_32790", "C_malaitae_32790-2")

[1] "percentage of called SNPs where genotypes differ between C_malaitae_32790 and C_malaitae_32790-2 = 0.12%."
[1] "Conflicting genotypes printed below"

      C_malaitae_32790 C_malaitae_32790-2
 [1,] "0/1"            "1/1"             
 [2,] "0/0"            "0/1"             
 [3,] "0/1"            "0/0"             
 [4,] "0/0"            "0/1"             
 [5,] "0/0"            "0/1"             
 [6,] "0/0"            "0/1"             
 [7,] "1/1"            "0/1"             
 [8,] "0/0"            "0/1"             
 [9,] "0/1"            "1/1"             
[10,] "0/0"            "0/1"             
[11,] "0/0"            "0/1"             
[12,] "0/1"            "0/0"             
[13,] "0/1"            "1/1"             
[14,] "0/1"            "0/0"             
[15,] "0/1"            "0/0"             
[16,] "0/0"            "0/1"             
[17,] "0/0"            "0/1"             
[18,] "0/1"            "0/0"             
[19,] "0/1"            "1/1"             
[20,] "0/1"            "0/0"             
[21,] "0/1"            "0/0"             
[22,] "0/1"            "1/1"             
[23,] "0/1"            "0/0"

Code

assess_technical_reps(vcf=test, "C_dispar_5611", "C_dispar_5611-2")

[1] "percentage of called SNPs where genotypes differ between C_dispar_5611 and C_dispar_5611-2 = 0.154%."
[1] "Conflicting genotypes printed below"

      C_dispar_5611 C_dispar_5611-2
 [1,] "0/0"         "0/1"          
 [2,] "0/1"         "0/0"          
 [3,] "0/0"         "0/1"          
 [4,] "0/0"         "0/1"          
 [5,] "0/0"         "0/1"          
 [6,] "0/1"         "0/0"          
 [7,] "0/1"         "0/0"          
 [8,] "0/0"         "0/1"          
 [9,] "0/0"         "0/1"          
[10,] "0/0"         "0/1"          
[11,] "0/1"         "0/0"          
[12,] "0/0"         "0/1"          
[13,] "0/1"         "1/1"          
[14,] "0/1"         "0/0"          
[15,] "0/1"         "0/0"          
[16,] "0/1"         "0/0"          
[17,] "0/0"         "0/1"          
[18,] "0/1"         "0/0"          
[19,] "0/1"         "0/0"          
[20,] "0/1"         "0/0"          
[21,] "0/1"         "0/0"          
[22,] "0/0"         "0/1"          
[23,] "0/0"         "0/1"          
[24,] "0/1"         "1/1"          
[25,] "0/1"         "1/1"          
[26,] "1/1"         "0/1"          
[27,] "0/1"         "0/0"          
[28,] "1/1"         "0/1"          
[29,] "0/0"         "0/1"          
[30,] "0/1"         "0/0"

These results suggest a technical error rate between .04 - .41 % based on genotype calling in technical replicates derived from the same DNA extracts. This error rate is likely highly dependent on depth of sequencing, as disagreements between called genotypes are nearly universally cases where one replicate is called heterozygous and the other is not. These cases depend greatly on the depth of sequencing to accurately recover enough reads for confident het calls. These results suggest an overall low technical error rate and no evidence for batch effects or contamination.

I will now remove the lower sequenced replicate for each technical replicate, leaving us with a dataset where each replicate comes from a unique biological sample

Code

#remove lower sequenced samples when there are multiple replicates
vcfR.trim<-vcfR.trim[,colnames(vcfR.trim@gt) != "C_collectoris_33272-2" &
      colnames(vcfR.trim@gt) != "C_collectoris_33274-2" &
       colnames(vcfR.trim@gt) != "C_meeki_5633-2" &
       colnames(vcfR.trim@gt) != "C_meeki_32075-3" &
       colnames(vcfR.trim@gt) != "C_meeki_32075-2" &
       colnames(vcfR.trim@gt) != "C_malaitae_32740-2" &
       colnames(vcfR.trim@gt) != "C_malaitae_32790" &
       colnames(vcfR.trim@gt) != "C_dispar_5611"]

#remove invariant SNPs
vcfR.trim<-min_mac(vcfR.trim, min.mac = 1)

4.08% of SNPs fell below a minor allele count of 1 and were removed from the VCF

5 Setting the missing data by SNP threshold

Now we will visualize different per SNP missing data thresholds and identify a value that optimizes the trade-off between amount of missing data and the total number of SNPs retained.

Code

#see what effect trimming samples and technical replicates has had on missing data across the dataset
by.snp<-missing_by_snp(vcfR.trim)

cutoff is not specified, exploratory visualizations will be generated

Picking joint bandwidth of 0.0526

Set a 90% per-SNP completeness cutoff which seems sufficient for a high quality dataset

Code

#set 90% completeness per SNP threshold
filt<-missing_by_snp(vcfR.trim, cutoff=.9)

cutoff is specified, filtered vcfR object will be returned

82.53% of SNPs fell below a completeness cutoff of 0.9 and were removed from the VCF

Code

bysamp<-missing_by_sample(filt)

No popmap provided

Code

filt

***** Object of Class vcfR *****
107 samples
2457 CHROMs
20,831 variants
Object size: 181.5 Mb
2.827 percent missing data
*****        *****         *****

Code

#convert to genlight
gen<-vcfR2genlight(filt)
#fix sample names to fit in <= 10 characters
gen@ind.names

  [1] "C_solitarius_4846"    "C_sacerdotis_5307"    "C_sacerdotis_5311"   
  [4] "C_sacerdotis_5317"    "C_solitarius_5447-8"  "C_solitarius_5565"   
  [7] "C_solitarius_6836"    "C_solitarius_6948"    "C_solitarius_7161"   
 [10] "C_solitarius_7256"    "C_solitarius_7604"    "C_solitarius_7629"   
 [13] "C_solitarius_9547"    "C_solitarius_9572"    "C_solitarius_9628"   
 [16] "C_solitarius_12865"   "C_nigromaxilla_15907" "C_nigromaxilla_15908"
 [19] "C_solitarius_16175"   "C_margarethae_27239"  "C_margarethae_27245" 
 [22] "C_margarethae_27254"  "C_margarethae_27261"  "C_margarethae_27264" 
 [25] "C_sacerdotis_27646"   "C_mulcatus_27674"     "C_mulcatus_27686"    
 [28] "C_mulcatus_27746"     "C_mulcatus_27839"     "C_mulcatus_27852"    
 [31] "C_solitarius_27884"   "C_sacerdotis_29529"   "C_meeki_32006"       
 [34] "C_meeki_32007"        "C_meeki_32014"        "C_meeki_32022"       
 [37] "C_meeki_32023"        "C_meeki_32024"        "C_meeki_32054"       
 [40] "C_meeki_32075"        "C_malaitae_32770"     "C_nigromaxilla_32835"
 [43] "C_nigromaxilla_32846" "C_nigromaxilla_32860" "C_collectoris_33756" 
 [46] "C_collectoris_33757"  "C_collectoris_33759"  "C_collectoris_33760" 
 [49] "C_collectoris_33789"  "C_collectoris_33790"  "C_collectoris_33822" 
 [52] "C_collectoris_33832"  "C_collectoris_33842"  "C_collectoris_33871" 
 [55] "C_collectoris_33905"  "C_collectoris_33906"  "C_meeki_34840"       
 [58] "C_meeki_34848"        "C_meeki_34855"        "C_meeki_34862"       
 [61] "C_meeki_34865"        "C_meeki_34870"        "C_gentianus_34953"   
 [64] "C_gentianus_34969"    "C_gentianus_35042"    "C_collectoris_36072" 
 [67] "C_collectoris_36076"  "C_collectoris_36094"  "C_collectoris_36095" 
 [70] "C_collectoris_36096"  "C_collectoris_36105"  "C_collectoris_36112" 
 [73] "C_collectoris_36129"  "C_collectoris_36144"  "C_collectoris_36145" 
 [76] "C_collectoris_36156"  "C_collectoris_36212"  "C_collectoris_36217" 
 [79] "C_meeki_34887"        "C_collectoris_36249"  "C_meeki_5633"        
 [82] "C_solitarius_5157"    "C_solitarius_5192"    "C_solitarius_6977"   
 [85] "C_solitarius_6982"    "C_solitarius_7229"    "C_solitarius_7295"   
 [88] "C_solitarius_7526"    "C_solitarius_9539"    "C_gentianus_13530"   
 [91] "C_margarethae_14484"  "C_nigromaxilla_15880" "C_nigromaxilla_15892"
 [94] "C_margarethae_19259"  "C_collectoris_33266"  "C_collectoris_33272" 
 [97] "C_collectoris_33274"  "C_collectoris_33761"  "C_collectoris_33797" 
[100] "C_collectoris_33863"  "C_collectoris_33878"  "C_collectoris_33908" 
[103] "C_collectoris_33922"  "C_collectoris_36115"  "C_collectoris_36133" 
[106] "C_dispar_5611-2"      "C_malaitae_32790-2"

Code

gen@ind.names<-gsub("C_collectoris","co", gen@ind.names)
gen@ind.names<-gsub("C_margarethae","mar", gen@ind.names)
gen@ind.names<-gsub("C_gentianus","gen", gen@ind.names)
gen@ind.names<-gsub("C_solitarius","sol", gen@ind.names)
gen@ind.names<-gsub("C_nigromaxilla","ni", gen@ind.names)
gen@ind.names<-gsub("C_malaitae","ml", gen@ind.names)
gen@ind.names<-gsub("C_meeki","me", gen@ind.names)
gen@ind.names<-gsub("C_dispar","dis", gen@ind.names)
gen@ind.names<-gsub("C_sacerdotis","sac", gen@ind.names)
gen@ind.names<-gsub("C_mulcatus","mul", gen@ind.names)
gen@ind.names

  [1] "sol_4846"   "sac_5307"   "sac_5311"   "sac_5317"   "sol_5447-8"
  [6] "sol_5565"   "sol_6836"   "sol_6948"   "sol_7161"   "sol_7256"  
 [11] "sol_7604"   "sol_7629"   "sol_9547"   "sol_9572"   "sol_9628"  
 [16] "sol_12865"  "ni_15907"   "ni_15908"   "sol_16175"  "mar_27239" 
 [21] "mar_27245"  "mar_27254"  "mar_27261"  "mar_27264"  "sac_27646" 
 [26] "mul_27674"  "mul_27686"  "mul_27746"  "mul_27839"  "mul_27852" 
 [31] "sol_27884"  "sac_29529"  "me_32006"   "me_32007"   "me_32014"  
 [36] "me_32022"   "me_32023"   "me_32024"   "me_32054"   "me_32075"  
 [41] "ml_32770"   "ni_32835"   "ni_32846"   "ni_32860"   "co_33756"  
 [46] "co_33757"   "co_33759"   "co_33760"   "co_33789"   "co_33790"  
 [51] "co_33822"   "co_33832"   "co_33842"   "co_33871"   "co_33905"  
 [56] "co_33906"   "me_34840"   "me_34848"   "me_34855"   "me_34862"  
 [61] "me_34865"   "me_34870"   "gen_34953"  "gen_34969"  "gen_35042" 
 [66] "co_36072"   "co_36076"   "co_36094"   "co_36095"   "co_36096"  
 [71] "co_36105"   "co_36112"   "co_36129"   "co_36144"   "co_36145"  
 [76] "co_36156"   "co_36212"   "co_36217"   "me_34887"   "co_36249"  
 [81] "me_5633"    "sol_5157"   "sol_5192"   "sol_6977"   "sol_6982"  
 [86] "sol_7229"   "sol_7295"   "sol_7526"   "sol_9539"   "gen_13530" 
 [91] "mar_14484"  "ni_15880"   "ni_15892"   "mar_19259"  "co_33266"  
 [96] "co_33272"   "co_33274"   "co_33761"   "co_33797"   "co_33863"  
[101] "co_33878"   "co_33908"   "co_33922"   "co_36115"   "co_36133"  
[106] "dis_5611-2" "ml_32790-2"

Code

pop(gen)<-gen@ind.names
#assign populations (a StaMPP requirement)
gen@pop<-as.factor(gen@ind.names)
#generate pairwise divergence matrix
sample.div <- stamppNeisD(gen, pop = FALSE)
#export for splitstree
stamppPhylip(distance.mat=sample.div, file="~/Desktop/nmel.ceyx.rad/ceyx.90.splits.txt")

#splitstree looks clean, no evidence of weird clustering driven by missing data
knitr::include_graphics("/Users/devonderaad/Desktop/ceyx.90.splits.png")

6 Remove overlapping SNPs

It is a known thing (see this) that Stacks will not merge SNPs if they are sequenced by separate (but physically overlapping) loci, and will instead retain the same SNP twice. To account for this, we will simply remove a SNP every time its chromosome and position have already been seen in the dataset, using the following code:

Code

#generate dataframe containing information for chromosome and bp locality of each SNP
df<-as.data.frame(filt@fix[,1:2])
#calc number of duplicated SNPs to remove
nrow(df) - length(unique(paste(df$CHROM,df$POS)))

[1] 17

Code

#remove duplicated SNPs
filt<-filt[!duplicated(paste(df$CHROM,df$POS)),]

7 Visualize depth and quality across all retained genotypes

Code

#plot depth per snp and per sample
dp <- extract.gt(filt, element = "DP", as.numeric=TRUE)
heatmap.bp(dp, rlabels = FALSE)

Code

#plot genotype quality per snp and per sample
gq <- extract.gt(filt, element = "GQ", as.numeric=TRUE)
heatmap.bp(gq, rlabels = FALSE)

8 linkage filter

Now filter for linkage

Code

#perform linkage filtering to get a reduced vcf with only unlinked SNPs
filt.thin<-distance_thin(filt, min.distance = 10000)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |                                                                      |   1%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |==                                                                    |   4%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |===                                                                   |   5%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=======                                                               |  11%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |=========                                                             |  14%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |==========                                                            |  15%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |============                                                          |  18%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |==============                                                        |  21%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |================                                                      |  24%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |==================                                                    |  25%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |===================                                                   |  26%
  |                                                                            
  |===================                                                   |  27%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=====================                                                 |  29%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |=====================                                                 |  31%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |=======================                                               |  34%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==========================                                            |  36%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |==========================                                            |  38%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |===========================                                           |  39%
  |                                                                            
  |============================                                          |  39%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |============================                                          |  41%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |==============================                                        |  42%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |==============================                                        |  44%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |================================                                      |  45%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |=================================                                     |  46%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |=================================                                     |  48%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |==================================                                    |  49%
  |                                                                            
  |===================================                                   |  49%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |===================================                                   |  51%
  |                                                                            
  |====================================                                  |  51%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |=====================================                                 |  52%
  |                                                                            
  |=====================================                                 |  53%
  |                                                                            
  |=====================================                                 |  54%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |======================================                                |  55%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |========================================                              |  56%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |========================================                              |  58%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |==========================================                            |  59%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |==========================================                            |  61%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |============================================                          |  63%
  |                                                                            
  |============================================                          |  64%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |=============================================                         |  65%
  |                                                                            
  |==============================================                        |  65%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |===============================================                       |  66%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |===============================================                       |  68%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=================================================                     |  69%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |=================================================                     |  71%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |===================================================                   |  72%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |===================================================                   |  74%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=====================================================                 |  75%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |======================================================                |  76%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |======================================================                |  78%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |========================================================              |  79%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |========================================================              |  81%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |==========================================================            |  84%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=============================================================         |  86%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |=============================================================         |  88%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |==============================================================        |  89%
  |                                                                            
  |===============================================================       |  89%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |===============================================================       |  91%
  |                                                                            
  |================================================================      |  91%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |=================================================================     |  94%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |==================================================================    |  95%
  |                                                                            
  |===================================================================   |  95%
  |                                                                            
  |===================================================================   |  96%
  |                                                                            
  |====================================================================  |  96%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |====================================================================  |  98%
  |                                                                            
  |===================================================================== |  98%
  |                                                                            
  |===================================================================== |  99%
  |                                                                            
  |======================================================================|  99%
  |                                                                            
  |======================================================================| 100%

2580 out of 20814 input SNPs were not located within 10000 base-pairs of another SNP and were retained despite filtering

9 write vcf to disk for downstream analyses

Code

#get info for all SNPs passing filtering vcf dataset
filt

***** Object of Class vcfR *****
107 samples
2457 CHROMs
20,814 variants
Object size: 181.4 Mb
2.829 percent missing data
*****        *****         *****

Code

#write to disk
#vcfR::write.vcf(filt, file = "~/Desktop/nmel.ceyx.rad/filtered.snps.vcf.gz")

#get info for the thinned SNP dataset
filt.thin

***** Object of Class vcfR *****
107 samples
2457 CHROMs
2,580 variants
Object size: 22.4 Mb
3.187 percent missing data
*****        *****         *****

Code

#write to disk
#vcfR::write.vcf(filt.thin, file = "~/Desktop/nmel.ceyx.rad/filtered.unlinked.snps.vcf.gz")

References

DeRaad, Devon A. 2022. “SNPfiltR: An R Package for Interactive and Reproducible SNP Filtering.” Molecular Ecology Resources 22 (6): 2443–53. https://doi.org/10.1111/1755-0998.13618.

Rochette, Nicolas C., Angel G. Rivera-Colón, and Julian M. Catchen. 2019. “Stacks 2: Analytical Methods for Paired-End Sequencing Improve RADseq-Based Population Genomics.” Molecular Ecology 28 (21): 4737–54. https://doi.org/https://doi.org/10.1111/mec.15253.