Code
library(vcfR)
library(SNPfiltR)
library(StAMPP)
library(adegenet)
library(ggplot2)
#read vcf
<-read.vcfR("~/Desktop/nmel.ceyx.rad/populations.snps.vcf.gz") v
This vcf comes straight out of mapping raw RADseq reads to a publicly available Ceyx reference genome and calling SNPs using Stacks (Rochette, Rivera-Colón, and Catchen 2019) (exact pipeline used for this dataset is available here).
library(vcfR)
library(SNPfiltR)
library(StAMPP)
library(adegenet)
library(ggplot2)
#read vcf
<-read.vcfR("~/Desktop/nmel.ceyx.rad/populations.snps.vcf.gz") v
I will now follow the SNP filtering pipeline outlined in detail in the documents of the SNPfiltR R package (see SNPfiltR vignettes) (DeRaad 2022)
This is because removing low data samples will alter percentage/quantile based missing data cutoffs, so we wait to implement those until after deciding on our final set of samples for downstream analysis
#visualize distributions
hard_filter(v)
no depth cutoff provided, exploratory visualization will be generated.
no genotype quality cutoff provided, exploratory visualization will be generated.
#hard filter to minimum depth of 3, and minimum genotype quality of 30
<-hard_filter(vcfR=v, depth = 3, gq = 30) v
13.84% of genotypes fall below a read depth of 3 and were converted to NA
2.59% of genotypes fall below a genotype quality of 30 and were converted to NA
***** Object of Class vcfR *****
135 samples
23234 CHROMs
179,940 variants
Object size: 659.6 Mb
61.66 percent missing data
***** ***** *****
Use the function allele_balance()
to filter for allele balance from Puritz SNP filtering tutorial “Allele balance: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous, we expect that the allele balance in our data (for real loci) should be close to 0.5”
#execute allele balance filter
<-filter_allele_balance(v, min.ratio = .1, max.ratio = .9) v
0.36% of het genotypes (0.01% of all genotypes) fall outside of 0.1 - 0.9 allele balance ratio and were converted to NA
We now want to implement a max depth filter (super high depth loci are likely multiple loci stuck together into a single paralogous locus, which we want to remove).
#visualize and pick appropriate max depth cutoff
max_depth(v)
cutoff is not specified, exploratory visualization will be generated.
dashed line indicates a mean depth across all SNPs of 45.2
#filter vcf by the max depth cutoff you chose
<-max_depth(v, maxdepth = 250) v
maxdepth cutoff is specified, filtered vcfR object will be returned
4.38% of SNPs were above a mean depth of 250 and were removed from the vcf
Remove SNPs from the vcfR that have become invariant following the removal of questionable genotypes above, and see how many SNPs we have left after this initial set of filters
<-min_mac(v, min.mac = 1) v
24.49% of SNPs fell below a minor allele count of 1 and were removed from the VCF
v
***** Object of Class vcfR *****
135 samples
19666 CHROMs
129,922 variants
Object size: 490.8 Mb
64.51 percent missing data
***** ***** *****
Check out the exploratory visualizations and make decisions about which samples to keep for downstream analysis.
#run function to visualize per sample missingness
<-missing_by_sample(v) miss
No popmap provided
#run function to visualize per SNP missingness
<-missing_by_snp(v) by.snp
cutoff is not specified, exploratory visualizations will be generated
Picking joint bandwidth of 0.0279
Warning: Removed 135 rows containing non-finite values
(`stat_density_ridges()`).
Warning: Removed 1 rows containing missing values (`geom_point()`).
Based on these visualizations, we will want to drop the worst sequenced samples that are dragging down the rest of the dataset. Drop those samples based on an approximate missing data proportion cutoff here (this can always be revised if we end up feeling like this is too lenient or stringent later):
#run function to drop samples above the threshold we want from the vcf
<-missing_by_sample(v, cutoff = .8) vcfR.trim
21 samples are above a 0.8 missing data cutoff, and were removed from VCF
#remove invariant sites generated by sample trimming
<-min_mac(vcfR.trim, min.mac = 1) vcfR.trim
4.32% of SNPs fell below a minor allele count of 1 and were removed from the VCF
#see what effect trimming samples had on missing data across the dataset
<-missing_by_snp(vcfR.trim) by.snp
cutoff is not specified, exploratory visualizations will be generated
Picking joint bandwidth of 0.053
To investigate evidence for technical error, batch effects, or contamination, I will compare patterns of sample clustering among all samples with technical replicates still included in the dataset.
#set 90% completeness per SNP threshold
<-missing_by_snp(vcfR.trim, cutoff=.9) test
cutoff is specified, filtered vcfR object will be returned
83.48% of SNPs fell below a completeness cutoff of 0.9 and were removed from the VCF
#convert to genlight
<-vcfR2genlight(test)
gen#fix sample names to fit in <= 10 characters
@ind.names gen
[1] "C_solitarius_4846" "C_sacerdotis_5307" "C_sacerdotis_5311"
[4] "C_sacerdotis_5317" "C_solitarius_5447-8" "C_solitarius_5565"
[7] "C_solitarius_6836" "C_solitarius_6948" "C_solitarius_7161"
[10] "C_solitarius_7256" "C_solitarius_7604" "C_solitarius_7629"
[13] "C_solitarius_9547" "C_solitarius_9572" "C_solitarius_9628"
[16] "C_solitarius_12865" "C_nigromaxilla_15907" "C_nigromaxilla_15908"
[19] "C_solitarius_16175" "C_margarethae_27239" "C_margarethae_27245"
[22] "C_margarethae_27254" "C_margarethae_27261" "C_margarethae_27264"
[25] "C_sacerdotis_27646" "C_mulcatus_27674" "C_mulcatus_27686"
[28] "C_mulcatus_27746" "C_mulcatus_27839" "C_mulcatus_27852"
[31] "C_solitarius_27884" "C_sacerdotis_29529" "C_meeki_32006"
[34] "C_meeki_32007" "C_meeki_32014" "C_meeki_32022"
[37] "C_meeki_32023" "C_meeki_32024" "C_meeki_32054"
[40] "C_meeki_32075" "C_malaitae_32770" "C_malaitae_32790"
[43] "C_nigromaxilla_32835" "C_nigromaxilla_32846" "C_nigromaxilla_32860"
[46] "C_collectoris_33756" "C_collectoris_33757" "C_collectoris_33759"
[49] "C_collectoris_33760" "C_collectoris_33789" "C_collectoris_33790"
[52] "C_collectoris_33822" "C_collectoris_33832" "C_collectoris_33842"
[55] "C_collectoris_33871" "C_collectoris_33905" "C_collectoris_33906"
[58] "C_meeki_34840" "C_meeki_34848" "C_meeki_34855"
[61] "C_meeki_34862" "C_meeki_34865" "C_meeki_34870"
[64] "C_meeki_5633-2" "C_meeki_32075-2" "C_meeki_32075-3"
[67] "C_gentianus_34953" "C_gentianus_34969" "C_gentianus_35042"
[70] "C_collectoris_36072" "C_collectoris_36076" "C_collectoris_36094"
[73] "C_collectoris_36095" "C_collectoris_36096" "C_collectoris_36105"
[76] "C_collectoris_36112" "C_collectoris_36129" "C_collectoris_36144"
[79] "C_collectoris_36145" "C_collectoris_36156" "C_collectoris_36212"
[82] "C_collectoris_36217" "C_meeki_34887" "C_collectoris_36249"
[85] "C_meeki_5633" "C_dispar_5611" "C_solitarius_5157"
[88] "C_solitarius_5192" "C_solitarius_6977" "C_solitarius_6982"
[91] "C_solitarius_7229" "C_solitarius_7295" "C_solitarius_7526"
[94] "C_solitarius_9539" "C_gentianus_13530" "C_margarethae_14484"
[97] "C_nigromaxilla_15880" "C_nigromaxilla_15892" "C_margarethae_19259"
[100] "C_collectoris_33266" "C_collectoris_33272" "C_collectoris_33274"
[103] "C_collectoris_33761" "C_collectoris_33797" "C_collectoris_33863"
[106] "C_collectoris_33878" "C_collectoris_33908" "C_collectoris_33922"
[109] "C_collectoris_36115" "C_collectoris_36133" "C_dispar_5611-2"
[112] "C_malaitae_32790-2" "C_collectoris_33272-2" "C_collectoris_33274-2"
@ind.names<-gsub("C_collectoris","co", gen@ind.names)
gen@ind.names<-gsub("C_margarethae","mar", gen@ind.names)
gen@ind.names<-gsub("C_gentianus","gen", gen@ind.names)
gen@ind.names<-gsub("C_solitarius","sol", gen@ind.names)
gen@ind.names<-gsub("C_nigromaxilla","ni", gen@ind.names)
gen@ind.names<-gsub("C_malaitae","ml", gen@ind.names)
gen@ind.names<-gsub("C_meeki","me", gen@ind.names)
gen@ind.names<-gsub("C_dispar","dis", gen@ind.names)
gen@ind.names<-gsub("C_sacerdotis","sac", gen@ind.names)
gen@ind.names<-gsub("C_mulcatus","mul", gen@ind.names)
gen@ind.names gen
[1] "sol_4846" "sac_5307" "sac_5311" "sac_5317" "sol_5447-8"
[6] "sol_5565" "sol_6836" "sol_6948" "sol_7161" "sol_7256"
[11] "sol_7604" "sol_7629" "sol_9547" "sol_9572" "sol_9628"
[16] "sol_12865" "ni_15907" "ni_15908" "sol_16175" "mar_27239"
[21] "mar_27245" "mar_27254" "mar_27261" "mar_27264" "sac_27646"
[26] "mul_27674" "mul_27686" "mul_27746" "mul_27839" "mul_27852"
[31] "sol_27884" "sac_29529" "me_32006" "me_32007" "me_32014"
[36] "me_32022" "me_32023" "me_32024" "me_32054" "me_32075"
[41] "ml_32770" "ml_32790" "ni_32835" "ni_32846" "ni_32860"
[46] "co_33756" "co_33757" "co_33759" "co_33760" "co_33789"
[51] "co_33790" "co_33822" "co_33832" "co_33842" "co_33871"
[56] "co_33905" "co_33906" "me_34840" "me_34848" "me_34855"
[61] "me_34862" "me_34865" "me_34870" "me_5633-2" "me_32075-2"
[66] "me_32075-3" "gen_34953" "gen_34969" "gen_35042" "co_36072"
[71] "co_36076" "co_36094" "co_36095" "co_36096" "co_36105"
[76] "co_36112" "co_36129" "co_36144" "co_36145" "co_36156"
[81] "co_36212" "co_36217" "me_34887" "co_36249" "me_5633"
[86] "dis_5611" "sol_5157" "sol_5192" "sol_6977" "sol_6982"
[91] "sol_7229" "sol_7295" "sol_7526" "sol_9539" "gen_13530"
[96] "mar_14484" "ni_15880" "ni_15892" "mar_19259" "co_33266"
[101] "co_33272" "co_33274" "co_33761" "co_33797" "co_33863"
[106] "co_33878" "co_33908" "co_33922" "co_36115" "co_36133"
[111] "dis_5611-2" "ml_32790-2" "co_33272-2" "co_33274-2"
pop(gen)<-gen@ind.names
#assign populations (a StaMPP requirement)
@pop<-as.factor(gen@ind.names)
gen#generate pairwise divergence matrix
<- stamppNeisD(gen, pop = FALSE)
sample.div #export for splitstree
stamppPhylip(distance.mat=sample.div, file="~/Desktop/nmel.ceyx.rad/test.90.splits.txt")
#view tree
::include_graphics("/Users/devonderaad/Desktop/ceyx.tree.png") knitr
#view zoomed patterns of clustering which show technical replicates tightly clustered
::include_graphics("/Users/devonderaad/Desktop/tech.reps1.png") knitr
::include_graphics("/Users/devonderaad/Desktop/tech.reps2.png") knitr
::include_graphics("/Users/devonderaad/Desktop/tech.reps3.png") knitr
::include_graphics("/Users/devonderaad/Desktop/tech.reps4.png") knitr
I will now investigate the number of genotypes that differ between technical replicates which were derived from the same DNA extract, in this filtered, 85% complete dataset
#define function to calculate degree of genotype conflict between technical replicates
<-function(vcf=NULL, rep1=NULL, rep2=NULL){
assess_technical_reps#print percentage of called genotypes that differ between reps
<-table(gsub(":.*","",(test@gt[,colnames(vcf@gt) == rep1])) == gsub(":.*","",(vcf@gt[,colnames(test@gt) == rep2])))
zzprint(paste0("percentage of called SNPs where genotypes differ between ", rep1," and ",rep2," = ",round(zz[1]*100/sum(zz), 3),"%."))
#isolate the exact genotypes that conflict
<-gsub(":.*","",(vcf@gt[,colnames(vcf@gt) == rep1 | colnames(vcf@gt) == rep2]))[gsub(":.*","",(vcf@gt[,colnames(vcf@gt) == rep1])) != gsub(":.*","",(vcf@gt[,colnames(vcf@gt) == rep2])),]
z#print them with missing values removed
print(paste0("Conflicting genotypes printed below"))
complete.cases(z), ]
z[
}
#assess all possible comparisons
assess_technical_reps(vcf=test, "C_collectoris_33272", "C_collectoris_33272-2")
[1] "percentage of called SNPs where genotypes differ between C_collectoris_33272 and C_collectoris_33272-2 = 0.039%."
[1] "Conflicting genotypes printed below"
C_collectoris_33272 C_collectoris_33272-2
[1,] "0/0" "0/1"
[2,] "1/1" "0/1"
[3,] "0/1" "0/0"
[4,] "0/1" "0/0"
[5,] "0/0" "0/1"
[6,] "0/0" "0/1"
[7,] "0/1" "0/0"
[8,] "0/1" "1/1"
assess_technical_reps(vcf=test, "C_collectoris_33274", "C_collectoris_33274-2")
[1] "percentage of called SNPs where genotypes differ between C_collectoris_33274 and C_collectoris_33274-2 = 0.083%."
[1] "Conflicting genotypes printed below"
C_collectoris_33274 C_collectoris_33274-2
[1,] "0/1" "0/0"
[2,] "0/0" "0/1"
[3,] "0/1" "0/0"
[4,] "0/1" "0/0"
[5,] "0/0" "0/1"
[6,] "0/1" "0/0"
[7,] "0/0" "0/1"
[8,] "0/1" "0/0"
[9,] "0/0" "0/1"
[10,] "0/1" "0/0"
[11,] "0/0" "0/1"
[12,] "0/0" "0/1"
[13,] "0/1" "0/0"
[14,] "0/1" "0/0"
[15,] "0/1" "0/0"
[16,] "0/1" "0/0"
assess_technical_reps(vcf=test, "C_meeki_5633", "C_meeki_5633-2")
[1] "percentage of called SNPs where genotypes differ between C_meeki_5633 and C_meeki_5633-2 = 0.136%."
[1] "Conflicting genotypes printed below"
C_meeki_5633-2 C_meeki_5633
[1,] "0/0" "0/1"
[2,] "0/1" "0/0"
[3,] "0/1" "0/0"
[4,] "0/1" "0/0"
[5,] "0/1" "1/1"
[6,] "0/0" "0/1"
[7,] "0/1" "0/0"
[8,] "0/1" "0/0"
[9,] "0/0" "0/1"
[10,] "0/1" "1/1"
[11,] "0/1" "0/0"
[12,] "0/1" "0/0"
[13,] "0/1" "0/0"
[14,] "0/0" "0/1"
[15,] "1/1" "0/1"
[16,] "0/0" "0/1"
[17,] "0/0" "0/1"
[18,] "0/0" "0/1"
[19,] "0/1" "1/1"
[20,] "0/1" "0/0"
[21,] "0/0" "0/1"
[22,] "0/1" "0/0"
[23,] "0/0" "0/1"
[24,] "0/1" "1/1"
[25,] "0/1" "0/0"
[26,] "0/0" "0/1"
[27,] "0/0" "0/1"
assess_technical_reps(vcf=test, "C_meeki_32075", "C_meeki_32075-2")
[1] "percentage of called SNPs where genotypes differ between C_meeki_32075 and C_meeki_32075-2 = 0.116%."
[1] "Conflicting genotypes printed below"
C_meeki_32075 C_meeki_32075-2
[1,] "0/0" "0/1"
[2,] "0/0" "0/1"
[3,] "0/1" "0/0"
[4,] "0/0" "0/1"
[5,] "0/1" "0/0"
[6,] "0/0" "0/1"
[7,] "0/1" "0/0"
[8,] "0/1" "0/0"
[9,] "0/1" "0/0"
[10,] "0/1" "0/0"
[11,] "0/1" "0/0"
[12,] "0/1" "0/0"
[13,] "0/0" "0/1"
[14,] "1/1" "0/1"
[15,] "0/1" "1/1"
[16,] "0/1" "0/0"
[17,] "0/0" "0/1"
[18,] "0/0" "0/1"
[19,] "0/0" "0/1"
[20,] "0/0" "0/1"
[21,] "0/0" "0/1"
assess_technical_reps(vcf=test, "C_meeki_32075", "C_meeki_32075-3")
[1] "percentage of called SNPs where genotypes differ between C_meeki_32075 and C_meeki_32075-3 = 0.14%."
[1] "Conflicting genotypes printed below"
C_meeki_32075 C_meeki_32075-3
[1,] "0/0" "0/1"
[2,] "0/1" "0/0"
[3,] "0/1" "0/0"
[4,] "0/1" "0/0"
[5,] "0/1" "0/0"
[6,] "0/0" "0/1"
[7,] "0/1" "1/1"
[8,] "0/1" "0/0"
[9,] "0/1" "0/0"
[10,] "0/1" "0/0"
[11,] "0/0" "0/1"
[12,] "0/1" "0/0"
[13,] "0/0" "0/1"
[14,] "0/0" "0/1"
[15,] "0/1" "0/0"
[16,] "0/1" "1/1"
[17,] "0/0" "0/1"
[18,] "0/1" "1/1"
[19,] "0/1" "0/0"
[20,] "0/1" "0/0"
[21,] "0/0" "0/1"
[22,] "0/1" "0/0"
[23,] "0/0" "0/1"
[24,] "0/0" "0/1"
[25,] "0/0" "0/1"
assess_technical_reps(vcf=test, "C_meeki_32075-2", "C_meeki_32075-3")
[1] "percentage of called SNPs where genotypes differ between C_meeki_32075-2 and C_meeki_32075-3 = 0.12%."
[1] "Conflicting genotypes printed below"
C_meeki_32075-2 C_meeki_32075-3
[1,] "0/1" "0/0"
[2,] "0/0" "0/1"
[3,] "0/0" "0/1"
[4,] "0/1" "0/0"
[5,] "0/1" "1/1"
[6,] "0/0" "0/1"
[7,] "0/0" "0/1"
[8,] "0/0" "0/1"
[9,] "0/0" "0/1"
[10,] "0/0" "0/1"
[11,] "0/0" "0/1"
[12,] "0/1" "0/0"
[13,] "0/1" "0/0"
[14,] "0/1" "0/0"
[15,] "0/1" "0/0"
[16,] "0/0" "0/1"
[17,] "0/1" "0/0"
[18,] "0/1" "0/0"
[19,] "0/0" "0/1"
[20,] "0/0" "0/1"
assess_technical_reps(vcf=test, "C_malaitae_32790", "C_malaitae_32790-2")
[1] "percentage of called SNPs where genotypes differ between C_malaitae_32790 and C_malaitae_32790-2 = 0.12%."
[1] "Conflicting genotypes printed below"
C_malaitae_32790 C_malaitae_32790-2
[1,] "0/1" "1/1"
[2,] "0/0" "0/1"
[3,] "0/1" "0/0"
[4,] "0/0" "0/1"
[5,] "0/0" "0/1"
[6,] "0/0" "0/1"
[7,] "1/1" "0/1"
[8,] "0/0" "0/1"
[9,] "0/1" "1/1"
[10,] "0/0" "0/1"
[11,] "0/0" "0/1"
[12,] "0/1" "0/0"
[13,] "0/1" "1/1"
[14,] "0/1" "0/0"
[15,] "0/1" "0/0"
[16,] "0/0" "0/1"
[17,] "0/0" "0/1"
[18,] "0/1" "0/0"
[19,] "0/1" "1/1"
[20,] "0/1" "0/0"
[21,] "0/1" "0/0"
[22,] "0/1" "1/1"
[23,] "0/1" "0/0"
assess_technical_reps(vcf=test, "C_dispar_5611", "C_dispar_5611-2")
[1] "percentage of called SNPs where genotypes differ between C_dispar_5611 and C_dispar_5611-2 = 0.154%."
[1] "Conflicting genotypes printed below"
C_dispar_5611 C_dispar_5611-2
[1,] "0/0" "0/1"
[2,] "0/1" "0/0"
[3,] "0/0" "0/1"
[4,] "0/0" "0/1"
[5,] "0/0" "0/1"
[6,] "0/1" "0/0"
[7,] "0/1" "0/0"
[8,] "0/0" "0/1"
[9,] "0/0" "0/1"
[10,] "0/0" "0/1"
[11,] "0/1" "0/0"
[12,] "0/0" "0/1"
[13,] "0/1" "1/1"
[14,] "0/1" "0/0"
[15,] "0/1" "0/0"
[16,] "0/1" "0/0"
[17,] "0/0" "0/1"
[18,] "0/1" "0/0"
[19,] "0/1" "0/0"
[20,] "0/1" "0/0"
[21,] "0/1" "0/0"
[22,] "0/0" "0/1"
[23,] "0/0" "0/1"
[24,] "0/1" "1/1"
[25,] "0/1" "1/1"
[26,] "1/1" "0/1"
[27,] "0/1" "0/0"
[28,] "1/1" "0/1"
[29,] "0/0" "0/1"
[30,] "0/1" "0/0"
These results suggest a technical error rate between .04 - .41 % based on genotype calling in technical replicates derived from the same DNA extracts. This error rate is likely highly dependent on depth of sequencing, as disagreements between called genotypes are nearly universally cases where one replicate is called heterozygous and the other is not. These cases depend greatly on the depth of sequencing to accurately recover enough reads for confident het calls. These results suggest an overall low technical error rate and no evidence for batch effects or contamination.
I will now remove the lower sequenced replicate for each technical replicate, leaving us with a dataset where each replicate comes from a unique biological sample
#remove lower sequenced samples when there are multiple replicates
<-vcfR.trim[,colnames(vcfR.trim@gt) != "C_collectoris_33272-2" &
vcfR.trimcolnames(vcfR.trim@gt) != "C_collectoris_33274-2" &
colnames(vcfR.trim@gt) != "C_meeki_5633-2" &
colnames(vcfR.trim@gt) != "C_meeki_32075-3" &
colnames(vcfR.trim@gt) != "C_meeki_32075-2" &
colnames(vcfR.trim@gt) != "C_malaitae_32740-2" &
colnames(vcfR.trim@gt) != "C_malaitae_32790" &
colnames(vcfR.trim@gt) != "C_dispar_5611"]
#remove invariant SNPs
<-min_mac(vcfR.trim, min.mac = 1) vcfR.trim
4.08% of SNPs fell below a minor allele count of 1 and were removed from the VCF
Now we will visualize different per SNP missing data thresholds and identify a value that optimizes the trade-off between amount of missing data and the total number of SNPs retained.
#see what effect trimming samples and technical replicates has had on missing data across the dataset
<-missing_by_snp(vcfR.trim) by.snp
cutoff is not specified, exploratory visualizations will be generated
Picking joint bandwidth of 0.0526
Set a 90% per-SNP completeness cutoff which seems sufficient for a high quality dataset
#set 90% completeness per SNP threshold
<-missing_by_snp(vcfR.trim, cutoff=.9) filt
cutoff is specified, filtered vcfR object will be returned
82.53% of SNPs fell below a completeness cutoff of 0.9 and were removed from the VCF
<-missing_by_sample(filt) bysamp
No popmap provided
filt
***** Object of Class vcfR *****
107 samples
2457 CHROMs
20,831 variants
Object size: 181.5 Mb
2.827 percent missing data
***** ***** *****
#convert to genlight
<-vcfR2genlight(filt)
gen#fix sample names to fit in <= 10 characters
@ind.names gen
[1] "C_solitarius_4846" "C_sacerdotis_5307" "C_sacerdotis_5311"
[4] "C_sacerdotis_5317" "C_solitarius_5447-8" "C_solitarius_5565"
[7] "C_solitarius_6836" "C_solitarius_6948" "C_solitarius_7161"
[10] "C_solitarius_7256" "C_solitarius_7604" "C_solitarius_7629"
[13] "C_solitarius_9547" "C_solitarius_9572" "C_solitarius_9628"
[16] "C_solitarius_12865" "C_nigromaxilla_15907" "C_nigromaxilla_15908"
[19] "C_solitarius_16175" "C_margarethae_27239" "C_margarethae_27245"
[22] "C_margarethae_27254" "C_margarethae_27261" "C_margarethae_27264"
[25] "C_sacerdotis_27646" "C_mulcatus_27674" "C_mulcatus_27686"
[28] "C_mulcatus_27746" "C_mulcatus_27839" "C_mulcatus_27852"
[31] "C_solitarius_27884" "C_sacerdotis_29529" "C_meeki_32006"
[34] "C_meeki_32007" "C_meeki_32014" "C_meeki_32022"
[37] "C_meeki_32023" "C_meeki_32024" "C_meeki_32054"
[40] "C_meeki_32075" "C_malaitae_32770" "C_nigromaxilla_32835"
[43] "C_nigromaxilla_32846" "C_nigromaxilla_32860" "C_collectoris_33756"
[46] "C_collectoris_33757" "C_collectoris_33759" "C_collectoris_33760"
[49] "C_collectoris_33789" "C_collectoris_33790" "C_collectoris_33822"
[52] "C_collectoris_33832" "C_collectoris_33842" "C_collectoris_33871"
[55] "C_collectoris_33905" "C_collectoris_33906" "C_meeki_34840"
[58] "C_meeki_34848" "C_meeki_34855" "C_meeki_34862"
[61] "C_meeki_34865" "C_meeki_34870" "C_gentianus_34953"
[64] "C_gentianus_34969" "C_gentianus_35042" "C_collectoris_36072"
[67] "C_collectoris_36076" "C_collectoris_36094" "C_collectoris_36095"
[70] "C_collectoris_36096" "C_collectoris_36105" "C_collectoris_36112"
[73] "C_collectoris_36129" "C_collectoris_36144" "C_collectoris_36145"
[76] "C_collectoris_36156" "C_collectoris_36212" "C_collectoris_36217"
[79] "C_meeki_34887" "C_collectoris_36249" "C_meeki_5633"
[82] "C_solitarius_5157" "C_solitarius_5192" "C_solitarius_6977"
[85] "C_solitarius_6982" "C_solitarius_7229" "C_solitarius_7295"
[88] "C_solitarius_7526" "C_solitarius_9539" "C_gentianus_13530"
[91] "C_margarethae_14484" "C_nigromaxilla_15880" "C_nigromaxilla_15892"
[94] "C_margarethae_19259" "C_collectoris_33266" "C_collectoris_33272"
[97] "C_collectoris_33274" "C_collectoris_33761" "C_collectoris_33797"
[100] "C_collectoris_33863" "C_collectoris_33878" "C_collectoris_33908"
[103] "C_collectoris_33922" "C_collectoris_36115" "C_collectoris_36133"
[106] "C_dispar_5611-2" "C_malaitae_32790-2"
@ind.names<-gsub("C_collectoris","co", gen@ind.names)
gen@ind.names<-gsub("C_margarethae","mar", gen@ind.names)
gen@ind.names<-gsub("C_gentianus","gen", gen@ind.names)
gen@ind.names<-gsub("C_solitarius","sol", gen@ind.names)
gen@ind.names<-gsub("C_nigromaxilla","ni", gen@ind.names)
gen@ind.names<-gsub("C_malaitae","ml", gen@ind.names)
gen@ind.names<-gsub("C_meeki","me", gen@ind.names)
gen@ind.names<-gsub("C_dispar","dis", gen@ind.names)
gen@ind.names<-gsub("C_sacerdotis","sac", gen@ind.names)
gen@ind.names<-gsub("C_mulcatus","mul", gen@ind.names)
gen@ind.names gen
[1] "sol_4846" "sac_5307" "sac_5311" "sac_5317" "sol_5447-8"
[6] "sol_5565" "sol_6836" "sol_6948" "sol_7161" "sol_7256"
[11] "sol_7604" "sol_7629" "sol_9547" "sol_9572" "sol_9628"
[16] "sol_12865" "ni_15907" "ni_15908" "sol_16175" "mar_27239"
[21] "mar_27245" "mar_27254" "mar_27261" "mar_27264" "sac_27646"
[26] "mul_27674" "mul_27686" "mul_27746" "mul_27839" "mul_27852"
[31] "sol_27884" "sac_29529" "me_32006" "me_32007" "me_32014"
[36] "me_32022" "me_32023" "me_32024" "me_32054" "me_32075"
[41] "ml_32770" "ni_32835" "ni_32846" "ni_32860" "co_33756"
[46] "co_33757" "co_33759" "co_33760" "co_33789" "co_33790"
[51] "co_33822" "co_33832" "co_33842" "co_33871" "co_33905"
[56] "co_33906" "me_34840" "me_34848" "me_34855" "me_34862"
[61] "me_34865" "me_34870" "gen_34953" "gen_34969" "gen_35042"
[66] "co_36072" "co_36076" "co_36094" "co_36095" "co_36096"
[71] "co_36105" "co_36112" "co_36129" "co_36144" "co_36145"
[76] "co_36156" "co_36212" "co_36217" "me_34887" "co_36249"
[81] "me_5633" "sol_5157" "sol_5192" "sol_6977" "sol_6982"
[86] "sol_7229" "sol_7295" "sol_7526" "sol_9539" "gen_13530"
[91] "mar_14484" "ni_15880" "ni_15892" "mar_19259" "co_33266"
[96] "co_33272" "co_33274" "co_33761" "co_33797" "co_33863"
[101] "co_33878" "co_33908" "co_33922" "co_36115" "co_36133"
[106] "dis_5611-2" "ml_32790-2"
pop(gen)<-gen@ind.names
#assign populations (a StaMPP requirement)
@pop<-as.factor(gen@ind.names)
gen#generate pairwise divergence matrix
<- stamppNeisD(gen, pop = FALSE)
sample.div #export for splitstree
stamppPhylip(distance.mat=sample.div, file="~/Desktop/nmel.ceyx.rad/ceyx.90.splits.txt")
#splitstree looks clean, no evidence of weird clustering driven by missing data
::include_graphics("/Users/devonderaad/Desktop/ceyx.90.splits.png") knitr
It is a known thing (see this) that Stacks will not merge SNPs if they are sequenced by separate (but physically overlapping) loci, and will instead retain the same SNP twice. To account for this, we will simply remove a SNP every time its chromosome and position have already been seen in the dataset, using the following code:
#generate dataframe containing information for chromosome and bp locality of each SNP
<-as.data.frame(filt@fix[,1:2])
df#calc number of duplicated SNPs to remove
nrow(df) - length(unique(paste(df$CHROM,df$POS)))
[1] 17
#remove duplicated SNPs
<-filt[!duplicated(paste(df$CHROM,df$POS)),] filt
#plot depth per snp and per sample
<- extract.gt(filt, element = "DP", as.numeric=TRUE)
dp heatmap.bp(dp, rlabels = FALSE)
#plot genotype quality per snp and per sample
<- extract.gt(filt, element = "GQ", as.numeric=TRUE)
gq heatmap.bp(gq, rlabels = FALSE)
Now filter for linkage
#perform linkage filtering to get a reduced vcf with only unlinked SNPs
<-distance_thin(filt, min.distance = 10000) filt.thin
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|===== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 9%
|
|======= | 10%
|
|======= | 11%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========= | 14%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 18%
|
|============= | 19%
|
|============== | 19%
|
|============== | 20%
|
|============== | 21%
|
|=============== | 21%
|
|=============== | 22%
|
|================ | 22%
|
|================ | 23%
|
|================ | 24%
|
|================= | 24%
|
|================= | 25%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 26%
|
|=================== | 27%
|
|=================== | 28%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 29%
|
|===================== | 30%
|
|===================== | 31%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 32%
|
|======================= | 33%
|
|======================= | 34%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 38%
|
|=========================== | 39%
|
|============================ | 39%
|
|============================ | 40%
|
|============================ | 41%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 42%
|
|============================== | 43%
|
|============================== | 44%
|
|=============================== | 44%
|
|=============================== | 45%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 46%
|
|================================= | 47%
|
|================================= | 48%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 49%
|
|=================================== | 50%
|
|=================================== | 51%
|
|==================================== | 51%
|
|==================================== | 52%
|
|===================================== | 52%
|
|===================================== | 53%
|
|===================================== | 54%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 55%
|
|======================================= | 56%
|
|======================================== | 56%
|
|======================================== | 57%
|
|======================================== | 58%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 59%
|
|========================================== | 60%
|
|========================================== | 61%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 62%
|
|============================================ | 63%
|
|============================================ | 64%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 65%
|
|============================================== | 66%
|
|=============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 69%
|
|================================================= | 70%
|
|================================================= | 71%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 72%
|
|=================================================== | 73%
|
|=================================================== | 74%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 79%
|
|======================================================== | 80%
|
|======================================================== | 81%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|========================================================== | 84%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 86%
|
|============================================================= | 87%
|
|============================================================= | 88%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 89%
|
|=============================================================== | 90%
|
|=============================================================== | 91%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================= | 94%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 96%
|
|==================================================================== | 97%
|
|==================================================================== | 98%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 99%
|
|======================================================================| 100%
2580 out of 20814 input SNPs were not located within 10000 base-pairs of another SNP and were retained despite filtering
#get info for all SNPs passing filtering vcf dataset
filt
***** Object of Class vcfR *****
107 samples
2457 CHROMs
20,814 variants
Object size: 181.4 Mb
2.829 percent missing data
***** ***** *****
#write to disk
#vcfR::write.vcf(filt, file = "~/Desktop/nmel.ceyx.rad/filtered.snps.vcf.gz")
#get info for the thinned SNP dataset
filt.thin
***** Object of Class vcfR *****
107 samples
2457 CHROMs
2,580 variants
Object size: 22.4 Mb
3.187 percent missing data
***** ***** *****
#write to disk
#vcfR::write.vcf(filt.thin, file = "~/Desktop/nmel.ceyx.rad/filtered.unlinked.snps.vcf.gz")