Open
Description
Brent;
We've been trying to use the latest development in bcbio (to get the initial hg38 support) and running into a failure during pca. Is there a minimum number of input samples needed to run pca, or am I doing something else wrong?
2018-05-30 23:07:51 r3467 peddy.pca[22570] INFO loaded and subsetted thousand-genomes genotypes (shape: (2504, 1)) in 0.8 seconds
Traceback (most recent call last):
File "/g/data3/gx8/local/development/bcbio/anaconda/bin/peddy", line 11, in <module>
load_entry_point('peddy==0.4.0', 'console_scripts', 'peddy')()
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/cli.py", line 204, in peddy
in ("ped_check", "het_check", "sex_check")]):
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/cli.py", line 43, in run
prefix=prefix, **kwargs)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/peddy.py", line 871, in het_check
pca_df, background_pca_df = pca(pca_plot, sitesfile, gt_types, sites)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/peddy/pca.py", line 61, in pca
clf.fit(genos1kg, background_target)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 248, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 213, in _fit
**fit_params_steps[name])
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/memory.py", line 362, in __call__
return self.func(*args, **kwargs)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 581, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 348, in fit_transform
U, S, V = self._fit(X)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 394, in _fit
return self._fit_truncated(X, n_components, svd_solver)
File "/g/data/gx8/local/development/bcbio/anaconda/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 468, in _fit_truncated
% (n_components, n_features, svd_solver))
ValueError: n_components=4 must be between 1 and n_features=1 with svd_solver='randomized'
Ideally we'd love to be able to get ancestry prediction for single sample inputs. Thanks for any pointers/suggestions.
Activity
brentp commentedon May 30, 2018
hmm. maybe when there's a single sample we need to use
X[:, np.newaxis]
or something as the number of features should be ~ 20K.brentp commentedon May 30, 2018
I'll try to carve out some time on friday to work on peddy and get these resolved. thanks for reporting.
brentp commentedon May 30, 2018
I can run with 1 sample on master. It seems like maybe you're only finding a single snp out of the full set.
E.g. your log info says
(shape: (2504, 1))
. Are you using the right genome-build for your test-case?brentp commentedon May 30, 2018
I see the problem that's causing this. Will release 0.4.2 shortly.
brentp commentedon May 30, 2018
actually, I don't think that should affect what you are reporting here. You can try updating to cyvcf2 v0.9.0 to see if that resolves it, but I don't think it will.
chapmanb commentedon May 31, 2018
Brent;
Thanks for the help with this. I agree it's probably also not the parsing. It's possible there was 1 overlapping SNP; bcbio ends up throwing a lot at peddy: panels, somatic calls (when there is no germline, hoping for germline leakage). Is it possible for peddy to skip pca here if there are not enough sites? It's a bit hard to tell up from what and will not work, so I've been just doing garbage in/garbage out and handling outputs that aren't useful. Would skipping here be possible? Thanks again for taking a look at all this so quickly.
brentp commentedon May 31, 2018
It could, but it uses those sites for the heterozygosity stuff too, so the only thing left would be the sex plots.
chapmanb commentedon May 31, 2018
That would be helpful for me. I know the output is not especially useful, but in bcbio we often get not good enough data that goes in and would be great to have peddy exit cleanly just without the analysis we can't do. Right now I'm catching exceptions and doing other ugly things when it fails, so would love to improve on that. Thanks for considering this.
brentp commentedon May 31, 2018
I'm wary of writing software that catches all possible problems and gives a 0 exit code.
I'm not completely against what you're saying, but I'm not sure how to implement. My though would be that if you send peddy--which has a list of sites that it uses, a VCF that contains none (or 1) of those sites, it's reasonable to raise an error.
I suppose the alternative is to output an informative error message and still exit 0.
chapmanb commentedon May 31, 2018
Brent;
I definitely hear what you're saying about design. Raising a consistent error is also cool with me, it's just hard to tell up front if a dataset will have very few overlaps with the sites (without calculating it myself) so would be awesome if peddy told me this. Right now we essentially catch the symptom (index errors):
https://github.com/bcbio/bcbio-nextgen/blob/ebf096ea43c6464403bd656a1081bc322746a0aa/bcbio/variation/peddy.py#L91
which I'd like to improve on. Thanks again for this discussion.