Stitch together Nanopore tiled amplicon data without polishing a reference

Related tags

Data AnalysisLilo
Overview

logo_dark_white

Stitch together Nanopore tiled amplicon data using a reference guided approach

Tiled amplicon data, like those produced from primers designed with primal scheme, are typically assembled using methods that involve aligning them to a reference and polishing the reference into a sequence that represents the reads. This works very well for obtaining a genome with SNPs and small indels representative of the reads. However in cases where the reads cannot be mapped well to the reference (e.g. genomes containing hypervariable regions between primers) or in cases where large structrual variants are expected this method may fail as polishing tools expect the reference to originate from the reads.

Lilo uses a reference only to assign reads to the amplicon they originated from and to order and orient the polished amplicons, no reference sequence is incorporated into the final assembly. Once assigned to an amplicon, a read with high average base quality of roughly median length for that amplicon is selected as a reference and polished with up to 300x coverage three times with medaka. The polished amplicons have primers removed with porechop (fork: https://github.com/sclamons/Porechop-1) and are then assembled with scaffold_builder.

Lilo has been tested on SARS-CoV-2 with artic V3 primers. It has also been tested on 7kb and 4kb amplicons with ~100-1000bp overlaps for ASFV, PRRSV-1 and PRRSV-2, schemes for which will be made available in the near future.

Requirments not covered by conda

Install Conda :)
Install this fork of porechop and make sure it is in your path: https://github.com/sclamons/Porechop-1

Installation

git clone https://github.com/amandawarr/Lilo  
cd Lilo  
conda env create --file LILO.yaml  
conda env create --file scaffold_builder.yaml

Usage

Lilo assumes your reads are in a folder called raw/ and have the suffix .fastq.gz. Multiple samples can be processed at the same time.
Lilo requires a config file detailing the location of a reference, a primer scheme (in the form of a primal scheme style bed file), and a primers.csv file (described below).

conda activate LILO
snakemake -k -s /path/to/LILO --configfile /path/to/config.file --cores N

It is recommended to run with -k so that one sample with insufficient coverage will not stop the other jobs completing.

Input specifications

  • config.file: an example config file has been provided.
  • Primer scheme: As output by primal scheme, with alt primers removed. Bed file of primer alignment locations. Columns: reference name, start, end, primer name, pool (must end with 1 or 2).
  • Primers.csv: Comma delimited, includes alt primers, with header line. Columns: amplicon_name, F_primer_name, F_primer_sequence, R_primer_name, R_primer_sequence. If there are a lot of degenerate bases in any of the primers it is recommended to expand these, the script expand.py will expand the described csv into a longer csv with IUPAC codes expanded.
  • reference.fasta Same reference used to make the scheme file.

Output

Lilo uses the names from raw/ to name the output file. For a file named "sample.fastq.gz", the final assembly will be named "sample_Scaffold.fasta", and files produced during the pipeline will be in a folder called "sample". The output will contain amplicons that had at least 40X full length coverage. Missing amplicons will be represented by Ns. Any ambiguity at overlaps will be indicated with IUPAC codes.

Note

  • Use of the wrong fork for porechop will cause the pipeline to fail.
  • Lilo is a work in progress and has been tested on a limited number of references, amplicon sizes, and overlap sizes, I recommend checking the results carefully for each new scheme.
  • The pipeline currently assumes that any structural variants are contained between the primers of an amplicon and do not change the length of the amplicon by more than 5%. If alt amplicons produce a product of a different length to the original amplicon they may not be allocated to their amplicon. Editing it to work better with alt amplicons is on my to do list.
  • Should not be used with reads produced with rapid kits, the pipeline assumes the reads are the length of the amplicons.
  • Do let me know if it destroys any cities or steals everyone's left shoe.
You might also like...
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

PostQF Copyright © 2022 Ralph Seichter PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j. See the ma

Catalogue data - A Python Scripts to prepare catalogue data

catalogue_data Scripts to prepare catalogue data. Setup Clone this repo. Install

NumPy and Pandas interface to Big Data
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges: Optimus is the missing framework to prof

Comments
  • Error in rule reporechop:

    Error in rule reporechop:

    Hello, While running the sample dataset, I have encoutered the following error messages. I have made such that prochop is installed correctly and in the path.

    Any help is greatly appreciated.

    Error in rule reporechop: jobid: 2 output: FAT94769_pass_barcode02_66883b35_0/polished_trimmed.fa shell: porechop --adapter_threshold 72 --end_threshold 70 --end_size 30 --extra_end_trim 5 --min_trim_size 3 -f ASFV.primers.csv -i FAT94769_pass_barcode02_66883b35_0/polished_clipped_amplicons.fa --threads 8 --no_split -o FAT94769_pass_barcode02_66883b35_0/polished_trimmed.fa (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

    opened by tboonf 1
  • Error while running LILO

    Error while running LILO

    Dear, I get the following error while running LILO. Any idea what could be the problem?

    /bin/bash: /home/minion/anaconda3/envs/LILO/etc/profile.d/conda.sh: No such file or directory
    
    CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
    To initialize your shell, run
    
        $ conda init <SHELL_NAME>
    
    Currently supported shells are:
      - bash
      - fish
      - tcsh
      - xonsh
      - zsh
      - powershell
    
    See 'conda init --help' for more information and options.
    
    IMPORTANT: You may need to close and restart your shell after running 'conda init'.
    
    
    /bin/bash: line 2: scaffold_builder.py: command not found
    sed: can't read reads_24h_Scaffold.fasta: No such file or directory
    [Wed Aug 10 11:12:28 2022]
    Error in rule scaffold:
        jobid: 1
        output: reads_24h_Scaffold.fasta
        shell:
            source $CONDA_PREFIX/etc/profile.d/conda.sh
                    conda activate scaffold_builder
                    scaffold_builder.py -i 75 -t 3693 -g 80000 -r /home/minion/lilo-test/ASFV.reference.fasta -q reads_24h/polished_trimmed.fa -p reads_24h
                    sed -i '1 s/^.*$/>reads_24h_Lilo_scaffold/' reads_24h_Scaffold.fasta
            (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    
    Job failed, going on with independent jobs.
    Exiting because a job execution failed. Look above for error message
    Complete log: /home/minion/lilo-test/.snakemake/log/2022-08-10T111227.425486.snakemake.log
    

    Kind regards, Elisabeth

    opened by el-mat 1
  • LILO with SLURM

    LILO with SLURM

    Hi there,

    I'm trying to run LILO on a SLURM HPC and I'm not sure what the errors are related to. Do you have an idea? It seems really environment depended, but maybe you stumbled across something similar.

    Call:

    snakemake -k -s [...]/tools/Lilo/LILO --configfile $CONFIG --profile [...]/tools/config-snippets/snake-cookies/slurm
    

    Log:

    [...]
    MissingOutputException in line 84 of [...]/tools/Lilo/LILO:
    Job Missing files after 30 seconds:
    FAR95540_pass_unclassified_7f618209_73/split/amplicon51.fastq
    This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
    Job id: 133673 completed successfully, but some output files are missing. 133673
    Trying to restart job 133673.
    [...]
    Error in rule assign:
        jobid: 133673
        output: FAR95540_pass_unclassified_7f618209_73/split/amplicon51.fastq
        shell:
            bedtools intersect -F 0.9 -wa -wb -bed -abam FAR95540_pass_unclassified_7f618209_73/alignments/reads_to_ref.bam -b amplicons.bed  | grep amplicon51 - | awk '{print $4}' - | seqtk subseq porechop/FAR95540_pass_unclassified_7f618209_73.fastq.gz - > FAR95540_pass_unclassified_7f618209_73/split/amplicon51.fastq
            (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
        cluster_jobid: 210115
    
    Error executing rule assign on cluster (jobid: 133673, external: 210115, jobscript: [...]/.snakemake/tmp.cssfeg5e/snakejob.assign.133673.sh). For error details see the cluster log and the log files of the involved rule(s).
    [...]
    Traceback (most recent call last):
      File "/scratch/lataretum/miniconda3/envs/LILO/lib/python3.8/site-packages/snakemake/__init__.py", line 701, in snakemake
        success = workflow.execute(
      File "/scratch/lataretum/miniconda3/envs/LILO/lib/python3.8/site-packages/snakemake/workflow.py", line 1077, in execute
        success = self.scheduler.schedule()
      File "/scratch/lataretum/miniconda3/envs/LILO/lib/python3.8/site-packages/snakemake/scheduler.py", line 441, in schedule
        self._error_jobs()
      File "/scratch/lataretum/miniconda3/envs/LILO/lib/python3.8/site-packages/snakemake/scheduler.py", line 557, in _error_jobs
        self._handle_error(job)
      File "/scratch/lataretum/miniconda3/envs/LILO/lib/python3.8/site-packages/snakemake/scheduler.py", line 615, in _handle_error
        self.running.remove(job)
    KeyError: assign
    

    I set --latency-wait 90 it again breaks after some time at a assign rule and a KeyError: read_select from the snakemake scheduler.

    Let me know which input/config files might be interesting to solve this. :)

    opened by MarieLataretu 7
Releases(v0.2)
Owner
Amanda Warr
Amanda Warr
A pipeline that creates consensus sequences from a Nanopore reads. I

A pipeline that creates consensus sequences from a Nanopore reads. It clusters reads that are similar to each other and creates a consensus that is then identified using BLAST.

Ada Madejska 2 May 15, 2022
Office365 (Microsoft365) audit log analysis tool

Office365 (Microsoft365) audit log analysis tool The header describes it all WHY?? The first line of code was written long time before other colleague

Anatoly 1 Jul 27, 2022
Cleaning and analysing aggregated UK political polling data.

Analysing aggregated UK polling data The tweet collection & storage pipeline used in email-service is used to also collect tweets from @britainelects.

Ajay Pethani 0 Dec 22, 2021
For making Tagtog annotation into csv dataset

tagtog_relation_extraction for making Tagtog annotation into csv dataset How to Use On Tagtog 1. Go to Project Downloads 2. Download all documents,

hyeong 4 Dec 28, 2021
BErt-like Neurophysiological Data Representation

BENDR BErt-like Neurophysiological Data Representation This repository contains the source code for reproducing, or extending the BERT-like self-super

114 Dec 23, 2022
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022
Evaluation of a Monocular Eye Tracking Set-Up

Evaluation of a Monocular Eye Tracking Set-Up As part of my master thesis, I implemented a new state-of-the-art model that is based on the work of Che

Pascal 19 Dec 17, 2022
Synthetic Data Generation for tabular, relational and time series data.

An Open Source Project from the Data to AI Lab, at MIT Website: https://sdv.dev Documentation: https://sdv.dev/SDV User Guides Developer Guides Github

The Synthetic Data Vault Project 1.2k Jan 07, 2023
A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

MatrixProfile MatrixProfile is a Python 3 library, brought to you by the Matrix Profile Foundation, for mining time series data. The Matrix Profile is

Matrix Profile Foundation 302 Dec 29, 2022
Kennedy Institute of Rheumatology University of Oxford Project November 2019

TradingBot6M Kennedy Institute of Rheumatology University of Oxford Project November 2019 Run Change api.txt to binance api key: https://www.binance.c

Kannan SAR 2 Nov 16, 2021
CubingB is a timer/analyzer for speedsolving Rubik's cubes, with smart cube support

CubingB is a timer/analyzer for speedsolving Rubik's cubes (and related puzzles). It focuses on supporting "smart cubes" (i.e. bluetooth cubes) for recording the exact moves of a solve in real time.

Zach Wegner 5 Sep 18, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 01, 2022
Using Python to derive insights on particular Pokemon, Types, Generations, and Stats

Pokémon Analysis Andreas Nikolaidis February 2022 Introduction Exploratory Analysis Correlations & Descriptive Statistics Principal Component Analysis

Andreas 1 Feb 18, 2022
This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

This is a project for analysis and estimation of House Prices in King County USA The .csv file contains the data of the house and the .ipynb file con

Amit Prakash 1 Jan 21, 2022
A CLI tool to reduce the friction between data scientists by reducing git conflicts removing notebook metadata and gracefully resolving git conflicts.

databooks is a package for reducing the friction data scientists while using Jupyter notebooks, by reducing the number of git conflicts between different notebooks and assisting in the resolution of

dataroots 86 Dec 25, 2022
This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics!

COSMETICS GENERATOR This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics! Remember to put the l

ᴅᴊʟᴏʀ3xᴢᴏ 11 Dec 13, 2022
Zipline, a Pythonic Algorithmic Trading Library

Zipline is a Pythonic algorithmic trading library. It is an event-driven system for backtesting. Zipline is currently used in production as the backte

Quantopian, Inc. 15.7k Jan 07, 2023
DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in cluste

Amazon Web Services - Labs 53 Dec 08, 2022
Bigdata Simulation Library Of Dream By Sandman Books

BIGDATA SIMULATION LIBRARY OF DREAM BY SANDMAN BOOKS ================= Solution Architecture Description In the realm of Dreaming, its ruler SANDMAN,

Maycon Cypriano 3 Jun 30, 2022
Gaussian processes in TensorFlow

Website | Documentation (release) | Documentation (develop) | Glossary Table of Contents What does GPflow do? Installation Getting Started with GPflow

GPflow 1.7k Jan 06, 2023