Last updated: 2022-09-15

Checks: 2 0

Knit directory: chromap_vs_cellranger_scATAC_exploration_10x/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 10fdcb0. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


working directory clean

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/index.Rmd) and HTML (docs/index.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 10fdcb0 jeremymsimon 2022-09-14 Initial commit
Rmd 112fdea jeremymsimon 2022-09-12 Start workflowr project.

Summary

This site contains the code and results of my comparative analysis of chromap v0.1 vs cellranger-atac v2.0.0 in the processing and analysis of 10x scATAC-seq data.

For illustration purposes, I utilize publicly-accessible 10x scATAC-seq data from PBMCs and HGMM (GM12878 B-lymphocytes).

All the code and results of this analysis are available from GitHub at https://github.com/jeremymsimon/chromap_vs_cellranger_scATAC_exploration_10x.

Follow the links below to access different pieces of the analysis.

Motivation

My personal preference is to support open-source software whenever possible, while being cognizant of computational efficiency. Cellranger is rather slow (runtimes included in the relevant chapter below), in part because it does more downstream analysis than I desire (clustering, marker gene/region detection, etc). I also like my approaches to remain flexible and modular such that new versions and tools can be easily incorporated as they are introduced.

Chromap is one such new approach to align and pre-process scATAC-seq (and other) data very quickly. It by default outputs a fragments file akin to cellranger, which contains the chromosome, start, and end coordinates of aligned fragments, along with the cellular barcode associated with that fragment, and the number of read pairs associated with that fragment. The associated manuscript states it is over 10 times (ironically, 10x, lol) faster than traditional approaches e.g. cellranger. Their paper does a great job already of comparing the alignments and resulting clusters to conventional approaches, so that should be your starting point before diving into what I’ve done here. The reason for my duplicating some of those efforts here is because A) their approach utilized MAESTRO as an intermediate for all processing, and B) they did not utilize a dataset with multiple conditions or replicates. Although MAESTRO is itself a very useful tool, I wished to take either the cellranger or chromap outputs and take a more direct path to R. Here, I utilize the Signac/Seurat framework, but the logic should be generalizable to other ecosystems.

Though to my knowledge this hasn’t been fully demonstrated in the literature, I also wish to incorporate an atlas of accessible chromatin from 222 human cell types. The rationale here is that peak-calling on one entire sample at once can miss accessible regions specific to rare cell types or that may be condition specific. As an analogy, for a differential splicing analysis, we may utilize a database of known as well as de novo detected splice junctions unique to our data. This is my thinking here - that we can increase our power to detect important regulatory elements by utilizing genomic loci known to play a regulatory role alongside regions that may be important in our own data. This sort of approach is supported in MAESTRO with --custompeak, so other groups may have a similar philosophy.

There are certainly other ways of aligning scATAC-seq reads, quantifying signal across the genome, and identifying critical regulatory elements for downstream analysis using methods not used or mentioned here. The comparative analysis presented here is not meant to be comprehensive or definitive, nor will I claim my approach is the best or the right-est or the efficient-est.

The analysis presented here merely serves as a proof-of-principle that a cellranger-independent approach is feasible, that chromap seems like a viable alternative, and that this framework can serve as the foundation for future improvements in peak-calling and feature selection.

Analysis Overview

  1. scATAC-seq data download

  2. Running cellranger
    1. PBMC
    2. HGMM
  3. Running chromap
    1. PBMC
    2. HGMM
    3. Compressing fragments files for downstream
  4. Peak-calling on chromap alignments
    1. Running MACS2
    2. Creation of union set for downstream analysis
  5. Analysis of chromap data in R
    1. Basic processing in Signac
    2. Compute QC metrics
    3. Integration, clustering, and estimating Gene Activity
  6. Analysis of cellranger data in R
    1. Basic processing and QC
    2. Integration and clustering
  7. Comparative analyses
    1. Cluster memberships, gene activity, and peak overlaps: approach-specific peaks
    2. Signal concordance under one unified peak set for both approaches
  8. Conclusions